Welcome to V3ALab! (Vision, Ask, Answer, Act). Artificial Intelligence (AI) currently empowers many tasks in our everyday lives, but there is still a long way to go to create human-level AI. One of the biggest obstacles is how humans can communicate with AI effectively to elicit an appropriate response, whether a textual answer or an action. To this end, our V3ALab aims to develop AI agents that communicates with humans on the basis of visual input, and can complete a sequence of actions in environments. Our V3ALab members mainly work on four research themes that correspond to human basic abilities: vision receives visual information from the environment akin to human perception; ask-answer is the basic communication unit for a human, while the act ability can map to the human’s movement and manipulation abilities. These research themes covers a wide range of tasks and applications including Image Captioning, Visual Question Answering, Referring Expression and Vision-Language Navigation etc.