Welcome to V3ALab! (Vision, Ask, Answer, Aact). Artificial Intelligence (AI) currently empowers many tasks in our everyday lives, but there is still a long way to go to create human-level AI. One of the biggest obstacles is how humans can communicate with AI effectively to elicit an appropriate response, whether a textual answer or an action. To this end, our V3ALab aims to develop AI agents that communicates with humans on the basis of visual input, and can complete a sequence of actions in environments. Our V3ALab members mainly work on four research themes that correspond to human basic abilities: vision receives visual information from the environment akin to human perception; ask-answer is the basic communication unit for a human, while the act ability can map to the human’s movement and manipulation abilities. These research themes covers a wide range of tasks and applications including Image Captioning, Visual Question Answering, Referring Expression and Vision-Language Navigation etc.



Two papers are accepted by NeurIPS 2021!

  • Debiased Visual Question Answering
  • Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

We are hosting the 2nd REVERIE Challenge on ICCV Workshop 2021!

🌟 More details at here

One paper is accepted by ICCV 2021!

  • Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation link

Two papers are accepted by ACM MM 2021!


Three papers are accepted by IJCAI 2021!


We will host the 1st Vision-and-Language Navigation Tutorial, CVPR 2021.

  • More details at here

Six papers are accepted by CVPR 2021!

  • A Recurrent Vision-and-Language BERT for Navigation
  • Jo-SRC: A Contrastive Approach for Combating Noisy Labels
  • Sketch, Ground, and Refine: Top-Down Dense Video Captioning
  • Non-Salient Region Object Mining for Weakly Supervised Semantic Segmentation
  • Towards Accurate Text-based Image Captioning with Content Diversity Exploration
  • Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression

One paper is accepted by TMM!


Three papers are accepted by AAAI 2021!

  • How to Train Your Agent to Read and Write?
  • Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps.
  • Confidence-aware Non-repetitive Multimodal Transformers for TextCaps.

Qi Wu will serve as an SPC of IJCAI 2021!


One paper is accepted by NeurIPS!

  • Language and Visual Entity Relationship Graph for Agent Navigation

One long paper is accepted by EMNLP Conference Proceeding!

  • Sub-Instruction Aware Vision-and-Language Navigation

One paper is accepted by IEEE TPAMI!

  • Visual Grounding via Accumulated Attention

Qi Wu will serve as an SPC (Area Chair) of AAAI 2021!


Four papers are accepted by ACM MM 2020!

  • Cascade Reasoning Network for Text-based Visual Question Answering
  • Visual-Semantic Graph Matching for Visual Grounding
  • Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge
  • Data-driven Meta-set Based Fine-Grained Visual Classification

Four papers are accepted by ECCV 2020!

  • Soft Expert Reward Learning for Vision-and-Language Navigation
  • Object-and-Action Aware Model for Visual Language Navigation
  • Length Controllable Image Captioning
  • Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

We won the first place in Medical VQA Challenge 2020!

  • See Leadboard here

We won the first place in TextVQA Challenge 2020!

  • See Leadboard here