Welcome to V3ALab — Vision, Ask, Answer, Act! Artificial Intelligence (AI) already powers many aspects of our daily lives, yet achieving human-level intelligence remains a major challenge, particularly in how effectively humans and AI can communicate to elicit meaningful responses—whether textual answers or physical actions. At V3ALab, we aim to develop intelligent agents that see, communicate, and act: interpreting visual inputs, engaging in natural ask–answer interactions, and executing purposeful actions in real or simulated environments. Our research is organised around four human-inspired abilities—vision for perception, ask–answer for communication, and act for movement and manipulation—and spans a wide range of tasks including Image Captioning, Visual Question Answering, Referring Expressions, and Vision-Language Navigation. Through these efforts, we strive to advance embodied and multimodal AI toward systems that truly understand and collaborate with humans.

News

 
 
2025-11
 
 
 

Two Papers are accepted by AAAI 2026

 
 
2025-10
 
 
 

We have one ARC Discovery Project accepted

 
 
2025-09
 
 
 

One paper is accepted by NeurIPS 2025

 
 
2025-07
 
 
 

Two papers are accepted by ICCV 2025

 
 
2025-06
 
 
 

We won the Social Mobile Manipulation challenge @CVPR 2025

 
 
2025-06
 
 
 

Two papers are accepted by IROS 2025

 
 
2025-05
 
 
 

Our MiniVLN paper won the Best Conference Paper Finalist

 
 
2025-02
 
 
 

Three papers are accepted by CVPR 2025

 
 
2025-01
 
 
 

Two papers are accepted by ICLR 2025

 
 
2025-01
 
 
 

Three papers are accepted by ICRA 2025