Welcome to V3ALab — Vision, Ask, Answer, Act! Artificial Intelligence (AI) already powers many aspects of our daily lives, yet achieving human-level intelligence remains a major challenge, particularly in how effectively humans and AI can communicate to elicit meaningful responses—whether textual answers or physical actions. At V3ALab, we aim to develop intelligent agents that see, communicate, and act: interpreting visual inputs, engaging in natural ask–answer interactions, and executing purposeful actions in real or simulated environments. Our research is organised around four human-inspired abilities—vision for perception, ask–answer for communication, and act for movement and manipulation—and spans a wide range of tasks including Image Captioning, Visual Question Answering, Referring Expressions, and Vision-Language Navigation. Through these efforts, we strive to advance embodied and multimodal AI toward systems that truly understand and collaborate with humans.