Selected Publications

*
Debiased Visual Question Answering from Feature and Sample Perspectives
Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
Neighbor-view Enhanced Model for Vision and Language Navigation
R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Sythesis via Generative Adversarial Networks
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation
Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
A Recurrent Vision-and-Language BERT for Navigation
Jo-SRC: A Contrastive Approach for Combating Noisy Labels
Non-Salient Region Object Mining for Weakly Supervised Semantic Segmentation
Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression
Sketch, Ground, and Refine: Top-Down Dense Video Captioning
Towards Accurate Text-based Image Captioning with Content Diversity Exploration
How to Train Your Agent to Read and Write?
Optimistic Agent: Accurate Graph-Based Value Estimation for More Successful Visual Navigation
Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps.
Language and Visual Entity Relationship Graph for Agent Navigation
Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue
Object-and-Action Aware Model for Visual Language Navigation
Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-modal Retrieval
REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments
Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension
DAM: Deliberation- Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue
DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue
Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
Length Controllable Image Captioning
Modular Graph Attention Network for Complex Visual Relational Reasoning
Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering
Soft Expert Reward Learning for Vision-and-Language Navigation
Sub-Instruction Aware Vision-and-Language Navigation
Attend and Imagine: Multi-label Image Classification with Visual Attention and Recurrent Neural Networks
Cascade Reasoning Network for Text-based Visual Question Answering
FVQA: Fact-based visual question answering
Medical Data Inquiry Using a Question Answering Model
Visual-Semantic Graph Matching for Visual Grounding
Visual Grounding via Accumulated Attention
Data-driven Meta-set Based Fine-Grained Visual Classification
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Gold Seeker: Information Gain from Policy Distributions for Goal-oriented Vision-and-Langauge Reasoning
Image and Sentence Matching via Semantic Concepts and Order Learning
Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only
Scripted Video Generation with a Bottom-up Generative Adversarial Network
Referring Expression Comprehension: A Survey of Methods and Datasets
Overcoming Language Priors in VQA via Decomposed Linguistic Representations
Medical image classification using synergic deep learning
Heritage Image Annotation via Collective Knowledge
Multi-Label Image Classification with Regional Latent Semantic Dependencies
Watch, Reason and Code: Learning to Represent Videos Using Program
Mind Your Neighbours: Image Annotation with Metadata Neighbourhood Graph Co-Attention Networks
Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge
Visual Question Answering: A Survey of Models and Datasets
Visual Question Answering: A Tutorial
Skin Lesion Classification in Dermoscopy Images Using Synergic Deep Learning
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning
Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards
Learning Semantic Concepts and Order for Image and Sentence Matching
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Visual Grounding via Accumulated Attention
Visual Question Answering with Memory-Augmented Networks
HCVRD: a benchmark for large-scale Human-Centered Visual Relationship Detection
Kill Two Birds With One Stone: Weakly-Supervised Neural Network for Image Annotation and Tag Refinement
Explicit Knowledge-based Reasoning for Visual Question Answering
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions
Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
Beyond Photo-Domain Object Recognition: Benchmarks for the Cross-Depiction Problem
Learning Graphs to Model Visual Objects across Different Depictive Styles
Modelling Visual Objects Invariant to Depictive Style
Learning Graphs to Model Visual Objects across Different Depictive Styles