Selected Publications

*
A Recurrent Vision-and-Language BERT for Navigation
A Recurrent Vision-and-Language BERT for Navigation
Jo-SRC: A Contrastive Approach for Combating Noisy Labels
Jo-SRC: A Contrastive Approach for Combating Noisy Labels
Non-Salient Region Object Mining for Weakly Supervised Semantic Segmentation
Non-Salient Region Object Mining for Weakly Supervised Semantic Segmentation
Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression
Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression
Sketch, Ground, and Refine: Top-Down Dense Video Captioning
Sketch, Ground, and Refine: Top-Down Dense Video Captioning
Towards Accurate Text-based Image Captioning with Content Diversity Exploration
Towards Accurate Text-based Image Captioning with Content Diversity Exploration
How to Train Your Agent to Read and Write?
Optimistic Agent: Accurate Graph-Based Value Estimation for More Successful Visual Navigation
Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps.
Language and Visual Entity Relationship Graph for Agent Navigation
Language and Visual Entity Relationship Graph for Agent Navigation
Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue
Learning Dual Encoding Model for Adaptive Visual Understanding in Visual Dialogue
Object-and-Action Aware Model for Visual Language Navigation
Object-and-Action Aware Model for Visual Language Navigation
Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-modal Retrieval
Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-modal Retrieval
REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments
REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments
Soft Expert Reward Learning for Vision-and-Language Navigation
Soft Expert Reward Learning for Vision-and-Language Navigation
DAM: Deliberation- Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue
DAM: Deliberation- Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue
Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension
Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension
Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge
Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
Length Controllable Image Captioning
Length Controllable Image Captioning
Modular Graph Attention Network for Complex Visual Relational Reasoning
Modular Graph Attention Network for Complex Visual Relational Reasoning
Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering
DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue
DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue
Sub-Instruction Aware Vision-and-Language Navigation
Sub-Instruction Aware Vision-and-Language Navigation
Attend and Imagine: Multi-label Image Classification with Visual Attention and Recurrent Neural Networks
Attend and Imagine: Multi-label Image Classification with Visual Attention and Recurrent Neural Networks
Cascade Reasoning Network for Text-based Visual Question Answering
FVQA: Fact-based visual question answering
FVQA: Fact-based visual question answering
Medical Data Inquiry Using a Question Answering Model
Medical Data Inquiry Using a Question Answering Model
Visual-Semantic Graph Matching for Visual Grounding
Visual-Semantic Graph Matching for Visual Grounding
Visual Grounding via Accumulated Attention
Visual Grounding via Accumulated Attention
Scripted Video Generation with a Bottom-up Generative Adversarial Network
Scripted Video Generation with a Bottom-up Generative Adversarial Network
Data-driven Meta-set Based Fine-Grained Visual Classification
Data-driven Meta-set Based Fine-Grained Visual Classification
Gold Seeker: Information Gain from Policy Distributions for Goal-oriented Vision-and-Langauge Reasoning
Gold Seeker: Information Gain from Policy Distributions for Goal-oriented Vision-and-Langauge Reasoning
Image and Sentence Matching via Semantic Concepts and Order Learning
Image and Sentence Matching via Semantic Concepts and Order Learning
Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only
Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
Referring Expression Comprehension: A Survey of Methods and Datasets
Referring Expression Comprehension: A Survey of Methods and Datasets
Overcoming Language Priors in VQA via Decomposed Linguistic Representations
Overcoming Language Priors in VQA via Decomposed Linguistic Representations
Medical image classification using synergic deep learning
Medical image classification using synergic deep learning
Heritage Image Annotation via Collective Knowledge
Heritage Image Annotation via Collective Knowledge
Multi-Label Image Classification with Regional Latent Semantic Dependencies
Multi-Label Image Classification with Regional Latent Semantic Dependencies
Watch, Reason and Code: Learning to Represent Videos Using Program
Mind Your Neighbours: Image Annotation with Metadata Neighbourhood Graph Co-Attention Networks
Mind Your Neighbours: Image Annotation with Metadata Neighbourhood Graph Co-Attention Networks
Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge
Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge
Visual Question Answering: A Survey of Models and Datasets
Visual Question Answering: A Survey of Models and Datasets
Visual Question Answering: A Tutorial
Visual Question Answering: A Tutorial
Skin Lesion Classification in Dermoscopy Images Using Synergic Deep Learning
Skin Lesion Classification in Dermoscopy Images Using Synergic Deep Learning
Visual Question Answering with Memory-Augmented Networks
Visual Question Answering with Memory-Augmented Networks
Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards
Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards
Learning Semantic Concepts and Order for Image and Sentence Matching
Learning Semantic Concepts and Order for Image and Sentence Matching
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Visual Grounding via Accumulated Attention
Visual Grounding via Accumulated Attention
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
HCVRD: a benchmark for large-scale Human-Centered Visual Relationship Detection
HCVRD: a benchmark for large-scale Human-Centered Visual Relationship Detection
Kill Two Birds With One Stone: Weakly-Supervised Neural Network for Image Annotation and Tag Refinement
Kill Two Birds With One Stone: Weakly-Supervised Neural Network for Image Annotation and Tag Refinement
Explicit Knowledge-based Reasoning for Visual Question Answering
Explicit Knowledge-based Reasoning for Visual Question Answering
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions
Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources
Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
Beyond Photo-Domain Object Recognition: Benchmarks for the Cross-Depiction Problem
Beyond Photo-Domain Object Recognition: Benchmarks for the Cross-Depiction Problem
Learning Graphs to Model Visual Objects across Different Depictive Styles
Learning Graphs to Model Visual Objects across Different Depictive Styles
Modelling Visual Objects Invariant to Depictive Style
Modelling Visual Objects Invariant to Depictive Style
Learning Graphs to Model Visual Objects across Different Depictive Styles
Learning Graphs to Model Visual Objects across Different Depictive Styles