Visual Question Answering: From Theory to Application

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Visual Question Answering (VQA) usually combines visual inputs like image and video with a natural language question concerning the input and generates a natural language answer as the output. This is by nature a multi-disciplinary research problem, involving computer vision (CV), natural language processing (NLP), knowledge representation and reasoning (KR), etc.

Further, VQA is an ambitious undertaking, as it must overcome the challenges of general image understanding and the question-answering task, as well as the difficulties entailed by using large-scale databases with mixed-quality inputs. However, with the advent of deep learning (DL) and driven by the existence of advanced techniques in both CV and NLP and the availability of relevant large-scale datasets, we have recently seen enormous strides in VQA, with more systems and promising results emerging.

This book provides a comprehensive overview of VQA, covering fundamental theories, models, datasets, and promising future directions. Given its scope, it can be used as a textbook on computer vision and natural language processing, especially for researchers and students in the area of visual question answering. It also highlights the key models used in VQA.

Author(s): Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Series: Advances in Computer Vision and Pattern Recognition
Publisher: Springer
Year: 2022

Language: English
Pages: 237
City: Singapore

Preface
Contents
1 Introduction
1.1 Motivation
1.2 Visual Question Answering in AI Tasks
1.3 Categorization of VQA
1.3.1 Classification Based on Data Settings
1.3.2 Classification by Task Settings
1.3.3 Others
1.4 Book Overview
References
Part I Preliminaries
2 Deep Learning Basics
2.1 Neural Networks
2.2 Convolutional Neural Networks
2.3 Recurrent Neural Networks and Variants
2.4 Encoder/Decoder Structure
2.5 Attention Mechanism
2.6 Memory Networks
2.7 Transformer Networks and BERT
2.8 Graph Neural Networks
References
3 Question Answering (QA) Basics
3.1 Rule-Based Methods
3.2 Information Retrieval-Based Methods
3.3 Neural Semantic Parsing for QA
3.4 Knowledge Base for QA
References
Part II Image-Based VQA
4 Classical Visual Question Answering
4.1 Introduction
4.2 Datasets
4.3 Generation Versus Classification: Two Answering Policies
4.4 Joint Embedding Methods
4.4.1 Sequence-to-Sequence Encoder/Decoder Models
4.4.2 Bilinear Encoding for VQA
4.5 Attention Mechanisms
4.5.1 Stacked Attention Networks
4.5.2 Hierarchical Question-Image Co-Attention
4.5.3 Bottom-Up and Top-Down Attention
4.6 Memory Networks for VQA
4.6.1 Improved Dynamic Memory Networks
4.6.2 Memory-Augmented Networks
4.7 Compositional Reasoning for VQA
4.7.1 Neural Modular Networks
4.7.2 Dynamic Neural Module Networks
4.8 Graph Neural Networks for VQA
4.8.1 Graph Convolutional Networks
4.8.2 Graph Attention Networks
4.8.3 Graph Convolutional Networks for VQA
4.8.4 Graph Attention Networks for VQA
References
5 Knowledge-Based VQA
5.1 Introduction
5.2 Datasets
5.3 Knowledge Bases
5.3.1 DBpedia
5.3.2 ConceptNet
5.4 Knowledge Embedding Methods
5.4.1 Word-to-Vector Representation
5.4.2 Bert-Based Representation
5.5 Question-to-Query Translation
5.5.1 Query-Mapping-Based Methods
5.5.2 Learning-Based Methods
5.6 Methods to Query Knowledge Bases
5.6.1 RDF Query
5.6.2 Memory Network Query
References
6 Vision-and-Language Pretraining for VQA
6.1 Introduction
6.2 General Pretraining Models
6.2.1 Embeddings from Language Models
6.2.2 Generative Pretraining Model
6.2.3 Bidirectional Encoder Representations from Transformers
6.3 Commonly Used Methods for Vision-and-Language Pretraining
6.3.1 Single-Stream Methods
6.3.2 Two-Stream Methods
6.4 Finetuning on VQA and Other Downstream Tasks
References
Part III Video-Based VQA
7 Video Representation Learning
7.1 Handcrafted Local Video Descriptors
7.2 Data-Driven Deep Learning Features for Video Representation
7.3 Self-supervised Learning for Video Representation
References
8 Video Question Answering
8.1 Introduction
8.2 Datasets
8.2.1 Multistep Reasoning Dataset
8.2.2 Single-Step Reasoning Dataset
8.3 Traditional Video Spatiotemporal Reasoning Using an Encoder-Decoder Framework
References
9 Advanced Models for Video Question Answering
9.1 Attention on Spatiotemporal Features
9.2 Memory Networks
9.3 Spatiotemporal Graph Neural Networks
References
Part IV Advanced Topics in VQA
10 Embodied VQA
10.1 Introduction
10.2 Simulators, Datasets and Evaluation Criteria
10.2.1 Simulators
10.2.2 Datasets
10.2.3 Evaluations
10.3 Language-Guided Visual Navigation
10.3.1 Vision-and-Language Navigation
10.3.2 Remote Object Localization
10.4 Embodied QA
10.5 Interactive QA
References
11 Medical VQA
11.1 Introduction
11.2 Datasets
11.3 Classical VQA Methods for Medical VQA
11.4 Meta-Learning Methods for Medical VQA
11.5 BERT-Based Methods for Medical VQA
References
12 Text-Based VQA
12.1 Introduction
12.2 Datasets
12.2.1 TextVQA
12.2.2 ST-VQA
12.2.3 OCR-VQA
12.3 OCR Token Representation
12.4 Simple Fusion Models
12.4.1 LoRRA: Look, Read, Reason & Answer
12.5 Transformer-Based Models
12.5.1 Multimodal Multicopy Mesh Model
12.6 Graph-Based Models
12.6.1 Structured Multimodal Attentions for TextVQA
References
13 Visual Question Generation
13.1 Introduction
13.2 VQG as Data Augmentation
13.2.1 Generating Questions from Answers
13.2.2 Generating Questions from Images
13.2.3 Adversarial Learning
13.3 VQG as Visual Understanding
References
14 Visual Dialogue
14.1 Introduction
14.2 Datasets
14.3 Attention Mechanism
14.3.1 Hierarchical Recurrent Encoder with Attention (HREA) and Memory Network (MN)
14.3.2 History-Conditioned Image Attentive Encoder (HCIAE)
14.3.3 Sequential Co-Attention Generative Model (CoAtt)
14.3.4 Synergistic Network
14.4 Visual Coreference Resolution
14.5 Graph-Based Methods
14.5.1 Scene Graph for Visual Representations
14.5.2 GNN for Visual and Dialogue Representations
14.6 Pretrained Models
14.6.1 VD_BERT
14.6.2 Visual-Dialog BERT
References
15 Referring Expression Comprehension
15.1 Introduction
15.2 Datasets
15.3 Two-Stage Models
15.3.1 Joint Embedding
15.3.2 Co-Attention Models
15.3.3 Graph-Based Models
15.4 One-Stage Models
15.5 Reasoning Process Comprehension
References
Part V Summary and Outlook
16 Summary and Outlook
16.1 Summary
16.2 Future Directions
16.2.1 Explainable VQA
16.2.2 Bias Elimination
16.2.3 Additional Settings and Applications
References
Index