This book presents deep learning techniques for video understanding. For deep learning basics, the authors cover machine learning pipelines and notations, 2D and 3D Convolutional Neural Networks for spatial and temporal feature learning. For action recognition, the authors introduce classical frameworks for image classification, and then elaborate both image-based and clip-based 2D/3D CNN networks for action recognition. For action detection, the authors elaborate sliding windows, proposal-based detection methods, single stage and two stage approaches, spatial and temporal action localization, followed by datasets introduction. For video captioning, the authors present language-based models and how to perform sequence to sequence learning for video captioning. For unsupervised feature learning, the authors discuss the necessity of shifting from supervised learning to unsupervised learning and then introduce how to design better surrogate training tasks to learn video representations. Finally, the book introduces recent self-training pipelines like contrastive learning and masked image/video modeling with transformers. The book provides promising directions, with an aim to promote future research outcomes in the field of video understanding with deep learning.
Author(s): Zuxuan Wu; Yu-Gang Jiang
Publisher: Springer
Year: 2024
Language: English
Pages: 197
Preface
Contents
1 Overview of Video Understanding
1.1 Video and Video Understanding
1.2 Video Understanding Tasks and Definitions
1.3 The Timeline of Video Understanding Techniques
2 Deep Learning Basics for Video Understanding
2.1 Convolutional Neural Networks (CNNs)
2.1.1 Convolution
2.1.2 Pooling
2.1.3 Classic Convolutional Neural Networks
2.2 Recurrent Neural Networks (RNNs)
2.3 Transformer
2.3.1 Self-attention
2.3.2 Transformer
2.3.3 Vision Transformer
2.3.4 Swin Transformer
2.4 Summary
3 Deep Learning for Action Recognition
3.1 Action Recognition with Convolutional Neural Networks
3.2 Feature Aggregation for Long-Range Temporal Modeling
3.2.1 Bag of Visual Words and its Variants
3.2.2 Temporal Aggregation with Recurrent Neural Networks
3.3 Action Recognition with Transformer Networks
3.3.1 Temporal Modeling Transformer
3.3.2 Vanilla Vision Transformer Variants
3.3.3 Video Transformer with Convolutional Advantages
3.3.4 Lightweight Video Transformer
3.3.5 CLIP for Video Transformer
3.4 Datasets
3.5 Summary
4 Deep Learning for Video Localization
4.1 Action Localization
4.1.1 Introduction of Action Localization Task
4.1.2 Supervised Action Localization
4.1.2.1 Two-Stage Methods
4.1.2.2 Single-Stage Methods
4.1.3 Weakly Supervised Action Localization
4.1.3.1 Class-Specific Attention Methods
4.1.3.2 Class-Agnostic Attention Methods
4.1.4 Unsupervised Action Localization
4.1.5 Spatial–Temporal Action Localization
4.1.6 Datasets
4.1.7 Summary
4.2 Temporal Video Grounding
4.2.1 Introduction of Temporal Video Grounding Task
4.2.2 Supervised Temporal Video Grounding Methods
4.2.2.1 Proposal-Based Methods
4.2.2.2 Proposal-Free Methods
4.2.3 Weakly Supervised Temporal Video Grounding Methods
4.2.3.1 Reconstruction-Based Methods
4.2.3.2 Multi-Instance Learning (MIL) Methods
4.2.4 Unsupervised and Zero-shot Temporal Video Grounding Methods
4.2.5 Datasets
4.2.6 Summary
5 Deep Learning for Video Captioning
5.1 Introduction of Video Captioning Task
5.1.1 Problem Formulation
5.1.2 A Common Encoder–Decoder Framework to Video Captioning
5.2 Video Captioning Methods
5.2.1 Template-Based Language Methods
5.2.2 Sequence Learning Methods
5.2.2.1 Multimodal Fusion-Based Methods
5.2.2.2 Spatial/Temporal Structure-Based Methods
5.2.2.3 Semantic/Syntactic Guidance-Based Methods
5.2.2.4 Other Methods
5.3 Datasets and Measures
5.3.1 Benchmark Datasets
5.3.2 Evaluation Metrics
5.4 Summary
6 Unsupervised Feature Learning for Video Understanding
6.1 Learning with Unlabeled Videos
6.2 Pretext Tasks Based on Self-prediction
6.2.1 Predicting Spatial–Temporal Transformations
6.2.2 Predicting by Generation
6.2.3 Predicting from Cross-modal Signals
6.2.4 Predicting with Multiple Tasks
6.3 Contrastive Self-supervised Learning
6.3.1 Spatial Domain
6.3.2 Temporal Domain
6.3.3 Spatial–Temporal Association Domain
6.3.4 Clustering
6.3.5 Multimodal Self-supervision
6.4 Masked Video Modeling on ViTs
6.4.1 Preliminary
6.4.2 Masked Video Prediction on ViTs
6.5 Summary
7 Efficient Video Understanding
7.1 Design Choices for Compact Neural Networks
7.1.1 3D-CNN-Based Approaches
7.1.2 Channel Separation
7.1.3 Decompose 3D into Spatial and Temporal Learning
7.1.4 Shift-Based Approaches
7.2 Training Strategies for Efficient Video Understanding
7.2.1 Time-Efficient Methods
7.2.1.1 Weight Initialization from Image Models
7.2.1.2 Advanced Training Strategies
7.2.2 Memory-Efficient Methods
7.2.2.1 CLIP Transfer Learning
7.2.2.2 Efficient Gradient Backpropagation
7.2.2.3 Input Pruning
7.3 Dynamic Inference for Video Understanding
7.3.1 Reducing Input Redundancy
7.3.1.1 Temporal Redundancy Reducing
7.3.1.2 Spatial Redundancy Reducing
7.3.1.3 Unified Spatial–Temporal Redundancy Reduction
7.3.1.4 Modality Redundancy Reducing
7.3.2 Adaptive Network Pruning
7.3.2.1 Dynamic Computational Resource Allocation
7.3.2.2 Adaptive Network Quantization
7.4 Summary
8 Conclusion and Future Directions
8.1 Concluding Remarks
8.2 Future Research Directions
Reference
Index