Neural networks were developed to simulate the human nervous system for Machine Learning tasks by treating the computational units in a learning model in a manner similar to human neurons. The grand vision of neural networks is to create artificial intelligence by building machines whose architecture simulates the computations in the human nervous system. Although the biological model of neural networks is an exciting one and evokes comparisons with science fiction, neural networks have a much simpler and mundane mathematical basis than a complex biological system. The neural network abstraction can be viewed as a modular approach of enabling learning algorithms that are based on continuous optimization on a computational graph of mathematical dependencies between the input and output. These ideas are strikingly similar to classical optimization methods in control theory, which historically preceded the development of neural network algorithms.
Neural networks were developed soon after the advent of computers in the fifties and sixties. Rosenblatt’s perceptron algorithm was seen as a fundamental cornerstone of neural networks, which caused an initial period of euphoria — it was soon followed by disappointment as the initial successes were somewhat limited. Eventually, at the turn of the century, greater data availability and increasing computational power lead to increased successes of neural networks, and this area was reborn under the new label of “Deep Learning.” Although we are still far from the day that Artificial Intelligence (AI) is close to human performance, there are specific domains like image recognition, self-driving cars, and game playing, where AI has matched or exceeded human performance. It is also hard to predict what AI might be able to do in the future. For example, few computer vision experts would have thought two decades ago that any automated system could ever perform an intuitive task like categorizing an image more accurately than a human. The large amounts of data available in recent years together with increased computational power have enabled experimentation with more sophisticated and deep neural architectures than was previously possible. The resulting success has changed the broader perception of the potential of Deep Learning. This book discusses neural networks from this modern perspective.
The chapters of the book are organized as follows:
1. The basics of neural networks: Chapters 1, 2, and 3 discuss the basics of neural network design and the backpropagation algorithm. Many traditional machine learning models can be understood as special cases of neural learning. Understanding the relationship between traditional machine learning and neural networks is the first step to understanding the latter. The simulation of various machine learning models with neural networks is provided in Chapter 3. This will give the analyst a feel of how neural networks push the envelope of traditional machine learning algorithms.
2. Fundamentals of neural networks: Although Chapters 1, 2, and 3 provide an overview of the training methods for neural networks, a more detailed understanding of the training challenges is provided in Chapters 4 and 5. Chapters 6 and 7 present radial-basis function (RBF) networks and restricted Boltzmann machines.
3. Advanced topics in neural networks: A lot of the recent success of deep learning is a result of the specialized architectures for various domains, such as recurrent neural networks and convolutional neural networks. Chapters 8 and 9 discuss recurrent and convolutional neural networks. Graph neural networks are discussed in Chapter 10. Several advanced topics like deep reinforcement learning, attention mechanisms, neural Turing mechanisms, and generative adversarial networks are discussed in Chapters 11 and 12.
Author(s): Charu C. Aggarwal
Edition: 2
Publisher: Springer International Publishing
Year: 2023
Language: English
Pages: 541
1 An Introduction to Neural Networks
1.1 Introduction
1.2 Single Computational Layer: The Perceptron
1.2.1 Use of Bias
1.2.2 What Objective Function Is the Perceptron Optimizing?
1.3 The Base Components of Neural Architectures
1.3.1 Choice of Activation Function
1.3.2 Softmax Activation Function
1.3.3 Common Loss Functions
1.4 Multilayer Neural Networks
1.4.1 The Multilayer Network as a Computational Graph
1.5 The Importance of Nonlinearity
1.5.1 Nonlinear Activations in Action
1.6 Advanced Architectures and Structured Data
1.7 Two Notable Benchmarks
1.7.1 The MNIST Database of Handwritten Digits
1.7.2 The ImageNet Database
1.8 Summary
1.9 Bibliographic Notes and Software Resources
1.10 Exercises
2 The Backpropagation Algorithm
2.1 Introduction
2.2 The Computational Graph Abstraction
2.2.1 Computational Graphs Create Complex Functions
2.3 Backpropagation in Computational Graphs
2.3.1 Computing Node-to-Node Derivatives with the Chain Rule
2.3.2 Dynamic Programming for Computing Node-to-NodeDerivatives
2.3.3 Converting Node-to-Node Derivatives into Loss-to-Weight Derivatives
2.4 Backpropagation in Neural Networks
2.4.1 Some Useful Derivatives of Activation Functions
2.4.2 Examples of Updates for Various Activations
2.5 The Vector-Centric View of Backpropagation
2.5.1 Derivatives with Respect to Vectors
2.5.2 Vector-Centric Chain Rule
2.5.3 A Decoupled View of Vector-Centric Backpropagation
2.5.4 Vector-Centric Backpropagation with Non-LayeredArchitectures
2.6 The Not-So-Unimportant Details
2.6.1 Mini-Batch Stochastic Gradient Descent
2.6.2 Learning Rate Decay
2.6.3 Checking the Correctness of Gradient Computation
2.6.4 Regularization
2.6.5 Loss Functions on Hidden Nodes
2.6.6 Backpropagation Tricks for Handling Shared Weights
2.7 Tuning and Preprocessing
2.7.1 Tuning Hyperparameters
2.7.2 Feature Preprocessing
2.7.3 Initialization
2.8 Backpropagation Is Interpretable
2.9 Summary
2.10 Bibliographic Notes and Software Resources
2.11 Exercises
3 Machine Learning with Shallow Neural Networks
3.1 Introduction
3.2 Neural Architectures for Binary Classification Models
3.2.1 Revisiting the Perceptron
3.2.2 Least-Squares Regression
3.2.2.1 Widrow-Hoff Learning
3.2.2.2 Closed Form Solutions
3.2.3 Support Vector Machines
3.2.4 Logistic Regression
3.2.5 Comparison of Different Models
3.3 Neural Architectures for Multiclass Models
3.3.1 Multiclass Perceptron
3.3.2 Weston-Watkins SVM
3.3.3 Multinomial Logistic Regression (Softmax Classifier)
3.4 Unsupervised Learning with Autoencoders
3.4.1 Linear Autoencoder with a Single Hidden Layer
3.4.1.1 Connections with Singular Value Decomposition
3.4.1.2 Sharing Weights in the Encoder and Decoder
3.4.2 Nonlinear Activation Functions and Depth
3.4.3 Application to Visualization
3.4.4 Application to Outlier Detection
3.4.5 Application to Multimodal Embeddings
3.4.6 Benefits of Autoencoders
3.5 Recommender Systems
3.6 Text Embedding with Word2vec
3.6.1 Neural Embedding with Continuous Bag of Words
3.6.2 Neural Embedding with Skip-Gram Model
3.6.3 Word2vec (SGNS) is Logistic Matrix Factorization
3.7 Simple Neural Architectures for Graph Embeddings
3.7.1 Handling Arbitrary Edge Counts
3.7.2 Beyond One-Hop Structural Models
3.7.3 Multinomial Model
3.8 Summary
3.9 Bibliographic Notes and Software Resources
3.10 Exercises
4 Deep Learning: Principles and Training Algorithms
4.1 Introduction
4.2 Why Is Depth Beneficial?
4.2.1 Hierarchical Feature Engineering: How Depth Reveals Rich Structure
4.3 Why Is Training Deep Networks Hard?
4.3.1 Geometric Understanding of the Effect of Gradient Ratios
4.3.2 The Vanishing and Exploding Gradient Problems
4.3.3 Cliffs and Valleys
4.3.4 Convergence Problems with Depth
4.3.5 Local Minima
4.4 Depth-Friendly Neural Architectures
4.4.1 Activation Function Choice
4.4.2 Dying Neurons and “Brain Damage”
4.4.2.1 Leaky ReLU
4.4.2.2 Maxout Networks
4.4.3 Using Skip Connections
4.5 Depth-Friendly Gradient-Descent Strategies
4.5.1 Importance of Preprocessing and Initialization
4.5.2 Momentum-Based Learning
4.5.3 Nesterov Momentum
4.5.4 Parameter-Specific Learning Rates
4.5.4.1 AdaGrad
4.5.4.2 RMSProp
4.5.4.3 AdaDelta
4.5.5 Combining Parameter-Specific Learning and Momentum
4.5.5.1 RMSProp with Nesterov Momentum
4.5.5.2 Adam
4.5.6 Gradient Clipping
4.5.7 Polyak Averaging
4.6 Second-Order Derivatives: The Newton Method
4.6.1 Example: Newton Method in the Quadratic Bowl
4.6.2 Example: Newton Method in a Non-Quadratic Function
4.6.3 The Saddle-Point Problem with Second-Order Methods
4.7 Fast Approximations of Newton Method
4.7.1 Conjugate Gradient Method
4.7.2 Quasi-Newton Methods and BFGS
4.8 Batch Normalization
4.9 Practical Tricks for Acceleration and Compression
4.9.1 GPU Acceleration
4.9.2 Parallel and Distributed Implementations
4.9.3 Algorithmic Tricks for Model Compression
4.10 Summary
4.11 Bibliographic Notes and Software Resources
4.12 Exercises
5 Teaching Deep Learners to Generalize
5.1 Introduction
5.1.1 Example: Linear Regression
5.1.2 Example: Polynomial Regression
5.2 The Bias-Variance Trade-Off
5.3 Generalization Issues in Model Tuning and Evaluation
5.3.1 Evaluating with Hold-Out and Cross-Validation
5.3.2 Issues with Training at Scale
5.3.3 How to Detect Need to Collect More Data
5.4 Penalty-Based Regularization
5.4.1 Connections with Noise Injection
5.4.2 L1-Regularization
5.4.3 L1- or L2-Regularization?
5.4.4 Penalizing Hidden Units: Learning Sparse Representations
5.5 Ensemble Methods
5.5.1 Bagging and Subsampling
5.5.2 Parametric Model Selection and Averaging
5.5.3 Randomized Connection Dropping
5.5.4 Dropout
5.5.5 Data Perturbation Ensembles
5.6 Early Stopping
5.6.1 Understanding Early Stopping from the Variance Perspective
5.7 Unsupervised Pretraining
5.7.1 Variations of Unsupervised Pretraining
5.7.2 What About Supervised Pretraining?
5.8 Continuation and Curriculum Learning
5.9 Parameter Sharing
5.10 Regularization in Unsupervised Applications
5.10.1 When the Hidden Layer is Broader than the Input Layer
5.10.1.1 Sparse Feature Learning
5.10.2 Noise Injection: De-noising Autoencoders
5.10.3 Gradient-Based Penalization: Contractive Autoencoders
5.10.4 Hidden Probabilistic Structure: Variational Autoencoders
5.10.4.1 Reconstruction and Generative Sampling
5.10.4.2 Conditional Variational Autoencoders
5.10.4.3 Relationship with Generative Adversarial Networks
5.11 Summary
5.12 Bibliographic Notes and Software Resources
5.13 Exercises
6 Radial Basis Function Networks
6.1 Introduction
6.2 Training an RBF Network
6.2.1 Training the Hidden Layer
6.2.2 Training the Output Layer
6.2.3 Iterative Construction of Hidden Layer
6.2.4 Fully Supervised Learning of Hidden Layer
6.3 Variations and Special Cases of RBF Networks
6.3.1 Classification with Perceptron Criterion
6.3.2 Classification with Hinge Loss
6.3.3 Example of Linear Separability Promoted by RBF
6.3.4 Application to Interpolation
6.4 Relationship with Kernel Methods
6.4.1 Kernel Regression Is a Special Case of RBF Networks
6.4.2 Kernel SVM Is a Special Case of RBF Networks
6.5 Summary
6.6 Bibliographic Notes and Software Resources
6.7 Exercises
7 Restricted Boltzmann Machines
7.1 Introduction
7.2 Hopfield Networks
7.2.1 Training a Hopfield Network
7.2.2 Building a Toy Recommender and Its Limitations
7.2.3 Increasing the Expressive Power of the Hopfield Network
7.3 The Boltzmann Machine
7.3.1 How a Boltzmann Machine Generates Data
7.3.2 Learning the Weights of a Boltzmann Machine
7.4 Restricted Boltzmann Machines
7.4.1 Training the RBM
7.4.2 Contrastive Divergence Algorithm
7.5 Applications of Restricted Boltzmann Machines
7.5.1 Dimensionality Reduction and Data Reconstruction
7.5.2 RBMs for Collaborative Filtering
7.5.3 Using RBMs for Classification
7.5.4 Topic Models with RBMs
7.5.5 RBMs for Machine Learning with Multimodal Data
7.6 Using RBMs beyond Binary Data Types
7.7 Stacking Restricted Boltzmann Machines
7.7.1 Unsupervised Learning
7.7.2 Supervised Learning
7.7.3 Deep Boltzmann Machines and Deep Belief Networks
7.8 Summary
7.9 Bibliographic Notes and Software Resources
7.10 Exercises
8 Recurrent Neural Networks
8.1 Introduction
8.2 The Architecture of Recurrent Neural Networks
8.2.1 Language Modeling Example of RNN
8.2.2 Backpropagation Through Time
8.2.3 Bidirectional Recurrent Networks
8.2.4 Multilayer Recurrent Networks
8.3 The Challenges of Training Recurrent Networks
8.3.1 Layer Normalization
8.4 Echo-State Networks
8.5 Long Short-Term Memory (LSTM)
8.6 Gated Recurrent Units (GRUs)
8.7 Applications of Recurrent Neural Networks
8.7.1 Contextualized Word Embeddings with ELMo
8.7.2 Application to Automatic Image Captioning
8.7.3 Sequence-to-Sequence Learning and Machine Translation
8.7.4 Application to Sentence-Level Classification
8.7.5 Token-Level Classification with Linguistic Features
8.7.6 Time-Series Forecasting and Prediction
8.7.7 Temporal Recommender Systems
8.7.8 Secondary Protein Structure Prediction
8.7.9 End-to-End Speech Recognition
8.7.10 Handwriting Recognition
8.8 Summary
8.9 Bibliographic Notes and Software Resources
8.10 Exercises
9 Convolutional Neural Networks
9.1 Introduction
9.1.1 Historical Perspective and Biological Inspiration
9.1.2 Broader Observations about Convolutional Neural Networks
9.2 The Basic Structure of a Convolutional Network
9.2.1 Padding
9.2.2 Strides
9.2.3 The ReLU Layer
9.2.4 Pooling
9.2.5 Fully Connected Layers
9.2.6 The Interleaving between Layers
9.2.7 Hierarchical Feature Engineering
9.3 Training a Convolutional Network
9.3.1 Backpropagating Through Convolutions
9.3.2 Backpropagation as Convolution with Inverted/TransposedFilter
9.3.3 Convolution/Backpropagation as Matrix Multiplications
9.3.4 Data Augmentation
9.4 Case Studies of Convolutional Architectures
9.4.1 AlexNet
9.4.2 ZFNet
9.4.3 VGG
9.4.4 GoogLeNet
9.4.5 ResNet
9.4.6 Squeeze-and-Excitation Networks (SENets)
9.4.7 The Effects of Depth
9.4.8 Pretrained Models
9.5 Visualization and Unsupervised Learning
9.5.1 Visualizing the Features of a Trained Network
9.5.2 Convolutional Autoencoders
9.6 Applications of Convolutional Networks
9.6.1 Content-Based Image Retrieval
9.6.2 Object Localization
9.6.3 Object Detection
9.6.4 Natural Language and Sequence Learning with TextCNN
9.6.5 Video Classification
9.7 Summary
9.8 Bibliographic Notes and Software Resources
9.9 Exercises
10 Graph Neural Networks
10.1 Introduction
10.2 Node Embeddings with ConventionalArchitectures
10.2.1 Adjacency Matrix Representation and Feature Engineering
10.3 Graph Neural Networks: The General Framework
10.3.1 The Neighborhood Function
10.3.2 Graph Convolution Function
10.3.3 GraphSAGE
10.3.4 Handling Edge Weights
10.3.5 Handling New Vertices
10.3.6 Handling Relational Networks
10.3.7 Directed Graphs
10.3.8 Gated Graph Neural Networks
10.3.9 Comparison with Image Convolutional Networks
10.4 Backpropagation in Graph Neural Networks
10.5 Beyond Nodes: Generating Graph-LevelModels
10.6 Applications of Graph Neural Networks
10.7 Summary
10.8 Bibliographic Notes and Software Resources
10.9 Exercises
11 Deep Reinforcement Learning
11.1 Introduction
11.2 Stateless Algorithms: Multi-Armed Bandits
11.3 The Basic Framework of Reinforcement Learning
11.4 Monte Carlo Sampling
11.4.1 Monte Carlo Sampling Algorithm
11.4.2 Monte Carlo Rollouts with Function Approximators
11.5 Bootstrapping for Value Function Learning
11.5.1 Q-Learning
11.5.2 Deep Learning Models as Function Approximators
11.5.3 Example: Neural Network Specifics for Video Game Setting
11.5.4 On-Policy versus Off-Policy Methods: SARSA
11.5.5 Modeling States versus State-Action Pairs
11.6 Policy Gradient Methods
11.6.1 Finite Difference Methods
11.6.2 Likelihood Ratio Methods
11.6.3 Actor-Critic Methods
11.6.4 Continuous Action Spaces
11.7 Monte Carlo Tree Search
11.8 Case Studies
11.8.1 AlphaGo and AlphaZero for Go and Chess
11.8.2 Self-Learning Robots
11.8.2.1 Deep Learning of Locomotion Skills
11.8.2.2 Deep Learning of Visuomotor Skills
11.8.3 Building Conversational Systems: Deep Learning for Chatbots
11.8.4 Self-Driving Cars
11.8.5 Neural Architecture Search with Reinforcement Learning
11.9 Practical Challenges Associated with Safety
11.10 Summary
11.11 Bibliographic Notes and Software Resources
11.12 Exercises
12 Advanced Topics in Deep Learning
12.1 Introduction
12.2 Attention Mechanisms
12.2.1 Recurrent Models of Visual Attention
12.2.2 Attention Mechanisms for Image Captioning
12.2.3 Soft Image Attention with Spatial Transformer
12.2.4 Attention Mechanisms for Machine Translation
12.2.5 Transformer Networks
12.2.5.1 How Self Attention Helps
12.2.5.2 The Self-Attention Module
12.2.5.3 Incorporating Positional Information
12.2.5.4 The Sequence-to-Sequence Transformer
12.2.5.5 Multihead Attention
12.2.6 Transformer-Based Pre-trained Language Models
12.2.6.1 GPT-n
12.2.6.2 BERT
12.2.6.3 T5
12.2.7 Vision Transformer (ViT)
12.2.8 Attention Mechanisms in Graphs
12.3 Neural Turing Machines
12.4 Adversarial Deep Learning
12.5 Generative Adversarial Networks (GANs)
12.5.1 Training a Generative Adversarial Network
12.5.2 Comparison with Variational Autoencoder
12.5.3 Using GANs for Generating Image Data
12.5.4 Conditional Generative Adversarial Networks
12.6 Competitive Learning
12.6.1 Vector Quantization
12.6.2 Kohonen Self-Organizing Map
12.7 Limitations of Neural Networks
12.7.1 An Aspirational Goal: Few Shot Learning
12.7.2 An Aspirational Goal: Energy-Efficient Learning
12.8 Summary
12.9 Bibliographic Notes and Software Resources
12.10 Exercises
Bibliography
Index