Deep Learning: Foundations and Concepts

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book offers a comprehensive introduction to the central ideas that underpin Deep Learning. It is intended both for newcomers to machine learning and for those already experienced in the field. Covering key concepts relating to contemporary architectures and techniques, this essential book equips readers with a robust foundation for potential future specialization. The field of deep learning is undergoing rapid evolution, and therefore this book focusses on ideas that are likely to endure the test of time. The book is organized into numerous bite-sized chapters, each exploring a distinct topic, and the narrative follows a linear progression, with each chapter building upon content from its predecessors. This structure is well-suited to teaching a two-semester undergraduate or postgraduate machine learning course, while remaining equally relevant to those engaged in active research or in self-study. A full understanding of Machine Learning requires some mathematical background and so the book includes a self-contained introduction to probability theory. However, the focus of the book is on conveying a clear understanding of ideas, with emphasis on the real-world practical value of techniques rather than on abstract theory. Complex concepts are therefore presented from multiple complementary perspectives including textual descriptions, diagrams, mathematical formulae, and pseudo-code. “Chris Bishop wrote a terrific textbook on neural networks in 1995 and has a deep knowledge of the field and its core ideas. His many years of experience in explaining neural networks have made him extremely skillful at presenting complicated ideas in the simplest possible way and it is a delight to see these skills applied to the revolutionary new developments in the field.” -- Geoffrey Hinton "With the recent explosion of deep learning and AI as a research topic, and the quickly growing importance of AI applications, a modern textbook on the topic was badly needed. The "New Bishop" masterfully fills the gap, covering algorithms for supervised and unsupervised learning, modern deep learning architecture families, as well as how to apply all of this to various application areas." – Yann LeCun “This excellent and very educational book will bring the reader up to date with the main concepts and advances in deep learning with a solid anchoring in probability. These concepts are powering current industrial AI systems and are likely to form the basis of further advances towards artificial general intelligence.” -- Yoshua Bengio

Author(s): Christopher M. Bishop, Hugh Bishop
Publisher: Springer
Year: 2024

Language: English
Pages: 656

Preface
Goals of the book
Responsible use of technology
Structure of the book
References
Exercises
Mathematical notation
Acknowledgements
Contents
1 The Deep Learning Revolution
1.1. The Impact of Deep Learning
1.1.1 Medical diagnosis
1.1.2 Protein structure
1.1.3 Image synthesis
1.1.4 Large language models
1.2. A Tutorial Example
1.2.1 Synthetic data
1.2.2 Linear models
1.2.3 Error function
1.2.4 Model complexity
1.2.5 Regularization
1.2.6 Model selection
1.3. A Brief History of Machine Learning
1.3.1 Single-layer networks
1.3.2 Backpropagation
1.3.3 Deep networks
2 Probabilities
2.1. The Rules of Probability
2.1.1 A medical screening example
2.1.2 The sum and product rules
2.1.3 Bayes’ theorem
2.1.4 Medical screening revisited
2.1.5 Prior and posterior probabilities
2.1.6 Independent variables
2.2. Probability Densities
2.2.1 Example distributions
2.2.2 Expectations and covariances
2.3. The Gaussian Distribution
2.3.1 Mean and variance
2.3.2 Likelihood function
2.3.3 Bias of maximum likelihood
2.3.4 Linear regression
2.4. Transformation of Densities
2.4.1 Multivariate distributions
2.5. Information Theory
2.5.1 Entropy
2.5.2 Physics perspective
2.5.3 Differential entropy
2.5.4 Maximum entropy
2.5.5 Kullback–Leibler divergence
2.5.6 Conditional entropy
2.5.7 Mutual information
2.6. Bayesian Probabilities
2.6.1 Model parameters
2.6.2 Regularization
2.6.3 Bayesian machine learning
Exercises
3 Standard Distributions
3.1. Discrete Variables
3.1.1 Bernoulli distribution
3.1.2 Binomial distribution
3.1.3 Multinomial distribution
3.2. The Multivariate Gaussian
3.2.1 Geometry of the Gaussian
3.2.2 Moments
3.2.3 Limitations
3.2.4 Conditional distribution
3.2.5 Marginal distribution
3.2.6 Bayes’ theorem
3.2.7 Maximum likelihood
3.2.8 Sequential estimation
3.2.9 Mixtures of Gaussians
3.3. Periodic Variables
3.3.1 Von Mises distribution
3.4. The Exponential Family
3.4.1 Sufficient statistics
3.5. Nonparametric Methods
3.5.1 Histograms
3.5.2 Kernel densities
3.5.3 Nearest-neighbours
Exercises
4 Single-layer Networks: Regression
4.1. Linear Regression
4.1.1 Basis functions
4.1.2 Likelihood function
4.1.3 Maximum likelihood
4.1.4 Geometry of least squares
4.1.5 Sequential learning
4.1.6 Regularized least squares
4.1.7 Multiple outputs
4.2. Decision theory
4.3. The Bias–Variance Trade-off
Exercises
5 Single-layer Networks: Classification
5.1. Discriminant Functions
5.1.1 Two classes
5.1.2 Multiple classes
5.1.3 1-of-K coding
5.1.4 Least squares for classification
5.2. Decision Theory
5.2.1 Misclassification rate
5.2.2 Expected loss
5.2.3 The reject option
5.2.4 Inference and decision
5.2.5 Classifier accuracy
5.2.6 ROC curve
5.3. Generative Classifiers
5.3.1 Continuous inputs
5.3.2 Maximum likelihood solution
5.3.3 Discrete features
5.3.4 Exponential family
5.4. Discriminative Classifiers
5.4.1 Activation functions
5.4.2 Fixed basis functions
5.4.3 Logistic regression
5.4.4 Multi-class logistic regression
5.4.5 Probit regression
5.4.6 Canonical link functions
Exercises
6 Deep Neural Networks
6.1. Limitations of Fixed Basis Functions
6.1.1 The curse of dimensionality
6.1.2 High-dimensional spaces
6.1.3 Data manifolds
6.1.4 Data-dependent basis functions
6.2. Multilayer Networks
6.2.1 Parameter matrices
6.2.2 Universal approximation
6.2.3 Hidden unit activation functions
6.2.4 Weight-space symmetries
6.3. Deep Networks
6.3.1 Hierarchical representations
6.3.2 Distributed representations
6.3.3 Representation learning
6.3.4 Transfer learning
6.3.5 Contrastive learning
6.3.6 General network architectures
6.3.7 Tensors
6.4. Error Functions
6.4.1 Regression
6.4.2 Binary classification
6.4.3 multiclass classification
6.5. Mixture Density Networks
6.5.1 Robot kinematics example
6.5.2 Conditional mixture distributions
6.5.3 Gradient optimization
6.5.4 Predictive distribution
Exercises
7 Gradient Descent
7.1. Error Surfaces
7.1.1 Local quadratic approximation
7.2. Gradient Descent Optimization
7.2.1 Use of gradient information
7.2.2 Batch gradient descent
7.2.3 Stochastic gradient descent
7.2.4 Mini-batches
7.2.5 Parameter initialization
7.3. Convergence
7.3.1 Momentum
7.3.2 Learning rate schedule
7.3.3 RMSProp and Adam
7.4. Normalization
7.4.1 Data normalization
7.4.2 Batch normalization
7.4.3 Layer normalization
Exercises
8 Backpropagation
8.1. Evaluation of Gradients
8.1.1 Single-layer networks
8.1.2 General feed-forward networks
8.1.3 A simple example
8.1.4 Numerical differentiation
8.1.5 The Jacobian matrix
8.1.6 The Hessian matrix
8.2. Automatic Differentiation
8.2.1 Forward-mode automatic differentiation
8.2.2 Reverse-mode automatic differentiation
Exercises
9 Regularization
9.1. Inductive Bias
9.1.1 Inverse problems
9.1.2 No free lunch theorem
9.1.3 Symmetry and invariance
9.1.4 Equivariance
9.2. Weight Decay
9.2.1 Consistent regularizers
9.2.2 Generalized weight decay
9.3. Learning Curves
9.3.1 Early stopping
9.3.2 Double descent
9.4. Parameter Sharing
9.4.1 Soft weight sharing
9.5. Residual Connections
9.6. Model Averaging
9.6.1 Dropout
Exercises
10 Convolutional Networks
10.1. Computer Vision
10.1.1 Image data
10.2. Convolutional Filters
10.2.1 Feature detectors
10.2.2 Translation equivariance
10.2.3 Padding
10.2.4 Strided convolutions
10.2.5 Multi-dimensional convolutions
10.2.6 Pooling
10.2.7 Multilayer convolutions
10.2.8 Example network architectures
10.3. Visualizing Trained CNNs
10.3.1 Visual cortex
10.3.2 Visualizing trained filters
10.3.3 Saliency maps
10.3.4 Adversarial attacks
10.3.5 Synthetic images
10.4. Object Detection
10.4.1 Bounding boxes
10.4.2 Intersection-over-union
10.4.3 Sliding windows
10.4.4 Detection across scales
10.4.5 Non-max suppression
10.4.6 Fast region CNNs
10.5. Image Segmentation
10.5.1 Convolutional segmentation
10.5.2 Up-sampling
10.5.3 Fully convolutional networks
10.5.4 The U-net architecture
10.6. Style Transfer
Exercises
11 Structured Distributions
11.1. Graphical Models
11.1.1 Directed graphs
11.1.2 Factorization
11.1.3 Discrete variables
11.1.4 Gaussian variables
11.1.5 Binary classifier
11.1.6 Parameters and observations
11.1.7 Bayes’ theorem
11.2. Conditional Independence
11.2.1 Three example graphs
11.2.2 Explaining away
11.2.3 D-separation
11.2.4 Naive Bayes
11.2.5 Generative models
11.2.6 Markov blanket
11.2.7 Graphs as filters
11.3. Sequence Models
11.3.1 Hidden variables
Exercises
12 Transformers
12.1. Attention
12.1.1 Transformer processing
12.1.2 Attention coefficients
12.1.3 Self-attention
12.1.4 Network parameters
12.1.5 Scaled self-attention
12.1.6 Multi-head attention
12.1.7 Transformer layers
12.1.8 Computational complexity
12.1.9 Positional encoding
12.2. Natural Language
12.2.1 Word embedding
12.2.2 Tokenization
12.2.3 Bag of words
12.2.4 Autoregressive models
12.2.5 Recurrent neural networks
12.2.6 Backpropagation through time
12.3. Transformer Language Models
12.3.1 Decoder transformers
12.3.2 Sampling strategies
12.3.3 Encoder transformers
12.3.4 Sequence-to-sequence transformers
12.3.5 Large language models
12.4. Multimodal Transformers
12.4.1 Vision transformers
12.4.2 Generative image transformers
12.4.3 Audio data
12.4.4 Text-to-speech
12.4.5 Vision and language transformers
Exercises
13 Graph Neural Networks
13.1. Machine Learning on Graphs
13.1.1 Graph properties
13.1.2 Adjacency matrix
13.1.3 Permutation equivariance
13.2. Neural Message-Passing
13.2.1 Convolutional filters
13.2.2 Graph convolutional networks
13.2.3 Aggregation operators
13.2.4 Update operators
13.2.5 Node classification
13.2.6 Edge classification
13.2.7 Graph classification
13.3. General Graph Networks
13.3.1 Graph attention networks
13.3.2 Edge embeddings
13.3.3 Graph embeddings
13.3.4 Over-smoothing
13.3.5 Regularization
13.3.6 Geometric deep learning
Exercises
14 Sampling
14.1. Basic Sampling Algorithms
14.1.1 Expectations
14.1.2 Standard distributions
14.1.3 Rejection sampling
14.1.4 Adaptive rejection sampling
14.1.5 Importance sampling
14.1.6 Sampling-importance-resampling
14.2. Markov Chain Monte Carlo
14.2.1 The Metropolis algorithm
14.2.2 Markov chains
14.2.3 The Metropolis–Hastings algorithm
14.2.4 Gibbs sampling
14.2.5 Ancestral sampling
14.3. Langevin Sampling
14.3.1 Energy-based models
14.3.2 Maximizing the likelihood
14.3.3 Langevin dynamics
Exercises
15 Discrete Latent Variables
15.1. K-means Clustering
15.1.1 Image segmentation
15.2. Mixtures of Gaussians
15.2.1 Likelihood function
15.2.2 Maximum likelihood
15.3. Expectation–Maximization Algorithm
15.3.1 Gaussian mixtures
15.3.2 Relation to K-means
15.3.3 Mixtures of Bernoulli distributions
15.4. Evidence Lower Bound
15.4.1 EM revisited
15.4.2 Independent and identically distributed data
15.4.3 Parameter priors
15.4.4 Generalized EM
15.4.5 Sequential EM
Exercises
16 Continuous Latent Variables
16.1. Principal Component Analysis
16.1.1 Maximum variance formulation
16.1.2 Minimum-error formulation
16.1.3 Data compression
16.1.4 Data whitening
16.1.5 High-dimensional data
16.2. Probabilistic Latent Variables
16.2.1 Generative model
16.2.2 Likelihood function
16.2.3 Maximum likelihood
16.2.4 Factor analysis
16.2.5 Independent component analysis
16.2.6 Kalman filters
16.3. Evidence Lower Bound
16.3.1 Expectation maximization
16.3.2 EM for PCA
16.3.3 EM for factor analysis
16.4. Nonlinear Latent Variable Models
16.4.1 Nonlinear manifolds
16.4.2 Likelihood function
16.4.3 Discrete data
16.4.4 Four approaches to generative modelling
Exercises
17 Generative Adversarial Networks
17.1. Adversarial Training
17.1.1 Loss function
17.1.2 GAN training in practice
17.2. Image GANs
17.2.1 CycleGAN
Exercises
18 Normalizing Flows
18.1. Coupling Flows
18.2. Autoregressive Flows
18.3. Continuous Flows
18.3.1 Neural differential equations
18.3.2 Neural ODE backpropagation
18.3.3 Neural ODE flows
Exercises
19 Autoencoders
19.1. Deterministic Autoencoders
19.1.1 Linear autoencoders
19.1.2 Deep autoencoders
19.1.3 Sparse autoencoders
19.1.4 Denoising autoencoders
19.1.5 Masked autoencoders
19.2. Variational Autoencoders
19.2.1 Amortized inference
19.2.2 The reparameterization trick
Exercises
20 Diffusion Models
20.1. Forward Encoder
20.1.1 Diffusion kernel
20.1.2 Conditional distribution
20.2. Reverse Decoder
20.2.1 Training the decoder
20.2.2 Evidence lower bound
20.2.3 Rewriting the ELBO
20.2.4 Predicting the noise
20.2.5 Generating new samples
20.3. Score Matching
20.3.1 Score loss function
20.3.2 Modified score loss
20.3.3 Noise variance
20.3.4 Stochastic differential equations
20.4. Guided Diffusion
20.4.1 Classifier guidance
20.4.2 Classifier-free guidance
Exercises
Appendix B. Calculus of Variations
Appendix C. Lagrange Multipliers
Bibliography
Index