A detailed and up-to-date introduction to machine learning, presented through the unifying lens of probabilistic modeling and Bayesian decision theory.
This book offers a detailed and up-to-date introduction to machine learning (including deep learning) through the unifying lens of probabilistic modeling and Bayesian decision theory. The book covers mathematical background (including linear algebra and optimization), basic supervised learning (including linear and logistic regression and deep neural networks), as well as more advanced topics (including transfer learning and unsupervised learning). End-of-chapter exercises allow students to apply what they have learned, and an appendix covers notation.
Probabilistic Machine Learning grew out of the author's 2012 book, Machine Learning: A Probabilistic Perspective. More than just a simple update, this is a completely new book that reflects the dramatic developments in the field since 2012, most notably deep learning. In addition, the new book is accompanied by online Python code, using libraries such as scikit-learn, JAX, PyTorch, and Tensorflow, which can be used to reproduce nearly all the figures; this code can be run inside a web browser using cloud-based notebooks, and provides a practical complement to the theoretical topics discussed in the book. This introductory text will be followed by a sequel that covers more advanced topics, taking the same probabilistic approach.
Author(s): Kevin P. Murphy
Series: Adaptive Computation and Machine Learning Series
Publisher: MIT Press
Year: 2022
Language: English
Commentary: Draft of “Probabilistic Machine Learning: An Introduction”. August 8, 2022
Pages: 856
City: Cambridge
Cover
Brief Contents
Contents
Preface
Introduction
What is machine learning?
Supervised learning
Classification
Regression
Overfitting and generalization
No free lunch theorem
Unsupervised learning
Clustering
Discovering latent ``factors of variation''
Self-supervised learning
Evaluating unsupervised learning
Reinforcement learning
Data
Some common image datasets
Some common text datasets
Preprocessing discrete input data
Preprocessing text data
Handling missing data
Discussion
The relationship between ML and other fields
Structure of the book
Caveats
I Foundations
Probability: Univariate Models
Introduction
What is probability?
Types of uncertainty
Probability as an extension of logic
Random variables
Discrete random variables
Continuous random variables
Sets of related random variables
Independence and conditional independence
Moments of a distribution
Limitations of summary statistics *
Bayes' rule
Example: Testing for COVID-19
Example: The Monty Hall problem
Inverse problems *
Bernoulli and binomial distributions
Definition
Sigmoid (logistic) function
Binary logistic regression
Categorical and multinomial distributions
Definition
Softmax function
Multiclass logistic regression
Log-sum-exp trick
Univariate Gaussian (normal) distribution
Cumulative distribution function
Probability density function
Regression
Why is the Gaussian distribution so widely used?
Dirac delta function as a limiting case
Some other common univariate distributions *
Student t distribution
Cauchy distribution
Laplace distribution
Beta distribution
Gamma distribution
Empirical distribution
Transformations of random variables *
Discrete case
Continuous case
Invertible transformations (bijections)
Moments of a linear transformation
The convolution theorem
Central limit theorem
Monte Carlo approximation
Exercises
Probability: Multivariate Models
Joint distributions for multiple random variables
Covariance
Correlation
Uncorrelated does not imply independent
Correlation does not imply causation
Simpson's paradox
The multivariate Gaussian (normal) distribution
Definition
Mahalanobis distance
Marginals and conditionals of an MVN *
Example: conditioning a 2d Gaussian
Example: Imputing missing values *
Linear Gaussian systems *
Bayes rule for Gaussians
Derivation *
Example: Inferring an unknown scalar
Example: inferring an unknown vector
Example: sensor fusion
The exponential family *
Definition
Example
Log partition function is cumulant generating function
Maximum entropy derivation of the exponential family
Mixture models
Gaussian mixture models
Bernoulli mixture models
Probabilistic graphical models *
Representation
Inference
Learning
Exercises
Statistics
Introduction
Maximum likelihood estimation (MLE)
Definition
Justification for MLE
Example: MLE for the Bernoulli distribution
Example: MLE for the categorical distribution
Example: MLE for the univariate Gaussian
Example: MLE for the multivariate Gaussian
Example: MLE for linear regression
Empirical risk minimization (ERM)
Example: minimizing the misclassification rate
Surrogate loss
Other estimation methods *
The method of moments
Online (recursive) estimation
Regularization
Example: MAP estimation for the Bernoulli distribution
Example: MAP estimation for the multivariate Gaussian *
Example: weight decay
Picking the regularizer using a validation set
Cross-validation
Early stopping
Using more data
Bayesian statistics *
Conjugate priors
The beta-binomial model
The Dirichlet-multinomial model
The Gaussian-Gaussian model
Beyond conjugate priors
Credible intervals
Bayesian machine learning
Computational issues
Frequentist statistics *
Sampling distributions
Gaussian approximation of the sampling distribution of the MLE
Bootstrap approximation of the sampling distribution of any estimator
Confidence intervals
Caution: Confidence intervals are not credible
The bias-variance tradeoff
Exercises
Decision Theory
Bayesian decision theory
Basics
Classification problems
ROC curves
Precision-recall curves
Regression problems
Probabilistic prediction problems
Choosing the ``right'' model
Bayesian hypothesis testing
Bayesian model selection
Occam's razor
Connection between cross validation and marginal likelihood
Information criteria
Posterior inference over effect sizes and Bayesian significance testing
Frequentist decision theory
Computing the risk of an estimator
Consistent estimators
Admissible estimators
Empirical risk minimization
Empirical risk
Structural risk
Cross-validation
Statistical learning theory *
Frequentist hypothesis testing *
Likelihood ratio test
Null hypothesis significance testing (NHST)
p-values
p-values considered harmful
Why isn't everyone a Bayesian?
Exercises
Information Theory
Entropy
Entropy for discrete random variables
Cross entropy
Joint entropy
Conditional entropy
Perplexity
Differential entropy for continuous random variables *
Relative entropy (KL divergence) *
Definition
Interpretation
Example: KL divergence between two Gaussians
Non-negativity of KL
KL divergence and MLE
Forward vs reverse KL
Mutual information *
Definition
Interpretation
Example
Conditional mutual information
MI as a ``generalized correlation coefficient''
Normalized mutual information
Maximal information coefficient
Data processing inequality
Sufficient Statistics
Fano's inequality *
Exercises
Linear Algebra
Introduction
Notation
Vector spaces
Norms of a vector and matrix
Properties of a matrix
Special types of matrices
Matrix multiplication
Vector–vector products
Matrix–vector products
Matrix–matrix products
Application: manipulating data matrices
Kronecker products *
Einstein summation *
Matrix inversion
The inverse of a square matrix
Schur complements *
The matrix inversion lemma *
Matrix determinant lemma *
Application: deriving the conditionals of an MVN *
Eigenvalue decomposition (EVD)
Basics
Diagonalization
Eigenvalues and eigenvectors of symmetric matrices
Geometry of quadratic forms
Standardizing and whitening data
Power method
Deflation
Eigenvectors optimize quadratic forms
Singular value decomposition (SVD)
Basics
Connection between SVD and EVD
Pseudo inverse
SVD and the range and null space of a matrix *
Truncated SVD
Other matrix decompositions *
LU factorization
QR decomposition
Cholesky decomposition
Solving systems of linear equations *
Solving square systems
Solving underconstrained systems (least norm estimation)
Solving overconstrained systems (least squares estimation)
Matrix calculus
Derivatives
Gradients
Directional derivative
Total derivative *
Jacobian
Hessian
Gradients of commonly used functions
Exercises
Optimization
Introduction
Local vs global optimization
Constrained vs unconstrained optimization
Convex vs nonconvex optimization
Smooth vs nonsmooth optimization
First-order methods
Descent direction
Step size (learning rate)
Convergence rates
Momentum methods
Second-order methods
Newton's method
BFGS and other quasi-Newton methods
Trust region methods
Stochastic gradient descent
Application to finite sum problems
Example: SGD for fitting linear regression
Choosing the step size (learning rate)
Iterate averaging
Variance reduction *
Preconditioned SGD
Constrained optimization
Lagrange multipliers
The KKT conditions
Linear programming
Quadratic programming
Mixed integer linear programming *
Proximal gradient method *
Projected gradient descent
Proximal operator for 1-norm regularizer
Proximal operator for quantization
Incremental (online) proximal methods
Bound optimization *
The general algorithm
The EM algorithm
Example: EM for a GMM
Blackbox and derivative free optimization
Exercises
II Linear Models
Linear Discriminant Analysis
Introduction
Gaussian discriminant analysis
Quadratic decision boundaries
Linear decision boundaries
The connection between LDA and logistic regression
Model fitting
Nearest centroid classifier
Fisher's linear discriminant analysis *
Naive Bayes classifiers
Example models
Model fitting
Bayesian naive Bayes
The connection between naive Bayes and logistic regression
Generative vs discriminative classifiers
Advantages of discriminative classifiers
Advantages of generative classifiers
Handling missing features
Exercises
Logistic Regression
Introduction
Binary logistic regression
Linear classifiers
Nonlinear classifiers
Maximum likelihood estimation
Stochastic gradient descent
Perceptron algorithm
Iteratively reweighted least squares
MAP estimation
Standardization
Multinomial logistic regression
Linear and nonlinear classifiers
Maximum likelihood estimation
Gradient-based optimization
Bound optimization
MAP estimation
Maximum entropy classifiers
Hierarchical classification
Handling large numbers of classes
Robust logistic regression *
Mixture model for the likelihood
Bi-tempered loss
Bayesian logistic regression *
Laplace approximation
Approximating the posterior predictive
Exercises
Linear Regression
Introduction
Least squares linear regression
Terminology
Least squares estimation
Other approaches to computing the MLE
Measuring goodness of fit
Ridge regression
Computing the MAP estimate
Connection between ridge regression and PCA
Choosing the strength of the regularizer
Lasso regression
MAP estimation with a Laplace prior (1 regularization)
Why does 1 regularization yield sparse solutions?
Hard vs soft thresholding
Regularization path
Comparison of least squares, lasso, ridge and subset selection
Variable selection consistency
Group lasso
Elastic net (ridge and lasso combined)
Optimization algorithms
Regression splines *
B-spline basis functions
Fitting a linear model using a spline basis
Smoothing splines
Generalized additive models
Robust linear regression *
Laplace likelihood
Student-t likelihood
Huber loss
RANSAC
Bayesian linear regression *
Priors
Posteriors
Example
Computing the posterior predictive
The advantage of centering
Dealing with multicollinearity
Automatic relevancy determination (ARD) *
Exercises
Generalized Linear Models *
Introduction
Examples
Linear regression
Binomial regression
Poisson regression
GLMs with non-canonical link functions
Maximum likelihood estimation
Worked example: predicting insurance claims
III Deep Neural Networks
Neural Networks for Tabular Data
Introduction
Multilayer perceptrons (MLPs)
The XOR problem
Differentiable MLPs
Activation functions
Example models
The importance of depth
The ``deep learning revolution''
Connections with biology
Backpropagation
Forward vs reverse mode differentiation
Reverse mode differentiation for multilayer perceptrons
Vector-Jacobian product for common layers
Computation graphs
Training neural networks
Tuning the learning rate
Vanishing and exploding gradients
Non-saturating activation functions
Residual connections
Parameter initialization
Parallel training
Regularization
Early stopping
Weight decay
Sparse DNNs
Dropout
Bayesian neural networks
Regularization effects of (stochastic) gradient descent *
Other kinds of feedforward networks *
Radial basis function networks
Mixtures of experts
Exercises
Neural Networks for Images
Introduction
Common layers
Convolutional layers
Pooling layers
Putting it all together
Normalization layers
Common architectures for image classification
LeNet
AlexNet
GoogLeNet (Inception)
ResNet
DenseNet
Neural architecture search
Other forms of convolution *
Dilated convolution
Transposed convolution
Depthwise separable convolution
Solving other discriminative vision tasks with CNNs *
Image tagging
Object detection
Instance segmentation
Semantic segmentation
Human pose estimation
Generating images by inverting CNNs *
Converting a trained classifier into a generative model
Image priors
Visualizing the features learned by a CNN
Deep Dream
Neural style transfer
Neural Networks for Sequences
Introduction
Recurrent neural networks (RNNs)
Vec2Seq (sequence generation)
Seq2Vec (sequence classification)
Seq2Seq (sequence translation)
Teacher forcing
Backpropagation through time
Vanishing and exploding gradients
Gating and long term memory
Beam search
1d CNNs
1d CNNs for sequence classification
Causal 1d CNNs for sequence generation
Attention
Attention as soft dictionary lookup
Kernel regression as non-parametric attention
Parametric attention
Seq2Seq with attention
Seq2vec with attention (text classification)
Seq+Seq2Vec with attention (text pair classification)
Soft vs hard attention
Transformers
Self-attention
Multi-headed attention
Positional encoding
Putting it all together
Comparing transformers, CNNs and RNNs
Transformers for images *
Other transformer variants *
Efficient transformers *
Fixed non-learnable localized attention patterns
Learnable sparse attention patterns
Memory and recurrence methods
Low-rank and kernel methods
Language models and unsupervised representation learning
ELMo
BERT
GPT
T5
Discussion
IV Nonparametric Models
Exemplar-based Methods
K nearest neighbor (KNN) classification
Example
The curse of dimensionality
Reducing the speed and memory requirements
Open set recognition
Learning distance metrics
Linear and convex methods
Deep metric learning
Classification losses
Ranking losses
Speeding up ranking loss optimization
Other training tricks for DML
Kernel density estimation (KDE)
Density kernels
Parzen window density estimator
How to choose the bandwidth parameter
From KDE to KNN classification
Kernel regression
Kernel Methods *
Mercer kernels
Mercer's theorem
Some popular Mercer kernels
Gaussian processes
Noise-free observations
Noisy observations
Comparison to kernel regression
Weight space vs function space
Numerical issues
Estimating the kernel
GPs for classification
Connections with deep learning
Scaling GPs to large datasets
Support vector machines (SVMs)
Large margin classifiers
The dual problem
Soft margin classifiers
The kernel trick
Converting SVM outputs into probabilities
Connection with logistic regression
Multi-class classification with SVMs
How to choose the regularizer C
Kernel ridge regression
SVMs for regression
Sparse vector machines
Relevance vector machines (RVMs)
Comparison of sparse and dense kernel methods
Exercises
Trees, Forests, Bagging, and Boosting
Classification and regression trees (CART)
Model definition
Model fitting
Regularization
Handling missing input features
Pros and cons
Ensemble learning
Stacking
Ensembling is not Bayes model averaging
Bagging
Random forests
Boosting
Forward stagewise additive modeling
Quadratic loss and least squares boosting
Exponential loss and AdaBoost
LogitBoost
Gradient boosting
Interpreting tree ensembles
Feature importance
Partial dependency plots
V Beyond Supervised Learning
Learning with Fewer Labeled Examples
Data augmentation
Examples
Theoretical justification
Transfer learning
Fine-tuning
Adapters
Supervised pre-training
Unsupervised pre-training (self-supervised learning)
Domain adaptation
Semi-supervised learning
Self-training and pseudo-labeling
Entropy minimization
Co-training
Label propagation on graphs
Consistency regularization
Deep generative models *
Combining self-supervised and semi-supervised learning
Active learning
Decision-theoretic approach
Information-theoretic approach
Batch active learning
Meta-learning
Model-agnostic meta-learning (MAML)
Few-shot learning
Matching networks
Weakly supervised learning
Exercises
Dimensionality Reduction
Principal components analysis (PCA)
Examples
Derivation of the algorithm
Computational issues
Choosing the number of latent dimensions
Factor analysis *
Generative model
Probabilistic PCA
EM algorithm for FA/PPCA
Unidentifiability of the parameters
Nonlinear factor analysis
Mixtures of factor analysers
Exponential family factor analysis
Factor analysis models for paired data
Autoencoders
Bottleneck autoencoders
Denoising autoencoders
Contractive autoencoders
Sparse autoencoders
Variational autoencoders
Manifold learning *
What are manifolds?
The manifold hypothesis
Approaches to manifold learning
Multi-dimensional scaling (MDS)
Isomap
Kernel PCA
Maximum variance unfolding (MVU)
Local linear embedding (LLE)
Laplacian eigenmaps
t-SNE
Word embeddings
Latent semantic analysis / indexing
Word2vec
GloVE
Word analogies
RAND-WALK model of word embeddings
Contextual word embeddings
Exercises
Clustering
Introduction
Evaluating the output of clustering methods
Hierarchical agglomerative clustering
The algorithm
Example
Extensions
K means clustering
The algorithm
Examples
Vector quantization
The K-means++ algorithm
The K-medoids algorithm
Speedup tricks
Choosing the number of clusters K
Clustering using mixture models
Mixtures of Gaussians
Mixtures of Bernoullis
Spectral clustering *
Normalized cuts
Eigenvectors of the graph Laplacian encode the clustering
Example
Connection with other methods
Biclustering *
Basic biclustering
Nested partition models (Crosscat)
Recommender Systems
Explicit feedback
Datasets
Collaborative filtering
Matrix factorization
Autoencoders
Implicit feedback
Bayesian personalized ranking
Factorization machines
Neural matrix factorization
Leveraging side information
Exploration-exploitation tradeoff
Graph Embeddings *
Introduction
Graph Embedding as an Encoder/Decoder Problem
Shallow graph embeddings
Unsupervised embeddings
Distance-based: Euclidean methods
Distance-based: non-Euclidean methods
Outer product-based: Matrix factorization methods
Outer product-based: Skip-gram methods
Supervised embeddings
Graph Neural Networks
Message passing GNNs
Spectral Graph Convolutions
Spatial Graph Convolutions
Non-Euclidean Graph Convolutions
Deep graph embeddings
Unsupervised embeddings
Semi-supervised embeddings
Applications
Unsupervised applications
Supervised applications
Notation
Introduction
Common mathematical symbols
Functions
Common functions of one argument
Common functions of two arguments
Common functions of >2 arguments
Linear algebra
General notation
Vectors
Matrices
Matrix calculus
Optimization
Probability
Information theory
Statistics and machine learning
Supervised learning
Unsupervised learning and generative models
Bayesian inference
Abbreviations
Index
Bibliography