An advanced book for researchers and graduate students working in machine learning and statistics who want to learn about deep learning, Bayesian inference, generative models, and decision making under uncertainty.
An advanced counterpart to Probabilistic Machine Learning: An Introduction, this high-level textbook provides researchers and graduate students detailed coverage of cutting-edge topics in machine learning, including deep generative modeling, graphical models, Bayesian inference, reinforcement learning, and causality. This volume puts deep learning into a larger statistical context and unifies approaches based on deep learning with ones based on probabilistic modeling and inference. With contributions from top scientists and domain experts from places such as Google, DeepMind, Amazon, Purdue University, NYU, and the University of Washington, this rigorous book is essential to understanding the vital issues in machine learning.
- Covers generation of high dimensional outputs, such as images, text, and graphs
- Discusses methods for discovering insights about data, based on latent variable models
- Considers training and testing under different distributions
- Explores how to use probabilistic models and inference for causal inference and decision making
- Features online Python code accompaniment
Author(s): Kevin P. Murphy
Series: Adaptive Computation and Machine Learning 02
Edition: Draft
Publisher: The MIT Press
Year: 2023
Language: English
Pages: 1360
Preface
1 Introduction
I Fundamentals
2 Probability
2.1 Introduction
2.1.1 Probability space
2.1.2 Discrete random variables
2.1.3 Continuous random variables
2.1.4 Probability axioms
2.1.5 Conditional probability
2.1.6 Bayes' rule
2.2 Some common probability distributions
2.2.1 Discrete distributions
2.2.2 Continuous distributions on R
2.2.3 Continuous distributions on R+
2.2.4 Continuous distributions on [0,1]
2.2.5 Multivariate continuous distributions
2.3 Gaussian joint distributions
2.3.1 The multivariate normal
2.3.2 Linear Gaussian systems
2.3.3 A general calculus for linear Gaussian systems
2.4 The exponential family
2.4.1 Definition
2.4.2 Examples
2.4.3 Log partition function is cumulant generating function
2.4.4 Canonical (natural) vs mean (moment) parameters
2.4.5 MLE for the exponential family
2.4.6 Exponential dispersion family
2.4.7 Maximum entropy derivation of the exponential family
2.5 Transformations of random variables
2.5.1 Invertible transformations (bijections)
2.5.2 Monte Carlo approximation
2.5.3 Probability integral transform
2.6 Markov chains
2.6.1 Parameterization
2.6.2 Application: language modeling
2.6.3 Parameter estimation
2.6.4 Stationary distribution of a Markov chain
2.7 Divergence measures between probability distributions
2.7.1 f-divergence
2.7.2 Integral probability metrics
2.7.3 Maximum mean discrepancy (MMD)
2.7.4 Total variation distance
2.7.5 Density ratio estimation using binary classifiers
3 Statistics
3.1 Introduction
3.2 Bayesian statistics
3.2.1 Tossing coins
3.2.2 Modeling more complex data
3.2.3 Selecting the prior
3.2.4 Computational issues
3.2.5 Exchangeability and de Finetti's theorem
3.3 Frequentist statistics
3.3.1 Sampling distributions
3.3.2 Bootstrap approximation of the sampling distribution
3.3.3 Asymptotic normality of the sampling distribution of the MLE
3.3.4 Fisher information matrix
3.3.5 Counterintuitive properties of frequentist statistics
3.3.6 Why isn't everyone a Bayesian?
3.4 Conjugate priors
3.4.1 The binomial model
3.4.2 The multinomial model
3.4.3 The univariate Gaussian model
3.4.4 The multivariate Gaussian model
3.4.5 The exponential family model
3.4.6 Beyond conjugate priors
3.5 Noninformative priors
3.5.1 Maximum entropy priors
3.5.2 Jeffreys priors
3.5.3 Invariant priors
3.5.4 Reference priors
3.6 Hierarchical priors
3.6.1 A hierarchical binomial model
3.6.2 A hierarchical Gaussian model
3.6.3 Hierarchical conditional models
3.7 Empirical Bayes
3.7.1 EB for the hierarchical binomial model
3.7.2 EB for the hierarchical Gaussian model
3.7.3 EB for Markov models (n-gram smoothing)
3.7.4 EB for non-conjugate models
3.8 Model selection
3.8.1 Bayesian model selection
3.8.2 Bayes model averaging
3.8.3 Estimating the marginal likelihood
3.8.4 Connection between cross validation and marginal likelihood
3.8.5 Conditional marginal likelihood
3.8.6 Bayesian leave-one-out (LOO) estimate
3.8.7 Information criteria
3.9 Model checking
3.9.1 Posterior predictive checks
3.9.2 Bayesian p-values
3.10 Hypothesis testing
3.10.1 Frequentist approach
3.10.2 Bayesian approach
3.10.3 Common statistical tests correspond to inference in linear models
3.11 Missing data
4 Graphical models
4.1 Introduction
4.2 Directed graphical models (Bayes nets)
4.2.1 Representing the joint distribution
4.2.2 Examples
4.2.3 Gaussian Bayes nets
4.2.4 Conditional independence properties
4.2.5 Generation (sampling)
4.2.6 Inference
4.2.7 Learning
4.2.8 Plate notation
4.3 Undirected graphical models (Markov random fields)
4.3.1 Representing the joint distribution
4.3.2 Fully visible MRFs (Ising, Potts, Hopfield, etc.)
4.3.3 MRFs with latent variables (Boltzmann machines, etc.)
4.3.4 Maximum entropy models
4.3.5 Gaussian MRFs
4.3.6 Conditional independence properties
4.3.7 Generation (sampling)
4.3.8 Inference
4.3.9 Learning
4.4 Conditional random fields (CRFs)
4.4.1 1d CRFs
4.4.2 2d CRFs
4.4.3 Parameter estimation
4.4.4 Other approaches to structured prediction
4.5 Comparing directed and undirected PGMs
4.5.1 CI properties
4.5.2 Converting between a directed and undirected model
4.5.3 Conditional directed vs undirected PGMs and the label bias problem
4.5.4 Combining directed and undirected graphs
4.5.5 Comparing directed and undirected Gaussian PGMs
4.6 PGM extensions
4.6.1 Factor graphs
4.6.2 Probabilistic circuits
4.6.3 Directed relational PGMs
4.6.4 Undirected relational PGMs
4.6.5 Open-universe probability models
4.6.6 Programs as probability models
4.7 Structural causal models
4.7.1 Example: causal impact of education on wealth
4.7.2 Structural equation models
4.7.3 Do operator and augmented DAGs
4.7.4 Counterfactuals
5 Information theory
5.1 KL divergence
5.1.1 Desiderata
5.1.2 The KL divergence uniquely satisfies the desiderata
5.1.3 Thinking about KL
5.1.4 Minimizing KL
5.1.5 Properties of KL
5.1.6 KL divergence and MLE
5.1.7 KL divergence and Bayesian inference
5.1.8 KL divergence and exponential families
5.1.9 Approximating KL divergence using the Fisher information matrix
5.1.10 Bregman divergence
5.2 Entropy
5.2.1 Definition
5.2.2 Differential entropy for continuous random variables
5.2.3 Typical sets
5.2.4 Cross entropy and perplexity
5.3 Mutual information
5.3.1 Definition
5.3.2 Interpretation
5.3.3 Data processing inequality
5.3.4 Sufficient statistics
5.3.5 Multivariate mutual information
5.3.6 Variational bounds on mutual information
5.3.7 Relevance networks
5.4 Data compression (source coding)
5.4.1 Lossless compression
5.4.2 Lossy compression and the rate-distortion tradeoff
5.4.3 Bits back coding
5.5 Error-correcting codes (channel coding)
5.6 The information bottleneck
5.6.1 Vanilla IB
5.6.2 Variational IB
5.6.3 Conditional entropy bottleneck
6 Optimization
6.1 Introduction
6.2 Automatic differentiation
6.2.1 Differentiation in functional form
6.2.2 Differentiating chains, circuits, and programs
6.3 Stochastic optimization
6.3.1 Stochastic gradient descent
6.3.2 SGD for optimizing a finite-sum objective
6.3.3 SGD for optimizing the parameters of a distribution
6.3.4 Score function estimator (REINFORCE)
6.3.5 Reparameterization trick
6.3.6 Gumbel softmax trick
6.3.7 Stochastic computation graphs
6.3.8 Straight-through estimator
6.4 Natural gradient descent
6.4.1 Defining the natural gradient
6.4.2 Interpretations of NGD
6.4.3 Benefits of NGD
6.4.4 Approximating the natural gradient
6.4.5 Natural gradients for the exponential family
6.5 Bound optimization (MM) algorithms
6.5.1 The general algorithm
6.5.2 Example: logistic regression
6.5.3 The EM algorithm
6.5.4 Example: EM for an MVN with missing data
6.5.5 Example: robust linear regression using Student likelihood
6.5.6 Extensions to EM
6.6 Bayesian optimization
6.6.1 Sequential model-based optimization
6.6.2 Surrogate functions
6.6.3 Acquisition functions
6.6.4 Other issues
6.7 Derivative-free optimization
6.7.1 Local search
6.7.2 Simulated annealing
6.7.3 Evolutionary algorithms
6.7.4 Estimation of distribution (EDA) algorithms
6.7.5 Cross-entropy method
6.7.6 Evolutionary strategies
6.8 Optimal transport
6.8.1 Warm-up: matching optimally two families of points
6.8.2 From optimal matchings to Kantorovich and Monge formulations
6.8.3 Solving optimal transport
6.9 Submodular optimization
6.9.1 Intuition, examples, and background
6.9.2 Submodular basic definitions
6.9.3 Example submodular functions
6.9.4 Submodular optimization
6.9.5 Applications of submodularity in machine learning and AI
6.9.6 Sketching, coresets, distillation, and data subset and feature selection
6.9.7 Combinatorial information functions
6.9.8 Clustering, data partitioning, and parallel machine learning
6.9.9 Active and semi-supervised learning
6.9.10 Probabilistic modeling
6.9.11 Structured norms and loss functions
6.9.12 Conclusions
II Inference
7 Inference algorithms: an overview
7.1 Introduction
7.2 Common inference patterns
7.2.1 Global latents
7.2.2 Local latents
7.2.3 Global and local latents
7.3 Exact inference algorithms
7.4 Approximate inference algorithms
7.4.1 The MAP approximation and its problems
7.4.2 Grid approximation
7.4.3 Laplace (quadratic) approximation
7.4.4 Variational inference
7.4.5 Markov chain Monte Carlo (MCMC)
7.4.6 Sequential Monte Carlo
7.4.7 Challenging posteriors
7.5 Evaluating approximate inference algorithms
8 Gaussian filtering and smoothing
8.1 Introduction
8.1.1 Inferential goals
8.1.2 Bayesian filtering equations
8.1.3 Bayesian smoothing equations
8.1.4 The Gaussian ansatz
8.2 Inference for linear-Gaussian SSMs
8.2.1 Examples
8.2.2 The Kalman filter
8.2.3 The Kalman (RTS) smoother
8.2.4 Information form filtering and smoothing
8.3 Inference based on local linearization
8.3.1 Taylor series expansion
8.3.2 The extended Kalman filter (EKF)
8.3.3 The extended Kalman smoother (EKS)
8.4 Inference based on the unscented transform
8.4.1 The unscented transform
8.4.2 The unscented Kalman filter (UKF)
8.4.3 The unscented Kalman smoother (UKS)
8.5 Other variants of the Kalman filter
8.5.1 General Gaussian filtering
8.5.2 Conditional moment Gaussian filtering
8.5.3 Iterated filters and smoothers
8.5.4 Ensemble Kalman filter
8.5.5 Robust Kalman filters
8.5.6 Dual EKF
8.5.7 Normalizing flow KFs
8.6 Assumed density filtering
8.6.1 Connection with Gaussian filtering
8.6.2 ADF for SLDS (Gaussian sum filter)
8.6.3 ADF for online logistic regression
8.6.4 ADF for online DNNs
8.7 Other inference methods for SSMs
8.7.1 Grid-based approximations
8.7.2 Expectation propagation
8.7.3 Variational inference
8.7.4 MCMC
8.7.5 Particle filtering
9 Message passing algorithms
9.1 Introduction
9.2 Belief propagation on chains
9.2.1 Hidden Markov Models
9.2.2 The forwards algorithm
9.2.3 The forwards-backwards algorithm
9.2.4 Forwards filtering backwards smoothing
9.2.5 Time and space complexity
9.2.6 The Viterbi algorithm
9.2.7 Forwards filtering backwards sampling
9.3 Belief propagation on trees
9.3.1 Directed vs undirected trees
9.3.2 Sum-product algorithm
9.3.3 Max-product algorithm
9.4 Loopy belief propagation
9.4.1 Loopy BP for pairwise undirected graphs
9.4.2 Loopy BP for factor graphs
9.4.3 Gaussian belief propagation
9.4.4 Convergence
9.4.5 Accuracy
9.4.6 Generalized belief propagation
9.4.7 Convex BP
9.4.8 Application: error correcting codes
9.4.9 Application: affinity propagation
9.4.10 Emulating BP with graph neural nets
9.5 The variable elimination (VE) algorithm
9.5.1 Derivation of the algorithm
9.5.2 Computational complexity of VE
9.5.3 Picking a good elimination order
9.5.4 Computational complexity of exact inference
9.5.5 Drawbacks of VE
9.6 The junction tree algorithm (JTA)
9.7 Inference as optimization
9.7.1 Inference as backpropagation
9.7.2 Perturb and MAP
10 Variational inference
10.1 Introduction
10.1.1 The variational objective
10.1.2 Form of the variational posterior
10.1.3 Parameter estimation using variational EM
10.1.4 Stochastic VI
10.1.5 Amortized VI
10.1.6 Semi-amortized inference
10.2 Gradient-based VI
10.2.1 Reparameterized VI
10.2.2 Automatic differentiation VI
10.2.3 Blackbox variational inference
10.3 Coordinate ascent VI
10.3.1 Derivation of CAVI algorithm
10.3.2 Example: CAVI for the Ising model
10.3.3 Variational Bayes
10.3.4 Example: VB for a univariate Gaussian
10.3.5 Variational Bayes EM
10.3.6 Example: VBEM for a GMM
10.3.7 Variational message passing (VMP)
10.3.8 Autoconj
10.4 More accurate variational posteriors
10.4.1 Structured mean field
10.4.2 Hierarchical (auxiliary variable) posteriors
10.4.3 Normalizing flow posteriors
10.4.4 Implicit posteriors
10.4.5 Combining VI with MCMC inference
10.5 Tighter bounds
10.5.1 Multi-sample ELBO (IWAE bound)
10.5.2 The thermodynamic variational objective (TVO)
10.5.3 Minimizing the evidence upper bound
10.6 Wake-sleep algorithm
10.6.1 Wake phase
10.6.2 Sleep phase
10.6.3 Daydream phase
10.6.4 Summary of algorithm
10.7 Expectation propagation (EP)
10.7.1 Algorithm
10.7.2 Example
10.7.3 EP as generalized ADF
10.7.4 Optimization issues
10.7.5 Power EP and -divergence
10.7.6 Stochastic EP
11 Monte Carlo methods
11.1 Introduction
11.2 Monte Carlo integration
11.2.1 Example: estimating by Monte Carlo integration
11.2.2 Accuracy of Monte Carlo integration
11.3 Generating random samples from simple distributions
11.3.1 Sampling using the inverse cdf
11.3.2 Sampling from a Gaussian (Box-Muller method)
11.4 Rejection sampling
11.4.1 Basic idea
11.4.2 Example
11.4.3 Adaptive rejection sampling
11.4.4 Rejection sampling in high dimensions
11.5 Importance sampling
11.5.1 Direct importance sampling
11.5.2 Self-normalized importance sampling
11.5.3 Choosing the proposal
11.5.4 Annealed importance sampling (AIS)
11.6 Controlling Monte Carlo variance
11.6.1 Common random numbers
11.6.2 Rao-Blackwellization
11.6.3 Control variates
11.6.4 Antithetic sampling
11.6.5 Quasi-Monte Carlo (QMC)
12 Markov chain Monte Carlo
12.1 Introduction
12.2 Metropolis-Hastings algorithm
12.2.1 Basic idea
12.2.2 Why MH works
12.2.3 Proposal distributions
12.2.4 Initialization
12.3 Gibbs sampling
12.3.1 Basic idea
12.3.2 Gibbs sampling is a special case of MH
12.3.3 Example: Gibbs sampling for Ising models
12.3.4 Example: Gibbs sampling for Potts models
12.3.5 Example: Gibbs sampling for GMMs
12.3.6 Metropolis within Gibbs
12.3.7 Blocked Gibbs sampling
12.3.8 Collapsed Gibbs sampling
12.4 Auxiliary variable MCMC
12.4.1 Slice sampling
12.4.2 Swendsen-Wang
12.5 Hamiltonian Monte Carlo (HMC)
12.5.1 Hamiltonian mechanics
12.5.2 Integrating Hamilton's equations
12.5.3 The HMC algorithm
12.5.4 Tuning HMC
12.5.5 Riemann manifold HMC
12.5.6 Langevin Monte Carlo (MALA)
12.5.7 Connection between SGD and Langevin sampling
12.5.8 Applying HMC to constrained parameters
12.5.9 Speeding up HMC
12.6 MCMC convergence
12.6.1 Mixing rates of Markov chains
12.6.2 Practical convergence diagnostics
12.6.3 Effective sample size
12.6.4 Improving speed of convergence
12.6.5 Non-centered parameterizations and Neal's funnel
12.7 Stochastic gradient MCMC
12.7.1 Stochastic gradient Langevin dynamics (SGLD)
12.7.2 Preconditionining
12.7.3 Reducing the variance of the gradient estimate
12.7.4 SG-HMC
12.7.5 Underdamped Langevin dynamics
12.8 Reversible jump (transdimensional) MCMC
12.8.1 Basic idea
12.8.2 Example
12.8.3 Discussion
12.9 Annealing methods
12.9.1 Simulated annealing
12.9.2 Parallel tempering
13 Sequential Monte Carlo
13.1 Introduction
13.1.1 Problem statement
13.1.2 Particle filtering for state-space models
13.1.3 SMC samplers for static parameter estimation
13.2 Particle filtering
13.2.1 Importance sampling
13.2.2 Sequential importance sampling
13.2.3 Sequential importance sampling with resampling
13.2.4 Resampling methods
13.2.5 Adaptive resampling
13.3 Proposal distributions
13.3.1 Locally optimal proposal
13.3.2 Proposals based on the extended and unscented Kalman filter
13.3.3 Proposals based on the Laplace approximation
13.3.4 Proposals based on SMC (nested SMC)
13.4 Rao-Blackwellized particle filtering (RBPF)
13.4.1 Mixture of Kalman filters
13.4.2 Example: tracking a maneuvering object
13.4.3 Example: FastSLAM
13.5 Extensions of the particle filter
13.6 SMC samplers
13.6.1 Ingredients of an SMC sampler
13.6.2 Likelihood tempering (geometric path)
13.6.3 Data tempering
13.6.4 Sampling rare events and extrema
13.6.5 SMC-ABC and likelihood-free inference
13.6.6 SMC2
13.6.7 Variational filtering SMC
13.6.8 Variational smoothing SMC
III Prediction
14 Predictive models: an overview
14.1 Introduction
14.1.1 Types of model
14.1.2 Model fitting using ERM, MLE, and MAP
14.1.3 Model fitting using Bayes, VI, and generalized Bayes
14.2 Evaluating predictive models
14.2.1 Proper scoring rules
14.2.2 Calibration
14.2.3 Beyond evaluating marginal probabilities
14.3 Conformal prediction
14.3.1 Conformalizing classification
14.3.2 Conformalizing regression
15 Generalized linear models
15.1 Introduction
15.1.1 Some popular GLMs
15.1.2 GLMs with noncanonical link functions
15.1.3 Maximum likelihood estimation
15.1.4 Bayesian inference
15.2 Linear regression
15.2.1 Ordinary least squares
15.2.2 Conjugate priors
15.2.3 Uninformative priors
15.2.4 Informative priors
15.2.5 Spike and slab prior
15.2.6 Laplace prior (Bayesian lasso)
15.2.7 Horseshoe prior
15.2.8 Automatic relevancy determination
15.2.9 Multivariate linear regression
15.3 Logistic regression
15.3.1 Binary logistic regression
15.3.2 Multinomial logistic regression
15.3.3 Dealing with class imbalance and the long tail
15.3.4 Parameter priors
15.3.5 Laplace approximation to the posterior
15.3.6 Approximating the posterior predictive distribution
15.3.7 MCMC inference
15.3.8 Other approximate inference methods
15.3.9 Case study: is Berkeley admissions biased against women?
15.4 Probit regression
15.4.1 Latent variable interpretation
15.4.2 Maximum likelihood estimation
15.4.3 Bayesian inference
15.4.4 Ordinal probit regression
15.4.5 Multinomial probit models
15.5 Multilevel (hierarchical) GLMs
15.5.1 Generalized linear mixed models (GLMMs)
15.5.2 Example: radon regression
16 Deep neural networks
16.1 Introduction
16.2 Building blocks of differentiable circuits
16.2.1 Linear layers
16.2.2 Nonlinearities
16.2.3 Convolutional layers
16.2.4 Residual (skip) connections
16.2.5 Normalization layers
16.2.6 Dropout layers
16.2.7 Attention layers
16.2.8 Recurrent layers
16.2.9 Multiplicative layers
16.2.10 Implicit layers
16.3 Canonical examples of neural networks
16.3.1 Multilayer perceptrons (MLPs)
16.3.2 Convolutional neural networks (CNNs)
16.3.3 Autoencoders
16.3.4 Recurrent neural networks (RNNs)
16.3.5 Transformers
16.3.6 Graph neural networks (GNNs)
17 Bayesian neural networks
17.1 Introduction
17.2 Priors for BNNs
17.2.1 Gaussian priors
17.2.2 Sparsity-promoting priors
17.2.3 Learning the prior
17.2.4 Priors in function space
17.2.5 Architectural priors
17.3 Posteriors for BNNs
17.3.1 Monte Carlo dropout
17.3.2 Laplace approximation
17.3.3 Variational inference
17.3.4 Expectation propagation
17.3.5 Last layer methods
17.3.6 SNGP
17.3.7 MCMC methods
17.3.8 Methods based on the SGD trajectory
17.3.9 Deep ensembles
17.3.10 Approximating the posterior predictive distribution
17.3.11 Tempered and cold posteriors
17.4 Generalization in Bayesian deep learning
17.4.1 Sharp vs flat minima
17.4.2 Mode connectivity and the loss landscape
17.4.3 Effective dimensionality of a model
17.4.4 The hypothesis space of DNNs
17.4.5 PAC-Bayes
17.4.6 Out-of-distribution generalization for BNNs
17.4.7 Model selection for BNNs
17.5 Online inference
17.5.1 Sequential Laplace for DNNs
17.5.2 Extended Kalman filtering for DNNs
17.5.3 Assumed density filtering for DNNs
17.5.4 Online variational inference for DNNs
17.6 Hierarchical Bayesian neural networks
17.6.1 Example: multimoons classification
18 Gaussian processes
18.1 Introduction
18.1.1 GPs: what and why?
18.2 Mercer kernels
18.2.1 Stationary kernels
18.2.2 Nonstationary kernels
18.2.3 Kernels for nonvectorial (structured) inputs
18.2.4 Making new kernels from old
18.2.5 Mercer's theorem
18.2.6 Approximating kernels with random features
18.3 GPs with Gaussian likelihoods
18.3.1 Predictions using noise-free observations
18.3.2 Predictions using noisy observations
18.3.3 Weight space vs function space
18.3.4 Semiparametric GPs
18.3.5 Marginal likelihood
18.3.6 Computational and numerical issues
18.3.7 Kernel ridge regression
18.4 GPs with non-Gaussian likelihoods
18.4.1 Binary classification
18.4.2 Multiclass classification
18.4.3 GPs for Poisson regression (Cox process)
18.4.4 Other likelihoods
18.5 Scaling GP inference to large datasets
18.5.1 Subset of data
18.5.2 Nyström approximation
18.5.3 Inducing point methods
18.5.4 Sparse variational methods
18.5.5 Exploiting parallelization and structure via kernel matrix multiplies
18.5.6 Converting a GP to an SSM
18.6 Learning the kernel
18.6.1 Empirical Bayes for the kernel parameters
18.6.2 Bayesian inference for the kernel parameters
18.6.3 Multiple kernel learning for additive kernels
18.6.4 Automatic search for compositional kernels
18.6.5 Spectral mixture kernel learning
18.6.6 Deep kernel learning
18.7 GPs and DNNs
18.7.1 Kernels derived from infinitely wide DNNs (NN-GP)
18.7.2 Neural tangent kernel (NTK)
18.7.3 Deep GPs
18.8 Gaussian processes for time series forecasting
18.8.1 Example: Mauna Loa
19 Beyond the iid assumption
19.1 Introduction
19.2 Distribution shift
19.2.1 Motivating examples
19.2.2 A causal view of distribution shift
19.2.3 The four main types of distribution shift
19.2.4 Selection bias
19.3 Detecting distribution shifts
19.3.1 Detecting shifts using two-sample testing
19.3.2 Detecting single out-of-distribution (OOD) inputs
19.3.3 Selective prediction
19.3.4 Open set and open world recognition
19.4 Robustness to distribution shifts
19.4.1 Data augmentation
19.4.2 Distributionally robust optimization
19.5 Adapting to distribution shifts
19.5.1 Supervised adaptation using transfer learning
19.5.2 Weighted ERM for covariate shift
19.5.3 Unsupervised domain adaptation for covariate shift
19.5.4 Unsupervised techniques for label shift
19.5.5 Test-time adaptation
19.6 Learning from multiple distributions
19.6.1 Multitask learning
19.6.2 Domain generalization
19.6.3 Invariant risk minimization
19.6.4 Meta learning
19.7 Continual learning
19.7.1 Domain drift
19.7.2 Concept drift
19.7.3 Class incremental learning
19.7.4 Catastrophic forgetting
19.7.5 Online learning
19.8 Adversarial examples
19.8.1 Whitebox (gradient-based) attacks
19.8.2 Blackbox (gradient-free) attacks
19.8.3 Real world adversarial attacks
19.8.4 Defenses based on robust optimization
19.8.5 Why models have adversarial examples
IV Generation
20 Generative models: an overview
20.1 Introduction
20.2 Types of generative model
20.3 Goals of generative modeling
20.3.1 Generating data
20.3.2 Density estimation
20.3.3 Imputation
20.3.4 Structure discovery
20.3.5 Latent space interpolation
20.3.6 Latent space arithmetic
20.3.7 Generative design
20.3.8 Model-based reinforcement learning
20.3.9 Representation learning
20.3.10 Data compression
20.4 Evaluating generative models
20.4.1 Likelihood-based evaluation
20.4.2 Distances and divergences in feature space
20.4.3 Precision and recall metrics
20.4.4 Statistical tests
20.4.5 Challenges with using pretrained classifiers
20.4.6 Using model samples to train classifiers
20.4.7 Assessing overfitting
20.4.8 Human evaluation
21 Variational autoencoders
21.1 Introduction
21.2 VAE basics
21.2.1 Modeling assumptions
21.2.2 Model fitting
21.2.3 Comparison of VAEs and autoencoders
21.2.4 VAEs optimize in an augmented space
21.3 VAE generalizations
21.3.1 -VAE
21.3.2 InfoVAE
21.3.3 Multimodal VAEs
21.3.4 Semisupervised VAEs
21.3.5 VAEs with sequential encoders/decoders
21.4 Avoiding posterior collapse
21.4.1 KL annealing
21.4.2 Lower bounding the rate
21.4.3 Free bits
21.4.4 Adding skip connections
21.4.5 Improved variational inference
21.4.6 Alternative objectives
21.5 VAEs with hierarchical structure
21.5.1 Bottom-up vs top-down inference
21.5.2 Example: very deep VAE
21.5.3 Connection with autoregressive models
21.5.4 Variational pruning
21.5.5 Other optimization difficulties
21.6 Vector quantization VAE
21.6.1 Autoencoder with binary code
21.6.2 VQ-VAE model
21.6.3 Learning the prior
21.6.4 Hierarchical extension (VQ-VAE-2)
21.6.5 Discrete VAE
21.6.6 VQ-GAN
22 Autoregressive models
22.1 Introduction
22.2 Neural autoregressive density estimators (NADE)
22.3 Causal CNNs
22.3.1 1d causal CNN (convolutional Markov models)
22.3.2 2d causal CNN (PixelCNN)
22.4 Transformers
22.4.1 Text generation (GPT, etc.)
22.4.2 Image generation (DALL-E, etc.)
22.4.3 Other applications
23 Normalizing flows
23.1 Introduction
23.1.1 Preliminaries
23.1.2 How to train a flow model
23.2 Constructing flows
23.2.1 Affine flows
23.2.2 Elementwise flows
23.2.3 Coupling flows
23.2.4 Autoregressive flows
23.2.5 Residual flows
23.2.6 Continuous-time flows
23.3 Applications
23.3.1 Density estimation
23.3.2 Generative modeling
23.3.3 Inference
24 Energy-based models
24.1 Introduction
24.1.1 Example: products of experts (PoE)
24.1.2 Computational difficulties
24.2 Maximum likelihood training
24.2.1 Gradient-based MCMC methods
24.2.2 Contrastive divergence
24.3 Score matching (SM)
24.3.1 Basic score matching
24.3.2 Denoising score matching (DSM)
24.3.3 Sliced score matching (SSM)
24.3.4 Connection to contrastive divergence
24.3.5 Score-based generative models
24.4 Noise contrastive estimation
24.4.1 Connection to score matching
24.5 Other methods
24.5.1 Minimizing Differences/Derivatives of KL Divergences
24.5.2 Minimizing the Stein discrepancy
24.5.3 Adversarial training
25 Diffusion models
25.1 Introduction
25.2 Denoising diffusion probabilistic models (DDPMs)
25.2.1 Encoder (forwards diffusion)
25.2.2 Decoder (reverse diffusion)
25.2.3 Model fitting
25.2.4 Learning the noise schedule
25.2.5 Example: image generation
25.3 Score-based generative models (SGMs)
25.3.1 Example
25.3.2 Adding noise at multiple scales
25.3.3 Equivalence to DDPM
25.4 Continuous time models using differential equations
25.4.1 Forwards diffusion SDE
25.4.2 Forwards diffusion ODE
25.4.3 Reverse diffusion SDE
25.4.4 Reverse diffusion ODE
25.4.5 Comparison of the SDE and ODE approach
25.4.6 Example
25.5 Speeding up diffusion models
25.5.1 DDIM sampler
25.5.2 Non-Gaussian decoder networks
25.5.3 Distillation
25.5.4 Latent space diffusion
25.6 Conditional generation
25.6.1 Conditional diffusion model
25.6.2 Classifier guidance
25.6.3 Classifier-free guidance
25.6.4 Generating high resolution images
25.7 Diffusion for discrete state spaces
25.7.1 Discrete Denoising Diffusion Probabilistic Models
25.7.2 Choice of Markov transition matrices for the forward processes
25.7.3 Parameterization of the reverse process
25.7.4 Noise schedules
25.7.5 Connections to other probabilistic models for discrete sequences
26 Generative adversarial networks
26.1 Introduction
26.2 Learning by comparison
26.2.1 Guiding principles
26.2.2 Density ratio estimation using binary classifiers
26.2.3 Bounds on f-divergences
26.2.4 Integral probability metrics
26.2.5 Moment matching
26.2.6 On density ratios and differences
26.3 Generative adversarial networks
26.3.1 From learning principles to loss functions
26.3.2 Gradient descent
26.3.3 Challenges with GAN training
26.3.4 Improving GAN optimization
26.3.5 Convergence of GAN training
26.4 Conditional GANs
26.5 Inference with GANs
26.6 Neural architectures in GANs
26.6.1 The importance of discriminator architectures
26.6.2 Architectural inductive biases
26.6.3 Attention in GANs
26.6.4 Progressive generation
26.6.5 Regularization
26.6.6 Scaling up GAN models
26.7 Applications
26.7.1 GANs for image generation
26.7.2 Video generation
26.7.3 Audio generation
26.7.4 Text generation
26.7.5 Imitation learning
26.7.6 Domain adaptation
26.7.7 Design, art and creativity
V Discovery
27 Discovery methods: an overview
27.1 Introduction
27.2 Overview of part:discovery
28 Latent factor models
28.1 Introduction
28.2 Mixture models
28.2.1 Gaussian mixture models (GMMs)
28.2.2 Bernoulli mixture models
28.2.3 Gaussian scale mixtures (GSMs)
28.2.4 Using GMMs as a prior for inverse imaging problems
28.2.5 Using mixture models for classification problems
28.3 Factor analysis
28.3.1 Factor analysis: the basics
28.3.2 Probabilistic PCA
28.3.3 Mixture of factor analyzers
28.3.4 Factor analysis models for paired data
28.3.5 Factor analysis with exponential family likelihoods
28.3.6 Factor analysis with DNN likelihoods (VAEs)
28.3.7 Factor analysis with GP likelihoods (GP-LVM)
28.4 LFMs with non-Gaussian priors
28.4.1 Non-negative matrix factorization (NMF)
28.4.2 Multinomial PCA
28.5 Topic models
28.5.1 Latent Dirichlet allocation (LDA)
28.5.2 Correlated topic model
28.5.3 Dynamic topic model
28.5.4 LDA-HMM
28.6 Independent components analysis (ICA)
28.6.1 Noiseless ICA model
28.6.2 The need for non-Gaussian priors
28.6.3 Maximum likelihood estimation
28.6.4 Alternatives to MLE
28.6.5 Sparse coding
28.6.6 Nonlinear ICA
29 State-space models
29.1 Introduction
29.2 Hidden Markov models (HMMs)
29.2.1 Conditional independence properties
29.2.2 State transition model
29.2.3 Discrete likelihoods
29.2.4 Gaussian likelihoods
29.2.5 Autoregressive likelihoods
29.2.6 Neural network likelihoods
29.3 HMMs: applications
29.3.1 Time series segmentation
29.3.2 Protein sequence alignment
29.3.3 Spelling correction
29.4 HMMs: parameter learning
29.4.1 The Baum-Welch (EM) algorithm
29.4.2 Parameter estimation using SGD
29.4.3 Parameter estimation using spectral methods
29.4.4 Bayesian HMMs
29.5 HMMs: generalizations
29.5.1 Hidden semi-Markov model (HSMM)
29.5.2 Hierarchical HMMs
29.5.3 Factorial HMMs
29.5.4 Coupled HMMs
29.5.5 Dynamic Bayes nets (DBN)
29.5.6 Changepoint detection
29.6 Linear dynamical systems (LDSs)
29.6.1 Conditional independence properties
29.6.2 Parameterization
29.7 LDS: applications
29.7.1 Object tracking and state estimation
29.7.2 Online Bayesian linear regression (recursive least squares)
29.7.3 Adaptive filtering
29.7.4 Time series forecasting
29.8 LDS: parameter learning
29.8.1 EM for LDS
29.8.2 Subspace identification methods
29.8.3 Ensuring stability of the dynamical system
29.8.4 Bayesian LDS
29.8.5 Online parameter learning for SSMs
29.9 Switching linear dynamical systems (SLDSs)
29.9.1 Parameterization
29.9.2 Posterior inference
29.9.3 Application: Multitarget tracking
29.10 Nonlinear SSMs
29.10.1 Example: object tracking and state estimation
29.10.2 Posterior inference
29.11 Non-Gaussian SSMs
29.11.1 Example: spike train modeling
29.11.2 Example: stochastic volatility models
29.11.3 Posterior inference
29.12 Structural time series models
29.12.1 Introduction
29.12.2 Structural building blocks
29.12.3 Model fitting
29.12.4 Forecasting
29.12.5 Examples
29.12.6 Causal impact of a time series intervention
29.12.7 Prophet
29.12.8 Neural forecasting methods
29.13 Deep SSMs
29.13.1 Deep Markov models
29.13.2 Recurrent SSM
29.13.3 Improving multistep predictions
29.13.4 Variational RNNs
30 Graph learning
30.1 Introduction
30.2 Latent variable models for graphs
30.3 Graphical model structure learning
31 Nonparametric Bayesian models
31.1 Introduction
32 Representation learning
32.1 Introduction
32.2 Evaluating and comparing learned representations
32.2.1 Downstream performance
32.2.2 Representational similarity
32.3 Approaches for learning representations
32.3.1 Supervised representation learning and transfer
32.3.2 Generative representation learning
32.3.3 Self-supervised representation learning
32.3.4 Multiview representation learning
32.4 Theory of representation learning
32.4.1 Identifiability
32.4.2 Information maximization
33 Interpretability
33.1 Introduction
33.1.1 The role of interpretability: unknowns and under-specifications
33.1.2 Terminology and framework
33.2 Methods for interpretable machine learning
33.2.1 Inherently interpretable models: the model is its explanation
33.2.2 Semi-inherently interpretable models: example-based methods
33.2.3 Post-hoc or joint training: the explanation gives a partial view of the model
33.2.4 Transparency and visualization
33.3 Properties: the abstraction between context and method
33.3.1 Properties of explanations from interpretable machine learning
33.3.2 Properties of explanations from cognitive science
33.4 Evaluation of interpretable machine learning models
33.4.1 Computational evaluation: does the method have desired properties?
33.4.2 User study-based evaluation: does the method help a user perform a target task?
33.5 Discussion: how to think about interpretable machine learning
VI Action
34 Decision making under uncertainty
34.1 Statistical decision theory
34.1.1 Basics
34.1.2 Frequentist decision theory
34.1.3 Bayesian decision theory
34.1.4 Frequentist optimality of the Bayesian approach
34.1.5 Examples of one-shot decision making problems
34.2 Decision (influence) diagrams
34.2.1 Example: oil wildcatter
34.2.2 Information arcs
34.2.3 Value of information
34.2.4 Computing the optimal policy
34.3 A/B testing
34.3.1 A Bayesian approach
34.3.2 Example
34.4 Contextual bandits
34.4.1 Types of bandit
34.4.2 Applications
34.4.3 Exploration-exploitation tradeoff
34.4.4 The optimal solution
34.4.5 Upper confidence bounds (UCBs)
34.4.6 Thompson sampling
34.4.7 Regret
34.5 Markov decision problems
34.5.1 Basics
34.5.2 Partially observed MDPs
34.5.3 Episodes and returns
34.5.4 Value functions
34.5.5 Optimal value functions and policies
34.6 Planning in an MDP
34.6.1 Value iteration
34.6.2 Policy iteration
34.6.3 Linear programming
34.7 Active learning
34.7.1 Active learning scenarios
34.7.2 Relationship to other forms of sequential decision making
34.7.3 Acquisition strategies
34.7.4 Batch active learning
35 Reinforcement learning
35.1 Introduction
35.1.1 Overview of methods
35.1.2 Value-based methods
35.1.3 Policy search methods
35.1.4 Model-based RL
35.1.5 Exploration-exploitation tradeoff
35.2 Value-based RL
35.2.1 Monte Carlo RL
35.2.2 Temporal difference (TD) learning
35.2.3 TD learning with eligibility traces
35.2.4 SARSA: on-policy TD control
35.2.5 Q-learning: off-policy TD control
35.2.6 Deep Q-network (DQN)
35.3 Policy-based RL
35.3.1 The policy gradient theorem
35.3.2 REINFORCE
35.3.3 Actor-critic methods
35.3.4 Bound optimization methods
35.3.5 Deterministic policy gradient methods
35.3.6 Gradient-free methods
35.4 Model-based RL
35.4.1 Model predictive control (MPC)
35.4.2 Combining model-based and model-free
35.4.3 MBRL using Gaussian processes
35.4.4 MBRL using DNNs
35.4.5 MBRL using latent-variable models
35.4.6 Robustness to model errors
35.5 Off-policy learning
35.5.1 Basic techniques
35.5.2 The curse of horizon
35.5.3 The deadly triad
35.6 Control as inference
35.6.1 Maximum entropy reinforcement learning
35.6.2 Other approaches
35.6.3 Imitation learning
36 Causality
36.1 Introduction
36.2 Causal formalism
36.2.1 Structural causal models
36.2.2 Causal DAGs
36.2.3 Identification
36.2.4 Counterfactuals and the causal hierarchy
36.3 Randomized control trials
36.4 Confounder adjustment
36.4.1 Causal estimand, statistical estimand, and identification
36.4.2 ATE estimation with observed confounders
36.4.3 Uncertainty quantification
36.4.4 Matching
36.4.5 Practical considerations and procedures
36.4.6 Summary and practical advice
36.5 Instrumental variable strategies
36.5.1 Additive unobserved confounding
36.5.2 Instrument monotonicity and local average treatment effect
36.5.3 Two stage least squares
36.6 Difference in differences
36.6.1 Estimation
36.7 Credibility checks
36.7.1 Placebo checks
36.7.2 Sensitivity analysis to unobserved confounding
36.8 The do-calculus
36.8.1 The three rules
36.8.2 Revisiting backdoor adjustment
36.8.3 Frontdoor adjustment
36.9 Further reading
Index
Bibliography