Machine learning : a probabilistic perspective

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Kevin P Murphy
Series: Adaptive computation and machine learning
Publisher: MIT Press
Year: 2012

Language: English
Pages: 1098
City: Cambridge, Mass
Tags: Информатика и вычислительная техника;Искусственный интеллект;

Cover Page......Page 1
Half Title Page......Page 2
Title Page......Page 4
Copyright Page......Page 5
Dedication......Page 6
Contents......Page 8
Preface......Page 28
1.1 Machine learning: what and why?......Page 32
1.1.1 Types of machine learning......Page 33
1.2.1 Classification......Page 34
1.2.2 Regression......Page 39
1.3 Unsupervised learning......Page 40
1.3.1 Discovering clusters......Page 41
1.3.2 Discovering latent factors......Page 42
1.3.3 Discovering graph structure......Page 44
1.3.4 Matrix completion......Page 45
1.4.2 A simple non-parametric classifier: K-nearest neighbors......Page 47
1.4.3 The curse of dimensionality......Page 49
1.4.5 Linear regression......Page 50
1.4.6 Logistic regression......Page 52
1.4.8 Model selection......Page 53
1.4.9 No free lunch theorem......Page 55
2.1 Introduction......Page 58
2.2.2 Fundamental rules......Page 59
2.2.3 Bayes rule......Page 60
2.2.4 Independence and conditional independence......Page 61
2.2.5 Continuous random variables......Page 63
2.2.7 Mean and variance......Page 64
2.3.1 The binomial and Bernoulli distributions......Page 65
2.3.2 The multinomial and multinoulli distributions......Page 66
2.3.4 The empirical distribution......Page 68
2.4.1 Gaussian (normal) distribution......Page 69
2.4.2 Degenerate pdf......Page 70
2.4.4 The gamma distribution......Page 72
2.4.5 The beta distribution......Page 73
2.4.6 Pareto distribution......Page 74
2.5.1 Covariance and correlation......Page 75
2.5.3 Multivariate Student t distribution......Page 77
2.5.4 Dirichlet distribution......Page 78
2.6.1 Linear transformations......Page 80
2.6.2 General transformations......Page 81
2.6.3 Central limit theorem......Page 82
2.7 Monte Carlo approximation......Page 83
2.7.1 Example: change of variables, the MC way......Page 84
2.7.3 Accuracy of Monte Carlo approximation......Page 85
2.8.1 Entropy......Page 87
2.8.2 KL divergence......Page 88
2.8.3 Mutual information......Page 90
3.2 Bayesian concept learning......Page 96
3.2.2 Prior......Page 98
3.2.3 Posterior......Page 99
3.2.4 Posterior predictive distribution......Page 102
3.3 The beta-binomial model......Page 103
3.3.1 Likelihood......Page 104
3.3.2 Prior......Page 105
3.3.3 Posterior......Page 106
3.3.4 Posterior predictive distribution......Page 108
3.4 The Dirichlet-multinomial model......Page 109
3.4.3 Posterior......Page 110
3.4.4 Posterior predictive......Page 112
3.5 Naive Bayes classifiers......Page 113
3.5.1 Model fitting......Page 114
3.5.2 Using the model for prediction......Page 116
3.5.4 Feature selection using mutual information......Page 117
3.5.5 Classifying documents using bag of words......Page 118
4.1.2 Basics......Page 128
4.1.3 MLE for an MVN......Page 130
4.2 Gaussian discriminant analysis......Page 132
4.2.1 Quadratic discriminant analysis (QDA)......Page 133
4.2.2 Linear discriminant analysis (LDA)......Page 134
4.2.3 Two-class LDA......Page 135
4.2.5 Strategies for preventing overfitting......Page 137
4.2.6 Regularized LDA *......Page 138
4.2.7 Diagonal LDA......Page 139
4.2.8 Nearest shrunken centroids classifier *......Page 140
4.3 Inference in jointly Gaussian distributions......Page 141
4.3.2 Examples......Page 142
4.3.3 Information form......Page 146
4.3.4 Proof of the result *......Page 147
4.4.1 Statement of the result......Page 150
4.4.2 Examples......Page 151
4.4.3 Proof of the result *......Page 155
4.5 Digression: The Wishart distribution *......Page 156
4.5.1 Inverse Wishart distribution......Page 157
4.6 Inferring the parameters of an MVN......Page 158
þÿ......Page 159
þÿ......Page 163
4.6.4 Sensor fusion with unknown precisions *......Page 169
5.2.1 MAP estimation......Page 180
5.2.2 Credible intervals......Page 183
5.2.3 Inference for a difference in proportions......Page 185
5.3 Bayesian model selection......Page 186
5.3.1 Bayesian Occams razor......Page 187
5.3.2 Computing the marginal likelihood (evidence)......Page 189
5.3.3 Bayes factors......Page 194
5.3.4 Jeffreys-Lindley paradox *......Page 195
5.4.1 Uninformative priors......Page 196
5.4.2 Jeffreys priors *......Page 197
5.4.4 Mixtures of conjugate priors......Page 199
5.5.1 Example: modeling related cancer rates......Page 202
5.6 Empirical Bayes......Page 203
5.6.2 Example: Gaussian-Gaussian model......Page 204
5.7 Bayesian decision theory......Page 207
5.7.1 Bayes estimators for common loss functions......Page 208
5.7.2 The false positive vs false negative tradeoff......Page 211
5.7.3 Other topics *......Page 215
6.2 Sampling distribution of an estimator......Page 222
6.2.1 Bootstrap......Page 223
6.2.2 Large sample theory for the MLE *......Page 224
6.3 Frequentist decision theory......Page 225
6.3.1 Bayes risk......Page 226
6.3.2 Minimax risk......Page 227
6.3.3 Admissible estimators......Page 228
6.4.2 Unbiased estimators......Page 231
6.4.3 Minimum variance estimators......Page 232
6.4.4 The bias-variance tradeoff......Page 233
6.5 Empirical risk minimization......Page 235
6.5.1 Regularized risk minimization......Page 236
6.5.3 Estimating the risk using cross validation......Page 237
6.5.4 Upper bounding the risk using statistical learning theory *......Page 240
6.5.5 Surrogate loss functions......Page 241
6.6 Pathologies of frequentist statistics *......Page 242
6.6.1 Counter-intuitive behavior of confidence intervals......Page 243
6.6.2 p-values considered harmful......Page 244
6.6.3 The likelihood principle......Page 245
6.6.4 Why isnt everyone a Bayesian?......Page 246
7.3 Maximum likelihood estimation (least squares)......Page 248
7.3.1 Derivation of the MLE......Page 250
7.3.2 Geometric interpretation......Page 251
7.3.3 Convexity......Page 252
7.4 Robust linear regression *......Page 254
7.5.1 Basic idea......Page 256
7.5.2 Numerically stable computation *......Page 258
7.5.3 Connection with PCA *......Page 259
7.5.4 Regularization effects of big data......Page 261
7.6 Bayesian linear regression......Page 262
7.6.1 Computing the posterior......Page 263
7.6.2 Computing the posterior predictive......Page 264
7.6.4 EB for linear regression (evidence procedure)......Page 265
8.3 Model fitting......Page 276
8.3.1 MLE......Page 277
8.3.2 Steepest descent......Page 278
8.3.3 Newtons method......Page 280
8.3.4 Iteratively reweighted least squares (IRLS)......Page 281
8.3.5 Quasi-Newton (variable metric) methods......Page 282
8.3.7 Multi-class logistic regression......Page 283
8.4 Bayesian logistic regression......Page 285
8.4.2 Derivation of the BIC......Page 286
8.4.4 Approximating the posterior predictive......Page 287
8.4.5 Residual analysis (outlier detection) *......Page 291
8.5 Online learning and stochastic optimization......Page 292
8.5.2 Stochastic optimization and risk minimization......Page 293
8.5.3 The LMS algorithm......Page 295
8.5.4 The perceptron algorithm......Page 296
8.5.5 A Bayesian view......Page 297
8.6 Generative vs discriminative classifiers......Page 298
8.6.1 Pros and cons of each approach......Page 299
8.6.2 Dealing with missing data......Page 300
8.6.3 Fishers linear discriminant analysis (FLDA) *......Page 302
9.2 The exponential family......Page 312
9.2.2 Examples......Page 313
9.2.3 Log partition function......Page 315
9.2.4 MLE for the exponential family......Page 317
9.2.5 Bayes for the exponential family *......Page 318
9.2.6 Maximum entropy derivation of the exponential family *......Page 320
9.3.1 Basics......Page 321
9.3.2 ML and MAP estimation......Page 323
9.4 Probit regression......Page 324
9.4.2 Latent variable interpretation......Page 325
9.4.4 Multinomial probit models *......Page 326
9.5.2 Application to personalized email spam filtering......Page 327
9.5.4 Other kinds of prior......Page 328
9.6.1 Example: semi-parametric GLMMs for medical data......Page 329
9.7 Learning to rank *......Page 331
9.7.2 The pairwise approach......Page 332
9.7.3 The listwise approach......Page 333
9.7.4 Loss functions for ranking......Page 334
10.1.1 Chain rule......Page 338
10.1.3 Graphical models......Page 339
10.1.4 Graph terminology......Page 340
10.1.5 Directed graphical models......Page 341
10.2.1 Naive Bayes classifiers......Page 342
10.2.2 Markov and hidden Markov models......Page 343
10.2.3 Medical diagnosis......Page 344
10.2.4 Genetic linkage analysis *......Page 346
10.2.5 Directed Gaussian graphical models *......Page 349
10.3 Inference......Page 350
10.4.1 Plate notation......Page 351
10.4.2 Learning from complete data......Page 353
10.4.3 Learning with missing and/or latent variables......Page 354
10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties)......Page 355
10.5.3 Markov blanket and full conditionals......Page 358
10.6 Influence (decision) diagrams *......Page 359
11.2 Mixture models......Page 368
11.2.1 Mixtures of Gaussians......Page 370
11.2.3 Using mixture models for clustering......Page 371
11.2.4 Mixtures of experts......Page 373
11.3 Parameter estimation for mixture models......Page 376
11.3.1 Unidentifiability......Page 377
11.3.2 Computing a MAP estimate is non-convex......Page 378
11.4 The EM algorithm......Page 379
11.4.1 Basic idea......Page 380
11.4.2 EM for GMMs......Page 381
11.4.3 EM for mixture of experts......Page 388
11.4.4 EM for DGMs with hidden variables......Page 389
11.4.5 EM for the Student distribution *......Page 390
11.4.6 EM for probit regression *......Page 393
11.4.7 Theoretical basis for EM *......Page 394
11.4.8 Online EM......Page 396
11.4.9 Other EM variants *......Page 398
11.5.2 Model selection for non-probabilistic methods......Page 401
11.6 Fitting models with missing data......Page 403
11.6.1 EM for the MLE of an MVN with missing data......Page 404
12.1.1 FA is a low rank parameterization of an MVN......Page 412
12.1.2 Inference of the latent factors......Page 413
12.1.3 Unidentifiability......Page 414
12.1.4 Mixtures of factor analysers......Page 416
12.1.5 EM for factor analysis models......Page 417
12.2.1 Classical PCA: statement of the theorem......Page 418
12.2.2 Proof *......Page 420
12.2.3 Singular value decomposition (SVD)......Page 423
12.2.4 Probabilistic PCA......Page 426
12.2.5 EM algorithm for PCA......Page 427
12.3.1 Model selection for FA/PPCA......Page 429
12.3.2 Model selection for PCA......Page 430
12.4 PCA for categorical data......Page 433
12.5 PCA for paired and multi-view data......Page 435
12.5.1 Supervised PCA (latent factor regression)......Page 436
12.5.2 Partial least squares......Page 437
12.6 Independent Component Analysis (ICA)......Page 438
12.6.1 Maximum likelihood estimation......Page 441
12.6.2 The FastICA algorithm......Page 442
12.6.3 Using EM......Page 445
12.6.4 Other estimation principles *......Page 446
13.1 Introduction......Page 452
13.2 Bayesian variable selection......Page 453
13.2.1 The spike and slab model......Page 455
13.2.2 From the Bernoulli-Gaussian model to 0 regularization......Page 456
13.2.3 Algorithms......Page 457
13.3 e1 regularization: basics......Page 460
13.3.1 Why does e1 regularization yield sparse solutions?......Page 461
13.3.2 Optimality conditions for lasso......Page 462
13.3.3 Comparison of least squares, lasso, ridge and subset selection......Page 466
13.3.4 Regularization path......Page 467
13.3.5 Model selection......Page 470
13.3.6 Bayesian inference for linear models with Laplace priors......Page 471
13.4.2 LARS and other homotopy methods......Page 472
13.4.3 Proximal and gradient projection methods......Page 473
13.4.4 EM for lasso......Page 478
13.5.1 Group Lasso......Page 480
13.5.2 Fused lasso......Page 485
13.5.3 Elastic net (ridge and lasso combined)......Page 486
13.6 Non-convex regularizers......Page 488
13.6.2 Hierarchical adaptive lasso......Page 489
13.6.3 Other hierarchical priors......Page 493
13.7.1 ARD for linear regression......Page 494
13.7.3 Connection to MAP estimation......Page 496
13.7.4 Algorithms for ARD *......Page 497
13.8 Sparse coding *......Page 499
13.8.1 Learning a sparse coding dictionary......Page 500
13.8.2 Results of dictionary learning from image patches......Page 501
13.8.4 Image inpainting and denoising......Page 503
14.2 Kernel functions......Page 510
14.2.2 Kernels for comparing documents......Page 511
14.2.3 Mercer (positive definite) kernels......Page 512
14.2.5 Matern kernels......Page 513
14.2.6 String kernels......Page 514
14.2.7 Pyramid match kernels......Page 515
14.2.8 Kernels derived from probabilistic generative models......Page 516
14.3.1 Kernel machines......Page 517
14.3.2 L1VMs, RVMs, and other sparse vector machines......Page 518
14.4 The kernel trick......Page 519
14.4.2 Kernelized K-medoids clustering......Page 520
14.4.3 Kernelized ridge regression......Page 523
14.4.4 Kernel PCA......Page 524
14.5 Support vector machines (SVMs)......Page 527
14.5.1 SVMs for regression......Page 528
14.5.2 SVMs for classification......Page 529
14.5.4 Summary of key points......Page 535
14.6 Comparison of discriminative kernel methods......Page 536
14.7.1 Smoothing kernels......Page 538
14.7.2 Kernel density estimation (KDE)......Page 539
14.7.3 From KDE to KNN......Page 540
14.7.4 Kernel regression......Page 541
14.7.5 Locally weighted regression......Page 543
15.1 Introduction......Page 546
15.2 GPs for regression......Page 547
15.2.1 Predictions using noise-free observations......Page 548
15.2.2 Predictions using noisy observations......Page 549
15.2.3 Effect of the kernel parameters......Page 550
15.2.4 Estimating the kernel parameters......Page 552
15.2.6 Semi-parametric GPs *......Page 555
15.3.1 Binary classification......Page 556
15.3.2 Multi-class classification......Page 559
15.3.3 GPs for Poisson regression......Page 562
15.4.1 Linear models compared to GPs......Page 563
15.4.2 Linear smoothers compared to GPs......Page 564
15.4.4 L1VM and RVMs compared to GPs......Page 565
15.4.5 Neural networks compared to GPs......Page 566
15.4.6 Smoothing splines compared to GPs *......Page 567
15.4.7 RKHS methods compared to GPs *......Page 569
15.5 GP latent variable model......Page 571
15.6 Approximation methods for large datasets......Page 573
16.1 Introduction......Page 574
16.2.1 Basics......Page 575
16.2.2 Growing a tree......Page 576
16.2.3 Pruning a tree......Page 580
16.2.5 Random forests......Page 581
16.2.6 CART compared to hierarchical mixture of experts *......Page 582
16.3.1 Backfitting......Page 583
16.3.3 Multivariate adaptive regression splines (MARS)......Page 584
16.4 Boosting......Page 585
16.4.1 Forward stagewise additive modeling......Page 586
16.4.2 L2boosting......Page 588
16.4.3 AdaBoost......Page 589
16.4.4 LogitBoost......Page 590
16.4.5 Boosting as functional gradient descent......Page 591
16.4.6 Sparse boosting......Page 592
16.4.8 Why does boosting work so well?......Page 593
16.5 Feedforward neural networks (multilayer perceptrons)......Page 594
16.5.1 Convolutional neural networks......Page 595
16.5.3 A brief history of the field......Page 599
16.5.4 The backpropagation algorithm......Page 600
16.5.6 Regularization......Page 603
16.5.7 Bayesian inference......Page 607
16.6.1 Stacking......Page 611
16.6.3 Ensemble learning is not equivalent to Bayes model averaging......Page 612
16.7.1 Low-dimensional features......Page 613
16.7.2 High-dimensional features......Page 614
16.8 Interpreting black-box models......Page 616
17.2.1 Transition matrix......Page 620
17.2.2 Application: Language modeling......Page 622
17.2.3 Stationary distribution of a Markov chain *......Page 627
17.2.4 Application: Googles PageRank algorithm for web page ranking......Page 631
17.3 Hidden Markov models......Page 634
17.3.1 Applications of HMMs......Page 635
17.4.1 Types of inference problems for temporal models......Page 637
17.4.2 The forwards algorithm......Page 640
17.4.3 The forwards-backwards algorithm......Page 641
17.4.4 The Viterbi algorithm......Page 643
17.4.5 Forwards filtering, backwards sampling......Page 647
17.5.1 Training with fully observed data......Page 648
17.5.2 EM for HMMs (the Baum-Welch algorithm)......Page 649
17.5.4 Discriminative training......Page 651
17.6 Generalizations of HMMs......Page 652
17.6.1 Variable duration (semi-Markov) HMMs......Page 653
17.6.2 Hierarchical HMMs......Page 655
17.6.3 Input-output HMMs......Page 656
17.6.4 Auto-regressive and buried HMMs......Page 657
17.6.5 Factorial HMM......Page 658
17.6.7 Dynamic Bayesian networks (DBNs)......Page 659
18.1 Introduction......Page 662
18.2.1 SSMs for object tracking......Page 663
18.2.2 Robotic SLAM......Page 664
18.2.3 Online parameter learning using recursive least squares......Page 667
18.2.4 SSM for time series forecasting *......Page 668
18.3.1 The Kalman filtering algorithm......Page 671
18.3.2 The Kalman smoothing algorithm......Page 674
18.4.1 Identifiability and numerical stability......Page 677
18.5 Approximate online inference for non-linear, non-Gaussian SSMs......Page 678
18.5.1 Extended Kalman filter (EKF)......Page 679
18.5.2 Unscented Kalman filter (UKF)......Page 681
18.5.3 Assumed density filtering (ADF)......Page 683
18.6 Hybrid discrete/continuous SSMs......Page 686
18.6.1 Inference......Page 687
18.6.2 Application: data association and multi-target tracking......Page 689
18.6.3 Application: fault diagnosis......Page 690
18.6.4 Application: econometric forecasting......Page 691
19.2.1 Key properties......Page 692
19.2.2 An undirected alternative to d-separation......Page 694
19.2.3 Comparing directed and undirected graphical models......Page 695
19.3.1 The Hammersley-Clifford theorem......Page 696
19.3.2 Representing potential functions......Page 698
19.4.1 Ising model......Page 699
19.4.2 Hopfield networks......Page 700
19.4.3 Potts model......Page 702
19.4.4 Gaussian MRFs......Page 703
19.4.5 Markov logic networks *......Page 705
19.5.1 Training maxent models using gradient methods......Page 707
19.5.2 Training partially observed maxent models......Page 708
19.5.4 Pseudo likelihood......Page 709
19.5.5 Stochastic maximum likelihood......Page 710
19.5.6 Feature induction for maxent models *......Page 711
19.5.7 Iterative proportional fitting (IPF) *......Page 712
19.6.1 Chain-structured CRFs, MEMMs and the label-bias problem......Page 715
19.6.2 Applications of CRFs......Page 717
19.6.3 CRF training......Page 723
19.7.1 SSVMs: a probabilistic view......Page 724
19.7.2 SSVMs: a non-probabilistic view......Page 726
19.7.3 Cutting plane methods for fitting SSVMs......Page 729
19.7.4 Online algorithms for fitting SSVMs......Page 731
19.7.5 Latent structural SVMs......Page 732
20.2.1 Serial protocol......Page 738
20.2.2 Parallel protocol......Page 740
20.2.3 Gaussian BP *......Page 741
20.2.4 Other BP variants *......Page 743
20.3 The variable elimination algorithm......Page 745
20.3.2 Computational complexity of VE......Page 748
20.4.1 Creating a junction tree......Page 751
20.4.2 Message passing on a junction tree......Page 753
20.4.3 Computational complexity of JTA......Page 756
20.5 Computational intractability of exact inference in the worst case......Page 757
20.5.1 Approximate inference......Page 758
21.1 Introduction......Page 762
21.2 Variational inference......Page 763
21.2.2 Forward or reverse KL? *......Page 764
21.3 The mean field method......Page 766
21.3.1 Derivation of the mean field update equations......Page 767
21.3.2 Example: mean field for the Ising model......Page 768
21.4 Structured mean field *......Page 770
21.4.1 Example: factorial HMM......Page 771
21.5.1 Example: VB for a univariate Gaussian......Page 773
21.5.2 Example: VB for linear regression......Page 777
21.6 Variational Bayes EM......Page 780
21.6.1 Example: VBEM for mixtures of Gaussians *......Page 781
21.8.1 Motivating applications......Page 787
21.8.2 Bohnings quadratic bound to the log-sum-exp function......Page 789
21.8.3 Bounds for the sigmoid function......Page 791
21.8.4 Other bounds and approximations to the log-sum-exp function......Page 793
21.8.5 Variational inference based on upper bounds......Page 794
22.2.1 A brief history......Page 798
22.2.2 LBP on pairwise models......Page 799
22.2.3 LBP on a factor graph......Page 800
22.2.4 Convergence......Page 802
22.2.5 Accuracy of LBP......Page 805
22.2.6 Other speedup tricks for LBP *......Page 806
22.3.1 UGMs represented in exponential family form......Page 807
22.3.2 The marginal polytope......Page 808
22.3.3 Exact inference as a variational optimization problem......Page 809
22.3.5 LBP as a variational optimization problem......Page 810
22.4.1 Generalized belief propagation......Page 814
22.4.2 Convex belief propagation......Page 816
22.5 Expectation propagation......Page 818
22.5.1 EP as a variational inference p......Page 819
22.5.2 Optimizing the EP objective using moment matching......Page 820
22.5.3 EP for the clutter problem......Page 822
22.5.4 LBP is a special case of EP......Page 823
22.5.5 Ranking players using TrueSkill......Page 824
22.6.1 Linear programming relaxation......Page 830
22.6.2 Max-product belief propagation......Page 831
22.6.3 Graphcuts......Page 832
22.6.4 Experimental comparison of graphcuts and BP......Page 835
22.6.5 Dual decomposition......Page 837
23.2.1 Using the cdf......Page 846
23.3.1 Basic idea......Page 848
23.3.2 Example......Page 849
23.3.4 Adaptive rejection sampling......Page 850
23.4.1 Basic idea......Page 851
23.4.2 Handling unnormalized distributions......Page 852
23.4.4 Sampling importance resampling (SIR)......Page 853
23.5 Particle filtering......Page 854
23.5.1 Sequential importance sampling......Page 855
23.5.3 The resampling step......Page 856
23.5.4 The proposal distribution......Page 858
23.5.6 Application: visual object tracking......Page 859
23.6.1 RBPF for switching LG-SSMs......Page 862
23.6.2 Application: tracking a maneuvering target......Page 863
23.6.3 Application: Fast SLAM......Page 865
24.1 Introduction......Page 868
24.2.2 Example: Gibbs sampling for the Ising model......Page 869
24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM......Page 871
24.2.4 Collapsed Gibbs sampling *......Page 872
24.2.5 Gibbs sampling for hierarchical GLMs......Page 875
24.2.6 BUGS and JAGS......Page 877
24.2.8 Blocking Gibbs sampling......Page 878
24.3.1 Basic idea......Page 879
24.3.2 Gibbs sampling is a special case of MH......Page 880
24.3.3 Proposal distributions......Page 881
24.3.4 Adaptive MCMC......Page 884
24.3.6 Why MH works *......Page 885
24.3.7 Reversible jump (trans-dimensional) MCMC *......Page 886
24.4.1 The burn-in phase......Page 887
24.4.2 Mixing rates of Markov chains *......Page 888
24.4.3 Practical convergence diagnostic......Page 889
24.4.4 Accuracy of MCMC......Page 891
24.4.5 How many chains?......Page 893
24.5.1 Auxiliary variable sampling for logistic regression......Page 894
24.5.2 Slice sampling......Page 895
24.5.3 Swendsen Wang......Page 897
24.6 Annealing methods......Page 899
24.6.1 Simulated annealing......Page 900
24.6.3 Parallel tempering......Page 902
24.7.2 Harmonic mean estimate......Page 903
24.7.3 Annealed importance sampling......Page 904
25.1.1 Measuring (dis)similarity......Page 906
25.1.2 Evaluating the output of clustering methods *......Page 907
25.2.1 From finite to infinite mixture models......Page 910
25.2.2 The Dirichlet process......Page 913
25.2.3 Applying Dirichlet processes to mixture modeling......Page 916
25.2.4 Fitting a DP mixture model......Page 917
25.3 Affinity propagation......Page 918
25.4 Spectral clustering......Page 921
25.4.1 Graph Laplacian......Page 922
25.4.2 Normalized graph Laplacian......Page 923
25.5 Hierarchical clustering......Page 924
25.5.1 Agglomerative clustering......Page 926
25.5.2 Divisive clustering......Page 929
25.5.4 Bayesian hierarchical clustering......Page 930
25.6 Clustering datapoints and features......Page 932
25.6.2 Multi-view clustering......Page 934
26.1 Introduction......Page 938
26.2.1 Relevance networks......Page 939
26.2.2 Dependency networks......Page 940
26.3 Learning tree structures......Page 941
26.3.1 Directed or undirected tree?......Page 942
26.3.3 Finding the MAP forest......Page 943
26.4.1 Markov equivalence......Page 945
26.4.2 Exact structural inference......Page 947
26.4.3 Scaling up to larger graphs......Page 951
26.5.1 Approximating the marginal likelihood when we have missing data......Page 953
26.5.2 Structural EM......Page 956
26.5.3 Discovering hidden variables......Page 957
26.5.4 Case study: Googles Rephil......Page 959
26.5.5 Structural equation models *......Page 960
26.6.1 Causal interpretation of DAGs......Page 962
26.6.2 Using causal DAGs to resolve Simpsons paradox......Page 964
26.6.3 Learning causal DAG structures......Page 966
26.7.1 MLE for a GGM......Page 969
26.7.2 Graphical lasso......Page 970
26.7.3 Bayesian inference for GGM structure *......Page 972
26.8.1 Graphical lasso for MRFs/CRFs......Page 973
26.8.2 Thin junction trees......Page 975
27.1 Introduction......Page 976
27.2.1 Mixture models......Page 977
27.2.2 Exponential family PCA......Page 978
27.2.3 LDA and mPCA......Page 979
27.2.4 GaP model and non-negative matrix factorization......Page 980
27.3.1 Basics......Page 981
27.3.3 Quantitatively evaluating LDA as a language model......Page 984
27.3.4 Fitting using (collapsed) Gibbs sampling......Page 986
27.3.5 Example......Page 987
27.3.6 Fitting using batch variational inference......Page 988
27.3.7 Fitting using online variational inference......Page 990
27.3.8 Determining the number of topics......Page 991
27.4.1 Correlated topic model......Page 992
27.4.2 Dynamic topic model......Page 993
27.4.3 LDA-HMM......Page 994
27.4.4 Supervised LDA......Page 998
27.5 LVMs for graph-structured data......Page 1001
27.5.1 Stochastic block model......Page 1002
27.5.2 Mixed membership stochastic block model......Page 1004
27.5.3 Relational topic model......Page 1005
27.6 LVMs for relational data......Page 1006
27.6.1 Infinite relational model......Page 1007
27.6.2 Probabilistic matrix factorization for collaborative filtering......Page 1010
27.7 Restricted Boltzmann machines (RBMs)......Page 1014
27.7.1 Varieties of RBMs......Page 1016
27.7.2 Learning RBMs......Page 1018
27.7.3 Applications of RBMs......Page 1022
28.2 Deep generative models......Page 1026
28.2.2 Deep Boltzmann machines......Page 1027
28.2.3 Deep belief networks......Page 1028
28.2.4 Greedy layer-wise learning of DBNs......Page 1029
28.3.1 Deep multi-layer perceptrons......Page 1030
28.3.2 Deep auto-encoders......Page 1031
28.4.1 Handwritten digit classification using DBNs......Page 1032
28.4.2 Data visualization and feature discovery using deep auto-encoders......Page 1033
28.4.3 Information retrieval using deep auto-encoders (semantic hashing)......Page 1034
28.4.4 Learning audio features using 1d convolutional DBNs......Page 1035
28.5 Discussion......Page 1036
Notation......Page 1040
Bibliography......Page 1046
Index to Code......Page 1078
Index to Keywords......Page 1081