Probabilistic Graphical Models: Principles and Techniques

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

A general framework for constructing and using probabilistic models of complex systems that would enable a computer to use available information for making decisions.

Most tasks require a person or an automated system to reason -- to reach conclusions based on available information. The framework of probabilistic graphical models, presented in this book, provides a general approach for this task. The approach is model-based, allowing interpretable models to be constructed and then manipulated by reasoning algorithms. These models can also be learned automatically from data, allowing the approach to be used in cases where manually constructing a model is difficult or even impossible. Because uncertainty is an inescapable aspect of most real-world applications, the book focuses on probabilistic models, which make the uncertainty explicit and provide models that are more faithful to reality.

Probabilistic Graphical Models discusses a variety of models, spanning Bayesian networks, undirected Markov networks, discrete and continuous models, and extensions to deal with dynamical systems and relational data. For each class of models, the text describes the three fundamental cornerstones: representation, inference, and learning, presenting both basic concepts and advanced techniques. Finally, the book considers the use of the proposed framework for causal reasoning and decision making under uncertainty. The main text in each chapter provides the detailed technical development of the key ideas. Most chapters also include boxes with additional material: skill boxes, which describe techniques; case study boxes, which discuss empirical cases related to the approach described in the text, including applications in computer vision, robotics, natural language understanding, and computational biology; and concept boxes, which present significant concepts drawn from the material in the chapter. Instructors (and readers) can group chapters in various combinations, from core topics to more technically advanced material, to suit their particular needs.

Author(s): Daphne Koller, Nir Friedman
Series: Adaptive Computation and Machine Learning series
Publisher: The MIT Press
Year: 2009

Language: English
Pages: 1270
Tags: Intelligence & Semantics;AI & Machine Learning;Computer Science;Computers & Technology;Machine Theory;AI & Machine Learning;Computer Science;Computers & Technology;Natural Language Processing;AI & Machine Learning;Computer Science;Computers & Technology;Computer Vision & Pattern Recognition;AI & Machine Learning;Computer Science;Computers & Technology;Robotics;Computer Science;Computers & Technology;Robotics & Automation;Industrial, Manufacturing & Operational Systems;Engineering;Engineering & T

Contents ... 10
Acknowledgments ... 24
List of Figures ... 26
List of Algorithms ... 32
List of Boxes ... 34
1 Introduction ... 38
1.1 Motivation ... 38
1.2 Structured Probabilistic Models ... 39
1.2.1 Probabilistic Graphical Models ... 40
1.2.2 Representation, Inference, Learning ... 42
1.3 Overview and Roadmap ... 43
1.3.1 Overview of Chapters ... 43
1.3.2 Reader’s Guide ... 46
1.3.3 Connection to Other Disciplines ... 48
1.3.3.1 What Have We Gained? ... 49
1.4 Historical Notes ... 49
2 Foundations ... 52
2.1 Probability Theory ... 52
2.1.1 Probability Distributions ... 52
2.1.1.1 Event Spaces ... 52
2.1.1.2 Probability Distributions ... 53
2.1.1.3 Interpretations of Probability ... 53
2.1.2 Basic Concepts in Probability ... 55
2.1.2.1 Conditional Probability ... 55
2.1.2.2 Chain Rule and Bayes Rule ... 55
2.1.3 Random Variables and Joint Distributions ... 56
2.1.3.1 Motivation ... 56
2.1.3.2 What Is a Random Variable? ... 57
2.1.3.3 Marginal and Joint Distributions ... 58
2.1.3.4 Conditional Probability ... 59
2.1.4 Independence and Conditional Independence ... 60
2.1.4.1 Independence ... 60
2.1.4.2 Conditional Independence ... 60
2.1.4.3 Independence of Random Variables ... 61
2.1.5 Querying a Distribution ... 62
2.1.5.1 Probability Queries ... 63
2.1.5.2 MAP Queries ... 63
2.1.5.3 Marginal MAP Queries ... 64
2.1.6 Continuous Spaces ... 64
2.1.6.1 Probability Density Functions ... 64
2.1.6.2 Joint Density Functions ... 66
2.1.6.3 Conditional Density Functions ... 67
2.1.7 Expectation and Variance ... 68
2.1.7.1 Expectation ... 68
2.1.7.2 Variance ... 70
2.2 Graphs ... 71
2.2.1 Nodes and Edges ... 71
2.2.2 Subgraphs ... 72
2.2.3 Paths and Trails ... 73
2.2.4 Cycles and Loops ... 73
2.3 Relevant Literature ... 76
2.4 Exercises ... 76
Part I - Representation ... 80
3 The Bayesian Network Representation ... 82
3.1 Exploiting Independence Properties ... 82
3.1.1 Independent Random Variables ... 82
3.1.2 The Conditional Parameterization ... 83
3.1.3 The Naive Bayes Model ... 85
3.1.3.1 The Student Example ... 85
3.1.3.2 The General Model ... 86
3.2 Bayesian Networks ... 88
3.2.1 The Student Example Revisited ... 89
3.2.1.1 The Model ... 89
3.2.1.2 Reasoning Patterns ... 91
3.2.2 Basic Independencies in Bayesian Networks ... 93
3.2.2.1 Independencies in the Student Example ... 93
3.2.2.2 Bayesian Network Semantics ... 94
3.2.3 Graphs and Distributions ... 97
3.2.3.1 I-Maps ... 97
3.2.3.2 I-Map to Factorization ... 98
3.2.3.3 Factorization to I-Map ... 100
3.3 Independencies in Graphs ... 105
3.3.1 D-separation ... 106
3.3.2 Soundness and Completeness ... 109
3.3.3 An Algorithm for d-Separation ... 111
3.3.4 I-Equivalence ... 113
3.4 From Distributions to Graphs ... 115
3.4.1 Minimal I-Maps ... 116
3.4.2 Perfect Maps ... 118
3.4.3 Finding Perfect Maps ? ... 120
3.4.3.1 Identifying the Undirected Skeleton ... 121
3.4.3.2 Identifying Immoralities ... 123
3.4.3.3 Representing Equivalence Classes ... 124
3.5 Summary ... 129
3.6 Relevant Literature ... 130
3.7 Exercises ... 133
4 Undirected Graphical Models ... 140
4.1 The Misconception Example ... 140
4.2 Parameterization ... 143
4.2.1 Factors ... 143
4.2.2 Gibbs Distributions and Markov Networks ... 145
4.2.3 Reduced Markov Networks ... 147
4.3 Markov Network Independencies ... 151
4.3.1 Basic Independencies ... 151
4.3.1.1 Soundness ... 152
4.3.1.2 Completeness ... 153
4.3.2 Independencies Revisited ... 154
4.3.2.1 Local Markov Assumptions ... 155
4.3.2.2 Relationships between Markov Properties ... 155
4.3.3 From Distributions to Graphs ... 157
4.4 Parameterization Revisited ... 159
4.4.1 Finer-Grained Parameterization ... 160
4.4.1.1 Factor Graphs ... 160
4.4.1.2 Log-Linear Models ... 161
4.4.1.3 Discussion ... 162
4.4.2 Overparameterization ... 165
4.4.2.1 Canonical Parameterization ... 166
4.4.2.2 Eliminating Redundancy ... 169
4.5 Bayesian Networks and Markov Networks ... 171
4.5.1 From Bayesian Networks to Markov Networks ... 171
4.5.1.1 Soundness of d-Separation ... 173
4.5.2 From Markov Networks to Bayesian Networks ... 174
4.5.3 Chordal Graphs ... 176
4.6 Partially Directed Models ... 179
4.6.1 Conditional Random Fields ... 179
4.6.1.1 CRF Representation and Semantics ... 180
4.6.1.2 Directed and Undirected Dependencies ... 181
4.6.2 Chain Graph Models ? ... 185
4.6.2.1 Factorization ... 185
4.6.2.2 Independencies in Chain Graphs ... 186
4.7 Summary and Discussion ... 188
4.8 Relevant Literature ... 189
4.9 Exercises ... 190
5 Local Probabilistic Models ... 194
5.1 Tabular CPDs ... 194
5.2 Deterministic CPDs ... 195
5.2.1 Representation ... 195
5.2.2 Independencies ... 196
5.3 Context-Speci?c CPDs ... 199
5.3.1 Representation ... 199
5.3.1.1 Tree-CPDs ... 200
5.3.1.2 Rule CPDs ... 203
5.3.1.3 Other Representations ... 207
5.3.2 Independencies ... 208
5.4 Independence of Causal In?uence ... 212
5.4.1 The Noisy-Or Model ... 212
5.4.2 Generalized Linear Models ... 215
5.4.2.1 Binary-Valued Variables ... 215
5.4.2.2 Multivalued Variables ... 218
5.4.3 The General Formulation ... 219
5.4.4 Independencies ... 221
5.5 Continuous Variables ... 222
5.5.1 Hybrid Models ... 226
5.6 Conditional Bayesian Networks ... 228
5.7 Summary ... 230
5.8 Relevant Literature ... 231
5.9 Exercises ... 232
6 Template-Based Representations ... 236
6.1 Introduction ... 236
6.2 Temporal Models ... 237
6.2.1 Basic Assumptions ... 238
6.2.2 Dynamic Bayesian Networks ... 239
6.2.3 State-Observation Models ... 244
6.2.3.1 Hidden Markov Models ... 245
6.2.3.2 Linear Dynamical Systems ... 248
6.3 Template Variables and Template Factors ... 249
6.4 Directed Probabilistic Models for Object-Relational Domains ... 253
6.4.1 Plate Models ... 253
6.4.1.1 Examples ... 253
6.4.1.2 Plate Models: Formal Framework ... 257
6.4.2 Probabilistic Relational Models ... 259
6.4.2.1 Contingent Dependencies ... 259
6.4.2.2 CPDs in the Ground Network ... 261
6.4.2.3 Checking Acyclicity ... 263
6.5 Undirected Representation ... 265
6.6 Structural Uncertainty ... 269
6.6.1 Relational Uncertainty ... 270
6.6.2 Object Uncertainty ... 272
6.7 Summary ... 277
6.8 Relevant Literature ... 279
6.9 Exercises ... 280
7 Gaussian Network Models ... 284
7.1 Multivariate Gaussians ... 284
7.1.1 Basic Parameterization ... 284
7.1.2 Operations on Gaussians ... 286
7.1.3 Independencies in Gaussians ... 287
7.2 Gaussian Bayesian Networks ... 288
7.3 Gaussian Markov Random Fields ... 291
7.4 Summary ... 294
7.5 Relevant Literature ... 295
7.6 Exercises ... 295
8 The Exponential Family ... 298
8.1 Introduction ... 298
8.2 Exponential Families ... 298
8.2.1 Linear Exponential Families ... 300
8.3 Factored Exponential Families ... 303
8.3.1 Product Distributions ... 303
8.3.2 Bayesian Networks ... 304
8.4 Entropy and Relative Entropy ... 306
8.4.1 Entropy ... 306
8.4.1.1 Entropy of an Exponential Model ... 307
8.4.1.2 Entropy of Bayesian Networks ... 308
8.4.2 Relative Entropy ... 309
8.5 Projections ... 310
8.5.1 Comparison ... 311
8.5.2 M-Projections ... 314
8.5.3 I-Projections ... 319
8.6 Summary ... 319
8.7 Relevant Literature ... 320
8.8 Exercises ... 320
Part II - Inference ... 322
9 Exact Inference: Variable Elimination ... 324
9.1 Analysis of Complexity ... 325
9.1.1 Analysis of Exact Inference ... 325
9.1.2 Analysis of Approximate Inference ... 327
9.2 Variable Elimination: The Basic Ideas ... 329
9.3 Variable Elimination ... 333
9.3.1 Basic Elimination ... 334
9.3.1.1 Factor Marginalization ... 334
9.3.1.2 The Variable Elimination Algorithm ... 335
9.3.1.3 Semantics of Factors ... 339
9.3.2 Dealing with Evidence ... 340
9.4 Complexity and Graph Structure: Variable Elimination ... 342
9.4.1 Simple Analysis ... 343
9.4.2 Graph-Theoretic Analysis ... 343
9.4.2.1 Factors and Undirected Graphs ... 343
9.4.2.2 Elimination as Graph Transformation ... 344
9.4.2.3 The Induced Graph ... 345
9.4.3 Finding Elimination Orderings ... 347
9.4.3.1 Chordal Graphs ... 348
9.4.3.2 Minimum Fill-Size-Weight Search ... 350
9.5 Conditioning ... 352
9.5.1 The Conditioning Algorithm ... 352
9.5.2 Conditioning and Variable Elimination ... 355
9.5.3 Graph-Theoretic Analysis ... 359
9.5.4 Improved Conditioning ... 360
9.5.4.1 Alternating Conditioning and Elimination ... 361
9.5.4.2 Network Decomposition ... 361
9.6 Inference with Structured CPDs ... 362
9.6.1 Independence of Causal In?uence ... 362
9.6.1.1 Noisy-Or Decompositions ... 362
9.6.1.2 The General Decomposition ... 364
9.6.1.3 Global Structure ... 365
9.6.1.4 Heterogeneous Factorization ... 365
9.6.2 Context-Speci?c Independence ... 366
9.6.2.1 Rule-Based Variable Elimination ... 366
9.6.2.2 Conditioning ... 371
9.6.3 Discussion ... 372
9.7 Summary and Discussion ... 373
9.8 Relevant Literature ... 374
9.9 Exercises ... 375
10 Exact Inference: Clique Trees ... 382
10.1 Variable Elimination and Clique Trees ... 382
10.1.1 Cluster Graphs ... 383
10.1.2 Clique Trees ... 383
10.2 Message Passing: Sum Product ... 385
10.2.1 Variable Elimination in a Clique Tree ... 386
10.2.1.1 An Example ... 386
10.2.1.2 Clique-Tree Message Passing ... 388
10.2.1.3 Correctness ... 390
10.2.2 Clique Tree Calibration ... 392
10.2.3 A Calibrated Clique Tree as a Distribution ... 398
10.3 Message Passing: Belief Update ... 401
10.3.1 Message Passing with Division ... 401
10.3.2 Equivalence of Sum-Product and Belief Update Messages ... 405
10.3.3 Answering Queries ... 406
10.3.3.1 Incremental Updates ... 406
10.3.3.2 Queries Outside a Clique ... 407
10.3.3.3 Multiple Queries ... 408
10.4 Constructing a Clique Tree ... 409
10.4.1 Clique Trees from Variable Elimination ... 409
10.4.2 Clique Trees from Chordal Graphs ... 411
10.5 Summary ... 413
10.6 Relevant Literature ... 414
10.7 Exercises ... 415
11 Inference as Optimization ... 418
11.1 Introduction ... 418
11.1.1 Exact Inference Revisited ... 419
11.1.2 The Energy Functional ... 421
11.1.3 Optimizing the Energy Functional ... 423
11.2 Exact Inference as Optimization ... 423
11.2.1 Fixed-Point Characterization ... 425
11.2.2 Inference as Optimization ... 427
11.3 Propagation-Based Approximation ... 428
11.3.1 A Simple Example ... 428
11.3.2 Cluster-Graph Belief Propagation ... 433
11.3.3 Properties of Cluster-Graph Belief Propagation ... 436
11.3.3.1 Reparameterization ... 436
11.3.3.2 Tree Consistency ... 437
11.3.4 Analyzing Convergence ... 438
11.3.5 Constructing Cluster Graphs ... 441
11.3.5.1 Pairwise Markov Networks ... 441
11.3.5.2 Bethe Cluster Graph ... 442
11.3.5.3 Beyond Marginal Probabilities ... 443
11.3.6 Variational Analysis ... 448
11.3.7 Other Entropy Approximations ... 451
11.3.7.1 Motivation ... 451
11.3.7.2 Convex Approximations ... 453
11.3.7.3 Region Graph Approximations ... 456
11.3.8 Discussion ... 465
11.4 Propagation with Approximate Messages ... 467
11.4.1 Factorized Messages ... 468
11.4.2 Approximate Message Computation ... 470
11.4.3 Inference with Approximate Messages ... 473
11.4.3.1 Sum-Product Propagation ... 473
11.4.3.2 Belief Update Propagation ... 476
11.4.4 Expectation Propagation ... 479
11.4.5 Variational Analysis ... 482
11.4.6 Discussion ... 485
11.5 Structured Variational Approximations ... 485
11.5.1 The Mean Field Approximation ... 486
11.5.1.1 The Mean Field Energy ... 486
11.5.1.2 Maximizing the Energy Functional: Fixed-point Characterization ... 488
11.5.1.3 Maximizing the Energy Functional: The Mean Field Algorithm ... 491
11.5.2 Structured Approximations ... 493
11.5.2.1 Fixed-Point Characterization ... 494
11.5.2.2 Optimization ... 496
11.5.2.3 Simplifying the Update Equations ... 497
11.5.2.4 Simplifying the Family Q ... 499
11.5.2.5 Selecting the Approximation ... 505
11.5.3 Local Variational Methods ... 506
11.5.3.1 Variational Bounds ... 507
11.5.3.2 Variational Variable Elimination ... 509
11.6 Summary and Discussion ... 510
11.7 Relevant Literature ... 512
11.8 Exercises ... 514
12 Particle-Based Approximate Inference ... 524
12.1 Forward Sampling ... 525
12.1.1 Sampling from a Bayesian Network ... 525
12.1.2 Analysis of Error ... 527
12.1.3 Conditional Probability Queries ... 528
12.2 Likelihood Weighting and Importance Sampling ... 529
12.2.1 Likelihood Weighting: Intuition ... 529
12.2.2 Importance Sampling ... 531
12.2.2.1 Unnormalized Importance Sampling ... 531
12.2.2.2 Normalized Importance Sampling ... 533
12.2.3 Importance Sampling for Bayesian Networks ... 535
12.2.3.1 The Mutilated Network Proposal Distribution ... 535
12.2.3.2 Unconditional Probability of an Event ... 537
12.2.3.3 Ratio Likelihood Weighting ... 539
12.2.3.4 Normalized Likelihood Weighting ... 540
12.2.3.5 Conditional Probabilities: Comparison ... 541
12.2.4 Importance Sampling Revisited ... 541
12.3 Markov Chain Monte Carlo Methods ... 542
12.3.1 Gibbs Sampling Algorithm ... 542
12.3.2 Markov Chains ... 544
12.3.2.1 Basic De?nition ... 544
12.3.2.2 Asymptotic Behavior ... 545
12.3.2.3 Stationary Distributions ... 546
12.3.2.4 Multiple Transition Models ... 548
12.3.3 Gibbs Sampling Revisited ... 549
12.3.4 A Broader Class of Markov Chains ... 552
12.3.4.1 Detailed Balance ... 552
12.3.4.2 Metropolis-Hastings Algorithm ... 553
12.3.5 Using a Markov Chain ... 555
12.3.5.1 Mixing Time ... 556
12.3.5.2 Collecting Samples ... 557
12.3.5.3 Discussion ... 560
12.4 Collapsed Particles ... 563
12.4.1 Collapsed Likelihood Weighting ... 564
12.4.1.1 Collapsed Importance Sampling ... 564
12.4.1.2 Formal Description ... 565
12.4.1.3 Proposal Distribution ... 567
12.4.2 Collapsed MCMC ... 568
12.5 Deterministic Search Methods ... 573
12.6 Summary ... 577
12.7 Relevant Literature ... 578
12.8 Exercises ... 581
13 MAP Inference ... 588
13.1 Overview ... 588
13.1.1 Computational Complexity ... 588
13.1.2 Overview of Solution Methods ... 589
13.2 Variable Elimination for (Marginal) MAP ... 591
13.2.1 Max-Product Variable Elimination ... 591
13.2.2 Finding the Most Probable Assignment ... 593
13.2.3 Variable Elimination for Marginal MAP ... 596
13.3 Max-Product in Clique Trees ... 599
13.3.1 Computing Max-Marginals ... 599
13.3.2 Message Passing as Reparameterization ... 601
13.3.3 Decoding Max-Marginals ... 602
13.4 Max-Product Belief Propagation in Loopy Cluster Graphs ... 604
13.4.1 Standard Max-Product Message Passing ... 604
13.4.1.1 Message Passing Algorithm ... 604
13.4.1.2 Decoding the Pseudo-Max-Marginals ... 605
13.4.1.3 Strong Local Maximum ... 607
13.4.2 Max-Product BP with Counting Numbers ... 609
13.4.2.1 Max-Product with Counting Numbers ... 610
13.4.2.2 Optimality Guarantee ... 611
13.4.3 Discussion ... 612
13.5 MAP as a Linear Optimization Problem ... 614
13.5.1 The Integer Program Formulation ... 614
13.5.2 Linear Programming Relaxation ... 616
13.5.3 Low-Temperature Limits ... 618
13.5.3.1 LP as Sum-Product Limit ... 618
13.5.3.2 Max-Product as Sum-Product Limit ... 619
13.5.3.3 Discussion ... 623
13.6 Using Graph Cuts for MAP ... 625
13.6.1 Inference Using Graph Cuts ... 625
13.6.2 Nonbinary Variables ... 629
13.7 Local Search Algorithms ... 632
13.8 Summary ... 634
13.9 Relevant Literature ... 635
13.10 Exercises ... 638
14 Inference in Hybrid Networks ... 642
14.1 Introduction ... 642
14.1.1 Challenges ... 642
14.1.2 Discretization ... 643
14.1.3 Overview ... 644
14.2 Variable Elimination in Gaussian Networks ... 645
14.2.1 Canonical Forms ... 646
14.2.1.1 The Canonical Form Representation ... 646
14.2.1.2 Operations on Canonical Forms ... 647
14.2.2 Sum-Product Algorithms ... 648
14.2.3 Gaussian Belief Propagation ... 649
14.3 Hybrid Networks ... 652
14.3.1 The Diculties ... 652
14.3.2 Factor Operations for Hybrid Gaussian Networks ... 655
14.3.2.1 Canonical Tables ... 655
14.3.2.2 Weak Marginalization ... 656
14.3.3 EP for CLG Networks ... 658
14.3.3.1 Ordering Constraints ... 659
14.3.3.2 Ill-De?ned Messages ... 661
14.3.4 An “Exact” CLG Algorithm ... 663
14.3.4.1 Strongly Rooted Trees ... 663
14.3.4.2 Strong Roots and Correctness ... 664
14.3.4.3 Strong Triangulation and Complexity ... 666
14.4 Nonlinear Dependencies ... 667
14.4.1 Linearization ... 668
14.4.1.1 Taylor Series Linearization ... 668
14.4.1.2 M-Projection Using Numerical Integration ... 669
14.4.1.3 Discussion ... 672
14.4.2 Expectation Propagation with Gaussian Approximation ... 674
14.4.2.1 Dealing with Evidence ... 675
14.4.2.2 Valid Linearization ... 676
14.4.2.3 Linearization of Multiple Functions ... 677
14.4.2.4 Iterated Message Passing ... 678
14.4.2.5 Augmented CLG Networks ... 679
14.5 Particle-Based Approximation Methods ... 679
14.5.1 Sampling in Continuous Spaces ... 679
14.5.2 Forward Sampling in Bayesian Networks ... 680
14.5.3 MCMC Methods ... 681
14.5.4 Collapsed Particles ... 682
14.5.5 Nonparametric Message Passing ... 683
14.6 Summary and Discussion ... 683
14.7 Relevant Literature ... 684
14.8 Exercises ... 686
15 Inference in Temporal Models ... 688
15.1 Inference Tasks ... 689
15.2 Exact Inference ... 690
15.2.1 Filtering in State-Observation Models ... 690
15.2.2 Filtering as Clique Tree Propagation ... 691
15.2.3 Clique Tree Inference in DBNs ... 692
15.2.4 Entanglement ... 693
15.3 Approximate Inference ... 698
15.3.1 Key Ideas ... 698
15.3.1.1 Bounded History Updates ... 698
15.3.1.2 Analysis of Convergence ... 699
15.3.2 Factored Belief State Methods ... 700
15.3.3 Particle Filtering ... 702
15.3.3.1 Naive Likelihood Weighting ... 702
15.3.3.2 The Bootstrap Filter ... 705
15.3.3.3 Sequential Importance Sampling ... 706
15.3.3.4 Sample Selection Scheme ... 709
15.3.3.5 Other Extensions ... 710
15.3.4 Deterministic Search Techniques ... 712
15.4 Hybrid DBNs ... 712
15.4.1 Continuous Models ... 713
15.4.1.1 The Kalman Filter ... 713
15.4.1.2 Nonlinear Systems ... 715
15.4.2 Hybrid Models ... 720
15.5 Summary ... 725
15.6 Relevant Literature ... 727
15.7 Exercises ... 729
Part III - Learning ... 732
16 Learning Graphical Models: Overview ... 734
16.1 Motivation ... 734
16.2 Goals of Learning ... 735
16.2.1 Density Estimation ... 735
16.2.2 Speci?c Prediction Tasks ... 737
16.2.3 Knowledge Discovery ... 738
16.3 Learning as Optimization ... 739
16.3.1 Empirical Risk and Over?tting ... 740
16.3.2 Discriminative versus Generative Training ... 746
16.4 Learning Tasks ... 748
16.4.1 Model Constraints ... 749
16.4.2 Data Observability ... 749
16.4.3 Taxonomy of Learning Tasks ... 751
16.5 Relevant Literature ... 752
17 Parameter Estimation ... 754
17.1 Maximum Likelihood Estimation ... 754
17.1.1 The Thumbtack Example ... 754
17.1.2 The Maximum Likelihood Principle ... 757
17.2 MLE for Bayesian Networks ... 759
17.2 MLE for Bayesian Networks ... 759
17.2.1 A Simple Example ... 760
17.2.2 Global Likelihood Decomposition ... 761
17.2.3 Table-CPDs ... 762
17.2.4 Gaussian Bayesian Networks ... 765
17.2.5 Maximum Likelihood Estimation as M-Projection ... 768
17.3 Bayesian Parameter Estimation ... 770
17.3.1 The Thumbtack Example Revisited ... 770
17.3.1.1 Joint Probabilistic Model ... 770
17.3.1.2 Prediction ... 771
17.3.1.3 Priors ... 772
17.3.2 Priors and Posteriors ... 774
17.4 Bayesian Parameter Estimation in Bayesian Networks ... 778
17.4.1 Parameter Independence and Global Decomposition ... 779
17.4.1.1 A Simple Example ... 779
17.4.1.2 General Networks ... 781
17.4.1.3 Prediction ... 781
17.4.2 Local Decomposition ... 783
17.4.3 Priors for Bayesian Network Learning ... 785
17.4.4 MAP Estimation ... 788
17.5 Learning Models with Shared Parameters ... 791
17.5.1 Global Parameter Sharing ... 792
17.5.1.1 Likelihood Function with Global Shared Parameters ... 792
17.5.1.2 Parameter Estimation for Template-Based Models ... 794
17.5.2 Local Parameter Sharing ... 797
17.5.3 Bayesian Inference with Shared Parameters ... 799
17.5.4 Hierarchical Priors ... 800
17.6 Generalization Analysis ... 806
17.6.1 Asymptotic Analysis ... 806
17.6.2 PAC-Bounds ... 807
17.6.2.1 Estimating a Multinomial ... 808
17.6.2.2 Estimating a Bayesian Network ... 810
17.7 Summary ... 813
17.8 Relevant Literature ... 814
17.9 Exercises ... 815
18 Structure Learning in Bayesian Networks ... 820
18.1 Introduction ... 820
18.1.1 Problem De?nition ... 820
18.1.2 Overview of Methods ... 822
18.2 Constraint-Based Approaches ... 823
18.2.1 General Framework ... 823
18.2.2 Independence Tests ... 824
18.2.2.1 Single-Sided Hypothesis Tests ... 824
18.2.2.2 Deviance Measures ... 825
18.2.2.3 Testing for Independence ... 826
18.2.2.4 Building Networks ... 827
18.3 Structure Scores ... 827
18.3.1 Likelihood Scores ... 828
18.3.1.1 Maximum Likelihood Parameters ... 828
18.3.1.2 Information-Theoretic Interpretation ... 828
18.3.1.3 Limitations of the Maximum Likelihood Score ... 830
18.3.2 Bayesian Score ... 831
18.3.3 Marginal Likelihood for a Single Variable ... 834
18.3.4 Bayesian Score for Bayesian Networks ... 836
18.3.5 Understanding the Bayesian Score ... 838
18.3.6 Priors ... 841
18.3.6.1 Structure Priors ... 841
18.3.6.2 Parameter Priors and Score Decomposability ... 842
18.3.6.3 Representing Parameter Priors ... 843
18.3.7 Score Equivalence ... 844
18.4 Structure Search ... 844
18.4.1 Learning Tree-Structured Networks ... 845
18.4.2 Known Order ... 846
18.4.3 General Graphs ... 848
18.4.3.1 The Search Space ... 849
18.4.3.2 The Search Procedure ... 851
18.4.3.3 Score Decomposition and Search ... 855
18.4.3.4 Empirical Evaluation ... 857
18.4.4 Learning with Equivalence Classes ... 858
18.5 Bayesian Model Averaging ... 861
18.5.1 Basic Theory ... 861
18.5.2 Model Averaging Given an Order ... 863
18.5.2.1 Computing the marginal likelihood ... 863
18.5.2.2 Probabilities of features ... 864
18.5.3 The General Case ... 865
18.5.3.1 MCMC Over Structures ... 866
18.5.3.2 MCMC Over Orders ... 868
18.6 Learning Models with Additional Structure ... 869
18.6.1 Learning with Local Structure ... 870
18.6.1.1 Scoring Networks ... 871
18.6.1.2 Search with Local Structure ... 872
18.6.2 Learning Template Models ... 874
18.7 Summary and Discussion ... 875
18.8 Relevant Literature ... 877
18.9 Exercises ... 880
19 Partially Observed Data ... 886
19.1 Foundations ... 886
19.1.1 Likelihood of Data and Observation Models ... 886
19.1.2 Decoupling of Observation Mechanism ... 890
19.1.3 The Likelihood Function ... 893
19.1.4 Identi?ability ... 897
19.2 Parameter Estimation ... 899
19.2.1 Gradient Ascent ... 900
19.2.1.1 Computing the Gradient ... 900
19.2.1.2 An Example ... 902
19.2.1.3 Gradient Ascent Algorithm ... 904
19.2.2 Expectation Maximization (EM) ... 905
19.2.2.1 Intuition ... 905
19.2.2.2 An Example ... 907
19.2.2.3 The EM Algorithm for Bayesian Networks ... 909
19.2.2.4 Bayesian Clustering Using EM ... 912
19.2.2.5 Theoretical Foundations ... 914
19.2.2.6 Hard-Assignment EM ... 921
19.2.3 Comparison: Gradient Ascent versus EM ... 924
19.2.4 Approximate Inference ... 930
19.3 Bayesian Learning with Incomplete Data ... 934
19.3.1 Overview ... 934
19.3.2 MCMC Sampling ... 936
19.3.2.1 Gibbs Sampling ... 936
19.3.2.2 Collapsed MCMC ... 938
19.3.3 Variational Bayesian Learning ... 941
19.4 Structure Learning ... 945
19.4.1 Scoring Structures ... 946
19.4.1.1 Laplace Approximation ... 946
19.4.1.2 Asymptotic Approximations ... 948
19.4.1.3 Cheeseman-Stutz Approximation ... 949
19.4.1.4 Candidate Method ... 950
19.4.1.5 Variational Marginal Likelihood ... 951
19.4.2 Structure Search ... 954
19.4.2.1 A Naive Approach ... 954
19.4.2.2 Heuristic Solutions ... 956
19.4.3 Structural EM ... 957
19.4.3.1 Approximating the Delta-score ... 957
19.4.3.2 The Structural EM Algorithm ... 958
19.4.3.3 Structural EM for Selective Clustering ... 960
19.4.3.4 An Eective Implementation of Structural EM ... 961
19.5 Learning Models with Hidden Variables ... 962
19.5.1 Information Content of Hidden Variables ... 963
19.5.2 Determining the Cardinality ... 965
19.5.2.1 Model Selection for Cardinality ... 965
19.5.2.2 Dirichlet Processes ... 965
19.5.3 Introducing Hidden Variables ... 967
19.6 Summary ... 970
19.7 Relevant Literature ... 971
19.8 Exercises ... 972
20 Learning Undirected Models ... 980
20.1 Overview ... 980
20.2 The Likelihood Function ... 981
20.2.1 An Example ... 981
20.2.2 Form of the Likelihood Function ... 983
20.2.3 Properties of the Likelihood Function ... 984
20.3 Maximum (Conditional) Likelihood Parameter Estimation ... 986
20.3.1 Maximum Likelihood Estimation ... 986
20.3.2 Conditionally Trained Models ... 987
20.3.3 Learning with Missing Data ... 991
20.3.3.1 Gradient Ascent ... 991
20.3.3.2 Expectation Maximization ... 992
20.3.4 Maximum Entropy and Maximum Likelihood ... 993
20.4 Parameter Priors and Regularization ... 995
20.4.1 Local Priors ... 995
20.4.2 Global Priors ... 998
20.5 Learning with Approximate Inference ... 998
20.5.1 Belief Propagation ... 999
20.5.1.1 Pseudo-moment Matching ... 1000
20.5.1.2 Belief Propagation and Entropy Approximations ... 1001
20.5.1.3 Sampling-Based Learning ... 1003
20.5.2 MAP-Based Learning ... 1004
20.6 Alternative Objectives ... 1006
20.6.1 Pseudolikelihood and Its Generalizations ... 1007
20.6.2 Contrastive Optimization Criteria ... 1011
20.6.2.1 Contrastive Divergence ... 1011
20.6.2.2 Margin-Based Training ... 1012
20.7 Structure Learning ... 1015
20.7.1 Structure Learning Using Independence Tests ... 1016
20.7.2 Score-Based Learning: Hypothesis Spaces ... 1018
20.7.3 Objective Functions ... 1019
20.7.3.1 Likelihood Function ... 1019
20.7.3.2 Bayesian Scores ... 1020
20.7.3.3 Parameter Penalty Scores ... 1021
20.7.4 Optimization Task ... 1022
20.7.4.1 Greedy Structure Search ... 1022
20.7.4.2 Successor Evaluation ... 1022
20.7.4.3 Choice of Scoring Function ... 1024
20.7.4.4 L 1 -Regularization for Structure Learning ... 1025
20.7.5 Evaluating Changes to the Model ... 1029
20.8 Summary ... 1033
20.9 Relevant Literature ... 1035
20.10 Exercises ... 1038
Part IV - Actions and Decisions ... 1044
21 Causality ... 1046
21.1 Motivation and Overview ... 1046
21.1.1 Conditioning and Intervention ... 1046
21.1.2 Correlation and Causation ... 1049
21.2 Causal Models ... 1051
21.3 Structural Causal Identi?ability ... 1054
21.3.1 Query Simpli?cation Rules ... 1054
21.3.2 Iterated Query Simpli?cation ... 1057
21.4 Mechanisms and Response Variables ... 1063
21.5 Partial Identi?ability in Functional Causal Models ... 1068
21.6 Counterfactual Queries ... 1071
21.6.1 Twinned Networks ... 1071
21.6.2 Bounds on Counterfactual Queries ... 1074
21.7 Learning Causal Models ... 1077
21.7.1 Learning Causal Models without Confounding Factors ... 1078
21.7.1.1 Key Assumptions ... 1078
21.7.1.2 Identifying the Model ... 1079
21.7.2 Learning from Interventional Data ... 1081
21.7.3 Dealing with Latent Variables ... 1085
21.7.3.1 Score-Based Approaches ... 1085
21.7.3.2 Constraint-Based Approaches ... 1085
21.7.4 Learning Functional Causal Models ... 1088
21.8 Summary ... 1090
21.9 Relevant Literature ... 1091
21.10 Exercises ... 1092
22 Utilities and Decisions ... 1096
22.1 Foundations: Maximizing Expected Utility ... 1096
22.1.1 Decision Making Under Uncertainty ... 1096
22.1.2 Theoretical Justi?cation ... 1099
22.2 Utility Curves ... 1101
22.2.1 Utility of Money ... 1102
22.2.2 Attitudes Toward Risk ... 1103
22.2.3 Rationality ... 1104
22.3 Utility Elicitation ... 1105
22.3.1 Utility Elicitation Procedures ... 1105
22.3.2 Utility of Human Life ... 1106
22.4 Utilities of Complex Outcomes ... 1108
22.4.1 Preference and Utility Independence ... 1108
22.4.2 Additive Independence Properties ... 1111
22.4.2.1 Additive Independence ... 1111
22.4.2.2 Conditional Additive Independence ... 1112
22.4.2.3 Generalized Additive Independence ... 1115
22.5 Summary ... 1118
23 Structured Decision Problems ... 1122
23.1 Decision Trees ... 1122
23.1.1 Representation ... 1122
23.1.2 Backward Induction Algorithm ... 1124
23.2 In?uence Diagrams ... 1125
23.2.1 Basic Representation ... 1126
23.2.2 Decision Rules ... 1127
23.2.3 Time and Recall ... 1129
23.2.4 Semantics and Optimality Criterion ... 1130
23.3 Backward Induction in In?uence Diagrams ... 1132
23.3.1 Decision Trees for In?uence Diagrams ... 1133
23.3.2 Sum-Max-Sum Rule ... 1135
23.4 Computing Expected Utilities ... 1137
23.4.1 Simple Variable Elimination ... 1137
23.4.2 Multiple Utility Variables: Simple Approaches ... 1139
23.4.3 Generalized Variable Elimination ... 1140
23.5 Optimization in In?uence Diagrams ... 1144
23.5.1 Optimizing a Single Decision Rule ... 1144
23.5.2 Iterated Optimization Algorithm ... 1145
23.5.3 Strategic Relevance and Global Optimality ... 1147
23.5.3.1 Strategic Relevance ... 1147
23.5.3.2 S-Reachability ... 1149
23.5.3.3 The Relevance Graph ... 1151
23.5.3.4 Global Optimality ... 1152
23.6 Ignoring Irrelevant Information ... 1156
23.7 Value of Information ... 1158
23.7.1 Single Observations ... 1159
23.7.2 Multiple Observations ... 1161
23.8 Summary ... 1163
23.9 Relevant Literature ... 1164
23.10 Exercises ... 1167
24 Epilogue ... 1170
Why Probabilistic Graphical Models? ... 1170
The Modeling Pipeline ... 1171
Some Current and Future Directions ... 1173
Appendix A Background Material ... 1174
A.1 Information Theory ... 1174
A.1.1 Compression and Entropy ... 1174
A.1.2 Conditional Entropy and Information ... 1176
A.1.3 Relative Entropy and Distances Between Distributions ... 1177
A.1.3.1 Relative Entropy ... 1178
A.1.3.2 Conditional Relative Entropy ... 1179
A.1.3.3 Other Distance Measures ... 1180
A.2 Convergence Bounds ... 1180
A.2.1 Central Limit Theorem ... 1181
A.2.2 Convergence Bounds ... 1182
A.3 Algorithms and Algorithmic Complexity ... 1183
A.3.1 Basic Graph Algorithms ... 1183
A.3.2 Analysis of Algorithmic Complexity ... 1184
A.3.3 Dynamic Programming ... 1186
A.3.4 Complexity Theory ... 1187
A.3.4.1 Decision Problems ... 1188
A.3.4.2 P and N P ... 1188
A.3.4.3 Other Complexity Classes ... 1190
A.4 Combinatorial Optimization and Search ... 1191
A.4.1 Optimization Problems ... 1191
A.4.2 Local Search ... 1191
A.4.2.1 Local Hill Climbing ... 1192
A.4.2.2 Forcing Exploration in New Directions ... 1193
A.4.2.3 Randomization in Search ... 1195
A.4.3 Branch and Bound Search ... 1197
A.5 Continuous Optimization ... 1198
A.5.1 Characterizing Optima of a Continuous Function ... 1198
A.5.2 Gradient Ascent Methods ... 1200
A.5.2.1 Gradient Ascent ... 1200
A.5.2.2 Line Search ... 1201
A.5.2.3 Conjugate Gradient Ascent ... 1202
A.5.3 Constrained Optimization ... 1204
A.5.4 Convex Duality ... 1208
Bibliography ... 1210
Notation Index ... 1248
Subject Index ... 1252