The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern.This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.

Author(s): Peter D. Grünwald
Series: Adaptive Computation and Machine Learning
Publisher: The MIT Press
Year: 2007

Language: English
Pages: 738
Tags: Информатика и вычислительная техника;Искусственный интеллект;Интеллектуальный анализ данных;

Cover......Page 1
Title Page......Page 5
Brief Contents......Page 9
Contents......Page 11
List of Figures......Page 21
Series Foreword......Page 23
Foreword......Page 25
Preface......Page 27
This Book......Page 28
A Guide for the Reader......Page 30
Acknowledgments......Page 33
I Introductory Material......Page 35
1 Learning, Regularity, and Compression......Page 37
1.2 Regularity and Compression......Page 38
1.3 Solomonoff’s Breakthrough – Kolmogorov Complexity......Page 42
1.4 Making the Idea Applicable......Page 44
1.5 Crude MDL, Refined MDL and Universal Coding......Page 46
1.5.1 From Crude to Refined MDL......Page 48
1.5.2 Universal Coding and Refined MDL......Page 51
1.5.3 Refined MDL for Model Selection......Page 52
1.5.4 General Refined MDL: Prediction and Hypothesis Selection......Page 54
1.6.1 Model Selection among Non-Nested Models......Page 57
1.6.2 Goals of Model vs. Point Hypothesis Selection......Page 59
1.7 The MDL Philosophy......Page 60
1.8 Does It Make Any Sense? MDL, Occam’s Razor, and the “True Model”......Page 63
1.8.1 Answer to Criticism No. 1: Refined MDL’s Notion of “Complexity” Is Not Arbitrary......Page 64
1.8.2 Answer to Criticism No. 2......Page 66
1.9 History and Forms of MDL......Page 70
1.9.1 What Is MDL?......Page 71
1.9.2 MDL Literature......Page 72
1.10 Summary and Outlook......Page 74
2.1 General Mathematical Preliminaries......Page 75
2.2.1 Definitions; Notational Conventions......Page 80
2.2.2 Probabilistic Sources......Page 87
2.2.3 Limit Theorems and Statements......Page 89
2.2.4 Probabilistic Models......Page 91
2.2.5 Probabilistic Model Classes......Page 94
2.3 Kinds of Probabilistic Models*......Page 96
2.4 Terminological Preliminaries: Model Selection, Hypothesis Selection, Parameter Estimation, and Prediction......Page 103
2.5.1 Consistency......Page 105
2.5.2 Basic Concepts of Bayesian Statistics......Page 108
2.6 Summary and Outlook......Page 112
3.1 Coding Preliminaries......Page 113
3.1.1 Restriction to Prefix Coding Systems; Descriptions as Messages......Page 117
3.1.2 Different Kinds of Codes......Page 120
3.2 The Most Important Section of This Book: Probabilities and Code Lengths, Part I......Page 124
3.2.1 The Kraft Inequality......Page 125
3.2.2 Code Lengths “Are” Probabilities......Page 129
3.2.3 Immediate Insights and Consequences......Page 133
3.3 Probabilities and Code Lengths, Part II......Page 135
3.3.1 (Relative) Entropy and the Information Inequality......Page 137
3.4 Summary, Outlook, Further Reading......Page 140
4.1 Introduction......Page 143
4.2 Likelihood and Observed Fisher Information......Page 145
4.3 KL Divergence and Expected Fisher Information......Page 151
4.4 Maximum Likelihood: Data vs. Parameters......Page 158
4.5 Summary and Outlook......Page 164
5 Crude Two-Part Code MDL......Page 165
5.1 Introduction: Making Two-Part MDL Precise......Page 166
5.2 Applying Two-Part Code MDL to Markov Chain Hypothesis Selection......Page 167
5.2.1 The Code C₂......Page 169
5.2.2 The Code C₁......Page 171
5.2.3 Exact Yet Simplistic Two-Part Code MDL for Markov Chains......Page 172
5.3 Simplistic Two-Part Code MDL Principle for Hypothesis Selection......Page 173
5.4 Two-Part MDL for Tasks Other Than Hypothesis Selection......Page 175
5.5 Behavior of Two-Part Code MDL......Page 176
5.6.1 The Maximum Likelihood Principle......Page 178
5.6.2 MDL vs. ML......Page 181
5.6.3 MDL as a Maximum Probability Principle......Page 182
5.7 Computing and Approximating the Two-Part MDL Criterion in Practice......Page 184
5.8 Justifying Crude MDL: Consistency and Code Design......Page 186
5.8.1 A General Consistency Result......Page 187
5.8.2 Code Design for Two-Part Code MDL......Page 191
5.A Appendix: Proof of Theorem 5.1......Page 197
II Universal Coding......Page 199
Introduction to Part II......Page 201
Structure of Part II......Page 202
6 Universal Coding with Countable Models......Page 205
6.1 Universal Coding: The Basic Idea......Page 206
6.1.1 Two-Part Codes as Simple Universal Codes......Page 208
6.1.2 From Universal Codes to Universal Models......Page 209
6.1.3 Formal Definition of Universality......Page 211
6.2 The Finite Case......Page 212
6.2.1 Minimax Regret and Normalized ML......Page 213
6.2.2 NML vs. Two-Part vs. Bayes......Page 216
6.3.1 The Two-Part and Bayesian Codes......Page 218
6.3.2 The NML Code......Page 221
6.4.1 Distributions as Prediction Strategies......Page 224
6.4.2 Bayes Is Prequential; NML and Two-part Are Not......Page 227
6.4.3 The Prequential Plug-In Model......Page 231
6.5.1 Stochastic Redundancy......Page 233
6.5.2 Uniformly Universal Models......Page 235
6.6 Summary, Outlook and Further Reading......Page 238
7.1 Introduction......Page 241
7.1.1 Preliminaries......Page 242
7.2 Asymptotic Expansion of Parametric Complexity......Page 245
7.3 The Meaning of ∫_Θ √(det I(θ)) dθ......Page 250
7.3.1 Complexity and Functional Form......Page 251
7.3.2 KL Divergence and Distinguishability......Page 253
7.3.3 Complexity and Volume......Page 256
7.3.4 Complexity and the Number of Distinguishable Distributions*......Page 258
7.4 Explicit and Simplified Computations......Page 260
8.1 The Bayesian Regret......Page 265
8.1.1 Basic Interpretation of Theorem 8.1......Page 267
8.2 Bayes Meets Minimax – Jeffreys’ Prior......Page 268
8.2.1 Jeffreys’ Prior and the Boundary......Page 271
8.3.1 Proof Sketch of Theorem 8.1......Page 273
8.3.2 Beyond Exponential Families......Page 275
8.3.3 Proof Sketch of Theorem 7.1......Page 277
8.4 Stochastic Universality*......Page 278
8.A Appendix: Proofs of Theorem 8.1 and Theorem 8.2......Page 282
9.1 Prequential Plug-in for Exponential Families......Page 291
9.2 The Plug-in vs. the Bayes Universal Model......Page 296
9.3 More Precise Asymptotics......Page 299
9.4 Summary......Page 303
10.1 The Ordinary Two-Part Universal Model......Page 305
10.1.1 Derivation of the Two-Part Code Regret......Page 308
10.1.2 Proof Sketch of Theorem 10.1......Page 311
10.1.3 Discussion......Page 316
10.2 The Conditional Two-Part Universal Code*......Page 318
10.2.1 Conditional Two-Part Codes for Discrete Exponential Families......Page 320
10.2.2 Distinguishability and the Phase Transition*......Page 324
10.3 Summary and Outlook......Page 327
11.1 Introduction......Page 329
11.1.1 Examples of Undefined NML Distribution......Page 332
11.1.2 Examples of Undefined Jeffreys’ Prior......Page 333
11.2 Metauniversal Codes......Page 335
11.2.1 Constrained Parametric Complexity......Page 336
11.2.2 Meta-Two-Part Coding......Page 337
11.2.3 Renormalized Maximum Likelihood*......Page 340
11.3 NML with Luckiness......Page 342
11.3.1 Asymptotic Expansion of LNML......Page 346
11.4 Conditional Universal Models......Page 350
11.4.1 Bayesian Approach with Jeffreys’ Prior......Page 351
11.4.2 Conditional NML......Page 354
11.4.3 Liang and Barron’s Approach......Page 359
11.A Appendix: Proof of Theorem 11.4......Page 363
12 Linear Regression......Page 369
12.1 Introduction......Page 370
12.1.1 Prelude: The Normal Location Family......Page 372
12.2 Least-Squares Estimation......Page 374
12.2.1 The Normal Equations......Page 376
12.2.2 Composition of Experiments......Page 379
12.2.3 Penalized Least-Squares......Page 380
12.3 The Linear Model......Page 382
12.3.1 Bayesian Linear Model M^X with Gaussian Prior......Page 388
12.3.2 Bayesian Linear Models M^X and S^X with Noninformative Priors......Page 393
12.4.1 NML......Page 397
12.4.2 Bayes and LNML......Page 398
12.4.3 Bayes-Jeffreys and CNML......Page 399
13 Beyond Parametrics......Page 403
13.1 Introduction......Page 404
13.2 CUP: Unions of Parametric Models......Page 406
13.2.1 CUP vs. Parametric Models......Page 409
13.3 Universal Codes Based on Histograms......Page 410
13.3.1 Redundancy of Universal CUP Histogram Codes......Page 414
13.4 Nonparametric Redundancy......Page 417
13.4.1 Standard CUP Universal Codes......Page 418
13.4.2 Minimax Nonparametric Redundancy......Page 421
13.5.1 Kernelization of Bayesian Linear Regression......Page 424
13.5.2 Gaussian Processes......Page 428
13.5.3 Gaussian Processes as Universal Models......Page 430
13.6 Conclusion and Further Reading......Page 436
III Refined MDL......Page 437
Introduction to Part III......Page 439
14.1 Introduction......Page 443
14.2 Simple Refined MDL Model Selection......Page 445
14.2.1 Compression Interpretation......Page 449
14.2.2 Counting Interpretation......Page 450
14.2.3 Bayesian Interpretation......Page 452
14.2.4 Prequential Interpretation......Page 453
14.3.1 Models with Infinite Complexities......Page 454
14.3.2 Comparing Many or Infinitely Many Models......Page 456
14.3.3 The General Picture......Page 459
14.4.1 Calculating Universal Codelengths......Page 462
14.4.2 Computational Efficiency and Practical Quality of Non-NML Universal Codes......Page 463
14.4.3 Model Selection with Conditional NML and Plug-in Codes......Page 465
14.4.4 General Warnings about Model Selection......Page 469
14.5 MDL Model Selection for Linear Regression......Page 472
14.5.1 Rissanen’s RNML Approach......Page 473
14.5.2 Hansen and Yu’s gMDL Approach......Page 477
14.5.3 Liang and Barron’s Approach......Page 480
14.5.4 Discussion......Page 482
14.6 Worst Case vs. Average Case*......Page 485
15.1 Introduction......Page 493
15.2 MDL for Prediction and Predictive Estimation......Page 494
15.2.1 Prequential MDL Estimators......Page 495
15.2.2 Prequential MDL Estimators Are Always Consistent......Page 499
15.2.3 Parametric and Nonparametric Examples......Page 503
15.2.4 Césaro KL consistency vs. KL consistency*......Page 506
15.3 Two-Part Code MDL for Point Hypothesis Selection......Page 510
15.3.1 Discussion of Two-Part Consistency Theorem......Page 512
15.4 MDL Parameter Estimation......Page 517
15.4.1 Parametric MDL Estimators vs. (Luckiness) ML Estimators......Page 521
15.4.2 What Estimator To Use?......Page 525
15.4.3 Comparison to Bayesian Estimators*......Page 527
15.5 Summary and Outlook......Page 532
15.A Appendix: Proof of Theorem 15.3......Page 533
16.1.1 The Scenarios Considered......Page 535
16.2 Consistency: Prequential and Two-Part MDL Estimators......Page 536
16.3.1 Case 1(c): Selection between a Union of Parametric Models......Page 539
16.3.2 Case 2(a): Nonparametric Model Selection Based on CUP Model Class......Page 542
16.4 MDL Consistency Peculiarities......Page 545
16.5 Risks and Rates......Page 549
16.5.1 Relations between Divergences and Risk Measures......Page 551
16.5.2 Minimax Rates......Page 553
16.6.1 Prequential and Two-Part MDL Estimators......Page 554
16.6.2 MDL Model Selection......Page 556
17 MDL in Context......Page 557
17.1 MDL and Frequentist Paradigms......Page 558
17.1.1 Frequentist Analysis: Sanity Check or Design Principle?......Page 559
17.1.2 The Weak Prequential Principle......Page 562
17.1.3 MDL vs. Frequentist Principles: Remaining Issues......Page 563
17.2 MDL and Bayesian Inference......Page 565
17.2.1 Luckiness Functions vs. Prior Distributions......Page 568
17.2.2 MDL, Bayes, and Occam......Page 573
17.2.3 MDL and Brands of Bayesian Statistics......Page 578
17.2.4 Conclusion: a Common Future after All?......Page 582
17.3.1 BIC......Page 583
17.3.2 AIC......Page 584
17.3.3 A Version of MDL that Combines the Best of AIC and BIC......Page 586
17.4 MDL and MML......Page 589
17.4.1 Strict Minimum Message Length......Page 590
17.4.2 Comparison to MDL......Page 592
17.4.3 Approximating SMML by the Wallace-Freeman (MML) Estimator......Page 594
17.5 MDL and Prequential Analysis......Page 596
17.6 MDL and Cross-Validation......Page 599
17.7 MDL and Maximum Entropy......Page 601
17.8 Kolmogorov Complexity and Structure Function; Ideal MDL......Page 604
17.9 MDL and Individual Sequence Prediction......Page 607
17.10 MDL and Statistical Learning Theory......Page 613
17.10.1 Structural Risk Minimization......Page 615
17.10.2 PAC-Bayesian Approaches......Page 619
17.10.3 PAC-Bayes and MDL......Page 622
17.11 The Road Ahead......Page 626
IV Additional Background......Page 631
18 The Exponential or “Maximum Entropy” Families......Page 633
18.1 Introduction......Page 634
18.2 Definition and Overview......Page 635
18.3 Basic Properties......Page 639
18.4.1 The Mean Value Parameterization......Page 643
18.4.2 Other Parameterizations......Page 645
18.4.3 Relating Mean-Value and Canonical Parameters**......Page 647
18.5 Exponential Families of General Probabilistic Sources*......Page 651
18.6 Fisher Information Definitions and Characterizations*......Page 653
19 Information-Theoretic Properties of Exponential Families......Page 657
19.2 Robustness of Exponential Family Codes......Page 658
19.2.1 If Θ_mean Does Not Contain the Mean*......Page 661
19.3 Behavior at the ML Estimate β̂......Page 663
19.4 Behavior of the ML Estimate β̂......Page 666
19.4.1 Central Limit Theorem......Page 667
19.4.2 Large Deviations......Page 668
19.5 Maximum Entropy and Minimax Codelength......Page 671
19.5.1 Exponential Families and Maximum Entropy......Page 672
19.5.2 Exponential Families and Minimax Codelength......Page 675
19.5.3 The Compression Game......Page 677
19.6 Likelihood Ratio Exponential Families and Rényi Divergences*......Page 679
19.6.1 The Likelihood Ratio Family......Page 681
19.7 Summary......Page 684
References......Page 685
List of Symbols......Page 709
Subject Index......Page 713
Jacket......Page 738