Learning with the Minimum Description Length Principle

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book introduces readers to the minimum description length (MDL) principle and its applications in learning. The MDL is a fundamental principle for inductive inference, which is used in many applications including statistical modeling, pattern recognition and machine learning. At its core, the MDL is based on the premise that “the shortest code length leads to the best strategy for learning anything from data.” The MDL provides a broad and unifying view of statistical inferences such as estimation, prediction and testing and, of course, machine learning.

The content covers the theoretical foundations of the MDL and broad practical areas such as detecting changes and anomalies, problems involving latent variable models, and high dimensional statistical inference, among others. The book offers an easy-to-follow guide to the MDL principle, together with other information criteria, explaining the differences between their standpoints.

Written in a systematic, concise and comprehensive style, this book is suitable for researchers and graduate students of machine learning, statistics, information theory and computer science.

Author(s): Kenji Yamanishi
Edition: 1
Publisher: Springer
Year: 2023

Language: English
Commentary: Publisher PDF
Pages: 359
Tags: Minimum Description Length Principle; Machine Learning; MDL; Data Science; Statistical Inferrence; Information Theory; Model Selection; Prediction; Anomaly Detection; Change Detection

Foreword
Preface
Contents
Acronyms
1 Information and Coding
1.1 Information, Probability, and Coding
1.1.1 What is Information?
1.1.2 Prefix Coding
1.1.3 Kraft Inequality
1.2 Shannon Information
1.2.1 Probability Distribution
1.2.2 Shannon's Coding Theorem
1.2.3 Shannon Information
1.3 Universal Coding
1.3.1 Two-part Coding
1.3.2 Bayes Coding
1.3.3 Counting Coding
1.3.4 Normalized Maximum Likelihood Coding
1.3.5 Kolmogorov Complexity
1.4 Stochastic Complexity
1.4.1 Stochastic Complexity for Parametric Classes
1.4.2 Shtarkov's Min-Max Regret
1.4.3 Generalized Coding Theorem
1.5 Parametric Complexity
1.5.1 Asymptotic Approximation Method
1.5.2 g-Function-Based Method*
1.5.3 Fourier Method*
1.5.4 Combinatorial Method
1.5.5 Monte Carlo Method
1.6 MDL Principle
1.6.1 Machine Learning with MDL Principle
1.6.2 Estimation
1.6.3 Prediction
1.6.4 Testing
1.7 Summary of This Chapter
References
2 Parameter Estimation
2.1 Maximum Likelihood Estimation
2.1.1 Maximum Likelihood Estimator
2.1.2 MLE for Multivariate Gaussian and Outlier Detection
2.1.3 MLE for Linear Regression
2.1.4 Properties of Maximum Likelihood Estimator
2.2 EM Algorithm
2.2.1 EM Algorithm for Latent Variable Models
2.2.2 Incremental EM Algorithm for Online Outlier Detection
2.3 Maximum a Posteriori Estimation
2.3.1 MAP Estimation and Regularization
2.3.2 Sparse Regularized Linear Regression
2.3.3 Sparse Regularized Graphical Model*
2.4 Gradient Descent Methods
2.4.1 Gradient Descent Algorithms
2.5 High-Dimensional Penalty Selection
2.5.1 Luckiness Normalized Maximum Likelihood Code-length
2.5.2 Penalty Selection with LNML*
2.5.3 Analytical Bounds for LNML Code-length*
2.6 Bayesian Estimation
2.6.1 Bayesian Estimator
2.6.2 Gibbs Sampler
2.7 Summary of This Chapter
References
3 Model Selection
3.1 Model Selection
3.1.1 Problem Setting
3.1.2 Akaike's Information Criterion
3.1.3 Bayesian Information Criterion
3.1.4 Minimum Message Length Criterion
3.1.5 Cross-Validation
3.2 Minimum Description Length Criterion
3.2.1 MDL Criterion
3.2.2 Consistency
3.2.3 Estimation Optimality*
3.2.4 Rate of Convergence
3.2.5 Sequential Normalized Maximum Likelihood Criterion
3.3 Applications of MDL Criterion
3.3.1 Histogram Density Estimation
3.3.2 Non-negative Matrix Factorization*
3.3.3 Decision Tree Learning
3.3.4 Dimensionality Selection for Word Embedding
3.3.5 Time Series Model Selection
3.3.6 Multivariate Linear Regression*
3.4 Summary of This Chapter
References
4 Latent Variable Model Selection
4.1 MDL Approach to Latent Variable Model Selection
4.1.1 Non-identifiability for Latent Variable Models
4.2 Latent Stochastic Complexity
4.2.1 LSC Criterion
4.2.2 Computational Complexity of LSC
4.3 Decomposed Normalized Maximum Likelihood Criterion
4.3.1 DNML Criterion
4.3.2 Computational Complexity of DNML
4.3.3 Estimation Optimality*
4.4 Applications of the DNML Criterion
4.4.1 Naïve Bayes models
4.4.2 Latent Dirichlet Allocation Model
4.4.3 Stochastic Block Model
4.4.4 Mixed Membership Stochastic Block Models*
4.4.5 Multivariate Gaussian Mixture Model
4.4.6 Approximating Inter-event Time Distributions
4.4.7 Graph Summarization
4.4.8 Probabilistic Principal Component Analysis*
4.4.9 Non-negative Matrix Factorization Revisited*
4.4.10 Dimensionality Selection of Hyperbolic Graph Embedding*
4.4.11 Neural Network Model Selection*
4.5 Summary of this Chapter
References
5 Sequential Prediction
5.1 Sequential Prediction
5.1.1 Sequential Stochastic Prediction Algorithms
5.1.2 Online Maximum Likelihood Prediction Algorithm
5.1.3 Bayesian Prediction Algorithm
5.1.4 Sequentially Normalized Maximum Likelihood Prediction Algorithm
5.1.5 Discounting SNML Prediction with Application to Change Point Detection*
5.1.6 Sequentially Luckiness Normalized Maximum Likelihood Prediction Algorithm
5.2 High-Dimensional Asymptotics
5.2.1 Prediction with Spike and Tails Prior*
5.3 Summary of This Chapter
References
6 MDL Change Detection
6.1 A System of Change Detection
6.2 Parameter Change Detection
6.2.1 MDL Change Statistics
6.2.2 Performance on Hypothesis Testing
6.2.3 Sequential MDL Change Detection
6.2.4 Network Change Detection
6.2.5 Adaptive Windowing Method for Sequential MDL Change Detection
6.3 Parameter Change Sign Detection
6.3.1 Differential MDL Change Statistics*
6.3.2 COVID-19 Pandemic Analysis*
6.4 Latent Structure Change Detection
6.4.1 Burst Detection
6.4.2 Switching Distribution*
6.4.3 Tracking Best Experts*
6.4.4 Dynamic Model Selection
6.4.5 Sequential Clustering Change Detection
6.4.6 MDL Model Change Statistics
6.4.7 Hierarchical Change Detection for Latent Variable Models*
6.5 Latent Structure Change Sign Detection
6.5.1 Structural Entropy
6.6 Summary of this Chapter
References
7 Continuous Model Selection
7.1 Descriptive Dimensionality
7.1.1 Motivation of Continuous Model Selection
7.1.2 Definition of Descriptive Dimensionality*
7.2 Continuous Model Selection
7.2.1 Continuous Model Selection with Ddim
7.2.2 Model Change Sign Detection Algorithms
7.3 Applications of Continuous Model Selection
7.3.1 Continuous Model Selection for GMM
7.3.2 Continuous Model Selection for AR Model*
7.3.3 Real Data Experiences for Model Change Sign Detection
7.4 Summary of This Chapter
References
8 Extension of Stochastic Complexity
8.1 Extended Stochastic Complexity
8.1.1 Decision-Theoretic Learning
8.1.2 Extended Stochastic Complexity
8.2 Function Estimation with ESC*
8.3 Sequential Learning with ESC*
8.4 Generalized Normalized Maximum Likelihood*
8.5 Summary of This Chapter
References
9 Mathematical Preliminaries
9.1 Analytics
9.1.1 Basic Notation
9.1.2 Measure
9.1.3 Max, Min, Sup, Inf
9.1.4 Norm and Distance
9.1.5 Taylor's Expansion
9.1.6 Gamma Function and Stirling Formula
9.1.7 Fubini's Theorem
9.1.8 Fourier Transform
9.1.9 Lebesgue's Dominated Convergence Theorem
9.2 Probability
9.2.1 Probability Space
9.2.2 Random Variables
9.2.3 Expectation
9.2.4 Variable Transformation
9.2.5 Inequalities
9.2.6 Convergence
9.2.7 Law of Large Numbers and Central Limit Theorem
9.2.8 Entropy and Kullback–Leibler Divergence
9.2.9 Probability Distance
9.2.10 Monte Carlo Sampling Method
9.3 Linear Algebra
9.3.1 Eigenvalues and Quadratic Forms
9.3.2 Matrix Algebra
9.4 Time Complexity Notations
References
Index