Natural language processing (NLP) went through a profound transformation in the mid-1980s when it shifted to make heavy use of corpora and data-driven techniques to analyze language.
Since then, the use of statistical techniques in NLP has evolved in several ways. One such example of evolution took place in the late 1990s or early 2000s, when full-fledged Bayesian machinery was introduced to NLP. This Bayesian approach to NLP has come to accommodate various shortcomings in the frequentist approach and to enrich it, especially in the unsupervised setting, where statistical learning is done without target prediction examples.
In this book, we cover the methods and algorithms that are needed to fluently read Bayesian learning papers in NLP and to do research in the area. These methods and algorithms are partially borrowed from both machine learning and statistics and are partially developed "in-house" in NLP. We cover inference techniques such as Markov chain Monte Carlo sampling and variational inference, Bayesian estimation, and nonparametric modeling. In response to rapid changes in the field, this second edition of the book includes a new chapter on representation learning and neural networks in the Bayesian context. We also cover fundamental concepts in Bayesian statistics such as prior distributions, conjugacy, and generative modeling. Finally, we review some of the fundamental modeling techniques in NLP, such as grammar modeling, neural networks and representation learning, and their use with Bayesian analysis.
Author(s): Shay Cohen; Graeme Hirst
Series: Synthesis Lectures on Human Language Technologies
Edition: 2
Publisher: Morgan & Claypool
Year: 2019
Language: English
Pages: 343
List of Figures......Page 21
List of Figures......Page 23
List of Figures......Page 25
Preface (First Edition)......Page 27
Acknowledgments (First Edition)......Page 31
Preface (Second Edition)......Page 33
Probability Measures......Page 35
Random Variables......Page 36
Continuous and Discrete Random Variables......Page 37
Joint Distribution over Multiple Random Variables......Page 38
Conditional Distributions......Page 39
Bayes' Rule......Page 40
Independent and Conditionally Independent Random Variables......Page 41
Exchangeable Random Variables......Page 42
Expectations of Random Variables......Page 43
Parametric vs. Nonparametric Models......Page 45
Inference with Models......Page 46
Generative Models......Page 48
Independence Assumptions in Models......Page 50
Directed Graphical Models......Page 51
Learning from Data Scenarios......Page 53
Bayesian and Frequentist Philosophy (Tip of the Iceberg)......Page 56
Summary......Page 57
Exercises......Page 58
Introduction......Page 59
Overview: Where Bayesian Statistics and NLP Meet......Page 60
First Example: The Latent Dirichlet Allocation Model......Page 63
The Dirichlet Distribution......Page 68
Inference......Page 70
Summary......Page 73
Second Example: Bayesian Text Regression......Page 74
Conclusion and Summary......Page 75
Exercises......Page 76
Priors......Page 77
Conjugate Priors......Page 78
Conjugate Priors and Normalization Constants......Page 81
The Use of Conjugate Priors with Latent Variable Models......Page 82
Mixture of Conjugate Priors......Page 84
Renormalized Conjugate Distributions......Page 85
Discussion: To Be or not to Be Conjugate?......Page 86
Priors Over Multinomial and Categorical Distributions......Page 87
The Dirichlet Distribution Re-Visited......Page 89
The Logistic Normal Distribution......Page 93
Summary......Page 99
Non-Informative Priors......Page 100
Jeffreys Prior......Page 101
Conjugacy and Exponential Models......Page 103
Multiple Parameter Draws in Models......Page 104
Conclusion and Summary......Page 107
Exercises......Page 109
Bayesian Estimation......Page 111
Learning with Latent Variables: Two Views......Page 112
Maximum a Posteriori Estimation......Page 113
Posterior Approximations Based on the MAP Solution......Page 121
Decision-Theoretic Point Estimation......Page 123
Empirical Bayes......Page 124
Summary......Page 127
Exercises......Page 129
Sampling Methods......Page 131
MCMC Algorithms: Overview......Page 132
NLP Model Structure for MCMC Inference......Page 133
Partitioning the Latent Variables......Page 134
Gibbs Sampling......Page 135
Collapsed Gibbs Sampling......Page 139
Operator View......Page 143
Parallelizing the Gibbs Sampler......Page 145
The Metropolis–Hastings Algorithm......Page 146
Variants of Metropolis–Hastings......Page 148
Slice Sampling......Page 149
The Use of Slice Sampling and Auxiliary Variable Sampling in NLP......Page 151
Simulated Annealing......Page 152
Convergence of MCMC Algorithms......Page 153
Markov Chain: Basic Theory......Page 155
Sampling Algorithms Not in the MCMC Realm......Page 157
Monte Carlo Integration......Page 160
Computability of Distribution vs. Sampling......Page 161
Particle Filtering......Page 162
Conclusion and Summary......Page 164
Exercises......Page 166
Variational Bound on Marginal Log-Likelihood......Page 169
Mean-Field Approximation......Page 172
Mean-Field Variational Inference Algorithm......Page 173
Dirichlet-Multinomial Variational Inference......Page 175
Connection to the Expectation-Maximization Algorithm......Page 179
Empirical Bayes with Variational Inference......Page 181
Initialization of the Inference Algorithms......Page 182
Convergence Diagnosis......Page 183
The Use of Variational Inference for Decoding......Page 184
Online Variational Inference......Page 185
Summary......Page 186
Exercises......Page 187
Nonparametric Priors......Page 189
The Dirichlet Process: Three Views......Page 190
The Stick-Breaking Process......Page 191
The Chinese Restaurant Process......Page 193
Inference with Dirichlet Process Mixtures......Page 195
The Hierarchical Dirichlet Process......Page 199
The Pitman–Yor Process......Page 201
Pitman–Yor Process for Language Modeling......Page 203
Power-Law Behavior of the Pitman–Yor Process......Page 204
Discussion......Page 205
The Indian Buffet Process......Page 206
Distance-Dependent Chinese Restaurant Process......Page 207
Sequence Memoizers......Page 208
Summary......Page 209
Exercises......Page 210
Bayesian Grammar Models......Page 211
Bayesian Hidden Markov Models......Page 212
Hidden Markov Models with an Infinite State Space......Page 213
Probabilistic Context-Free Grammars......Page 215
PCFGs as a Collection of Multinomials......Page 218
Basic Inference Algorithms for PCFGs......Page 219
Priors on PCFGs......Page 223
Monte Carlo Inference with Bayesian PCFGs......Page 224
Variational Inference with Bayesian PCFGs......Page 226
Adaptor Grammars......Page 227
Pitman–Yor Adaptor Grammars......Page 228
Stick-Breaking View of PYAG......Page 230
Inference with PYAG......Page 231
Hierarchical Dirichlet Process PCFGs (HDP-PCFGs)......Page 234
Extensions to the HDP-PCFG Model......Page 235
Dependency Grammars......Page 236
State-Split Nonparametric Dependency Models......Page 237
Synchronous Grammars......Page 239
Part-of-Speech Tagging......Page 240
Grammar Induction......Page 242
Further Reading......Page 243
Summary......Page 245
Exercises......Page 246
Representation Learning and Neural Networks......Page 247
Neural Networks and Representation Learning: Why Now?......Page 248
Word Embeddings......Page 251
Skip-Gram Models for Word Embeddings......Page 252
Bayesian Skip-Gram Word Embeddings......Page 254
Discussion......Page 255
Neural Networks......Page 256
Frequentist Estimation and the Backpropagation Algorithm......Page 258
Priors on Neural Network Weights......Page 262
Recurrent and Recursive Neural Networks......Page 264
Vanishing and Exploding Gradient Problem......Page 266
Neural Encoder-Decoder Models......Page 269
Convolutional Neural Networks......Page 273
Regularization......Page 276
Hyperparameter Tuning......Page 277
Generative Modeling with Neural Networks......Page 279
Variational Autoencoders......Page 280
Generative Adversarial Networks......Page 286
Conclusion......Page 287
Exercises......Page 290
Closing Remarks......Page 291
Entropy and Cross Entropy......Page 293
Jensen's Inequality......Page 294
Transformation of Continuous Random Variables......Page 295
The Expectation-Maximization Algorithm......Page 296
Basic Concepts in Optimization......Page 297
Stochastic Gradient Descent......Page 298
Constrained Optimization......Page 299
The Dirichlet Distribution......Page 301
The Gamma Distribution......Page 303
The Laplace Distribution......Page 304
The Logistic Normal Distribution......Page 305
The Gumbel Distribution......Page 306
Bibliography......Page 309
Author's Biography......Page 339
Index......Page 0
Blank Page......Page 2