A First Course in Machine Learning

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

"A First Course in Machine Learning by Simon Rogers and Mark Girolami is the best introductory book for ML currently available. It combines rigor and precision with accessibility, starts from a detailed explanation of the basic foundations of Bayesian analysis in the simplest of settings, and goes all the way to the frontiers of the subject such as infinite mixture models, GPs, and MCMC."
―Devdatt Dubhashi, Professor, Department of Computer Science and Engineering, Chalmers University, Sweden

"This textbook manages to be easier to read than other comparable books in the subject while retaining all the rigorous treatment needed. The new chapters put it at the forefront of the field by covering topics that have become mainstream in machine learning over the last decade."
―Daniel Barbara, George Mason University, Fairfax, Virginia, USA

"The new edition of A First Course in Machine Learning by Rogers and Girolami is an excellent introduction to the use of statistical methods in machine learning. The book introduces concepts such as mathematical modeling, inference, and prediction, providing ‘just in time’ the essential background on linear algebra, calculus, and probability theory that the reader needs to understand these concepts."
―Daniel Ortiz-Arroyo, Associate Professor, Aalborg University Esbjerg, Denmark

"I was impressed by how closely the material aligns with the needs of an introductory course on machine learning, which is its greatest strength…Overall, this is a pragmatic and helpful book, which is well-aligned to the needs of an introductory course and one that I will be looking at for my own students in coming months."
―David Clifton, University of Oxford, UK

"The first edition of this book was already an excellent introductory text on machine learning for an advanced undergraduate or taught masters level course, or indeed for anybody who wants to learn about an interesting and important field of computer science. The additional chapters of advanced material on Gaussian process, MCMC and mixture modeling provide an ideal basis for practical projects, without disturbing the very clear and readable exposition of the basics contained in the first part of the book."
―Gavin Cawley, Senior Lecturer, School of Computing Sciences, University of East Anglia, UK

"This book could be used for junior/senior undergraduate students or first-year graduate students, as well as individuals who want to explore the field of machine learning…The book introduces not only the concepts but the underlying ideas on algorithm implementation from a critical thinking perspective."
―Guangzhi Qu, Oakland University, Rochester, Michigan, USA

Author(s): Simon Rogers, Mark Girolami
Series: Machine Learning & Pattern Recognition
Edition: 2
Publisher: Chapman and Hall/CRC
Year: 2016

Language: English
Pages: 427
Tags: Statistics;Education & Reference;Business & Money;Machine Theory;AI & Machine Learning;Computer Science;Computers & Technology;Data Mining;Databases & Big Data;Computers & Technology;Statistics;Applied;Mathematics;Science & Math;Business & Finance;Accounting;Banking;Business Communication;Business Development;Business Ethics;Business Law;Economics;Entrepreneurship;Finance;Human Resources;International Business;Investments & Securities;Management;Marketing;Real Estate;Sales;New, Used & Rental Tex

Contents ... 6
List of Tables ... 15
List of Figures ... 16
Preface to the First Edition ... 25
Preface to the Second Edition ... 27
I Basic Topics ... 28
Chapter 1 Linear Modelling: A Least Squares Approach ... 29
1.1 LINEAR MODELLING ... 29
1.1.1 De?ning the model ... 30
1.1.2 Modelling assumptions ... 31
1.1.3 De?ning a good model ... 32
1.1.4 The least squares solution – a worked example ... 34
1.1.5 Worked example ... 38
1.1.6 Least squares ?t to the Olympic data ... 39
1.1.7 Summary ... 40
1.2 MAKING PREDICTIONS ... 41
1.2.1 A second Olympic dataset ... 41
1.2.2 Summary ... 43
1.3 VECTOR/MATRIX NOTATION ... 43
1.3.1 Example ... 51
1.3.2 Numerical example ... 52
1.3.3 Making predictions ... 53
1.3.4 Summary ... 53
1.4 NON-LINEAR RESPONSE FROM A LINEAR MODEL ... 54
1.5 GENERALISATION AND OVER-FITTING ... 57
1.5.1 Validation data ... 57
1.5.2 Cross-validation ... 58
1.5.3 Computational scaling of K-fold cross-validation ... 60
1.6 REGULARISED LEAST SQUARES ... 60
1.7 EXERCISES ... 63
Chapter 2 Linear Modelling: A Maximum Likelihood Approach ... 66
2.1 ERRORS AS NOISE ... 66
2.1.1 Thinking generatively ... 67
2.2 RANDOM VARIABLES AND PROBABILITY ... 68
2.2.1 Random variables ... 68
2.2.2 Probability and distributions ... 69
2.2.3 Adding probabilities ... 71
2.2.4 Conditional probabilities ... 71
2.2.5 Joint probabilities ... 72
2.2.6 Marginalisation ... 74
2.2.7 Aside – Bayes’ rule ... 76
2.2.8 Expectations ... 77
2.3 POPULAR DISCRETE DISTRIBUTIONS ... 80
2.3.1 Bernoulli distribution ... 80
2.3.2 Binomial distribution ... 80
2.3.3 Multinomial distribution ... 81
2.4 CONTINUOUS RANDOM VARIABLES – DENSITY FUNCTIONS ... 82
2.5 POPULAR CONTINUOUS DENSITY FUNCTIONS ... 85
2.5.1 The uniform density function ... 85
2.5.2 The beta density function ... 87
2.5.3 The Gaussian density function ... 88
2.5.4 Multivariate Gaussian ... 89
2.6 SUMMARY ... 91
2.7 THINKING GENERATIVELY...CONTINUED ... 92
2.8 LIKELIHOOD ... 93
2.8.1 Dataset likelihood ... 94
2.8.2 Maximum likelihood ... 95
2.8.3 Characteristics of the maximum likelihood solution ... 98
2.8.4 Maximum likelihood favours complex models ... 100
2.9 THE BIAS-VARIANCE TRADE-OFF ... 100
2.9.1 Summary ... 101
2.10 EFFECT OF NOISE ON PARAMETER ESTIMATES ... 102
2.10.1 Uncertainty in estimates ... 103
2.10.2 Comparison with empirical values ... 108
2.10.3 Variability in model parameters – Olympic data ... 109
2.11 VARIABILITY IN PREDICTIONS ... 109
2.11.1 Predictive variability – an example ... 111
2.11.2 Expected values of the estimators ... 111
2.12 CHAPTER SUMMARY ... 116
2.13 EXERCISES ... 117
Chapter 3 The Bayesian Approach to Machine Learning ... 119
3.1 A COIN GAME ... 119
3.1.1 Counting heads ... 121
3.1.2 The Bayesian way ... 122
3.2 THE EXACT POSTERIOR ... 127
3.3 THE THREE SCENARIOS ... 128
3.3.1 No prior knowledge ... 128
3.3.2 The fair coin scenario ... 136
3.3.3 A biased coin ... 138
3.3.4 The three scenarios – a summary ... 140
3.3.5 Adding more data ... 141
3.4 MARGINAL LIKELIHOODS ... 141
3.4.1 Model comparison with the marginal likelihood ... 143
3.5 HYPERPARAMETERS ... 143
3.6 GRAPHICAL MODELS ... 144
3.7 SUMMARY ... 146
3.8 A BAYESIAN TREATMENT OF THE OLYMPIC 100 m DATA ... 146
3.8.1 The model ... 146
3.8.2 The likelihood ... 148
3.8.3 The prior ... 148
3.8.4 The posterior ... 148
3.8.5 A ?rst-order polynomial ... 150
3.8.6 Making predictions ... 153
3.9 MARGINAL LIKELIHOOD FOR POLYNOMIAL MODEL OR DER SELECTION ... 154
3.10 CHAPTER SUMMARY ... 157
3.11 EXERCISES ... 157
3.12 FURTHER READING ... 159
Chapter 4 Bayesian Inference ... 160
4.1 NON-CONJUGATE MODELS ... 160
4.2 BINARY RESPONSES ... 161
4.2.1 A model for binary responses ... 161
4.3 A POINT ESTIMATE – THE MAP SOLUTION ... 164
4.4 THE LAPLACE APPROXIMATION ... 170
4.4.1 Laplace approximation example: Approximating a gamma density ... 171
4.4.2 Laplace approximation for the binary response model ... 173
4.5 SAMPLING TECHNIQUES ... 175
4.5.1 Playing darts ... 175
4.5.2 The Metropolis–Hastings algorithm ... 177
4.5.3 The art of sampling ... 185
4.6 CHAPTER SUMMARY ... 186
4.7 EXERCISES ... 186
4.8 FURTHER READING ... 187
Chapter 5 Classification ... 189
5.1 THE GENERAL PROBLEM ... 189
5.2 PROBABILISTIC CLASSIFIERS ... 190
5.2.1 The Bayes classi?er ... 190
5.2.1.1 Likelihood – class-conditional distributions ... 191
5.2.1.2 Prior class distribution ... 191
5.2.1.3 Example – Gaussian class-conditionals ... 192
5.2.1.4 Making predictions ... 193
5.2.1.5 The naive-Bayes assumption ... 194
5.2.1.6 Example – classifying text ... 196
5.2.1.7 Smoothing ... 198
5.2.2 Logistic regression ... 200
5.2.2.1 Motivation ... 200
5.2.2.2 Non-linear decision functions ... 201
5.2.2.3 Non-parametric models – the Gaussian process ... 202
5.3 NON-PROBABILISTIC CLASSIFIERS ... 203
5.3.1 K-nearest neighbours ... 203
5.3.1.1 Choosing K ... 204
5.3.2 Support vector machines and other kernel methods ... 207
5.3.2.1 The margin ... 207
5.3.2.2 Maximising the margin ... 208
5.3.2.3 Making predictions ... 211
5.3.2.4 Support vectors ... 211
5.3.2.5 Soft margins ... 213
5.3.2.6 Kernels ... 215
5.3.3 Summary ... 218
5.4 ASSESSING CLASSIFICATION PERFORMANCE ... 218
5.4.1 Accuracy – 0/1 loss ... 218
5.4.2 Sensitivity and speci?city ... 219
5.4.3 The area under the ROC curve ... 220
5.4.4 Confusion matrices ... 222
5.5 DISCRIMINATIVE AND GENERATIVE CLASSIFIERS ... 224
5.6 CHAPTER SUMMARY ... 224
5.7 EXERCISES ... 224
5.8 FURTHER READING ... 225
Chapter 6 Clustering ... 226
6.1 THE GENERAL PROBLEM ... 226
6.2 K-MEANS CLUSTERING ... 227
6.2.1 Choosing the number of clusters ... 229
6.2.2 Where K-means fails ... 231
6.2.3 Kernelised K-means ... 231
6.2.4 Summary ... 233
6.3 MIXTURE MODELS ... 234
6.3.1 A generative process ... 235
6.3.2 Mixture model likelihood ... 236
6.3.3 The EM algorithm ... 238
6.3.3.1 Updating ? k ... 239
6.3.3.2 Updating µ k ... 240
6.3.3.3 Updating ? k ... 241
6.3.3.4 Updating q nk ... 242
6.3.3.5 Some intuition ... 243
6.3.4 Example ... 244
6.3.5 EM ?nds local optima ... 245
6.3.6 Choosing the number of components ... 247
6.3.7 Other forms of mixture component ... 249
6.3.8 MAP estimates with EM ... 251
6.3.9 Bayesian mixture models ... 252
6.4 CHAPTER SUMMARY ... 253
6.5 EXERCISES ... 253
6.6 FURTHER READING ... 254
Chapter 7 Principal Components Analysis and Latent Variable Models ... 255
7.1 THE GENERAL PROBLEM ... 255
7.1.1 Variance as a proxy for interest ... 256
7.2 PRINCIPAL COMPONENTS ANALYSIS ... 258
7.2.1 Choosing D ... 262
7.2.2 Limitations of PCA ... 263
7.3 LATENT VARIABLE MODELS ... 264
7.3.1 Mixture models as latent variable models ... 264
7.3.2 Summary ... 265
7.4 VARIATIONAL BAYES ... 265
7.4.1 Choosing Q(?) ... 267
7.4.2 Optimising the bound ... 268
7.5 A PROBABILISTIC MODEL FOR PCA ... 268
7.5.1 Q ? (?) ... 270
7.5.2 Q x n (x n ) ... 272
7.5.3 Q w m (w m ) ... 273
7.5.4 The required expectations ... 274
7.5.5 The algorithm ... 274
7.5.6 An example ... 276
7.6 MISSING VALUES ... 276
7.6.1 Missing values as latent variables ... 279
7.6.2 Predicting missing values ... 280
7.7 NON-REAL-VALUED DATA ... 280
7.7.1 Probit PPCA ... 280
7.7.2 Visualising parliamentary data ... 284
7.7.2.1 Aside – relationship to classi?cation ... 287
7.8 CHAPTER SUMMARY ... 289
7.9 EXERCISES ... 290
7.10 FURTHER READING ... 290
II Advanced Topics ... 292Blue,bold,italic,open,TopLeftZoom,-65,204,0.0
Chapter 8 Gaussian Processes ... 293
8.1 PROLOGUE – NON-PARAMETRIC MODELS ... 293
8.2 GAUSSIAN PROCESS REGRESSION ... 296
8.2.1 The Gaussian process prior ... 296
8.2.2 Noise-free regression ... 301
8.2.3 Noisy regression ... 305
8.2.4 Summary ... 306
8.2.5 Noisy regression – an alternative route ... 307
8.2.6 Alternative covariance functions ... 310
8.2.7 ARD ... 314
8.2.9 Summary ... 315
8.3 GAUSSIAN PROCESS CLASSIFICATION ... 315
8.3.1 A classi?cation likelihood ... 315
8.3.2 A classi?cation roadmap ... 317
8.3.3 The point estimate approximation ... 318
8.3.4 Propagating uncertainty through the sigmoid ... 321
8.3.5 The Laplace approximation ... 323
8.3.6 Summary ... 326
8.4 HYPERPARAMETER OPTIMISATION ... 327
8.5 EXTENSIONS ... 329
8.5.1 Non-zero mean ... 329
8.5.2 Multiclass classi?cation ... 329
8.5.3 Other likelihood functions and models ... 329
8.5.4 Other inference schemes ... 329
8.6 CHAPTER SUMMARY ... 330
8.7 EXERCISES ... 330
8.8 FURTHER READING ... 332
Chapter 9 Markov Chain Monte Carlo Sampling ... 333
9.1 GIBBS SAMPLING ... 334
9.2 EXAMPLE: GIBBS SAMPLING FOR GP CLASSIFICATION ... 338
9.2.1 Conditional densities for GP classi?cation via Gibbs sampling ... 340
9.2.2 Summary ... 342
9.3 WHY DOES MCMC WORK? ... 345
9.4 SOME SAMPLING PROBLEMS AND SOLUTIONS ... 349
9.4.1 Burn-in and convergence ... 349
9.4.2 Autocorrelation ... 351
9.4.3 Summary ... 355
9.5 ADVANCED SAMPLING TECHNIQUES ... 356
9.5.1 Adaptive proposals and Hamiltonian Monte Carlo ... 356
9.5.2 Approximate Bayesian computation ... 359
9.5.3 Population MCMC and temperature schedules ... 363
9.5.4 Sequential Monte Carlo ... 364
9.6 CHAPTER SUMMARY ... 366
9.7 EXERCISES ... 367
9.8 FURTHER READING ... 368
Chapter 10 Advanced Mixture Modelling ... 369
10.1 A GIBBS SAMPLER FOR MIXTURE MODELS ... 369
10.2 COLLAPSED GIBBS SAMPLING ... 377
10.3 AN INFINITE MIXTURE MODEL ... 382
10.3.1 The Chinese restaurant process ... 384
10.3.2 Inference in the in?nite mixture model ... 385
10.3.3 Summary ... 389
10.4 DIRICHLET PROCESSES ... 389
10.4.1 Hierarchical Dirichlet processes ... 395
10.4.2 Summary ... 398
10.5 BEYOND STANDARD MIXTURES – TOPIC MODELS ... 398
10.6 CHAPTER SUMMARY ... 400
10.7 EXERCISES ... 401
10.8 FURTHER READING ... 403
Glossary ... 404
Index ... 412