A comprehensive introduction to Support Vector Machines and related kernel methods.
In the 1990s, a new type of learning algorithm was developed, based on results from statistical learning theory: the Support Vector Machine (SVM). This gave rise to a new class of theoretically elegant learning machines that use a central concept of SVMs -- -kernels--for a number of learning tasks. Kernel machines provide a modular framework that can be adapted to different tasks and domains by the choice of the kernel function and the base algorithm. They are replacing neural networks in a variety of fields, including engineering, information retrieval, and bioinformatics.
Learning with Kernels provides an introduction to SVMs and related kernel methods. Although the book begins with the basics, it also includes the latest research. It provides all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms and to understand and apply the powerful algorithms that have been developed over the last few years.
Author(s): Bernhard Schölkopf, Alexander J. Smola
Series: Adaptive Computation and Machine Learning
Edition: 1st
Publisher: The MIT Press
Year: 2001
Language: English
Pages: 644
Tags: Intelligence & Semantics;AI & Machine Learning;Computer Science;Computers & Technology;Algorithms;Data Structures;Genetic;Memory Management;Programming;Computers & Technology;Mathematics;Applied;Geometry & Topology;History;Infinity;Mathematical Analysis;Matrices;Number Systems;Popular & Elementary;Pure Mathematics;Reference;Research;Study & Teaching;Transformations;Trigonometry;Science & Math;Artificial Intelligence;Computer Science;New, Used & Rental Textbooks;Specialty Boutique;Programming Lan
Contents ... 8
Preface ... 16
1 A Tutorial Introduction ... 20
1.1 Data Representation and Similarity ... 20
1.2 A Simple Pattern Recognition Algorithm ... 23
1.3 Some Insights From Statistical Learning Theory ... 25
1.4 Hyperplane Classi?ers ... 30
1.5 Support Vector Classi?cation ... 34
1.6 Support Vector Regression ... 36
1.8 Empirical Results and Implementations ... 40
I CONCEPTS AND TOOLS ... 42
2 Kernels ... 44
2.1 Product Features ... 45
2.2 The Representation of Similarities in Linear Spaces ... 48
2.2.1 Positive De?nite Kernels ... 49
2.2.2 The Reproducing Kernel Map ... 51
2.2.3 Reproducing Kernel Hilbert Spaces ... 54
2.2.4 The Mercer Kernel Map ... 55
2.2.5 The Shape of the Mapped Data in Feature Space ... 59
2.2.6 The Empirical Kernel Map ... 61
2.2.7 A Kernel Map De?ned from Pairwise Similarities ... 63
2.3 Examples and Properties of Kernels ... 64
2.4 The Representation of Dissimilarities in Linear Spaces ... 67
2.4.1 Conditionally Positive De?nite Kernels ... 67
2.4.2 Hilbert Space Representation of CPD Kernels ... 69
2.4.3 Higher Order CPD Kernels ... 72
2.5 Summary ... 74
2.6 Problems ... 74
3 Risk and Loss Functions ... 80
3.1 Loss Functions ... 81
3.1.1 Binary Classi?cation ... 81
3.1.2 Regression ... 83
3.2 Test Error and Expected Risk ... 84
3.2.1 Exact Quantities ... 84
3.2.2 Approximations ... 85
3.3 A Statistical Perspective ... 87
3.3.1 Maximum Likelihood Estimation ... 87
3.3.2 Ef?ciency ... 90
3.4 Robust Estimators ... 94
3.4.1 Robustness via Loss Functions ... 95
3.4.2 Ef?ciency and the -Insensitive Loss Function ... 97
3.4.3 Adaptive Loss Functions ... 99
3.4.4 Optimal Choice of ... 100
3.5 Summary ... 102
3.6 Problems ... 103
4 Regularization ... 106
4.1 The Regularized Risk Functional ... 107
4.2 The Representer Theorem ... 108
4.3 Regularization Operators ... 111
4.4 Translation Invariant Kernels ... 115
4.4.1 B n -Splines ... 117
4.4.2 Gaussian Kernels ... 118
4.4.3 Dirichlet Kernels ... 120
4.4.4 Periodic Kernels ... 122
4.4.5 Practical Implications ... 124
4.5 Translation Invariant Kernels in Higher Dimensions ... 124
4.5.1 Basic Tools ... 126
4.5.2 Regularization Properties of Kernels in R N ... 126
4.5.3 A Note on Other Invariances ... 128
4.6 Dot Product Kernels ... 129
4.6.1 Conditions for Positivity and Eigenvector Decompositions ... 130
4.6.2 Examples and Applications ... 131
4.7 Multi-Output Regularization ... 132
4.8 Semiparametric Regularization ... 134
4.9 Coef?cient Based Regularization ... 137
4.9.1 Ridge Regression ... 138
4.9.2 Linear Programming Regularization ... 139
4.9.3 Mixed Semiparametric Regularizers ... 139
4.10 Summary ... 140
4.11 Problems ... 141
5 Elements of Statistical Learning Theory ... 144
5.1 Introduction ... 144
5.2 The Law of Large Numbers ... 147
5.3 When Does Learning Work: the Question of Consistency ... 150
5.4 Uniform Convergence and Consistency ... 150
5.5 How to Derive a VC Bound ... 153
5.5.1 The Union Bound ... 153
5.5.2 Symmetrization ... 154
5.5.3 The Shattering Coef?cient ... 155
5.5.4 Uniform Convergence Bounds ... 155
5.5.5 Con?dence Intervals ... 157
5.5.6 The VC Dimension and Other Capacity Concepts ... 158
5.6 A Model Selection Example ... 163
5.7 Summary ... 165
5.8 Problems ... 165
6 Optimization ... 168
6.1 Convex Optimization ... 169
6.2 Unconstrained Problems ... 173
6.2.1 Functions of One Variable ... 173
6.2.2 Functions of Several Variables: Gradient Descent ... 176
6.2.3 Convergence Properties of Gradient Descent ... 177
6.2.4 Functions of Several Variables: Conjugate Gradient Descent ... 179
6.2.5 Predictor Corrector Methods ... 182
6.3 Constrained Problems ... 184
6.3.1 Optimality Conditions ... 185
6.3.2 Duality and KKT-Gap ... 188
6.3.3 Linear and Quadratic Programs ... 191
6.4 Interior Point Methods ... 194
6.4.1 Suf?cient Conditions for a Solution ... 194
6.4.2 Solving the Equations ... 195
6.4.3 Updating ... 196
6.4.4 Initial Conditions and Stopping Criterion ... 196
6.5 Maximum Search Problems ... 198
6.5.1 Random Subset Selection ... 198
6.5.2 Random Evaluation ... 200
6.5.3 Greedy Optimization Strategies ... 201
6.6 Summary ... 202
6.7 Problems ... 203
II SUPPORT VECTOR MACHINES ... 206
7 Pattern Recognition ... 208
7.1 Separating Hyperplanes ... 208
7.2 The Role of the Margin ... 211
7.3 Optimal Margin Hyperplanes ... 215
7.4 Nonlinear Support Vector Classi?ers ... 219
7.5 Soft Margin Hyperplanes ... 223
7.6 Multi-Class Classi?cation ... 230
7.6.1 One Versus the Rest ... 230
7.6.2 Pairwise Classi?cation ... 231
7.6.3 Error-Correcting Output Coding ... 232
7.6.4 Multi-Class Objective Functions ... 232
7.7 Variations on a Theme ... 233
7.8 Experiments ... 234
7.8.1 Digit Recognition Using Different Kernels ... 234
7.8.2 Universality of the Support Vector Set ... 238
7.8.3 Other Applications ... 240
7.9 Summary ... 241
7.10 Problems ... 241
8 Single-Class Problems: Quantile Estimation and Novelty Detection ... 246
8.1 Introduction ... 247
8.2 A Distribution’s Support and Quantiles ... 248
8.3 Algorithms ... 249
8.4 Optimization ... 253
8.5 Theory ... 255
8.6 Discussion ... 260
8.7 Experiments ... 262
8.8 Summary ... 266
8.9 Problems ... 267
9 Regression Estimation ... 270
9.1 Linear Regression with Insensitive Loss Function ... 270
9.2 Dual Problems ... 273
9.2.2 More General Loss Functions ... 275
9.2.3 The Bigger Picture ... 278
9.3 -SV Regression ... 279
9.4 Convex Combinations and 1 -Norms ... 285
9.5 Parametric Insensitivity Models ... 288
9.6 Applications ... 291
9.7 Summary ... 292
9.8 Problems ... 293
10 Implementation ... 298
10.1 Tricks of the Trade ... 300
10.1.1 Stopping Criterion ... 300
10.1.2 Restarting with Different Parameters ... 304
10.1.3 Caching ... 305
10.1.4 Shrinking the Training Set ... 306
10.2 Sparse Greedy Matrix Approximation ... 307
10.2.1 Sparse Approximations ... 307
10.2.2 Iterative Methods and Random Sets ... 309
10.2.3 Optimal and Greedy Selections ... 310
10.2.4 Experiments ... 312
10.3 Interior Point Algorithms ... 314
10.3.1 Solving the Equations ... 315
10.3.2 Special Considerations for Classi?cation ... 316
10.3.3 Special Considerations for SV Regression ... 317
10.3.4 Large Scale Problems ... 318
10.4 Subset Selection Methods ... 319
10.4.1 Chunking ... 319
10.4.2 Working Set Algorithms ... 320
10.4.3 Selection Strategies ... 321
10.5 Sequential Minimal Optimization ... 324
10.5.1 Analytic Solutions ... 324
10.5.2 Classi?cation ... 326
10.5.3 Regression ... 327
10.5.4 Computing the Offset b and Optimality Criteria ... 329
10.5.5 Selection Rules ... 330
10.6 Iterative Methods ... 331
10.6.1 Gradient Descent ... 334
10.6.2 Lagrangian Support Vector Machines ... 337
10.6.3 Online Extensions ... 339
10.7 Summary ... 346
10.7.1 Topics We Did Not Cover ... 346
10.7.2 Topics We Covered ... 347
10.7.3 Future Developments and Code ... 348
10.8 Problems ... 348
11 Incorporating Invariances ... 352
11.1 Prior Knowledge ... 352
11.2 Transformation Invariance ... 354
11.3 The Virtual SV Method ... 356
11.4 Constructing Invariance Kernels ... 362
11.4.1 Invariance in Input Space ... 363
11.4.2 Invariance in Feature Space ... 367
11.4.3 Experiments ... 371
11.5 The Jittered SV Method ... 373
11.6 Summary ... 375
11.7 Problems ... 376
12 Learning Theory Revisited ... 378
12.1 Concentration of Measure Inequalities ... 379
12.1.1 McDiarmid’s Bound ... 379
12.1.2 Uniform Stability and Convergence ... 380
12.1.3 Uniform Stability of Regularization Networks ... 382
12.2 Leave-One-Out Estimates ... 385
12.2.1 Theoretical Background ... 385
12.2.2 Lagrange Multiplier Estimates ... 388
12.2.3 The Span Bound for Classi?cation ... 389
12.2.4 The Span Bound for Quantile Estimation ... 391
12.2.5 Methods from Statistical Physics ... 395
12.3 PAC-Bayesian Bounds ... 400
12.3.1 Gibbs and Bayes Classi?ers ... 400
12.3.2 PAC-Bayesian Bounds for Single Classi?ers ... 402
12.3.3 PAC-Bayesian Bounds for Combinations of Classi?ers ... 405
12.3.3 PAC-Bayesian Bounds for Combinations of Classi?ers ... 405
12.3.4 Applications to Large Margin Classi?ers ... 408
12.4 Operator-Theoretic Methods in Learning Theory ... 410
12.4.1 Scale-Sensitivity and the Fat Shattering Dimension ... 410
12.4.2 Entropy and Covering Numbers ... 411
12.4.3 Generalization Bounds via Uniform Convergence ... 413
12.4.4 Entropy Numbers for Kernel Machines ... 415
12.4.5 Discrete Spectra of Convolution Operators ... 419
12.4.6 Covering Numbers for Given Decay Rates ... 421
12.5 Summary ... 422
12.6 Problems ... 423
III KERNEL METHODS ... 424
13 Designing Kernels ... 426
13.1 Tricks for Constructing Kernels ... 427
13.2 String Kernels ... 431
13.3 Locality-Improved Kernels ... 433
13.3.1 Image Processing ... 433
13.3.2 DNA Start Codon Recognition ... 435
13.4 Natural Kernels ... 437
13.4.1 Natural Kernels ... 437
13.4.2 The Natural Regularization Operator ... 439
13.4.3 The Feature Map of Natural Kernels ... 440
13.5 Summary ... 442
13.6 Problems ... 442
14 Kernel Feature Extraction ... 446
14.1 Introduction ... 446
14.2 Kernel PCA ... 448
14.2.1 Nonlinear PCA as an Eigenvalue Problem ... 448
14.2.2 Properties of Kernel PCA ... 450
14.2.3 Comparison to Other Methods ... 453
14.3 Kernel PCA Experiments ... 456
14.4 A Framework for Feature Extraction ... 461
14.4.1 Principal Component Analysis ... 461
14.4.2 Kernel PCA ... 462
14.4.3 Sparse Kernel Feature Analysis ... 462
14.4.4 Projection Pursuit ... 464
14.4.5 Kernel Projection Pursuit ... 464
14.4.6 Connections to Supervised Learning ... 465
14.5 Algorithms for Sparse KFA ... 466
14.5.1 Solution by Maximum Sear ... 466
14.5.2 Sequential Decompositions ... 466
14.5.3 A Probabilistic Speedup ... 468
14.5.4 A Quantile Trick ... 469
14.5.5 Theoretical Analysis ... 469
14.6 KFA Experiments ... 469
14.7 Summary ... 470
14.8 Problems ... 471
15 Kernel Fisher Discriminant ... 476
15.1 Introduction ... 476
15.2 Fisher’s Discriminant in Feature Space ... 477
15.3 Ef?cient Training of Kernel Fisher Discriminants ... 479
15.4 Probabilistic Outputs ... 483
15.5 Experiments ... 485
15.6 Summary ... 486
15.7 Problems ... 487
16 Bayesian Kernel Methods ... 488
16.1 Bayesics ... 489
16.1.1 Likelihood ... 489
16.1.2 Prior Distributions ... 491
16.1.3 Bayes’ Rule and Inference ... 493
16.1.4 Hyperparameters ... 494
16.2 Inference Methods ... 494
16.2.1 Maximum a Posteriori Approximation ... 495
16.2.2 Parametric Approximation of the Posterior Distribution ... 497
16.2.3 Connection to Regularized Risk Functionals ... 498
16.3 Gaussian Processes ... 499
16.3.1 Correlated Observations ... 499
16.3.2 De?nitions and Basic Notions ... 500
16.3.3 Simple Hypotheses ... 502
16.3.4 Regression ... 503
16.3.5 Classi?cation ... 505
16.3.6 Adjusting Hyperparameters for Gaussian Processes ... 506
16.4 Implementation of Gaussian Processes ... 507
16.4.1 Laplace Approximation ... 507
16.4.2 Variational Methods ... 509
16.4.3 Approximate Solutions for Gaussian Process Regression ... 510
16.4.4 Solutions on Subspaces ... 511
16.4.5 Implementation Issues ... 513
16.4.6 Hardness and Approximation Results ... 514
16.4.7 Experimental Evidence ... 516
16.5 Laplacian Processes ... 518
16.5.1 Data Dependent Priors ... 518
16.5.2 Samples from the Prior ... 520
16.5.3 Prediction ... 520
16.5.4 Con?dence Intervals for Gaussian Noise ... 522
16.5.5 Data Independent Formulation ... 523
16.5.6 An Equivalent Gaussian Process ... 524
16.6 Relevance Vector Machines ... 525
16.6.1 Regression with Hyperparameters ... 526
16.6.2 Finding Optimal Hyperparameters ... 527
16.6.3 Explicit Priors by Integration ... 528
16.6.4 Classi?cation ... 529
16.6.5 Toy Example and Discussion ... 529
16.7 Summary ... 530
16.7.2 Key Issues ... 531
16.8 Problems ... 532
17 Regularized Principal Manifolds ... 536
17.1 A Coding Framework ... 537
17.1.1 Quantization Error ... 537
17.1.2 Examples with Finite Codes ... 538
17.1.3 Examples with In?nite Codes ... 539
17.2 A Regularized Quantization Functional ... 541
17.2.1 Quadratic Regularizers ... 543
17.2.2 Examples of Regularization Operators ... 543
17.2.3 Linear Programming Regularizers ... 544
17.3 An Algorithm for Minimizing R reg [ f ] ... 545
17.3.1 Projection ... 546
17.3.2 Adaptation ... 546
17.3.3 Initialization ... 547
17.4 Connections to Other Algorithms ... 548
17.4.1 Generative Models ... 548
17.4.2 The Generative Topographic Mapping ... 549
17.4.3 Robust Coding and Regularized Quantization ... 550
17.5 Uniform Convergence Bounds ... 552
17.5.1 Metrics and Covering Numbers ... 552
17.5.2 Upper and Lower Bounds ... 553
17.5.3 Bounding Covering Numbers ... 554
17.5.4 Rates of Convergence ... 555
17.6 Experiments ... 556
17.7 Summary ... 558
17.8 Problems ... 559
18 Pre-Images and Reduced Set Methods ... 562
18.1 The Pre-Image Problem ... 563
18.1.1 Exact Pre-Images ... 563
18.1.2 Approximate Pre-Images ... 565
18.2 Finding Approximate Pre-Images ... 566
18.2.1 Minimizing the Projection Distance ... 566
18.2.2 Fixed Point Iteration Approach for RBF Kernels ... 567
18.2.3 Toy Examples ... 568
18.2.4 Handwritten Digit Denoising ... 570
18.3 Reduced Set Methods ... 571
18.3.1 The Problem ... 571
18.4 Reduced Set Selection Methods ... 573
18.4.1 RS Selection via Kernel PCA ... 574
18.4.2 RS Selection via 1 Penalization ... 576
18.4.3 RS Selection by Sparse Greedy Methods ... 577
18.4.4 The Primal Reformulation ... 578
18.4.5 RS Selection via SV Regression ... 579
18.5 Reduced Set Construction Methods ... 580
18.5.1 Iterated Pre-Images ... 580
18.5.2 Phase II: Simultaneous Optimization of RS Vectors ... 580
18.5.3 Experiments ... 581
18.6 Sequential Evaluation of Reduced Set Expansions ... 583
18.7 Summary ... 585
18.8 Problems ... 586
A Addenda ... 588
A.1 Data Sets ... 588
A.2 Proofs ... 591
B Mathematical Prerequisites ... 594
B.1 Probability ... 594
B.1.1 Probability Spaces ... 594
B.1.2 IID Samples ... 596
B.1.3 Densities and Integrals ... 597
B.1.4 Stochastic Processes ... 599
B.2 Linear Algebra ... 599
B.2.1 Vector Spaces ... 599
B.2.2 Norms and Dot Products ... 602
B.3 Functional Analysis ... 605
B.3.1 Advanced Topics ... 608
References ... 610
Index ... 636