The Elements of Statistical Learning : Data Mining, Inference and Prediction

Share
Add to Wishlist
PDF

Mathematics\\Applied Mathematics

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Download

Search on Amazon

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

With Table of Contents

Author(s): Trevor Hastie, Robert Tibshirani, Jerome Friedman
Series: Springer Series in Statistics
Publisher: Springer
Year: 2013

Language: English
Pages: 764

1 Introduction......Page 20
2.2 Variable Types and Terminology......Page 28
2.3.1 Linear Models and Least Squares......Page 30
2.3.2 Nearest-Neighbor Methods......Page 33
2.3.3 From Least Squares to Nearest Neighbors......Page 35
2.4 Statistical Decision Theory......Page 37
2.5 Local Methods in High Dimensions......Page 41
2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y )......Page 47
2.6.3 Function Approximation......Page 48
2.7.1 Difficulty of the Problem......Page 51
2.8 Classes of Restricted Estimators......Page 52
2.8.2 Kernel Methods and Local Regression......Page 53
2.8.3 Basis Functions and Dictionary Methods......Page 54
2.9 Model Selection and the Bias–Variance Tradeoff......Page 56
Exercises......Page 58
3.1 Introduction......Page 62
3.2 Linear Regression Models and Least Squares......Page 63
3.2.1 Example: Prostate Cancer......Page 68
3.2.2 The Gauss–Markov Theorem......Page 70
3.2.3 Multiple Regression from Simple Univariate Regression......Page 71
3.2.4 Multiple Outputs......Page 75
3.3.1 Best-Subset Selection......Page 76
3.3.2 Forward- and Backward-Stepwise Selection......Page 77
3.3.3 Forward-Stagewise Regression......Page 79
3.4.1 Ridge Regression......Page 80
3.4.2 The Lasso......Page 87
3.4.3 Discussion: Subset Selection, Ridge Regression and the Lasso......Page 88
3.4.4 Least Angle Regression......Page 92
3.5.1 Principal Components Regression......Page 98
3.5.2 Partial Least Squares......Page 99
3.6 Discussion: A Comparison of the Selection and Shrinkage Methods......Page 101
3.7 Multiple Outcome Shrinkage and Selection......Page 103
3.8.1 Incremental Forward Stagewise Regression......Page 105
3.8.3 The Dantzig Selector......Page 108
3.8.4 The Grouped Lasso......Page 109
3.8.5 Further Properties of the Lasso......Page 110
3.8.6 Pathwise Coordinate Optimization......Page 111
3.9 Computational Considerations......Page 112
Exercises......Page 113
4.1 Introduction......Page 120
4.2 Linear Regression of an Indicator Matrix......Page 122
4.3 Linear Discriminant Analysis......Page 125
4.3.1 Regularized Discriminant Analysis......Page 131
4.3.3 Reduced-Rank Linear Discriminant Analysis......Page 132
4.4 Logistic Regression......Page 138
4.4.1 Fitting Logistic Regression Models......Page 139
4.4.2 Example: South African Heart Disease......Page 141
4.4.3 Quadratic Approximations and Inference......Page 143
4.4.4 L1 Regularized Logistic Regression......Page 144
4.4.5 Logistic Regression or LDA?......Page 146
4.5 Separating Hyperplanes......Page 148
4.5.1 Rosenblatt’s Perceptron Learning Algorithm......Page 149
4.5.2 Optimal Separating Hyperplanes......Page 151
Exercises......Page 154
5.1 Introduction......Page 158
5.2 Piecewise Polynomials and Splines......Page 160
5.2.1 Natural Cubic Splines......Page 163
5.2.2 Example: South African Heart Disease (Continued)......Page 165
5.2.3 Example: Phoneme Recognition......Page 167
5.3 Filtering and Feature Extraction......Page 169
5.4 Smoothing Splines......Page 170
5.4.1 Degrees of Freedom and Smoother Matrices......Page 172
5.5 Automatic Selection of the Smoothing Parameters......Page 175
5.5.2 The Bias–Variance Tradeoff......Page 177
5.6 Nonparametric Logistic Regression......Page 180
5.7 Multidimensional Splines......Page 181
5.8 Regularization and Reproducing Kernel Hilbert Spaces......Page 186
5.8.1 Spaces of Functions Generated by Kernels......Page 187
5.8.2 Examples of RKHS......Page 189
5.9 Wavelet Smoothing......Page 193
5.9.1 Wavelet Bases and the Wavelet Transform......Page 195
5.9.2 Adaptive Wavelet Filtering......Page 198
Exercises......Page 200
Appendix: B-splines......Page 205
Appendix: Computations for Smoothing Splines......Page 208
6 Kernel Smoothing Methods......Page 210
6.1 One-Dimensional Kernel Smoothers......Page 211
6.1.1 Local Linear Regression......Page 213
6.1.2 Local Polynomial Regression......Page 216
6.2 Selecting the Width of the Kernel......Page 217
6.3 Local Regression in IRp......Page 219
6.4 Structured Local Regression Models in IRp......Page 220
6.4.2 Structured Regression Functions......Page 222
6.5 Local Likelihood and Other Models......Page 224
6.6.1 Kernel Density Estimation......Page 227
6.6.3 The Naive Bayes Classifier......Page 229
6.7 Radial Basis Functions and Kernels......Page 231
6.8 Mixture Models for Density Estimation and Classification......Page 233
Exercises......Page 235
7.2 Bias, Variance and Model Complexity......Page 238
7.3 The Bias–Variance Decomposition......Page 242
7.3.1 Example: Bias–Variance Tradeoff......Page 245
7.4 Optimism of the Training Error Rate......Page 247
7.5 Estimates of In-Sample Prediction Error......Page 249
7.6 The Effective Number of Parameters......Page 251
7.7 The Bayesian Approach and BIC......Page 252
7.8 Minimum Description Length......Page 254
7.9 Vapnik–Chervonenkis Dimension......Page 256
7.9.1 Example (Continued)......Page 258
7.10.1 K-Fold Cross-Validation......Page 260
7.10.2 The Wrong and Right Way to Do Cross-validation......Page 264
7.10.3 Does Cross-Validation Really Work?......Page 266
7.11 Bootstrap Methods......Page 268
7.11.1 Example (Continued)......Page 271
7.12 Conditional or Expected Test Error?......Page 273
Exercises......Page 276
8.2.1 A Smoothing Example......Page 280
8.2.2 Maximum Likelihood Inference......Page 284
8.3 Bayesian Methods......Page 286
8.4 Relationship Between the Bootstrap and Bayesian Inference......Page 290
8.5.1 Two-Component Mixture Model......Page 291
8.5.2 The EM Algorithm in General......Page 295
8.5.3 EM as a Maximization–Maximization Procedure......Page 296
8.6 MCMC for Sampling from the Posterior......Page 298
8.7 Bagging......Page 301
8.7.1 Example: Trees with Simulated Data......Page 302
8.8 Model Averaging and Stacking......Page 307
8.9 Stochastic Search: Bumping......Page 309
Bibliographic Notes......Page 311
Exercises......Page 312
9.1 Generalized Additive Models......Page 314
9.1.1 Fitting Additive Models......Page 316
9.1.2 Example: Additive Logistic Regression......Page 318
9.1.3 Summary......Page 323
9.2.1 Background......Page 324
9.2.2 Regression Trees......Page 326
9.2.3 Classification Trees......Page 327
9.2.4 Other Issues......Page 329
9.2.5 Spam Example (Continued)......Page 332
9.3 PRIM: Bump Hunting......Page 336
9.3.1 Spam Example (Continued)......Page 339
9.4 MARS: Multivariate Adaptive Regression Splines......Page 340
9.4.1 Spam Example (Continued)......Page 345
9.4.2 Example (Simulated Data)......Page 346
9.4.3 Other Issues......Page 347
9.5 Hierarchical Mixtures of Experts......Page 348
9.6 Missing Data......Page 351
Bibliographic Notes......Page 353
Exercises......Page 354
10.1 Boosting Methods......Page 356
10.1.1 Outline of This Chapter......Page 359
10.2 Boosting Fits an Additive Model......Page 360
10.3 Forward Stagewise Additive Modeling......Page 361
10.4 Exponential Loss and AdaBoost......Page 362
10.5 Why Exponential Loss?......Page 364
10.6 Loss Functions and Robustness......Page 365
10.7 “Off-the-Shelf” Procedures for Data Mining......Page 369
10.8 Example: Spam Data......Page 371
10.9 Boosting Trees......Page 372
10.10.1 Steepest Descent......Page 377
10.10.2 Gradient Boosting......Page 378
10.10.3 Implementations of Gradient Boosting......Page 379
10.11 Right-Sized Trees for Boosting......Page 380
10.12.1 Shrinkage......Page 383
10.12.2 Subsampling......Page 384
10.13.1 Relative Importance of Predictor Variables......Page 386
10.13.2 Partial Dependence Plots......Page 388
10.14.1 California Housing......Page 390
10.14.2 New Zealand Fish......Page 394
10.14.3 Demographics Data......Page 398
Bibliographic Notes......Page 399
Exercises......Page 403
11.2 Projection Pursuit Regression......Page 408
11.3 Neural Networks......Page 411
11.4 Fitting Neural Networks......Page 414
11.5.1 Starting Values......Page 416
11.5.3 Scaling of the Inputs......Page 417
11.5.5 Multiple Minima......Page 419
11.6 Example: Simulated Data......Page 420
11.7 Example: ZIP Code Data......Page 423
11.8 Discussion......Page 427
11.9 Bayesian Neural Nets and the NIPS 2003 Challenge......Page 428
11.9.1 Bayes, Boosting and Bagging......Page 429
11.9.2 Performance Comparisons......Page 431
11.10 Computational Considerations......Page 433
Exercises......Page 434
12.2 The Support Vector Classifier......Page 436
12.2.1 Computing the Support Vector Classifier......Page 439
12.2.2 Mixture Example (Continued)......Page 440
12.3.1 Computing the SVM for Classification......Page 442
12.3.2 The SVM as a Penalization Method......Page 445
12.3.3 Function Estimation and Reproducing Kernels......Page 447
12.3.4 SVMs and the Curse of Dimensionality......Page 450
12.3.5 A Path Algorithm for the SVM Classifier......Page 451
12.3.6 Support Vector Machines for Regression......Page 453
12.3.7 Regression and Kernels......Page 455
12.4 Generalizing Linear Discriminant Analysis......Page 457
12.5 Flexible Discriminant Analysis......Page 459
12.5.1 Computing the FDA Estimates......Page 463
12.6 Penalized Discriminant Analysis......Page 465
12.7 Mixture Discriminant Analysis......Page 468
12.7.1 Example: Waveform Data......Page 470
Exercises......Page 474
13.2 Prototype Methods......Page 478
13.2.1 K-means Clustering......Page 479
13.2.2 Learning Vector Quantization......Page 481
13.3 k-Nearest-Neighbor Classifiers......Page 482
13.3.1 Example: A Comparative Study......Page 487
13.3.2 Example: k-Nearest-Neighbors and Image Scene Classification......Page 489
13.3.3 Invariant Metrics and Tangent Distance......Page 490
13.4 Adaptive Nearest-Neighbor Methods......Page 494
13.4.1 Example......Page 497
13.4.2 Global Dimension Reduction for Nearest-Neighbors......Page 498
13.5 Computational Considerations......Page 499
Exercises......Page 500
14.1 Introduction......Page 504
14.2 Association Rules......Page 506
14.2.1 Market Basket Analysis......Page 507
14.2.2 The Apriori Algorithm......Page 508
14.2.3 Example: Market Basket Analysis......Page 511
14.2.4 Unsupervised as Supervised Learning......Page 514
14.2.5 Generalized Association Rules......Page 516
14.2.7 Example: Market Basket Analysis (Continued)......Page 518
14.3 Cluster Analysis......Page 520
14.3.2 Dissimilarities Based on Attributes......Page 522
14.3.3 Object Dissimilarity......Page 524
14.3.5 Combinatorial Algorithms......Page 526
14.3.6 K-means......Page 528
14.3.7 Gaussian Mixtures as Soft K-means Clustering......Page 529
14.3.8 Example: Human Tumor Microarray Data......Page 531
14.3.9 Vector Quantization......Page 533
14.3.10 K-medoids......Page 534
14.3.11 Practical Issues......Page 537
14.3.12 Hierarchical Clustering......Page 539
14.4 Self-Organizing Maps......Page 547
14.5.1 Principal Components......Page 553
14.5.2 Principal Curves and Surfaces......Page 560
14.5.3 Spectral Clustering......Page 563
14.5.4 Kernel Principal Components......Page 566
14.5.5 Sparse Principal Components......Page 569
14.6 Non-negative Matrix Factorization......Page 572
14.6.1 Archetypal Analysis......Page 573
14.7 Independent Component Analysis and Exploratory Projection Pursuit......Page 576
14.7.1 Latent Variables and Factor Analysis......Page 577
14.7.2 Independent Component Analysis......Page 579
14.7.4 A Direct Approach to ICA......Page 584
14.8 Multidimensional Scaling......Page 589
14.9 Nonlinear Dimension Reduction and Local Multidimensional Scaling......Page 591
14.10 The Google PageRank Algorithm......Page 595
Bibliographic Notes......Page 597
Exercises......Page 598
15.2 Definition of Random Forests......Page 606
15.3.1 Out of Bag Samples......Page 611
15.3.2 Variable Importance......Page 612
15.3.3 Proximity Plots......Page 614
15.3.4 Random Forests and Overfitting......Page 615
15.4.1 Variance and the De-Correlation Effect......Page 616
15.4.2 Bias......Page 619
15.4.3 Adaptive Nearest Neighbors......Page 620
Bibliographic Notes......Page 621
Exercises......Page 622
16.1 Introduction......Page 624
16.2.1 Penalized Regression......Page 626
16.2.2 The “Bet on Sparsity” Principle......Page 629
16.2.3 Regularization Paths, Over-fitting and Margins......Page 632
16.3 Learning Ensembles......Page 635
16.3.1 Learning a Good Ensemble......Page 636
16.3.2 Rule Ensembles......Page 641
Bibliographic Notes......Page 642
Exercises......Page 643
17.1 Introduction......Page 644
17.2 Markov Graphs and Their Properties......Page 646
17.3 Undirected Graphical Models for Continuous Variables......Page 649
17.3.1 Estimation of the Parameters when the Graph Structure is Known......Page 650
17.3.2 Estimation of the Graph Structure......Page 654
17.4 Undirected Graphical Models for Discrete Variables......Page 657
17.4.1 Estimation of the Parameters when the Graph Structure is Known......Page 658
17.4.2 Hidden Nodes......Page 660
17.4.3 Estimation of the Graph Structure......Page 661
17.4.4 Restricted Boltzmann Machines......Page 662
Exercises......Page 664
18.1 When p is Much Bigger than N......Page 668
18.2 Diagonal Linear Discriminant Analysis and Nearest Shrunken Centroids......Page 670
18.3 Linear Classifiers with Quadratic Regularization......Page 673
18.3.1 Regularized Discriminant Analysis......Page 675
18.3.3 The Support Vector Classifier......Page 676
18.3.4 Feature Selection......Page 677
18.3.5 Computational Shortcuts When p ≫ N......Page 678
18.4 Linear Classifiers with L1 Regularization......Page 680
18.4.1 Application of Lasso to Protein Mass Spectroscopy......Page 683
18.4.2 The Fused Lasso for Functional Data......Page 685
18.5.1 Example: String Kernels and Protein Classification......Page 687
18.5.2 Classification and Other Models Using Inner-Product Kernels and Pairwise Distances......Page 689
18.5.3 Example: Abstracts Classification......Page 691
18.6 High-Dimensional Regression: Supervised Principal Components......Page 693
18.6.1 Connection to Latent-Variable Modeling......Page 697
18.6.2 Relationship with Partial Least Squares......Page 699
18.6.3 Pre-Conditioning for Feature Selection......Page 700
18.7 Feature Assessment and the Multiple-Testing Problem......Page 702
18.7.1 The False Discovery Rate......Page 706
18.7.2 Asymmetric Cutpoints and the SAM Procedure......Page 709
18.7.3 A Bayesian Interpretation of the FDR......Page 711
Bibliographic Notes......Page 712
References......Page 718

The Elements of Statistical Learning : Data Mining, Inference and Prediction

My Account

Infomation

Talk To Us

The Elements of Statistical Learning : Data Mining, Inference and Prediction

How to Download?

Is it Free?

Book is not loading. What to do?