Tree-Based Methods for Statistical Learning in R: A Practical Introduction with Applications in R

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Tree-based Methods for Statistical Learning in R provides a thorough introduction to both individual decision tree algorithms (Part I) and ensembles thereof (Part II). Part I of the book brings several different tree algorithms into focus, both conventional and contemporary. Building a strong foundation for how individual decision trees work will help readers better understand tree-based ensembles at a deeper level, which lie at the cutting edge of modern statistical and machine learning methodology.

The book follows up most ideas and mathematical concepts with code-based examples in the R statistical language; with an emphasis on using as few external packages as possible. For example, users will be exposed to writing their own random forest and gradient tree boosting functions using simple for loops and basic tree fitting software (like rpart and party/partykit), and more. The core chapters also end with a detailed section on relevant software in both R and other opensource alternatives (e.g., Python, Spark, and Julia), and example usage on real data sets. While the book mostly uses R, it is meant to be equally accessible and useful to non-R programmers.

Consumers of this book will have gained a solid foundation (and appreciation) for tree-based methods and how they can be used to solve practical problems and challenges data scientists often face in applied work.

Features:

  • Thorough coverage, from the ground up, of tree-based methods (e.g., CART, conditional inference trees, bagging, boosting, and random forests).

  • A companion website containing additional supplementary material and the code to reproduce every example and figure in the book.
  • A companion R package, called treemisc, which contains several data sets and functions used throughout the book (e.g., there’s an implementation of gradient tree boosting with LAD loss that shows how to perform the line search step by updating the terminal node estimates of a fitted rpart tree).
  • Interesting examples that are of practical use; for example, how to construct partial dependence plots from a fitted model in Spark MLlib (using only Spark operations), or post-processing tree ensembles via the LASSO to reduce the number of trees while maintaining, or even improving performance.

Author(s): Brandon M. Greenwell
Series: Chapman & Hall/CRC Data Science Series
Publisher: CRC Press
Year: 2022

Language: English
Pages: 388
City: Boca Raton

Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Preface
1. Introduction
1.1. Select topics in statistical and machine learning
1.1.1. Statistical jargon and conventions
1.1.2. Supervised learning
1.1.2.1. Description
1.1.2.2. Prediction
1.1.2.3. Classification vs. regression
1.1.2.4. Discrimination vs. prediction
1.1.2.5. The bias-variance tradeoff
1.1.3. Unsupervised learning
1.2. Why trees?
1.2.1. A brief history of decision trees
1.2.2. The anatomy of a simple decision tree
1.2.2.1. Example: survival on the Titanic
1.3. Why R?
1.3.1. No really, why R?
1.3.2. Software information and conventions
1.4. Some example data sets
1.4.1. Swiss banknotes
1.4.2. New York air quality measurements
1.4.3. The Friedman 1 benchmark problem
1.4.4. Mushroom edibility
1.4.5. Spam or ham?
1.4.6. Employee attrition
1.4.7. Predicting home prices in Ames, Iowa
1.4.8. Wine quality ratings
1.4.9. Mayo Clinic primary biliary cholangitis study
1.5. There ain’t no such thing as a free lunch
1.6. Outline of this book
I. Decision trees
2. Binary recursive partitioning with CART
2.1. Introduction
2.2. Classification trees
2.2.1. Splits on ordered variables
2.2.1.1. So which is it in practice, Gini or entropy?
2.2.2. Example: Swiss banknotes
2.2.3. Fitted values and predictions
2.2.4. Class priors and misclassification costs
2.2.4.1. Altered priors
2.2.4.2. Example: employee attrition
2.3. Regression trees
2.3.1. Example: New York air quality measurements
2.4. Categorical splits
2.4.1. Example: mushroom edibility
2.4.2. Be wary of categoricals with high cardinality
2.4.3. To encode, or not to encode?
2.5. Building a decision tree
2.5.1. Cost-complexity pruning
2.5.1.1. Example: mushroom edibility
2.5.2. Cross-validation
2.5.2.1. The 1-SE rule
2.6. Hyperparameters and tuning
2.7. Missing data and surrogate splits
2.7.1. Other missing value strategies
2.8. Variable importance
2.9. Software and examples
2.9.1. Example: Swiss banknotes
2.9.2. Example: mushroom edibility
2.9.3. Example: predicting home prices
2.9.4. Example: employee attrition
2.9.5. Example: letter image recognition
2.10. Discussion
2.10.1. Advantages of CART
2.10.2. Disadvantages of CART
2.11. Recommended reading
3. Conditional inference trees
3.1. Introduction
3.2. Early attempts at unbiased recursive partitioning
3.3. A quick digression into conditional inference
3.3.1. Example: X and Y are both univariate continuous
3.3.2. Example: X and Y are both nominal categorical
3.3.3. Which test statistic should you use?
3.4. Conditional inference trees
3.4.1. Selecting the splitting variable
3.4.1.1. Example: New York air quality measurements
3.4.1.2. Example: Swiss banknotes
3.4.2. Finding the optimal split point
3.4.2.1. Example: New York air quality measurements
3.4.3. Pruning
3.4.4. Missing values
3.4.5. Choice of α, g (), and h ()
3.4.6. Fitted values and predictions
3.4.7. Imbalanced classes
3.4.8. Variable importance
3.5. Software and examples
3.5.1. Example: New York air quality measurements
3.5.2. Example: wine quality ratings
3.5.3. Example: Mayo Clinic liver transplant data
3.6. Final thoughts
4. The hitchhiker’s GUIDE to modern decision trees
4.1. Introduction
4.2. A GUIDE for regression
4.2.1. Piecewise constant models
4.2.1.1. Example: New York air quality measurements
4.2.2. Interaction tests
4.2.3. Non-constant fits
4.2.3.1. Example: predicting home prices
4.2.3.2. Bootstrap bias correction
4.3. A GUIDE for classification
4.3.1. Linear/oblique splits
4.3.1.1. Example: classifying the Palmer penguins
4.3.2. Priors and misclassification costs
4.3.3. Non-constant fits
4.3.3.1. Kernel-based and k-nearest neighbor fits
4.4. Pruning
4.5. Missing values
4.6. Fitted values and predictions
4.7. Variable importance
4.8. Ensembles
4.9. Software and examples
4.9.1. Example: credit card default
4.10. Final thoughts
II. Tree-based ensembles
5. Ensemble algorithms
5.1. Bootstrap aggregating (bagging)
5.1.1. When does bagging work?
5.1.2. Bagging from scratch: classifying email spam
5.1.3. Sampling without replacement
5.1.4. Hyperparameters and tuning
5.1.5. Software
5.2. Boosting
5.2.1. AdaBoost.M1 for binary outcomes
5.2.2. Boosting from scratch: classifying email spam
5.2.3. Tuning
5.2.4. Forward stagewise additive modeling and exponential loss
5.2.5. Software
5.3. Bagging or boosting: which should you use?
5.4. Variable importance
5.5. Importance sampled learning ensembles
5.5.1. Example: post-processing a bagged tree ensemble
5.6. Final thoughts
6. Peeking inside the “black box”: post-hoc interpretability
6.1. Feature importance
6.1.1. Permutation importance
6.1.2. Software
6.1.3. Example: predicting home prices
6.2. Feature effects
6.2.1. Partial dependence
6.2.1.1. Classification problems
6.2.2. Interaction effects
6.2.3. Individual conditional expectations
6.2.4. Software
6.2.5. Example: predicting home prices
6.2.6. Example: Edgar Anderson’s iris data
6.3. Feature contributions
6.3.1. Shapley values
6.3.2. Explaining predictions with Shapley values
6.3.2.1. Tree SHAP
6.3.2.2. Monte Carlo-based Shapley explanations
6.3.3. Software
6.3.4. Example: predicting home prices
6.4. Drawbacks of existing methods
6.5. Final thoughts
7. Random forests
7.1. Introduction
7.2. The random forest algorithm
7.2.1. Voting and probability estimation
7.2.1.1. Example: Mease model simulation
7.2.2. Subsampling (without replacement)
7.2.3. Random forest from scratch: predicting home prices
7.3. Out-of-bag (OOB) data
7.4. Hyperparameters and tuning
7.5. Variable importance
7.5.1. Impurity-based importance
7.5.2. OOB-based permutation importance
7.5.2.1. Holdout permutation importance
7.5.2.2. Conditional permutation importance
7.6. Casewise proximities
7.6.1. Detecting anomalies and outliers
7.6.1.1. Example: Swiss banknotes
7.6.2. Missing value imputation
7.6.3. Unsupervised random forests
7.6.3.1. Example: Swiss banknotes
7.6.4. Case-specific random forests
7.7. Prediction standard errors
7.7.1. Example: predicting email spam
7.8. Random forest extensions
7.8.1. Oblique random forests
7.8.2. Quantile regression forests
7.8.2.1. Example: predicting home prices (with prediction intervals)
7.8.3. Rotation forests and random rotation forests
7.8.3.1. Random rotation forests
7.8.3.2. Example: Gaussian mixture data
7.8.4. Extremely randomized trees
7.8.5. Anomaly detection with isolation forests
7.8.5.1. Extended isolation forests
7.8.5.2. Example: detecting credit card fraud
7.9. Software and examples
7.9.1. Example: mushroom edibility
7.9.2. Example: “deforesting” a random forest
7.9.3. Example: survival on the Titanic
7.9.3.1. Missing value imputation
7.9.3.2. Analyzing the imputed data sets
7.9.4. Example: class imbalance (the good, the bad, and the ugly)
7.9.5. Example: partial dependence with Spark MLlib
7.10. Final thoughts
8. Gradient boosting machines
8.1. Steepest descent (a brief overview)
8.2. Gradient tree boosting
8.2.0.1. Loss functions
8.2.0.2. Always a regression tree?
8.2.0.3. Priors and missclassification cost
8.3. Hyperparameters and tuning
8.3.1. Boosting-specific hyperparameters
8.3.1.1. The number of trees in the ensemble: B
8.3.1.2. Regularization and shrinkage
8.3.1.3. Example: predicting ALS progression
8.3.2. Tree-specific hyperparameters
8.3.3. A simple tuning strategy
8.4. Stochastic gradient boosting
8.4.1. Column subsampling
8.5. Gradient tree boosting from scratch
8.5.1. Example: predicting home prices
8.6. Interpretability
8.6.1. Faster partial dependence with the recursion method
8.6.1.1. Example: predicting email spam
8.6.2. Monotonic constraints
8.6.2.1. Example: bank marketing data
8.7. Specialized topics
8.7.1. Level-wise vs. leaf-wise tree induction
8.7.2. Histogram binning
8.7.3. Explainable boosting machines
8.7.4. Probabilistic regression via natural gradient boosting
8.8. Specialized implementations
8.8.1. eXtreme Gradient Boosting: XGBoost
8.8.2. Light Gradient Boosting Machine: LightGBM
8.8.3. CatBoost
8.9. Software and examples
8.9.1. Example: Mayo Clinic liver transplant data
8.9.2. Example: probabilistic predictions with NGBoost (in Python)
8.9.3. Example: post-processing GBMs with the LASSO
8.9.4. Example: direct marketing campaigns with XGBoost
8.10. Final thoughts
Bibliography
Index