Machine learning without advanced math! This book presents a serious, practical look at machine learning, preparing you for valuable insights on your own data. The Art of Machine Learning is packed with real dataset examples and sophisticated advice on how to make full use of powerful machine learning methods. Readers will need only an intuitive grasp of charts, graphs, and the slope of a line, as well as familiarity with the R programming language. You’ll become skilled in a range of machine learning methods, starting with the simple k-Nearest Neighbors method (k-NN), then on to random forests, gradient boosting, linear/logistic models, support vector machines, the LASSO, and neural networks.Final chapters introduce text and image classification, as well as time series. You’ll learn not only how to use machine learning methods, but also why these methods work, providing the strong foundational background you’ll need in practice. Additional features
How to avoid common problems, such as dealing with “dirty” data and factor variables with large numbers of levels
A look at typical misconceptions, such as dealing with unbalanced data
Exploration of the famous Bias-Variance Tradeoff, central to machine learning, and how it plays out in practice for each machine learning method
Dozens of illustrative examples involving real datasets of varying size and field of application
Standard R packages are used throughout, with a simple wrapper interface to provide convenient access.
After finishing this book, you will be well equipped to start applying machine learning techniques to your own datasets.
Author(s): Norman Matloff
Publisher: No Starch Press, Inc.
Year: 2023
Language: English
Pages: 272
Cover Page
Title Page
Copyright Page
About the Author
About the Technical Reviewer
BRIEF CONTENTS
CONTENTS IN DETAIL
ACKNOWLEDGMENTS
INTRODUCTION
0.1 What Is ML?
0.2 The Role of Math in ML Theory and Practice
0.3 Why Another ML Book?
0.4 Recurring Special Sections
0.5 Background Needed
0.6 The qe*-Series Software
0.7 The Book’s Grand Plan
0.8 One More Point
PART I PROLOGUE, AND NEIGHBORHOOD-BASED METHODS
1 REGRESSION MODELS
1.1 Example: The Bike Sharing Dataset
1.1.1 Loading the Data
1.1.2 A Look Ahead
1.2 Machine Learning and Prediction
1.2.1 Predicting Past, Present, and Future
1.2.2 Statistics vs. Machine Learning in Prediction
1.3 Introducing the k-Nearest Neighbors Method
1.3.1 Predicting Bike Ridership with k-NN
1.4 Dummy Variables and Categorical Variables
1.5 Analysis with qeKNN()
1.5.1 Predicting Bike Ridership with qeKNN()
1.6 The Regression Function: The Basis of ML
1.7 The Bias-Variance Trade-off
1.7.1 Analogy to Election Polls
1.7.2 Back to ML
1.8 Example: The mlb Dataset
1.9 k-NN and Categorical Features
1.10 Scaling
1.11 Choosing Hyperparameters
1.11.1 Predicting the Training Data
1.12 Holdout Sets
1.12.1 Loss Functions
1.12.2 Holdout Sets in the qe*-Series
1.12.3 Motivating Cross-Validation
1.12.4 Hyperparameters, Dataset Size, and Number of Features
1.13 Pitfall: p-Hacking and Hyperparameter Selection
1.14 Pitfall: Long-Term Time Trends
1.15 Pitfall: Dirty Data
1.16 Pitfall: Missing Data
1.17 Direct Access to the regtools k-NN Code
1.18 Conclusions
2 CLASSIFICATION MODELS
2.1 Classification Is a Special Case of Regression
2.2 Example: The Telco Churn Dataset
2.2.1 Pitfall: Factor Data Read as Non-factor
2.2.2 Pitfall: Retaining Useless Features
2.2.3 Dealing with NA Values
2.2.4 Applying the k-Nearest Neighbors Method
2.2.5 Pitfall: Overfitting Due to Features with Many Categories
2.3 Example: Vertebrae Data
2.3.1 Analysis
2.4 Pitfall: Error Rate Improves Only Slightly Using the Features
2.5 The Confusion Matrix
2.6 Clearing the Confusion: Unbalanced Data
2.6.1 Example: The Kaggle Appointments Dataset
2.6.2 A Better Approach to Unbalanced Data
2.7 Receiver Operating Characteristic and Area Under Curve
2.7.1 Details of ROC and AUC
2.7.2 The qeROC() Function
2.7.3 Example: Telco Churn Data
2.7.4 Example: Vertebrae Data
2.7.5 Pitfall: Overreliance on AUC
2.8 Conclusions
3 BIAS, VARIANCE, OVERFITTING, AND CROSS-VALIDATION
3.1 Overfitting and Underfitting
3.1.1 Intuition Regarding the Number of Features and Overfitting
3.1.2 Relation to Overall Dataset Size
3.1.3 Well Then, What Are the Best Values of k and p?
3.2 Cross-Validation
3.2.1 K-Fold Cross-Validation
3.2.2 Using the replicMeans() Function
3.2.3 Example: Programmer and Engineer Data
3.2.4 Triple Cross-Validation
3.3 Conclusions
4 DEALING WITH LARGE NUMBERS OF FEATURES
4.1 Pitfall: Computational Issues in Large Datasets
4.2 Introduction to Dimension Reduction
4.2.1 Example: The Million Song Dataset
4.2.2 The Need for Dimension Reduction
4.3 Methods for Dimension Reduction
4.3.1 Consolidation and Embedding
4.3.2 The All Possible Subsets Method
4.3.3 Principal Components Analysis
4.3.4 But Now We Have Two Hyperparameters
4.3.5 Using the qePCA() Wrapper
4.3.6 PCs and the Bias-Variance Trade-off
4.4 The Curse of Dimensionality
4.5 Other Methods of Dimension Reduction
4.5.1 Feature Ordering by Conditional Independence
4.5.2 Uniform Manifold Approximation and Projection
4.6 Going Further Computationally
4.7 Conclusions
PART II TREE-BASED METHODS
5 A STEP BEYOND K-NN: DECISION TREES
5.1 Basics of Decision Trees
5.2 The qeDT() Function
5.2.1 Looking at the Plot
5.3 Example: New York City Taxi Data
5.3.1 Pitfall: Too Many Combinations of Factor Levels
5.3.2 Tree-Based Analysis
5.4 Example: Forest Cover Data
5.5 Decision Tree Hyperparameters: How to Split?
5.6 Hyperparameters in the qeDT() Function
5.7 Conclusions
6 TWEAKING THE TREES
6.1 Bias vs. Variance, Bagging, and Boosting
6.2 Bagging: Generating New Trees by Resampling
6.2.1 Random Forests
6.2.2 The qeRF() Function
6.2.3 Example: Vertebrae Data
6.2.4 Example: Remote-Sensing Soil Analysis
6.3 Boosting: Repeatedly Tweaking a Tree
6.3.1 Implementation: AdaBoost
6.3.2 Gradient Boosting
6.3.3 Example: Call Network Monitoring
6.3.4 Example: Vertebrae Data
6.3.5 Bias vs. Variance in Boosting
6.3.6 Computational Speed
6.3.7 Further Hyperparameters
6.3.8 The Learning Rate
6.4 Pitfall: No Free Lunch
7 FINDING A GOOD SET OF HYPERPARAMETERS
7.1 Combinations of Hyperparameters
7.2 Grid Searching with qeFT()
7.2.1 How to Call qeFT()
7.3 Example: Programmer and Engineer Data
7.3.1 Confidence Intervals
7.3.2 The Takeaway on Grid Searching
7.4 Example: Programmer and Engineer Data
7.5 Example: Phoneme Data
7.6 Conclusions
PART III METHODS BASED ON LINEAR RELATIONSHIPS
8 PARAMETRIC METHODS
8.1 Motivating Example: The Baseball Player Data
8.1.1 A Graph to Guide Our Intuition
8.1.2 View as Dimension Reduction
8.2 The lm() Function
8.3 Wrapper for lm() in the qe*-Series: qeLin()
8.4 Use of Multiple Features
8.4.1 Example: Baseball Player, Continued
8.4.2 Beta Notation
8.4.3 Example: Airbnb Data
8.4.4 Applying the Linear Model
8.5 Dimension Reduction
8.5.1 Which Features Are Important?
8.5.2 Statistical Significance and Dimension Reduction
8.6 Least Squares and Residuals
8.7 Diagnostics: Is the Linear Model Valid?
8.7.1 Exactness?
8.7.2 Diagnostic Methods
8.8 The R-Squared Value(s)
8.9 Classification Applications: The Logistic Model
8.9.1 The glm() and qeLogit() Functions
8.9.2 Example: Telco Churn Data
8.9.3 Multiclass Case
8.9.4 Example: Fall Detection Data
8.10 Bias and Variance in Linear/Generalized Linear Models
8.10.1 Example: Bike Sharing Data
8.11 Polynomial Models
8.11.1 Motivation
8.11.2 Modeling Nonlinearity with a Linear Model
8.11.3 Polynomial Logistic Regression
8.11.4 Example: Programmer and Engineer Wages
8.12 Blending the Linear Model with Other Methods
8.13 The qeCompare() Function
8.13.1 Need for Caution Regarding Polynomial Models
8.14 What’s Next
9 CUTTING THINGS DOWN TO SIZE: REGULARIZATION
9.1 Motivation
9.2 Size of a Vector
9.3 Ridge Regression and the LASSO
9.3.1 How They Work
9.3.2 The Bias-Variance Trade-off, Avoiding Overfitting
9.3.3 Relation Between λ, n, and p
9.3.4 Comparison, Ridge vs. LASSO
9.4 Software
9.5 Example: NYC Taxi Data
9.6 Example: Airbnb Data
9.7 Example: African Soil Data
9.7.1 LASSO Analysis
9.8 Optional Section: The Famous LASSO Picture
9.9 Coming Up
PART IV METHODS BASED ON SEPARATING LINES AND PLANES
10 A BOUNDARY APPROACH: SUPPORT VECTOR MACHINES
10.1 Motivation
10.1.1 Example: The Forest Cover Dataset
10.2 Lines, Planes, and Hyperplanes
10.3 Math Notation
10.3.1 Vector Expressions
10.3.2 Dot Products
10.3.3 SVM as a Parametric Model
10.4 SVM: The Basic Ideas—Separable Case
10.4.1 Example: The Anderson Iris Dataset
10.4.2 Optimizing Criterion
10.5 Major Problem: Lack of Linear Separability
10.5.1 Applying a “Kernel”
10.5.2 Soft Margin
10.6 Example: Forest Cover Data
10.7 And What About That Kernel Trick?
10.8 “Warning: Maximum Number of Iterations Reached”
10.9 Summary
11 LINEAR MODELS ON STEROIDS: NEURAL NETWORKS
11.1 Overview
11.2 Working on Top of a Complex Infrastructure
11.3 Example: Vertebrae Data
11.4 Neural Network Hyperparameters
11.5 Activation Functions
11.6 Regularization
11.6.1 L1 and L2 Regularization
11.6.2 Regularization by Dropout
11.7 Example: Fall Detection Data
11.8 Pitfall: Convergence Problems
11.9 Close Relation to Polynomial Regression
11.10 Bias vs. Variance in Neural Networks
11.11 Discussion
PART V APPLICATIONS
12 IMAGE CLASSIFICATION
12.1 Example: The Fashion MNIST Data
12.1.1 A First Try Using a Logit Model
12.1.2 Refinement via PCA
12.2 Convolutional Models
12.2.1 Need for Recognition of Locality
12.2.2 Overview of Convolutional Methods
12.2.3 Image Tiling
12.2.4 The Convolution Operation
12.2.5 The Pooling Operation
12.2.6 Shape Evolution Across Layers
12.2.7 Dropout
12.2.8 Summary of Shape Evolution
12.2.9 Translation Invariance
12.3 Tricks of the Trade
12.3.1 Data Augmentation
12.3.2 Pretrained Networks
12.4 So, What About the Overfitting Issue?
12.5 Conclusions
13 HANDLING TIME SERIES AND TEXT DATA
13.1 Converting Time Series Data to Rectangular Form
13.1.1 Toy Example
13.1.2 The regtools Function TStoX()
13.2 The qeTS() Function
13.3 Example: Weather Data
13.4 Bias vs. Variance
13.5 Text Applications
13.5.1 The Bag-of-Words Model
13.5.2 The qeText() Function
13.5.3 Example: Quiz Data
13.5.4 Example: AG News Dataset
13.6 Summary
A LIST OF ACRONYMS AND SYMBOLS
B STATISTICS AND ML TERMINOLOGY CORRESPONDENCE
C MATRICES, DATA FRAMES, AND FACTOR CONVERSIONS
C.1 Matrices
C.2 Conversions: Between R Factors and Dummy Variables, Between Data Frames and Matrices
D PITFALL: BEWARE OF “P-HACKING”!
INDEX