This practical guide provides more than 200 self-contained recipes to help you solve machine learning challenges you may encounter in your work. If you're comfortable with Python and its libraries, including pandas and scikit-learn, you'll be able to address specific problems, from loading data to training models and leveraging neural networks.
Each recipe in this updated edition includes code that you can copy, paste, and run with a toy dataset to ensure that it works. From there, you can adapt these recipes according to your use case or application. Recipes include a discussion that explains the solution and provides meaningful context.
Go beyond theory and concepts by learning the nuts and bolts you need to construct working machine learning applications. You'll find recipes for:
• Vectors, matrices, and arrays
• Working with data from CSV, JSON, SQL, databases, cloud storage, and other sources
• Handling numerical and categorical data, text, images, and dates and times
• Dimensionality reduction using feature extraction or feature selection
• Model evaluation and selection
• Linear and logical regression, trees and forests, and k-nearest neighbors
• Supporting vector machines (SVM), naäve Bayes, clustering, and tree-based models
• Saving, loading, and serving trained models from multiple frameworks
Author(s): Kyle Gallatin, Chris Albon
Edition: 2
Publisher: O'Reilly Media
Year: 2023
Language: English
Commentary: Publisher's PDF
Pages: 413
City: Sebastopol, CA
Tags: Machine Learning; Neural Networks; Deep Learning; Image Processing; Python; Support Vector Machines; Linear Regression; Logistic Regression; NumPy; PyTorch; Data Wrangling; Model Evaluation; Model Selection; Feature Extraction; Random Forest; Dimensionality Reduction; Trees; Text Processing; Tensor Calculus; Data Preprocessing; Naïve Bayes; Cluster Analysis; K-Nearest Neighbors
Cover
Copyright
Table of Contents
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Working with Vectors, Matrices, and Arrays in NumPy
1.0 Introduction
1.1 Creating a Vector
Problem
Solution
Discussion
See Also
1.2 Creating a Matrix
Problem
Solution
Discussion
See Also
1.3 Creating a Sparse Matrix
Problem
Solution
Discussion
See Also
1.4 Preallocating NumPy Arrays
Problem
Solution
Discussion
1.5 Selecting Elements
Problem
Solution
Discussion
1.6 Describing a Matrix
Problem
Solution
Discussion
1.7 Applying Functions over Each Element
Problem
Solution
Discussion
1.8 Finding the Maximum and Minimum Values
Problem
Solution
Discussion
1.9 Calculating the Average, Variance, and Standard Deviation
Problem
Solution
Discussion
1.10 Reshaping Arrays
Problem
Solution
Discussion
1.11 Transposing a Vector or Matrix
Problem
Solution
Discussion
1.12 Flattening a Matrix
Problem
Solution
Discussion
1.13 Finding the Rank of a Matrix
Problem
Solution
Discussion
See Also
1.14 Getting the Diagonal of a Matrix
Problem
Solution
Discussion
1.15 Calculating the Trace of a Matrix
Problem
Solution
Discussion
See Also
1.16 Calculating Dot Products
Problem
Solution
Discussion
See Also
1.17 Adding and Subtracting Matrices
Problem
Solution
Discussion
1.18 Multiplying Matrices
Problem
Solution
Discussion
See Also
1.19 Inverting a Matrix
Problem
Solution
Discussion
See Also
1.20 Generating Random Values
Problem
Solution
Discussion
Chapter 2. Loading Data
2.0 Introduction
2.1 Loading a Sample Dataset
Problem
Solution
Discussion
See Also
2.2 Creating a Simulated Dataset
Problem
Solution
Discussion
See Also
2.3 Loading a CSV File
Problem
Solution
Discussion
2.4 Loading an Excel File
Problem
Solution
Discussion
2.5 Loading a JSON File
Problem
Solution
Discussion
See Also
2.6 Loading a Parquet File
Problem
Solution
Discussion
See Also
2.7 Loading an Avro File
Problem
Solution
Discussion
See Also
2.8 Querying a SQLite Database
Problem
Solution
Discussion
See Also
2.9 Querying a Remote SQL Database
Problem
Solution
Discussion
See Also
2.10 Loading Data from a Google Sheet
Problem
Solution
Discussion
See Also
2.11 Loading Data from an S3 Bucket
Problem
Solution
Discussion
See Also
2.12 Loading Unstructured Data
Problem
Solution
Discussion
See Also
Chapter 3. Data Wrangling
3.0 Introduction
3.1 Creating a Dataframe
Problem
Solution
Discussion
3.2 Getting Information about the Data
Problem
Solution
Discussion
3.3 Slicing DataFrames
Problem
Solution
Discussion
3.4 Selecting Rows Based on Conditionals
Problem
Solution
Discussion
3.5 Sorting Values
Problem
Solution
Discussion
3.6 Replacing Values
Problem
Solution
Discussion
3.7 Renaming Columns
Problem
Solution
Discussion
3.8 Finding the Minimum, Maximum, Sum, Average, and Count
Problem
Solution
Discussion
3.9 Finding Unique Values
Problem
Solution
Discussion
3.10 Handling Missing Values
Problem
Solution
Discussion
3.11 Deleting a Column
Problem
Solution
Discussion
3.12 Deleting a Row
Problem
Solution
Discussion
3.13 Dropping Duplicate Rows
Problem
Solution
Discussion
3.14 Grouping Rows by Values
Problem
Solution
Discussion
3.15 Grouping Rows by Time
Problem
Solution
Discussion
See Also
3.16 Aggregating Operations and Statistics
Problem
Solution
Discussion
See Also
3.17 Looping over a Column
Problem
Solution
Discussion
3.18 Applying a Function over All Elements in a Column
Problem
Solution
Discussion
3.19 Applying a Function to Groups
Problem
Solution
Discussion
3.20 Concatenating DataFrames
Problem
Solution
Discussion
3.21 Merging DataFrames
Problem
Solution
Discussion
See Also
Chapter 4. Handling Numerical Data
4.0 Introduction
4.1 Rescaling a Feature
Problem
Solution
Discussion
See Also
4.2 Standardizing a Feature
Problem
Solution
Discussion
4.3 Normalizing Observations
Problem
Solution
Discussion
4.4 Generating Polynomial and Interaction Features
Problem
Solution
Discussion
4.5 Transforming Features
Problem
Solution
Discussion
4.6 Detecting Outliers
Problem
Solution
Discussion
See Also
4.7 Handling Outliers
Problem
Solution
Discussion
See Also
4.8 Discretizating Features
Problem
Solution
Discussion
See Also
4.9 Grouping Observations Using Clustering
Problem
Solution
Discussion
4.10 Deleting Observations with Missing Values
Problem
Solution
Discussion
See Also
4.11 Imputing Missing Values
Problem
Solution
Discussion
See Also
Chapter 5. Handling Categorical Data
5.0 Introduction
5.1 Encoding Nominal Categorical Features
Problem
Solution
Discussion
See Also
5.2 Encoding Ordinal Categorical Features
Problem
Solution
Discussion
5.3 Encoding Dictionaries of Features
Problem
Solution
Discussion
See Also
5.4 Imputing Missing Class Values
Problem
Solution
Discussion
See Also
5.5 Handling Imbalanced Classes
Problem
Solution
Discussion
Chapter 6. Handling Text
6.0 Introduction
6.1 Cleaning Text
Problem
Solution
Discussion
See Also
6.2 Parsing and Cleaning HTML
Problem
Solution
Discussion
See Also
6.3 Removing Punctuation
Problem
Solution
Discussion
6.4 Tokenizing Text
Problem
Solution
Discussion
6.5 Removing Stop Words
Problem
Solution
Discussion
6.6 Stemming Words
Problem
Solution
Discussion
See Also
6.7 Tagging Parts of Speech
Problem
Solution
Discussion
See Also
6.8 Performing Named-Entity Recognition
Problem
Solution
Discussion
See Also
6.9 Encoding Text as a Bag of Words
Problem
Solution
Discussion
See Also
6.10 Weighting Word Importance
Problem
Solution
Discussion
See Also
6.11 Using Text Vectors to Calculate Text Similarity in a Search Query
Problem
Solution
Discussion
See Also
6.12 Using a Sentiment Analysis Classifier
Problem
Solution
Discussion
See Also
Chapter 7. Handling Dates and Times
7.0 Introduction
7.1 Converting Strings to Dates
Problem
Solution
Discussion
See Also
7.2 Handling Time Zones
Problem
Solution
Discussion
7.3 Selecting Dates and Times
Problem
Solution
Discussion
7.4 Breaking Up Date Data into Multiple Features
Problem
Solution
Discussion
7.5 Calculating the Difference Between Dates
Problem
Solution
Discussion
See Also
7.6 Encoding Days of the Week
Problem
Solution
Discussion
See Also
7.7 Creating a Lagged Feature
Problem
Solution
Discussion
7.8 Using Rolling Time Windows
Problem
Solution
Discussion
See Also
7.9 Handling Missing Data in Time Series
Problem
Solution
Discussion
Chapter 8. Handling Images
8.0 Introduction
8.1 Loading Images
Problem
Solution
Discussion
See Also
8.2 Saving Images
Problem
Solution
Discussion
8.3 Resizing Images
Problem
Solution
Discussion
8.4 Cropping Images
Problem
Solution
Discussion
See Also
8.5 Blurring Images
Problem
Solution
Discussion
See Also
8.6 Sharpening Images
Problem
Solution
Discussion
8.7 Enhancing Contrast
Problem
Solution
Discussion
8.8 Isolating Colors
Problem
Solution
Discussion
8.9 Binarizing Images
Problem
Solution
Discussion
8.10 Removing Backgrounds
Problem
Solution
Discussion
8.11 Detecting Edges
Problem
Solution
Discussion
See Also
8.12 Detecting Corners
Problem
Solution
Discussion
See Also
8.13 Creating Features for Machine Learning
Problem
Solution
Discussion
8.14 Encoding Color Histograms as Features
Problem
Solution
Discussion
See Also
8.15 Using Pretrained Embeddings as Features
Problem
Solution
Discussion
See Also
8.16 Detecting Objects with OpenCV
Problem
Solution
Discussion
See Also
8.17 Classifying Images with Pytorch
Problem
Solution
Discussion
See Also
Chapter 9. Dimensionality Reduction Using Feature Extraction
9.0 Introduction
9.1 Reducing Features Using Principal Components
Problem
Solution
Discussion
See Also
9.2 Reducing Features When Data Is Linearly Inseparable
Problem
Solution
Discussion
See Also
9.3 Reducing Features by Maximizing Class Separability
Problem
Solution
Discussion
See Also
9.4 Reducing Features Using Matrix Factorization
Problem
Solution
Discussion
See Also
9.5 Reducing Features on Sparse Data
Problem
Solution
Discussion
See Also
Chapter 10. Dimensionality Reduction Using Feature Selection
10.0 Introduction
10.1 Thresholding Numerical Feature Variance
Problem
Solution
Discussion
10.2 Thresholding Binary Feature Variance
Problem
Solution
Discussion
10.3 Handling Highly Correlated Features
Problem
Solution
Discussion
10.4 Removing Irrelevant Features for Classification
Problem
Solution
Discussion
10.5 Recursively Eliminating Features
Problem
Solution
Discussion
See Also
Chapter 11. Model Evaluation
11.0 Introduction
11.1 Cross-Validating Models
Problem
Solution
Discussion
See Also
11.2 Creating a Baseline Regression Model
Problem
Solution
Discussion
11.3 Creating a Baseline Classification Model
Problem
Solution
Discussion
See Also
11.4 Evaluating Binary Classifier Predictions
Problem
Solution
Discussion
See Also
11.5 Evaluating Binary Classifier Thresholds
Problem
Solution
Discussion
See Also
11.6 Evaluating Multiclass Classifier Predictions
Problem
Solution
Discussion
11.7 Visualizing a Classifier’s Performance
Problem
Solution
Discussion
See Also
11.8 Evaluating Regression Models
Problem
Solution
Discussion
See Also
11.9 Evaluating Clustering Models
Problem
Solution
Discussion
See Also
11.10 Creating a Custom Evaluation Metric
Problem
Solution
Discussion
See Also
11.11 Visualizing the Effect of Training Set Size
Problem
Solution
Discussion
See Also
11.12 Creating a Text Report of Evaluation Metrics
Problem
Solution
Discussion
See Also
11.13 Visualizing the Effect of Hyperparameter Values
Problem
Solution
Discussion
See Also
Chapter 12. Model Selection
12.0 Introduction
12.1 Selecting the Best Models Using Exhaustive Search
Problem
Solution
Discussion
See Also
12.2 Selecting the Best Models Using Randomized Search
Problem
Solution
Discussion
See Also
12.3 Selecting the Best Models from Multiple Learning Algorithms
Problem
Solution
Discussion
12.4 Selecting the Best Models When Preprocessing
Problem
Solution
Discussion
12.5 Speeding Up Model Selection with Parallelization
Problem
Solution
Discussion
12.6 Speeding Up Model Selection Using Algorithm-Specific Methods
Problem
Solution
Discussion
See Also
12.7 Evaluating Performance After Model Selection
Problem
Solution
Discussion
Chapter 13. Linear Regression
13.0 Introduction
13.1 Fitting a Line
Problem
Solution
Discussion
13.2 Handling Interactive Effects
Problem
Solution
Discussion
13.3 Fitting a Nonlinear Relationship
Problem
Solution
Discussion
13.4 Reducing Variance with Regularization
Problem
Solution
Discussion
13.5 Reducing Features with Lasso Regression
Problem
Solution
Discussion
Chapter 14. Trees and Forests
14.0 Introduction
14.1 Training a Decision Tree Classifier
Problem
Solution
Discussion
See Also
14.2 Training a Decision Tree Regressor
Problem
Solution
Discussion
See Also
14.3 Visualizing a Decision Tree Model
Problem
Solution
Discussion
See Also
14.4 Training a Random Forest Classifier
Problem
Solution
Discussion
See Also
14.5 Training a Random Forest Regressor
Problem
Solution
Discussion
See Also
14.6 Evaluating Random Forests with Out-of-Bag Errors
Problem
Solution
Discussion
14.7 Identifying Important Features in Random Forests
Problem
Solution
Discussion
14.8 Selecting Important Features in Random Forests
Problem
Solution
Discussion
See Also
14.9 Handling Imbalanced Classes
Problem
Solution
Discussion
14.10 Controlling Tree Size
Problem
Solution
Discussion
14.11 Improving Performance Through Boosting
Problem
Solution
Discussion
See Also
14.12 Training an XGBoost Model
Problem
Solution
Discussion
See Also
14.13 Improving Real-Time Performance with LightGBM
Problem
Solution
Discussion
See Also
Chapter 15. K-Nearest Neighbors
15.0 Introduction
15.1 Finding an Observation’s Nearest Neighbors
Problem
Solution
Discussion
15.2 Creating a K-Nearest Neighbors Classifier
Problem
Solution
Discussion
15.3 Identifying the Best Neighborhood Size
Problem
Solution
Discussion
15.4 Creating a Radius-Based Nearest Neighbors Classifier
Problem
Solution
Discussion
15.5 Finding Approximate Nearest Neighbors
Problem
Solution
Discussion
See Also
15.6 Evaluating Approximate Nearest Neighbors
Problem
Solution
Discussion
See Also
Chapter 16. Logistic Regression
16.0 Introduction
16.1 Training a Binary Classifier
Problem
Solution
Discussion
16.2 Training a Multiclass Classifier
Problem
Solution
Discussion
16.3 Reducing Variance Through Regularization
Problem
Solution
Discussion
16.4 Training a Classifier on Very Large Data
Problem
Solution
Discussion
See Also
16.5 Handling Imbalanced Classes
Problem
Solution
Discussion
Chapter 17. Support Vector Machines
17.0 Introduction
17.1 Training a Linear Classifier
Problem
Solution
Discussion
17.2 Handling Linearly Inseparable Classes Using Kernels
Problem
Solution
Discussion
17.3 Creating Predicted Probabilities
Problem
Solution
Discussion
17.4 Identifying Support Vectors
Problem
Solution
Discussion
17.5 Handling Imbalanced Classes
Problem
Solution
Discussion
Chapter 18. Naive Bayes
18.0 Introduction
18.1 Training a Classifier for Continuous Features
Problem
Solution
Discussion
See Also
18.2 Training a Classifier for Discrete and Count Features
Problem
Solution
Discussion
18.3 Training a Naive Bayes Classifier for Binary Features
Problem
Solution
Discussion
18.4 Calibrating Predicted Probabilities
Problem
Solution
Discussion
Chapter 19. Clustering
19.0 Introduction
19.1 Clustering Using K-Means
Problem
Solution
Discussion
See Also
19.2 Speeding Up K-Means Clustering
Problem
Solution
Discussion
19.3 Clustering Using Mean Shift
Problem
Solution
Discussion
See Also
19.4 Clustering Using DBSCAN
Problem
Solution
Discussion
See Also
19.5 Clustering Using Hierarchical Merging
Problem
Solution
Discussion
Chapter 20. Tensors with PyTorch
20.0 Introduction
20.1 Creating a Tensor
Problem
Solution
Discussion
See Also
20.2 Creating a Tensor from NumPy
Problem
Solution
Discussion
See Also
20.3 Creating a Sparse Tensor
Problem
Solution
Discussion
See Also
20.4 Selecting Elements in a Tensor
Problem
Solution
Discussion
See Also
20.5 Describing a Tensor
Problem
Solution
Discussion
20.6 Applying Operations to Elements
Problem
Solution
Discussion
See Also
20.7 Finding the Maximum and Minimum Values
Problem
Solution
Discussion
20.8 Reshaping Tensors
Problem
Solution
Discussion
20.9 Transposing a Tensor
Problem
Solution
Discussion
20.10 Flattening a Tensor
Problem
Solution
Discussion
20.11 Calculating Dot Products
Problem
Solution
Discussion
See Also
20.12 Multiplying Tensors
Problem
Solution
Discussion
Chapter 21. Neural Networks
21.0 Introduction
21.1 Using Autograd with PyTorch
Problem
Solution
Discussion
See Also
21.2 Preprocessing Data for Neural Networks
Problem
Solution
Discussion
21.3 Designing a Neural Network
Problem
Solution
Discussion
See Also
21.4 Training a Binary Classifier
Problem
Solution
Discussion
21.5 Training a Multiclass Classifier
Problem
Solution
Discussion
21.6 Training a Regressor
Problem
Solution
Discussion
21.7 Making Predictions
Problem
Solution
Discussion
21.8 Visualize Training History
Problem
Solution
Discussion
21.9 Reducing Overfitting with Weight Regularization
Problem
Solution
Discussion
21.10 Reducing Overfitting with Early Stopping
Problem
Solution
Discussion
21.11 Reducing Overfitting with Dropout
Problem
Solution
Discussion
21.12 Saving Model Training Progress
Problem
Solution
Discussion
21.13 Tuning Neural Networks
Problem
Solution
Discussion
21.14 Visualizing Neural Networks
Problem
Solution
Discussion
Chapter 22. Neural Networks for Unstructured Data
22.0 Introduction
22.1 Training a Neural Network for Image Classification
Problem
Solution
Discussion
See Also
22.2 Training a Neural Network for Text Classification
Problem
Solution
Discussion
22.3 Fine-Tuning a Pretrained Model for Image Classification
Problem
Solution
Discussion
See Also
22.4 Fine-Tuning a Pretrained Model for Text Classification
Problem
Solution
Discussion
See Also
Chapter 23. Saving, Loading, and Serving Trained Models
23.0 Introduction
23.1 Saving and Loading a scikit-learn Model
Problem
Solution
Discussion
23.2 Saving and Loading a TensorFlow Model
Problem
Solution
Discussion
See Also
23.3 Saving and Loading a PyTorch Model
Problem
Solution
Discussion
See Also
23.4 Serving scikit-learn Models
Problem
Solution
Discussion
23.5 Serving TensorFlow Models
Problem
Solution
Discussion
See Also
23.6 Serving PyTorch Models in Seldon
Problem
Solution
Discussion
See Also
Index
About the Authors