Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Imbalanced classification are those classification tasks where the distribution of examples across the classes is not equal. Cut through the equations, Greek letters, and confusion, and discover the specialized techniques data preparation techniques, learning algorithms, and performance metrics that you need to know. Using clear explanations, standard Python libraries, and step-by-step tutorial lessons, you will discover how to confidently develop robust models for your own imbalanced classification projects.

Author(s): Jason Brownlee
Series: Machine Learning Mastery
Edition: 1.2
Publisher: Independently Published
Year: 2020

Language: English
Pages: 446

Copyright
Contents
Preface
I Introduction
II Foundation
What is Imbalanced Classification
Tutorial Overview
Classification Predictive Modeling
Imbalanced Classification Problems
Causes of Class Imbalance
Challenge of Imbalanced Classification
Examples of Imbalanced Classification
Further Reading
Summary
Intuition for Imbalanced Classification
Tutorial Overview
Create and Plot a Binary Classification Problem
Create Synthetic Dataset with a Class Distribution
Effect of Skewed Class Distributions
Further Reading
Summary
Challenge of Imbalanced Classification
Tutorial Overview
Why Imbalanced Classification Is Hard
Compounding Effect of Dataset Size
Compounding Effect of Label Noise
Compounding Effect of Data Distribution
Further Reading
Summary
III Model Evaluation
Tour of Model Evaluation Metrics
Tutorial Overview
Challenge of Evaluation Metrics
Taxonomy of Classifier Evaluation Metrics
How to Choose an Evaluation Metric
Further Reading
Summary
The Failure of Accuracy
Tutorial Overview
What Is Classification Accuracy?
Accuracy Fails for Imbalanced Classification
Example of Accuracy for Imbalanced Classification
Further Reading
Summary
Precision, Recall, and F-measure
Tutorial Overview
Precision Measure
Recall Measure
Precision vs. Recall
F-measure
Further Reading
Summary
ROC Curves and Precision-Recall Curves
Tutorial Overview
ROC Curves and ROC AUC
Precision-Recall Curves and AUC
ROC and PR Curves With a Severe Imbalance
Further Reading
Summary
Probability Scoring Methods
Tutorial Overview
Probability Metrics
Log Loss Score
Brier Score
Further Reading
Summary
Cross-Validation for Imbalanced Datasets
Tutorial Overview
Challenge of Evaluating Classifiers
Failure of k-Fold Cross-Validation
Fix Cross-Validation for Imbalanced Classification
Further Reading
Summary
IV Data Sampling
Tour of Data Sampling Methods
Tutorial Overview
Problem of an Imbalanced Class Distribution
Balance the Class Distribution With Sampling
Tour of Popular Data Sampling Methods
Further Reading
Summary
Random Data Sampling
Tutorial Overview
Random Sampling
Random Oversampling
Random Undersampling
Further Reading
Summary
Oversampling Methods
Tutorial Overview
Synthetic Minority Oversampling Technique
SMOTE for Balancing Data
SMOTE for Classification
SMOTE With Selective Sample Generation
Further Reading
Summary
Undersampling Methods
Tutorial Overview
Undersampling for Imbalanced Classification
Methods that Select Examples to Keep
Methods that Select Examples to Delete
Combinations of Keep and Delete Methods
Further Reading
Summary
Oversampling and Undersampling
Tutorial Overview
Binary Test Problem and Decision Tree Model
Manually Combine Data Sampling Methods
Standard Combined Data Sampling Methods
Further Reading
Summary
V Cost-Sensitive
Cost-Sensitive Learning
Tutorial Overview
Not All Classification Errors Are Equal
Cost-Sensitive Learning
Cost-Sensitive Imbalanced Classification
Cost-Sensitive Methods
Further Reading
Summary
Cost-Sensitive Logistic Regression
Tutorial Overview
Imbalanced Classification Dataset
Logistic Regression for Imbalanced Classification
Weighted Logistic Regression with Scikit-Learn
Grid Search Weighted Logistic Regression
Further Reading
Summary
Cost-Sensitive Decision Trees
Tutorial Overview
Imbalanced Classification Dataset
Decision Trees for Imbalanced Classification
Weighted Decision Tree With Scikit-Learn
Grid Search Weighted Decision Tree
Further Reading
Summary
Cost-Sensitive Support Vector Machines
Tutorial Overview
Imbalanced Classification Dataset
SVM for Imbalanced Classification
Weighted SVM With Scikit-Learn
Grid Search Weighted SVM
Further Reading
Summary
Cost-Sensitive Deep Learning in Keras
Tutorial Overview
Imbalanced Classification Dataset
Neural Network Model in Keras
Deep Learning for Imbalanced Classification
Weighted Neural Network With Keras
Further Reading
Summary
Cost-Sensitive Gradient Boosting with XGBoost
Tutorial Overview
Imbalanced Classification Dataset
XGBoost Model for Classification
Weighted XGBoost for Class Imbalance
Tune the Class Weighting Hyperparameter
Further Reading
Summary
VI Advanced Algorithms
Probability Threshold Moving
Tutorial Overview
Converting Probabilities to Class Labels
Threshold-Moving for Imbalanced Classification
Optimal Threshold for ROC Curve
Optimal Threshold for Precision-Recall Curve
Optimal Threshold Tuning
Further Reading
Summary
Probability Calibration
Tutorial Overview
Problem of Uncalibrated Probabilities
How to Calibrate Probabilities
SVM With Calibrated Probabilities
Decision Tree With Calibrated Probabilities
Grid Search Probability Calibration With KNN
Further Reading
Summary
Ensemble Algorithms
Tutorial Overview
Bagging for Imbalanced Classification
Random Forest for Imbalanced Classification
Easy Ensemble for Imbalanced Classification
Further Reading
Summary
One-Class Classification
Tutorial Overview
One-Class Classification for Imbalanced Data
One-Class Support Vector Machines
Isolation Forest
Minimum Covariance Determinant
Local Outlier Factor
Further Reading
Summary
VII Projects
Framework for Imbalanced Classification Projects
Tutorial Overview
What Algorithm To Use?
Use a Systematic Framework
Detailed Framework for Imbalanced Classification
Further Reading
Summary
Project: Haberman Breast Cancer Classification
Tutorial Overview
Haberman Breast Cancer Survival Dataset
Explore the Dataset
Model Test and Baseline Result
Evaluate Probabilistic Models
Make Prediction on New Data
Further Reading
Summary
Project: Oil Spill Classification
Tutorial Overview
Oil Spill Dataset
Explore the Dataset
Model Test and Baseline Result
Evaluate Models
Make Prediction on New Data
Further Reading
Summary
Project: German Credit Classification
Tutorial Overview
German Credit Dataset
Explore the Dataset
Model Test and Baseline Result
Evaluate Models
Make Prediction on New Data
Further Reading
Summary
Project: Microcalcification Classification
Tutorial Overview
Mammography Dataset
Explore the Dataset
Model Test and Baseline Result
Evaluate Models
Make Predictions on New Data
Further Reading
Summary
Project: Phoneme Classification
Tutorial Overview
Phoneme Dataset
Explore the Dataset
Model Test and Baseline Result
Evaluate Models
Make Prediction on New Data
Further Reading
Summary
VIII Appendix
Getting Help
Imbalanced Classification Books
Machine Learning Books
Python APIs
Ask Questions About Imbalanced Classification
How to Ask Questions
Contact the Author
How to Setup Python on Your Workstation
Tutorial Overview
Download Anaconda
Install Anaconda
Start and Update Anaconda
Install the Imbalanced-Learn Library
Install the Deep Learning Libraries
Install the XGBoost Library
Further Reading
Summary
IX Conclusions
How Far You Have Come