Gain hands-on experience in Python programming with industry-standard machine learning tools using pandas, scikit-learn, and XGBoost
Key Features
• Think critically about data by exploring and cleaning it
• Choose an appropriate machine learning model and train it on your data
• Communicate data-driven insights with confidence and clarity
Book Description
If data is the new oil, then machine learning is the drill. As companies gain access to ever-increasing quantities of raw data, the ability to deliver state-of-the-art predictive models that support business decision-making becomes more and more valuable.
In this book, you'll work on an end-to-end project based around a realistic data set and split up into bite-sized practical exercises. This creates a case-study approach that simulates the working conditions you'll experience in real-world data science projects.
You'll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning algorithms such as regularized logistic regression and random forest.
Now in its second edition, this book will take you through the process of exploring data and delivering machine learning models. Updated to the latest version of Python, this new edition for 2021 includes brand new content on XGBoost, SHAP values, and how to evaluate and monitor machine learning models.
By the end of this data science book, you'll have the skills, understanding, and confidence to build your own machine learning models and gain insights from real data.
What You Will Learn
• Load, explore, and process data using the pandas Python package
• Use Matplotlib to create compelling data visualizations
• Implement predictive machine learning models with scikit-learn
• Use lasso and ridge regression to reduce model overfitting
• Evaluate random forest and logistic regression model performance
• Create state-of-the-art models with XGBoost
• Learn to use SHAP values to explain model predictions
• Deliver business insights by presenting clear, convincing conclusions
Who This Book Is For
Data Science Projects with Python - Second Edition is for anyone who wants to get started with data science and machine learning. If you're keen to advance your career by using data analysis and predictive modeling to generate business insights, then this book is the perfect place to begin. To quickly grasp the concepts covered, it is recommended that you have basic experience of programming with Python or another similar language, and a general interest in statistics.
Author(s): Stephen Klosterman
Edition: 2
Publisher: Packt Publishing
Year: 2021
Language: English
Commentary: Vector PDF
Pages: 432
City: Birmingham, UK
Tags: Machine Learning; Decision Trees; Data Science; Python; Classification; Feature Engineering; Finance; Linear Regression; Logistic Regression; scikit-learn; Ensemble Learning; Data Cleaning; matplotlib; pandas; Jupyter; Model Evaluation; Overfitting; Random Forest; Data Pipelines; Gradient Boosting; Synthetic Data; Data Quality; XGBoost
Cover
FM
Copyright
Table of Contents
Preface
Chapter 1: Data Exploration and Cleaning
Introduction
Python and the Anaconda Package Management System
Indexing and the Slice Operator
Exercise 1.01: Examining Anaconda and Getting Familiar with Python
Different Types of Data Science Problems
Loading the Case Study Data with Jupyter and pandas
Exercise 1.02: Loading the Case Study Data in a Jupyter Notebook
Getting Familiar with Data and Performing Data Cleaning
The Business Problem
Data Exploration Steps
Exercise 1.03: Verifying Basic Data Integrity
Boolean Masks
Exercise 1.04: Continuing Verification of Data Integrity
Exercise 1.05: Exploring and Cleaning the Data
Data Quality Assurance and Exploration
Exercise 1.06: Exploring the Credit Limit and Demographic Features
Deep Dive: Categorical Features
Exercise 1.07: Implementing OHE for a Categorical Feature
Exploring the Financial History Features in the Dataset
Activity 1.01: Exploring the Remaining Financial Features in the Dataset
Summary
Chapter 2: Introduction to Scikit-Learn and Model Evaluation
Introduction
Exploring the Response Variable and Concluding the Initial Exploration
Introduction to Scikit-Learn
Generating Synthetic Data
Data for Linear Regression
Exercise 2.01: Linear Regression in Scikit-Learn
Model Performance Metrics for Binary Classification
Splitting the Data: Training and Test Sets
Classification Accuracy
True Positive Rate, False Positive Rate, and Confusion Matrix
Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python
Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?
Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model
The Receiver Operating Characteristic (ROC) Curve
Precision
Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve
Summary
Chapter 3: Details of Logistic Regression and Feature Exploration
Introduction
Examining the Relationships Between Features and the Response Variable
Pearson Correlation
Mathematics of Linear Correlation
F-test
Exercise 3.01: F-test and Univariate Feature Selection
Finer Points of the F-test: Equivalence to the t-test for Two Classes and Cautions
Hypotheses and Next Steps
Exercise 3.02: Visualizing the Relationship Between the Features and Response Variable
Univariate Feature Selection: What it Does and Doesn't Do
Understanding Logistic Regression and the Sigmoid Function Using Function Syntax in Python
Exercise 3.03: Plotting the Sigmoid Function
Scope of Functions
Why Is Logistic Regression Considered a Linear Model?
Exercise 3.04: Examining the Appropriateness of Features for Logistic Regression
From Logistic Regression Coefficients to Predictions Using Sigmoid
Exercise 3.05: Linear Decision Boundary of Logistic Regression
Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients
Summary
Chapter 4: The Bias-Variance Trade-Off
Introduction
Estimating the Coefficients and Intercepts of Logistic Regression
Gradient Descent to Find Optimal Parameter Values
Exercise 4.01: Using Gradient Descent to Minimize a Cost Function
Assumptions of Logistic Regression
The Motivation for Regularization: The Bias-Variance Trade-Off
Exercise 4.02: Generating and Modeling Synthetic Classification Data
Lasso (L1) and Ridge (L2) Regularization
Cross-Validation: Choosing the Regularization Parameter
Exercise 4.03: Reducing Overfitting on the Synthetic Data Classification Problem
Options for Logistic Regression in Scikit-Learn
Scaling Data, Pipelines, and Interaction Features in Scikit-Learn
Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data
Summary
Chapter 5: Decision Trees and Random Forests
Introduction
Decision Trees
The Terminology of Decision Trees and Connections to Machine Learning
Exercise 5.01: A Decision Tree in Scikit-Learn
Training Decision Trees: Node Impurity
Features Used for the First Splits: Connections to Univariate Feature Selection and Interactions
Training Decision Trees: A Greedy Algorithm
Training Decision Trees: Different Stopping Criteria and Other Options
Using Decision Trees: Advantages and Predicted Probabilities
A More Convenient Approach to Cross-Validation
Exercise 5.02: Finding Optimal Hyperparameters for a Decision Tree
Random Forests: Ensembles of Decision Trees
Random Forest: Predictions and Interpretability
Exercise 5.03: Fitting a Random Forest
Checkerboard Graph
Activity 5.01: Cross-Validation Grid Search with Random Forest
Summary
Chapter 6: Gradient Boosting, XGBoost, and SHAP Values
Introduction
Gradient Boosting and XGBoost
What Is Boosting?
Gradient Boosting and XGBoost
XGBoost Hyperparameters
Early Stopping
Tuning the Learning Rate
Other Important Hyperparameters in XGBoost
Exercise 6.01: Randomized Grid Search for Tuning XGBoost Hyperparameters
Another Way of Growing Trees: XGBoost's grow_policy
Explaining Model Predictions with SHAP Values
Exercise 6.02: Plotting SHAP Interactions, Feature Importance, and Reconstructing Predicted Probabilities from SHAP Values
Missing Data
Saving Python Variables to a File
Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP
Summary
Chapter 7: Test Set Analysis, Financial Insights, and Delivery to the Client
Introduction
Review of Modeling Results
Feature Engineering
Ensembling Multiple Models
Different Modeling Techniques
Balancing Classes
Model Performance on the Test Set
Distribution of Predicted Probability and Decile Chart
Exercise 7.01: Equal-Interval Chart
Calibration of Predicted Probabilities
Financial Analysis
Financial Conversation with the Client
Exercise 7.02: Characterizing Costs and Savings
Activity 7.01: Deriving Financial Insights
Final Thoughts on Delivering a Predictive Model to the Client
Model Monitoring
Ethics in Predictive Modeling
Summary
Appendix
Index