The Data Science Workshop: A New, Interactive Approach to Learning Data Science

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Anthony So; Thomas V. Joseph; Robert Thas John; Andrew Worsley; Dr. Samuel Asare
Publisher: Packt Publishing
Year: 2020

Language: English

Cover
FM
Copyright
Table of Contents
Preface
Chapter 1: Introduction to Data Science in Python
Introduction
Application of Data Science
What Is Machine Learning?
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Overview of Python
Types of Variable
Numeric Variables
Text Variables
Python List
Python Dictionary
Exercise 1.01: Creating a Dictionary That Will Contain Machine Learning Algorithms
Python for Data Science
The pandas Package
DataFrame and Series
CSV Files
Excel Spreadsheets
JSON
Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
Scikit-Learn
What Is a Model?
Model Hyperparameters
The sklearn API
Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
Activity 1.01: Train a Spam Detector Algorithm
Summary
Chapter 2: Regression
Introduction
Simple Linear Regression
The Method of Least Squares
Multiple Linear Regression
Estimating the Regression Coefficients (β0, β1, β2 and β3)
Logarithmic Transformations of Variables
Correlation Matrices
Conducting Regression Analysis Using Python
Exercise 2.01: Loading and Preparing the Data for Analysis
The Correlation Coefficient
Exercise 2.02: Graphical Investigation of Linear Relationships Using Python
Exercise 2.03: Examining a Possible Log-Linear Relationship Using Python
The Statsmodels formula API
Exercise 2.04: Fitting a Simple Linear Regression Model Using the Statsmodels formula API
Analyzing the Model Summary
The Model Formula Language
Intercept Handling
Activity 2.01: Fitting a Log-Linear Model Using the Statsmodels formula API
Multiple Regression Analysis
Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels formula API
Assumptions of Regression Analysis
Activity 2.02: Fitting a Multiple Log-Linear Regression Model
Explaining the Results of Regression Analysis
Regression Analysis Checks and Balances
The F-test
The t-test
Summary
Chapter 3: Binary Classification
Introduction
Understanding the Business Context
Business Discovery
Exercise 3.01: Loading and Exploring the Data from the Dataset
Testing Business Hypotheses Using Exploratory Data Analysis
Visualization for Exploratory Data Analysis
Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan
Intuitions from the Exploratory Analysis
Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits
Feature Engineering
Business-Driven Feature Engineering
Exercise 3.03: Feature Engineering – Exploration of Individual Features
Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones
Data-Driven Feature Engineering
A Quick Peek at Data Types and a Descriptive Summary
Correlation Matrix and Visualization
Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data
Skewness of Data
Histograms
Density Plots
Other Feature Engineering Methods
Summarizing Feature Engineering
Building a Binary Classification Model Using the Logistic Regression Function
Logistic Regression Demystified
Metrics for Evaluating Model Performance
Confusion Matrix
Accuracy
Classification Report
Data Preprocessing
Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank
Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables
Next Steps
Summary
Chapter 4: Multiclass Classification with RandomForest
Introduction
Training a Random Forest Classifier
Evaluating the Model's Performance
Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance
Number of Trees Estimator
Exercise 4.02: Tuning n_estimators to Reduce Overfitting
Maximum Depth
Exercise 4.03: Tuning max_depth to Reduce Overfitting
Minimum Sample in Leaf
Exercise 4.04: Tuning min_samples_leaf
Maximum Features
Exercise 4.05: Tuning max_features
Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset
Summary
Chapter 5: Performing Your First Cluster Analysis
Introduction
Clustering with k-means
Exercise 5.01: Performing Your First Clustering Analysis on the ATO Dataset
Interpreting k-means Results
Exercise 5.02: Clustering Australian Postcodes by Business Income and Expenses
Choosing the Number of Clusters
Exercise 5.03: Finding the Optimal Number of Clusters
Initializing Clusters
Exercise 5.04: Using Different Initialization Parameters to Achieve a Suitable Outcome
Calculating the Distance to the Centroid
Exercise 5.05: Finding the Closest Centroids in Our Dataset
Standardizing Data
Exercise 5.06: Standardizing the Data from Our Dataset
Activity 5.01: Perform Customer Segmentation Analysis in a Bank Using k-means
Summary
Chapter 6: How to Assess Performance
Introduction
Splitting Data
Exercise 6.01: Importing and Splitting Data
Assessing Model Performance for Regression Models
Data Structures – Vectors and Matrices
Scalars
Vectors
Matrices
R2 Score
Exercise 6.02: Computing the R2 Score of a Linear Regression Model
Mean Absolute Error
Exercise 6.03: Computing the MAE of a Model
Exercise 6.04: Computing the Mean Absolute Error of a Second Model
Other Evaluation Metrics
Assessing Model Performance for Classification Models
Exercise 6.05: Creating a Classification Model for Computing Evaluation Metrics
The Confusion Matrix
Exercise 6.06: Generating a Confusion Matrix for the Classification Model
More on the Confusion Matrix
Precision
Exercise 6.07: Computing Precision for the Classification Model
Recall
Exercise 6.08: Computing Recall for the Classification Model
F1 Score
Exercise 6.09: Computing the F1 Score for the Classification Model
Accuracy
Exercise 6.10: Computing Model Accuracy for the Classification Model
Logarithmic Loss
Exercise 6.11: Computing the Log Loss for the Classification Model
Receiver Operating Characteristic Curve
Exercise 6.12: Computing and Plotting ROC Curve for a Binary Classification Problem
Area Under the ROC Curve
Exercise 6.13: Computing the ROC AUC for the Caesarian Dataset
Saving and Loading Models
Exercise 6.14: Saving and Loading a Model
Activity 6.01: Train Three Different Models and Use Evaluation Metrics to Pick the Best Performing Model
Summary
Chapter 7: The Generalization of Machine Learning Models
Introduction
Overfitting
Training on Too Many Features
Training for Too Long
Underfitting
Data
The Ratio for Dataset Splits
Creating Dataset Splits
Exercise 7.01: Importing and Splitting Data
Random State
Exercise 7.02: Setting a Random State When Splitting Data
Cross-Validation
KFold
Exercise 7.03: Creating a Five-Fold Cross-Validation Dataset
Exercise 7.04: Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls
cross_val_score
Exercise 7.05: Getting the Scores from Five-Fold Cross-Validation
Understanding Estimators That Implement CV
LogisticRegressionCV
Exercise 7.06: Training a Logistic Regression Model Using Cross-Validation
Hyperparameter Tuning with GridSearchCV
Decision Trees
Exercise 7.07: Using Grid Search with Cross-Validation to Find the Best Parameters for a Model
Hyperparameter Tuning with RandomizedSearchCV
Exercise 7.08: Using Randomized Search for Hyperparameter Tuning
Model Regularization with Lasso Regression
Exercise 7.09: Fixing Model Overfitting Using Lasso Regression
Ridge Regression
Exercise 7.10: Fixing Model Overfitting Using Ridge Regression
Activity 7.01: Find an Optimal Model for Predicting the Critical Temperatures of Superconductors
Summary
Chapter 8: Hyperparameter Tuning
Introduction
What Are Hyperparameters?
Difference between Hyperparameters and Statistical Model Parameters
Setting Hyperparameters
A Note on Defaults
Finding the Best Hyperparameterization
Exercise 8.01: Manual Hyperparameter Tuning for a k-NN Classifier
Advantages and Disadvantages of a Manual Search
Tuning Using Grid Search
Simple Demonstration of the Grid Search Strategy
GridSearchCV
Tuning using GridSearchCV
Support Vector Machine (SVM) Classifiers
Exercise 8.02: Grid Search Hyperparameter Tuning for an SVM
Advantages and Disadvantages of Grid Search
Random Search
Random Variables and Their Distributions
Simple Demonstration of the Random Search Process
Tuning Using RandomizedSearchCV
Exercise 8.03: Random Search Hyperparameter Tuning for a Random Forest Classifier
Advantages and Disadvantages of a Random Search
Activity 8.01: Is the Mushroom Poisonous?
Summary
Chapter 9: Interpreting a Machine Learning Model
Introduction
Linear Model Coefficients
Exercise 9.01: Extracting the Linear Regression Coefficient
RandomForest Variable Importance
Exercise 9.02: Extracting RandomForest Feature Importance
Variable Importance via Permutation
Exercise 9.03: Extracting Feature Importance via Permutation
Partial Dependence Plots
Exercise 9.04: Plotting Partial Dependence
Local Interpretation with LIME
Exercise 9.05: Local Interpretation with LIME
Activity 9.01: Train and Analyze a Network Intrusion Detection Model
Summary
Chapter 10: Analyzing a Dataset
Introduction
Exploring Your Data
Analyzing Your Dataset
Exercise 10.01: Exploring the Ames Housing Dataset with Descriptive Statistics
Analyzing the Content of a Categorical Variable
Exercise 10.02: Analyzing the Categorical Variables from the Ames Housing Dataset
Summarizing Numerical Variables
Exercise 10.03: Analyzing Numerical Variables from the Ames Housing Dataset
Visualizing Your Data
How to use the Altair API
Histogram for Numerical Variables
Bar Chart for Categorical Variables
Boxplots
Exercise 10.04: Visualizing the Ames Housing Dataset with Altair
Activity 10.01: Analyzing Churn Data Using Visual Data Analysis Techniques
Summary
Chapter 11: Data Preparation
Introduction
Handling Row Duplication
Exercise 11.01: Handling Duplicates in a Breast Cancer Dataset
Converting Data Types
Exercise 11.02: Converting Data Types for the Ames Housing Dataset
Handling Incorrect Values
Exercise 11.03: Fixing Incorrect Values in the State Column
Handling Missing Values
Exercise 11.04: Fixing Missing Values for the Horse Colic Dataset
Activity 11.01: Preparing the Speed Dating Dataset
Summary
Chapter 12: Feature Engineering
Introduction
Merging Datasets
The left join
The right join
Exercise 12.01: Merging the ATO Dataset with the Postcode Data
Binning Variables
Exercise 12.02: Binning the YearBuilt variable from the AMES Housing dataset
Manipulating Dates
Exercise 12.03: Date Manipulation on Financial Services Consumer Complaints
Performing Data Aggregation
Exercise 12.04: Feature Engineering Using Data Aggregation on the AMES Housing Dataset
Activity 12.01: Feature Engineering on a Financial Dataset
Summary
Chapter 13: Imbalanced Datasets
Introduction
Understanding the Business Context
Exercise 13.01: Benchmarking the Logistic Regression Model on the Dataset
Analysis of the Result
Challenges of Imbalanced Datasets
Strategies for Dealing with Imbalanced Datasets
Collecting More Data
Resampling Data
Exercise 13.02: Implementing Random Undersampling and Classification on Our Banking Dataset to Find the Optimal Result
Analysis
Generating Synthetic Samples
Implementation of SMOTE and MSMOTE
Exercise 13.03: Implementing SMOTE on Our Banking Dataset to Find the Optimal Result
Exercise 13.04: Implementing MSMOTE on Our Banking Dataset to Find the Optimal Result
Applying Balancing Techniques on a Telecom Dataset
Activity 13.01: Finding the Best Balancing Technique by Fitting a Classifier on the Telecom Churn Dataset
Summary
Chapter 14: Dimensionality Reduction
Introduction
Business Context
Exercise 14.01: Loading and Cleaning the Dataset
Creating a High-Dimensional Dataset
Activity 14.01: Fitting a Logistic Regression Model on a High‑Dimensional Dataset
Strategies for Addressing High-Dimensional Datasets
Backward Feature Elimination (Recursive Feature Elimination)
Exercise 14.02: Dimensionality Reduction Using Backward Feature Elimination
Forward Feature Selection
Exercise 14.03: Dimensionality Reduction Using Forward Feature Selection
Principal Component Analysis (PCA)
Exercise 14.04: Dimensionality Reduction Using PCA
Independent Component Analysis (ICA)
Exercise 14.05: Dimensionality Reduction Using Independent Component Analysis
Factor Analysis
Exercise 14.06: Dimensionality Reduction Using Factor Analysis
Comparing Different Dimensionality Reduction Techniques
Activity 14.02: Comparison of Dimensionality Reduction Techniques on the Enhanced Ads Dataset
Summary
Chapter 15: Ensemble Learning
Introduction
Ensemble Learning
Variance
Bias
Business Context
Exercise 15.01: Loading, Exploring, and Cleaning the Data
Activity 15.01: Fitting a Logistic Regression Model on Credit Card Data
Simple Methods for Ensemble Learning
Averaging
Exercise 15.02: Ensemble Model Using the Averaging Technique
Weighted Averaging
Exercise 15.03: Ensemble Model Using the Weighted Averaging Technique
Iteration 2 with Different Weights
Max Voting
Exercise 15.04: Ensemble Model Using Max Voting
Advanced Techniques for Ensemble Learning
Bagging
Exercise 15.05: Ensemble Learning Using Bagging
Boosting
Exercise 15.06: Ensemble Learning Using Boosting
Stacking
Exercise 15.07: Ensemble Learning Using Stacking
Activity 15.02: Comparison of Advanced Ensemble Techniques
Summary
Chapter 16: Machine Learning Pipelines
Introduction
Pipelines
Business Context
Exercise 16.01: Preparing the Dataset to Implement Pipelines
Automating ML Workflows Using Pipeline
Automating Data Preprocessing Using Pipelines
Exercise 16.02: Applying Pipelines for Feature Extraction to the Dataset
ML Pipeline with Processing and Dimensionality Reduction
Exercise 16.03: Adding Dimensionality Reduction to the Feature Extraction Pipeline
ML Pipeline for Modeling and Prediction
Exercise 16.04: Modeling and Predictions Using ML Pipelines
ML Pipeline for Spot-Checking Multiple Models
Exercise 16.05: Spot-Checking Models Using ML Pipelines
ML Pipelines for Identifying the Best Parameters for a Model
Cross-Validation
Grid Search
Exercise 16.06: Grid Search and Cross-Validation with ML Pipelines
Applying Pipelines to a Dataset
Activity 16.01: Complete ML Workflow in a Pipeline
Summary
Chapter 17: Automated Feature Engineering
Introduction
Feature Engineering
Automating Feature Engineering Using Feature Tools
Business Context
Domain Story for the Problem Statement
Featuretools – Creating Entities and Relationships
Exercise 17.01: Defining Entities and Establishing Relationships
Feature Engineering – Basic Operations
Featuretools – Automated Feature Engineering
Exercise 17.02: Creating New Features Using Deep Feature Synthesis
Exercise 17.03: Classification Model after Automated Feature Generation
Featuretools on a New Dataset
Activity 17.01: Building a Classification Model with Features that have been Generated Using Featuretools
Summary
Index