Thinking Data Science: A Data Science Practitioner’s Guide

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This definitive guide to Machine Learning projects answers the problems an aspiring or experienced data scientist frequently has: Confused on what technology to use for your ML development? Should I use GOFAI, ANN/DNN or Transfer Learning? Can I rely on AutoML for model development? What if the client provides me Gig and Terabytes of data for developing analytic models? How do I handle high-frequency dynamic datasets? This book provides the practitioner with a consolidation of the entire data science process in a single “Cheat Sheet”.

The challenge for a data scientist is to extract meaningful information from huge datasets that will help to create better strategies for businesses. Many Machine Learning algorithms and Neural Networks are designed to do analytics on such datasets. For a data scientist, it is a daunting decision as to which algorithm to use for a given dataset. Although there is no single answer to this question, a systematic approach to problem solving is necessary. This book describes the various ML algorithms conceptually and defines/discusses a process in the selection of ML/DL models. The consolidation of available algorithms and techniques for designing efficient ML models is the key aspect of this book. Thinking Data Science will help practising data scientists, academicians, researchers, and students who want to build ML models using the appropriate algorithms and architectures, whether the data be small or big.

 

Author(s): Poornachandra Sarang
Series: The Springer Series in Applied Machine Learning
Publisher: Springer
Year: 2023

Language: English
Pages: 365
City: Cham

Preface
Contents
1: Data Science Process
Traditional Model Building
Modern Approach for Model Building
AI on Image Datasets
Model Development on Text Datasets
Model Building on High-Frequency Datasets
Data Science Process
Data Preparation
Numeric Data Processing
Text Processing
Preprocessing Text Data
Exploratory Data Analysis
Features Engineering
Deciding on Model Type
Model Training
Algorithm Selection
AutoML
Hyper-Parameter Tuning
Model Building Using ANN
Models Based on Transfer Learning
Summary
2: Dimensionality Reduction
In a Nutshell
Why Reduce Dimensionality?
Dimensionality Reduction Techniques
Project Dataset
Columns with Missing Values
Filtering Columns Based on Variance
Filtering Highly Correlated Columns
Random Forest
Backward Elimination
Forward Features Selection
Factor Analysis
Principal Component Analysis
PCA on Huge Multi-columnar Dataset
About the Dataset
Loading Dataset
Model Building
PCA for Visualization
PCA for Model Building
Independent Component Analysis
Isometric Mapping
t-Distributed Stochastic Neighbor Embedding (t-SNE)
UMAP
Singular Value Decomposition
Linear Discriminant Analysis (LDA)
Summary
Part I: Classical Algorithms: Overview
3: Regression Analysis
In a Nutshell
When to Use?
Regression Types
Linear Regression
Assumptions
Polynomial Regression
Ridge Regression
Lasso Regression
ElasticNet Regression
Linear Regression Implementations
Linear Regression
Ridge Regression
Lasso Regression
Bayesian Linear Regression
BLR Implementation
BLR Project
Logistic Regression
Logistic Regression Implementation
Guidelines for Model Selection
What´s Next?
Summary
4: Decision Tree
In a Nutshell
Wide Range of Applications
Decision Tree Workings
Tree Traversal
Tree Construction
Entropy
Information Gain
Gini Index
Constructing Tree
Tree Construction Algorithm
Tree Traversal Algorithm
Implementation
Project (Regression)
Loading Dataset
Preparing Datasets
Model Building
Evaluating Performance
Tree Visualization
Feature Importance
Project (Classifier)
Summary
5: Ensemble: Bagging and Boosting
What is Bagging and Boosting?
Bagging
Boosting
Random Forest
In a Nutshell
What Is Random Forest?
Random Forest Algorithm
Advantages
Applications
Implementation
Random Forest Project
ExtraTrees
Bagging Ensemble Project
ExtraTreesRegressor
ExtraTreesClassifier
Bagging
BaggingRegressor
BaggingClassifier
AdaBoost
How Does It Work?
Implementation
AdaBoostRegressor
AdaBoost Classifier
Advantages/Disadvantages
Gradient Boosting
Loss Function
Requirements for Gradient Boosting
Implementation
GradientBoostingRegressor
AdaBoostClassifier
Pros and Cons
XGBoost
Implementation
XGBRegressor
XGBClassifier
CatBoost
Implementation
CatBoostRegressor
CatBoostClassifier
LightGBM
Implementation
The LGBMRegressor
The LGBMClassifier
Performance Summary
Summary
6: K-Nearest Neighbors
In a Nutshell
K-Nearest Neighbors
KNN Algorithm
KNN Working
Effect of K
Advantages
Disadvantages of KNN
Implementation
Project
Loading Dataset
Determining K Optimal
Model Training
Model Testing
When to Use?
Summary
7: Naive Bayes
In a Nutshell
When to Use?
Naive Bayes Theorem
Applying the Theorem
Advantages
Disadvantages
Improving Performance
Naive Bayes Types
Multinomial Naive Bayes
Bernoulli Naive Bayes
Gaussian Naive Bayes
Complement Naive Bayes
Categorical Naive Bayes
Model Fitting for Huge Datasets
Project
Preparing Dataset
Data Visualization
Model Building
Inferring on Unseen Data
Summary
8: Support Vector Machines
In a Nutshell
SVM Working
Hyperplane Types
Kernel Effects
Linear Kernel
Polynomial Kernel
Radial Basis Function
Sigmoid
Guidelines on Kernel Selection
Parameter Tuning
The C Parameter
The Degree Parameter
The Gamma Parameter
The decision_function_shape Parameter
Project
Advantages and Disadvantages
Summary
Part II: Clustering: Overview
9: Centroid-Based Clustering
The K-Means Algorithm
In a Nutshell
How Does It Work?
K-Means Algorithm
Objective Function
The Process Workflow
Selecting Optimal Clusters
Elbow Method
Average Silhouette Method
The Gap Statistic Method
Limitations of K-Means Clustering
Applications
Implementation
Project
The K-Medoids Algorithm
In a Nutshell
Why K-Medoids?
Algorithm
Merits and Demerits
Implementation
Summary
10: Connectivity-Based Clustering
Agglomerative Clustering
In a Nutshell
The Working
Single Linkage
Complete Linkage
Average Linkage
Advantages and Disadvantages
Applications
Implementation
Project
Divisive Clustering
In a Nutshell
The Working
Implementation Challenges
Summary
11: Gaussian Mixture Model
In a Nutshell
Gaussian Distribution
Probability Distribution
Selecting Number of Clusters
Implementation
Project
Determining Optimal Number of Clusters
Summary
12: Density-Based Clustering
DBSCAN
In a Nutshell
Why DBSCAN?
Preliminaries
Algorithm Working
Advantages and Disadvantages
Implementation
Project
OPTICS
In a Nutshell
Core Distance
Reachability Distance
Implementation
Project
Mean Shift Clustering
In a Nutshell
Algorithm Working
Bandwidth Selection
Strengths
Weaknesses
Applications
Implementation
Project
Summary
13: BIRCH
In a Nutshell
Why BIRCH?
Clustering Feature
CF Tree
BIRCH Algorithm
Implementation
Project
Summary
14: CLARANS
In a Nutshell
CLARA Algorithm
CLARANS Algorithm
Advantages
Project
Summary
15: Affinity Propagation Clustering
In a Nutshell
Algorithm Working
Responsibility Matrix Updates
Availability Matrix Updates
Updating Scores
Few Remarks
Implementation
Project
Summary
16: STING & CLIQUE
STING: A Grid-Based Clustering Algorithm
In a Nutshell
How Does It Work?
Advantages and Disadvantages
Applications
CLIQUE: Density- and Grid-Based Subspace Clustering Algorithm
In a Nutshell
How Does It Work?
Pros/Cons
Implementation
Project
Summary
Part III: ANN: Overview
17: Artificial Neural Networks
AI Evolution
Artificial Neural Networks
Perceptron
What Is ANN?
Network Training
ANN Architectures
What Is DNN?
Network Architectures
What Are Pre-trained Models?
Important Terms to Know
Activation Functions
Back Propagation
Vanishing and Exploding Gradients
Optimization Functions
Types of Optimizers
Loss Functions
Regression Loss Functions
Classification Loss Functions
Types of Network Architectures
Convolutional Neural Network
Convolutional Layer
Pooling Layer
Fully Connected Layer
CNN Applications
Generative Adversarial Network
Model Architecture
The Generator
The Discriminator
How Does GAN Work?
How Data Scientists Use GAN?
Recurrent Neural Networks (RNN)
Long Short-Term Memory (LSTM)
Forget Gate
Input Gate
Update Gate
Output Gate
LSTM Applications
Transfer Learning
Pre-trained Models for Text
Word2Vec
Glove
Transformer
BERT
GPT
Pre-trained Models for Image Data
Advantages/Disadvantages
Summary
18: ANN-Based Applications
Developing NLP Applications
Dataset
Text Preprocessing
Using BERT
Creating Training/Testing Datasets
Setting Up BERT
Model Building
Model Training
Model Evaluation
Using Embeddings
N-gram Analysis
Tokenizing
Remove Stop Words
Model Building
Using Own Embeddings: Model 0
Embedding Weight Matrix
Glove: Model 1
Glove: Model 2
Glove: Model 3
Final Thoughts
Developing Image-Based Applications
Data Preparation
Modeling
CNN-Based Network
VGG16
ResNet50
MobileNet
DenseNet121
Summarizing Observations
Modeling on High-Resolution Images
Inferring Web Images
Summary
19: Automated Tools
In a Nutshell
Classical AI
Auto-sklearn
Auto-sklearn for Classification on Synthetic Dataset
Auto-sklearn for Classification on Real Dataset
Auto-sklearn for Regression
Auto-sklearn Architecture
Auto-sklearn Features
What´s Next?
ANN/DNN
AutoKeras for Classification
AutoKeras for Regression
AutoKeras Image Classifier
More AutoML Frameworks
PyCaret
MLBox
TPOT
H2O.ai
DataRobot
DataBricks
BlobCity AutoAI
Summary
20: Data Scientist´s Ultimate Workflow
Consolidated Overview
Workflow-0: Quick Solution
Workflow-1: Technology Selection
Workflow-2: Data Preprocessing
Workflow-3: EDA
Workflow-4: Features Engineering
Workflow-5: Type of Task
Workflow-6: Preparing Datasets
Workflow-7: Algorithm Selections
Workflow-8: AutoML
Workflow-9: Hyper-parameter Tuning
Workflow-10: ANN Model Building
Workflow-11: Clustering
Summary