Practitioner's Guide to Data Science

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book aims to increase the visibility of Data Science in real-world, which differs from what you learn from a typical textbook. Many aspects of day-to-day Data Science work are almost absent from conventional statistics, Machine Learning, and Data Science curriculum. Yet these activities account for a considerable share of the time and effort for data professionals in the industry. Based on industry experience, this book outlines real-world scenarios and discusses pitfalls that Data Science practitioners should avoid. It also covers the Big Data cloud platform and the art of Data Science, such as soft skills. The authors use R as the primary tool and provide code for both R and Python. This book is for readers who want to explore possible career paths and eventually become data scientists. This book comprehensively introduces various Data Science fields, soft and programming skills in Data Science projects, and potential career paths. Traditional data-related practitioners such as statisticians, business analysts, and data analysts will find this book helpful in expanding their skills for future Data Science careers. Undergraduate and graduate students from analytics-related areas will find this book beneficial to learn real-world Data Science applications. Non-mathematical readers will appreciate the reproducibility of the companion R and Python codes. Key Features: • It covers both technical and soft skills. • It has a chapter dedicated to the Big Data cloud environment. For industry applications, the practice of data science is often in such an environment. • It is hands-on. We provide the data and repeatable R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform analyses with their data and problems, if possible. The best way to learn Data Science is to do it!

Author(s): Hui Lin, Ming Li
Series: Data Science
Publisher: CRC Press
Year: 2023

Language: English
Pages: 403

Cover
Half Title
Series Page
Title Page
Copyright Page
Contents
List of Figures
Preface
About the Authors
Acknowledgment
1. Introduction
1.1. A Brief History of Data Science
1.2. Data Science Role and Skill Tracks
1.2.1. Engineering
1.2.2. Analysis
1.2.3. Modeling/Inference
1.3. What Kind of Questions Can Data Science Solve?
1.3.1. Prerequisites
1.3.2. Problem Type
1.4. Structure of Data Science Team
1.5. Data Science Roles
2. Soft Skills for Data Scientists
2.1. Comparison between Statistician and Data Scientist
2.2. Beyond Data and Analytics
2.3. Three Pillars of Knowledge
2.4. Data Science Project Cycle
2.4.1. Types of Data Science Projects
2.4.2. Problem Formulation and Project Planning Stage
2.4.3. Project Modeling Stage
2.4.4. Model Implementation and Post Production Stage
2.4.5. Project Cycle Summary
2.5. Common Mistakes in Data Science
2.5.1. Problem Formulation Stage
2.5.2. Project Planning Stage
2.5.3. Project Modeling Stage
2.5.4. Model Implementation and Post Production Stage
2.5.5. Summary of Common Mistakes
3. Introduction to the Data
3.1. Customer Data for a Clothing Company
3.2. Swine Disease Breakout Data
3.3. MNIST Dataset
3.4. IMDB Dataset
4. Big Data Cloud Platform
4.1. Power of Cluster of Computers
4.2. Evolution of Cluster Computing
4.2.1. Hadoop
4.2.2. Spark
4.3. Introduction of Cloud Environment
4.3.1. Open Account and Create a Cluster
4.3.2. R Notebook
4.3.3. Markdown cells
4.4. Leverage Spark Using R Notebook
4.5. Databases and SQL
4.5.1. History
4.5.2. Database, Table, and View
4.5.3. Basic SQL Statement
4.5.4. Advanced Topics in Database
5. Data Pre-processing
5.1. Data Cleaning
5.2. Missing Values
5.2.1. Impute Missing Values with Median/Mode
5.2.2. K-nearest Neighbors
5.2.3. Bagging Tree
5.3. Centering and Scaling
5.4. Resolve Skewness
5.5. Resolve Outliers
5.6. Collinearity
5.7. Sparse Variables
5.8. Re-encode Dummy Variables
6. Data Wrangling
6.1. Summarize Data
6.1.1. dplyr Package
6.1.2. apply(), lapply() and sapply() in base R
6.2. Tidy and Reshape Data
7. Model Tuning Strategy
7.1. Variance-Bias Trade-Off
7.2. Data Splitting and Resampling
7.2.1. Data Splitting
7.2.2. Resampling
8. Measuring Performance
8.1. Regression Model Performance
8.2. Classification Model Performance
8.2.1. Confusion Matrix
8.2.2. Kappa Statistic
8.2.3. ROC
8.2.4. Gain and Lift Charts
9. Regression Models
9.1. Ordinary Least Square
9.1.1. The Magic P-value
9.1.2. Diagnostics for Linear Regression
9.2. Principal Component Regression and Partial Least Square
10. Regularization Methods
10.1. Ridge Regression
10.2. LASSO
10.3. Elastic Net
10.4. Penalized Generalized Linear Model
10.4.1. Introduction to glmnet Package
10.4.2. Penalized Logistic Regression
11. Tree-Based Methods
11.1. Tree Basics
11.2. Splitting Criteria
11.2.1. Gini Impurity
11.2.2. Information Gain (IG)
11.2.3. Information Gain Ratio (IGR)
11.2.4. Sum of Squared Error (SSE)
11.3. Tree Pruning
11.4. Regression and Decision Tree Basic
11.4.1. Regression Tree
11.4.2. Decision Tree
11.5. Bagging Tree
11.6. Random Forest
11.7. Gradient Boosted Machine
11.7.1. Adaptive Boosting
11.7.2. Stochastic Gradient Boosting
12. Deep Learning
12.1. Feedforward Neural Network
12.1.1. Logistic Regression as Neural Network
12.1.2. Stochastic Gradient Descent
12.1.3. Deep Neural Network
12.1.4. Activation Function
12.1.5. Optimization
12.1.6. Deal with Overfitting
12.1.7. Image Recognition Using FFNN
12.2. Convolutional Neural Network
12.2.1. Convolution Layer
12.2.2. Padding layer
12.2.3. Pooling Layer
12.2.4. Convolution Over Volume
12.2.5. Image Recognition Using CNN
12.3. Recurrent Neural Network
12.3.1. RNN Model
12.3.2. Long Short Term Memory
12.3.3. Word Embedding
12.3.4. Sentiment Analysis Using RNN
A: Handling Large Local Data
A.1. readr
A.2. data.table— Enhanced data.frame
B: R Code for Data Simulation
B.1. Customer Data for Clothing Company
B.2. Swine Disease Breakout Data
Bibliography
Index