Data Science in Practice is the ideal introduction to data science. With or without math skills, here, you get the all-round view that you need for your projects. This book describes how to properly question data, in order to unearth the treasure that data can be. You will get to know the relevant analysis methods, and will be introduced to the programming language R, which is ideally suited for data analysis. Associated tools like notebooks that make data science programming easily accessible are included in this introduction. Because technology alone is not enough, this book also deals with problems in project implementation, illuminates various fields of application, and does not forget to address ethical aspects. Data Science in Practice includes many examples, notes on errors, decision-making aids, and other practical tips. This book is ideal as a complementary text for university students, and is a useful learning tool for those moving into more data-related roles. Key
Author(s): Tom Alby
Series: Data Science Series
Edition: 1
Publisher: CRC Press
Year: 2023
Language: English
Pages: 318
Cover
Half Title
Series Page
Title Page
Copyright Page
Table of Contents
Foreword
Figures
Chapter 1 Introduction
1.1 The Age of Data: Is It Just a Hype?
1.2 Why Is Data Science Relevant Now?
1.3 Why Data Science with R?
1.4 Who Is This Book For?
1.5 Is It Possible to Learn Data Science without Math?
1.6 How to Use This Book
Chapter 2 Machine Learning, Data Science, and Artificial Intelligence
2.1 Learning from History – All Just Hype?
2.1.1 Data and Machines before the Dawn of AI
2.1.2 The First Spring of Artificial Intelligence
2.1.3 The First AI Winter
2.1.4 The Second AI Spring: Expert Systems
2.1.5 The Second AI Winter
2.1.6 Is This a New AI Spring?
2.1.7 Setbacks and New Hopes
2.1.8 Technological Singularity: Do Machines Have a Mind?
2.1.9 Alan Turing and the Turing Test
2.2 Definitions
2.2.1 Machine Learning
2.2.2 Artificial Intelligence
2.2.3 Data Science
2.2.4 Data Analysis and Statistics
2.2.5 Big Data
Chapter 3 The Anatomy of a Data Science Project
3.1 The General Flow of a Data Science Project
3.1.1 The CRISP-DM Stages
3.1.2 ASUM-DM
3.1.3 The Data Science Workflow According to Hadley Wickham
3.1.4 Which Approach Is Right for Me?
3.2 Business Understanding: What Is the Problem to Be Solved?
3.2.1 Senior Management Support and Involvement of the Specialist Department
3.2.2 Understanding Requirements
3.2.3 Overcoming Resistance: Who Is Afraid of the Evil AI?
3.3 Basic Approaches in Machine Learning
3.3.1 Supervised, Unsupervised, and Reinforcement Learning
3.3.2 Feature Engineering
3.4 Performance Measurement
3.4.1 Test and Training Data
3.4.2 Not all Errors Are Created Equal: False Positives and False Negatives
3.4.3 Confusion Matrix
3.4.4 ROC AUC
3.4.5 Precision Recall Curve
3.4.6 Impact Outside the Lab
3.4.7 Data Science ROI
3.5 Communication with Stakeholders
3.5.1 Reporting
3.5.2 Storytelling
3.6 From the Lab to the World: Data Science Applicationsin Production
3.6.1 Data Pipelines and Data Lakes
3.6.2 Integration with other Systems
3.7 Typical Roles in a Data Science Project
3.7.1 Data Scientist
3.7.2 Data Engineer
3.7.3 Data Science Architect
3.7.4 Business Intelligence Analyst
3.7.5 The Subject Matter Expert
3.7.6 Project Management
3.7.7 Citizen Data Scientist
3.7.8 Other Roles
Chapter 4 Introduction to R
4.1 R: Free, Portable, and Interactive
4.1.1 History
4.1.2 Extension with Packages
4.1.3 The IDE RStudio
4.1.4 R versus Python
4.1.5 Other Languages
4.2 Installation and Configuration of R and RStudio
4.2.1 Installation of R and Short Functional Test
4.2.2 RStudio Installation
4.2.3 Configuration of R and RStudio
4.2.4 A Tour of RStudio
4.2.5 Projects in RStudio
4.2.6 The Cloud Alternative: Posit Cloud
4.3 First Steps with R
4.3.1 Everything in R Is an Object
4.3.2 Basic Commands
4.3.3 Data Types
4.3.4 Reading Data
4.3.5 Writing Data
4.3.6 Shortcuts
Chapter 5 Exploratory Data Analysis
5.1 Data: Collection, Cleaning and Transformation
5.1.1 Data Acquisition
5.1.2 How Much Data Is Enough?
5.1.3 Data Cleaning: The Different Dimensions of Data Quality
5.1.4 Data Transformation: The Underestimated Effort
5.2 Notebooks
5.2.1 EDAs with Notebooks and Markdown
5.2.2 Knitting
5.3 The Tidyverse
5.3.1 Why Use the Tidyverse?
5.3.2 The Basic Tidyverse Verbs
5.3.3 From Data Frames to Tibbles
5.3.4 Data Transformation
5.3.5 Regular Expressions and Mutate()
5.4 Data Visualization
5.4.1 Data Visualization as Part of the Analysis Process
5.4.2 Data Visualization as Part of the Reporting
5.4.3 Plots in Base R
5.4.4 ggplot2: A Grammar of Graphics
5.5 Data Analysis
Chapter 6 Forecasting
6.1 Linear Regression
6.1.1 How the Algorithm Works
6.1.2 Linear Regression in R
6.1.3 Interpretation of the Results
6.1.4 Advantages and Disadvantages
6.1.5 Non-Linear Regression
6.1.6 Small Hack: Linear Regression with Non-Linear Data
6.1.7 Logistic Regression
6.2 Anomaly Detection
6.2.1 Time Series Analyses
6.2.2 Fitting with the Forecast Package
Chapter 7 Clustering
7.1 Hierarchical Clustering
7.1.1 Introduction to the Algorithm
7.1.2 The Euclidean Distance and its Competitors
7.1.3 The Distance Matrix, but Scaled
7.1.4 The Dendrogram
7.1.5 Dummy Variables: What If We Have No Numerical Data?
7.1.6 What Do You Do with the Results?
7.2 k-Means
7.2.1 How the Algorithm Works
7.2.2 How Do We Know k?
7.2.3 Interpretation of the Results
7.2.4 Is k-Means Always the Answer?
Chapter 8 Classification
8.1 Use cases for classification
8.2 Create Training and Test Data
8.2.1 The Titanic Data Set: A Brief EDA
8.2.2 The Caret Package: Dummy Variables and Splitting the Data
8.2.3 The pROC Package
8.3 Decision Trees
8.3.1 How the Algorithm Works
8.3.2 Training and Test
8.3.3 Interpretation of the Results
8.4 Support Vector Machines
8.4.1 How the Algorithm Works
8.4.2 Data Preparation
8.4.3 Training and Test
8.4.4 Interpretations of the Results
8.5 Naive Bayes
8.5.1 How the Algorithm Works
8.5.2 Data Preparation
8.5.3 Training and Test
8.5.4 Interpretation of the Results
8.6 XG Boost: The Newcomer
8.6.1 How the algorithm works
8.6.2 Data Preparation
8.6.3 Training and Test
8.6.4 Interpretation of the Results
8.7 Text Classification
8.7.1 Prepare the Data
8.7.2 Training and Test
8.7.3 Interpretation of the results
Chapter 9 Other Use Cases
9.1 Shopping Cart Analysis – Association Rules
9.1.1 How the Algorithm Works
9.1.2 Data Preparation
9.1.3 Application of the Algorithm
9.1.4 Interpretations of the Results
9.1.5 Visualization of Association Rules
9.1.6 Association Rules with the Groceries Data Set
9.2 k-nearest Neighbors
9.2.1 How the Algorithm Identifies Outliers
9.2.2 Who Is the Furthest out of Everyone Now?
9.2.3 kNN as Classifier
9.2.4 LOF for Misclassification Analysis
Chapter 10 Workflows and Tools
10.1 Versioning with Git
10.1.1 Why Versioning?
10.1.2 Git, GitHub, and GitLab
10.1.3 Basic commands
10.1.4 Integration with RStudio
10.1.5 Commit and Push Code
10.2 Dealing with Large Amounts of Data
10.2.1 Need a Bigger Computer? Cloud Computing with R
10.2.2 Working with Clusters: Spark and Sparklyr
10.2.3 data.table
10.3 Deploy Applications via API
10.3.1 What Is a REST API?
10.3.2 Provide an API with the “plumber” Package
10.3.3 The Next Step: Docker
10.4 Create Applications with Shiny
10.4.1 What Is Shiny?
10.4.2 UI and Server
10.4.3 Publish a Shiny App from RStudio
10.4.4 Example Applications
10.4.5 shinyapps.io
Chapter 11 Ethical Handling of Data and Algorithms
11.1 Privacy
11.1.1 Legislations around the World
11.1.2 Do Users Really Care?
11.2 Ethics: Against Profiling and Discrimination
11.2.1 What Is Discrimination?
11.2.2 How to Prevent Discrimination
11.2.3 What Is Profiling?
11.2.4 How Can Profiling Be Prevented?
Chapter 12 Next Steps after This Book
12.1 Projects, Projects, Projects
12.1.1 Putting Together a Project Portfolio
12.1.2 Kaggle
12.2 Where to Find Help
12.2.1 RTFM
12.2.2 Stack Overflow
12.2.3 The R-Help Mailing List
12.3 RSeek
Chapter 13 Appendix: Troubleshooting
13.1 Typical Error Messages and Solutions
13.2 Typical Mistakes and How to Avoid Them
13.3 R or RStudio Does Not Respond
13.4 Typical Error Messages
Chapter 14 Glossary
Bibliography
Index