The Data Science Design Manual

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This engaging and clearly written textbook/reference provides a must-have introduction to the rapidly emerging interdisciplinary field of data science. It focuses on the principles fundamental to becoming a good data scientist and the key skills needed to build systems for collecting, analyzing, and interpreting data. The Data Science Design Manual is a source of practical insights that highlights what really matters in analyzing data, and provides an intuitive understanding of how these core concepts can be used. The book does not emphasize any particular programming language or suite of data-analysis tools, focusing instead on high-level discussion of important design principles. This easy-to-read text ideally serves the needs of undergraduate and early graduate students embarking on an “Introduction to Data Science” course. It reveals how this discipline sits at the intersection of statistics, computer science, and machine learning, with a distinct heft and character of its own. Practitioners in these and related fields will find this book perfect for self-study as well. Additional learning tools: Contains “War Stories,” offering perspectives on how data science applies in the real world Includes “Homework Problems,” providing a wide range of exercises and projects for self-study Provides a complete set of lecture slides and online video lectures at www.data-manual.com Provides “Take-Home Lessons,” emphasizing the big-picture concepts to learn from each chapter Recommends exciting “Kaggle Challenges” from the online platform Kaggle Highlights “False Starts,” revealing the subtle reasons why certain approaches fail Offers examples taken from the data science television show “The Quant Shop” (www.quant-shop.com)

Author(s): Steven S. Skiena
Publisher: Springer
Year: 2017

Language: English
Pages: 445

Preface
Contents
1
What is Data Science?
1.1 Computer Science, Data Science, and Real
Science
1.2 Asking Interesting Questions from Data
1.2.1 The Baseball Encyclopedia
1.2.2 The Internet Movie Database (IMDb)
1.2.3 Google Ngrams
1.2.4 New York Taxi Records
1.3 Properties of Data
1.3.1 Structured vs. Unstructured Data
1.3.2 Quantitative vs. Categorical Data
1.3.3 Big Data vs. Little Data
1.4 Classi cation and Regression
1.5 Data Science Television: The Quant Shop
1.5.1 Kaggle Challenges
1.6 About the War Stories
1.7 War Story: Answering the Right Question
1.8 Chapter Notes
1.9 Exercises
2
Mathematical Preliminaries
2.1 Probability
2.1.1 Probability vs. Statistics
2.1.2 Compound Events and Independence
2.1.3 Conditional Probability
2.1.4 Probability Distributions
2.2 Descriptive Statistics
2.2.1 Centrality Measures
2.2.2 Variability Measures
2.2.3 Interpreting Variance
2.2.4 Characterizing Distributions
2.3 Correlation Analysis
2.3.1 Correlation Coecients: Pearson and Spearman Rank
2.3.2 The Power and Signi cance of Correlation
2.3.3 Correlation Does Not Imply Causation!
2.3.4 Detecting Periodicities by Autocorrelation
2.4 Logarithms
2.4.1 Logarithms and Multiplying Probabilities
2.4.2 Logarithms and Ratios
2.4.3 Logarithms and Normalizing Skewed Distributions
2.5 War Story: Fitting Designer Genes
2.6 Chapter Notes
2.7 Exercises
3
Data Munging
3.1 Languages for Data Science
3.1.1 The Importance of Notebook Environments
3.1.2 Standard Data Formats
3.2 Collecting Data
3.2.1 Hunting
3.2.2 Scraping
3.2.3 Logging
3.3 Cleaning Data
3.3.1 Errors vs. Artifacts
3.3.2 Data Compatibility
3.3.3 Dealing with Missing Values
3.3.4 Outlier Detection
3.4 War Story: Beating the Market
3.5 Crowdsourcing
3.5.1 The Penny Demo
3.5.2 When is the Crowd Wise?
3.5.3 Mechanisms for Aggregation
3.5.4 Crowdsourcing Services
3.5.5 Gami cation
3.6 Chapter Notes
3.7 Exercises
4
Scores and Rankings
4.1 The Body Mass Index (BMI)
4.2 Developing Scoring Systems
4.2.1 Gold Standards and Proxies
4.2.2 Scores vs. Rankings
4.2.3 Recognizing Good Scoring Functions
4.3 Z-scores and Normalization
4.4 Advanced Ranking Techniques
4.4.1 Elo Rankings
4.4.2 Merging Rankings
4.4.3 Digraph-based Rankings
4.4.4 PageRank
4.5 War Story: Clyde's Revenge
4.6 Arrow's Impossibility Theorem
4.7 War Story: Who's Bigger?
4.8 Chapter Notes
4.9 Exercises
5
Statistical Analysis
5.1 Statistical Distributions
5.1.1 The Binomial Distribution
5.1.2 The Normal Distribution
5.1.3 Implications of the Normal Distribution
5.1.4 Poisson Distribution
5.1.5 Power Law Distributions
5.2 Sampling from Distributions
5.2.1 Random Sampling beyond One Dimension
5.3 Statistical Signi cance
5.3.1 The Signi cance of Signi cance
5.3.2 The T-test: Comparing Population Means
5.3.3 The Kolmogorov-Smirnov Test
5.3.4 The Bonferroni Correction
5.3.5 False Discovery Rate
5.4 War Story: Discovering the Fountain of Youth?
5.5 Permutation Tests and P-values
5.5.1 Generating Random Permutations
5.5.2 DiMaggio's Hitting Streak
5.6 Bayesian Reasoning
5.7 Chapter Notes
5.8 Exercises
6
Visualizing Data
6.1 Exploratory Data Analysis
6.1.1 Confronting a New Data Set
6.1.2 Summary Statistics and Anscombe's Quartet
6.1.3 Visualization Tools
6.2 Developing a Visualization Aesthetic
6.2.1 Maximizing Data-Ink Ratio
6.2.2 Minimizing the Lie Factor
6.2.3 Minimizing Chartjunk
6.2.4 Proper Scaling and Labeling
6.2.5 E ective Use of Color and Shading
6.2.6 The Power of Repetition
6.3 Chart Types
6.3.1 Tabular Data
6.3.2 Dot and Line Plots
6.3.3 Scatter Plots
6.3.4 Bar Plots and Pie Charts
6.3.5 Histograms
6.3.6 Data Maps
6.4 Great Visualizations
6.4.1 Marey's Train Schedule
6.4.2 Snow's Cholera Map
6.4.3 New York's Weather Year
6.5 Reading Graphs
6.5.1 The Obscured Distribution
6.5.2 Overinterpreting Variance
6.6 Interactive Visualization
6.7 War Story: TextMapping the World
6.8 Chapter Notes
6.9 Exercises
7
Mathematical Models
7.1 Philosophies of Modeling
7.1.1 Occam's Razor
7.1.2 Bias{Variance Trade-O s
7.1.3 What Would Nate Silver Do?
7.2 A Taxonomy of Models
7.2.1 Linear vs. Non-Linear Models
7.2.2 Blackbox vs. Descriptive Models
7.2.3 First-Principle vs. Data-Driven Models
7.2.4 Stochastic vs. Deterministic Models
7.2.5 Flat vs. Hierarchical Models
7.3 Baseline Models
7.3.1 Baseline Models for Classi cation
7.3.2 Baseline Models for Value Prediction
7.4 Evaluating Models
7.4.1 Evaluating Classi ers
7.4.2 Receiver-Operator Characteristic (ROC) Curves
7.4.3 Evaluating Multiclass Systems
7.4.4 Evaluating Value Prediction Models
7.5 Evaluation Environments
7.5.1 Data Hygiene for Evaluation
7.5.2 Amplifying Small Evaluation Sets
7.6 War Story: 100% Accuracy
7.7 Simulation Models
7.8 War Story: Calculated Bets
7.9 Chapter Notes
7.10 Exercises
8
Linear Algebra
8.1 The Power of Linear Algebra
8.1.1 Interpreting Linear Algebraic Formulae
8.1.2 Geometry and Vectors
8.2 Visualizing Matrix Operations
8.2.1 Matrix Addition
8.2.2 Matrix Multiplication
8.2.3 Applications of Matrix Multiplication
8.2.4 Identity Matrices and Inversion
8.2.5 Matrix Inversion and Linear Systems
8.2.6 Matrix Rank
8.3 Factoring Matrices
8.3.1 Why Factor Feature Matrices?
8.3.2 LU Decomposition and Determinants
8.4 Eigenvalues and Eigenvectors
8.4.1 Properties of Eigenvalues
8.4.2 Computing Eigenvalues
8.5 Eigenvalue Decomposition
8.5.1 Singular Value Decomposition
8.5.2 Principal Components Analysis
8.6 War Story: The Human Factors
8.7 Chapter Notes
8.8 Exercises
9 Linear and Logistic Regression
9.1 Linear Regression
9.1.1 Linear Regression and Duality
9.1.2 Error in Linear Regression
9.1.3 Finding the Optimal Fit
9.2 Better Regression Models
9.2.1 Removing Outliers
9.2.2 Fitting Non-Linear Functions
9.2.3 Feature and Target Scaling
9.2.4 Dealing with Highly-Correlated Features
9.3 War Story: Taxi Deriver
9.4 Regression as Parameter Fitting
9.4.1 Convex Parameter Spaces
9.4.2 Gradient Descent Search
9.4.3 What is the Right Learning Rate?
9.4.4 Stochastic Gradient Descent
9.5 Simplifying Models through Regularization
9.5.1 Ridge Regression
9.5.2 LASSO Regression
9.5.3 Trade-O s between Fit and Complexity
9.6 Classi cation and Logistic Regression
9.6.1 Regression for Classi cation
9.6.2 Decision Boundaries
9.6.3 Logistic Regression
9.7 Issues in Logistic Classi cation
9.7.1 Balanced Training Classes
9.7.2 Multi-Class Classi cation
9.7.3 Hierarchical Classi cation
9.7.4 Partition Functions and Multinomial Regression
9.8 Chapter Notes
9.9 Exercises
10
Distance and Network Methods
10.1 Measuring Distances
10.1.1 Distance Metrics
10.1.2 The
Distance Metric
10.1.3 Working in Higher Dimensions
10.1.4 Dimensional Egalitarianism
10.1.5 Points vs. Vectors
10.1.6 Distances between Probability Distributions
10.2 Nearest Neighbor Classi cation
10.2.1 Seeking Good Analogies
10.2.2
-Nearest Neighbors
10.2.3 Finding Nearest Neighbors
10.2.4 Locality Sensitive Hashing
10.3 Graphs, Networks, and Distances
10.3.1 Weighted Graphs and Induced Networks
10.3.2 Talking About Graphs
10.3.3 Graph Theory
10.4 PageRank
10.5 Clustering
10.5.1
-means Clustering
10.5.2 Agglomerative Clustering
10.5.3 Comparing Clusterings
10.5.4 Similarity Graphs and Cut-Based Clustering
10.6 War Story: Cluster Bombing
10.7 Chapter Notes
10.8 Exercises
11
Machine Learning
11.1 Naive Bayes
11.1.1 Formulation
11.1.2 Dealing with Zero Counts (Discounting)
11.2 Decision Tree Classi ers
11.2.1 Constructing Decision Trees
11.2.2 Realizing Exclusive Or
11.2.3 Ensembles of Decision Trees
11.3 Boosting and Ensemble Learning
11.3.1 Voting with Classi ers
11.3.2 Boosting Algorithms
11.4 Support Vector Machines
11.4.1 Linear SVMs
11.4.2 Non-linear SVMs
11.4.3 Kernels
11.5 Degrees of Supervision
11.5.1 Supervised Learning
11.5.2 Unsupervised Learning
11.5.3 Semi-supervised Learning
11.5.4 Feature Engineering
11.6 Deep Learning
11.6.1 Networks and Depth
11.6.2 Backpropagation
11.6.3 Word and Graph Embeddings
11.7 War Story: The Name Game
11.8 Chapter Notes
11.9 Exercises
12
Big Data: Achieving Scale
12.1 What is Big Data?
12.1.1 Big Data as Bad Data
12.1.2 The Three Vs
12.2 War Story: Infrastructure Matters
12.3 Algorithmics for Big Data
12.3.1 Big Oh Analysis
12.3.2 Hashing
12.3.3 Exploiting the Storage Hierarchy
12.3.4 Streaming and Single-Pass Algorithms
12.4 Filtering and Sampling
12.4.1 Deterministic Sampling Algorithms
12.4.2 Randomized and Stream Sampling
12.5 Parallelism
12.5.1 One, Two, Many
12.5.2 Data Parallelism
12.5.3 Grid Search
12.5.4 Cloud Computing Services
12.6 MapReduce
12.6.1 Map-Reduce Programming
12.6.2 MapReduce under the Hood
12.7 Societal and Ethical Implications
12.8 Chapter Notes
12.9 Exercises
13
Coda
13.1 Get a Job!
13.2 Go to Graduate School!
13.3 Professional Consulting Services
Bibliography
Index