Data Science Bookcamp: Five real-world Python projects

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science. In Data Science Bookcamp you will learn: • Techniques for computing and plotting probabilities • Statistical analysis using Scipy • How to organize datasets with clustering algorithms • How to visualize complex multi-variable datasets • How to train a decision tree machine learning algorithm In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career. About the technology A data science project has a lot of moving parts, and it takes practice and skill to get all the code, algorithms, datasets, formats, and visualizations working together harmoniously. This unique book guides you through five realistic projects, including tracking disease outbreaks from news headlines, analyzing social networks, and finding relevant patterns in ad click data. About the book Data Science Bookcamp doesn’t stop with surface-level theory and toy examples. As you work through each project, you’ll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don’t quite fit the model you’re building. You’ll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you’ll be confident in your skills because you can see the results. What's inside • Web scraping • Organize datasets with clustering algorithms • Visualize complex multi-variable datasets • Train a decision tree machine learning algorithm About the reader For readers who know the basics of Python. No prior data science or machine learning skills required. About the author Leonard Apeltsin is the Head of Data Science at Anomaly, where his team applies advanced analytics to uncover healthcare fraud, waste, and abuse.

Author(s): Leonard Apeltsin
Edition: 1
Publisher: Manning Publications
Year: 2021

Language: English
Commentary: Vector PDF
Pages: 704
City: Shelter Island, NY
Tags: Machine Learning; Probabilistic Models; Natural Language Processing; Decision Trees; Data Science; Supervised Learning; Python; Clustering; Data Visualization; Statistics; Logistic Regression; scikit-learn; Web Scraping; NumPy; matplotlib; pandas; Graph Theory; NetworkX; Graph Algorithms; Geospatial Data; Probability Theory; Hypothesis Testing; Network Analysis; Statistical Inference; Text Processing; Markov Models; Cartopy; Elementary; Monte Carlo Simulations

Data Science Bookcamp
brief contents
contents
preface
acknowledgments
about this book
Who should read this book
How this book is organized
About the code
about the author
about the cover illustration
Case study 1—Finding the winning strategy in a card game
Section 1—Computing probabilities using Python
1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes
1.1.1 Analyzing a biased coin
1.2 Computing nontrivial probabilities
1.2.1 Problem 1: Analyzing a family with four children
1.2.2 Problem 2: Analyzing multiple die rolls
1.2.3 Problem 3: Computing die-roll probabilities using weighted sample spaces
1.3 Computing probabilities over interval ranges
1.3.1 Evaluating extremes using interval analysis
Summary
Section 2—Plotting probabilities using Matplotlib
2.1 Basic Matplotlib plots
2.2 Plotting coin-flip probabilities
2.2.1 Comparing multiple coin-flip probability distributions
Summary
Section 3—Running random simulations in NumPy
3.1 Simulating random coin flips and die rolls using NumPy
3.1.1 Analyzing biased coin flips
3.2 Computing confidence intervals using histograms and NumPy arrays
3.2.1 Binning similar points in histogram plots
3.2.2 Deriving probabilities from histograms
3.2.3 Shrinking the range of a high confidence interval
3.2.4 Computing histograms in NumPy
3.3 Using confidence intervals to analyze a biased deck of cards
3.4 Using permutations to shuffle cards
Summary
Section 4—Case study 1 solution
4.1 Predicting red cards in a shuffled deck
4.1.1 Estimating the probability of strategy success
4.2 Optimizing strategies using the sample space for a 10-card deck
Summary
Case study 2—Assessing online ad clicks for significance
Section 5—Basic probability and statistical analysis using SciPy
5.1 Exploring the relationships between data and probability using SciPy
5.2 Mean as a measure of centrality
5.2.1 Finding the mean of a probability distribution
5.3 Variance as a measure of dispersion
5.3.1 Finding the variance of a probability distribution
Summary
Section 6—Making predictions using the central limit theorem and SciPy
6.1 Manipulating the normal distribution using SciPy
6.1.1 Comparing two sampled normal curves
6.2 Determining the mean and variance of a population through random sampling
6.3 Making predictions using the mean and variance
6.3.1 Computing the area beneath a normal curve
6.3.2 Interpreting the computed probability
Summary
Section 7—Statistical hypothesis testing
7.1 Assessing the divergence between sample mean and population mean
7.2 Data dredging: Coming to false conclusions through oversampling
7.3 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown
7.4 Permutation testing: Comparing means of samples when the population parameters are unknown
Summary
Section 8—Analyzing tables using Pandas
8.1 Storing tables using basic Python
8.2 Exploring tables using Pandas
8.3 Retrieving table columns
8.4 Retrieving table rows
8.5 Modifying table rows and columns
8.6 Saving and loading table data
8.7 Visualizing tables using Seaborn
Summary
Section 9—Case study 2 solution
9.1 Processing the ad-click table in Pandas
9.2 Computing p-values from differences in means
9.3 Determining statistical significance
9.4 41 shades of blue: A real-life cautionary tale
Summary
Case study 3—Tracking disease outbreaks using news headlines
Section 10—Clustering data into groups
10.1 Using centrality to discover clusters
10.2 K-means: A clustering algorithm for grouping data into K central groups
10.2.1 K-means clustering using scikit-learn
10.2.2 Selecting the optimal K using the elbow method
10.3 Using density to discover clusters
10.4 DBSCAN: A clustering algorithm for grouping data based on spatial density
10.4.1 Comparing DBSCAN and K-means
10.4.2 Clustering based on non-Euclidean distance
10.5 Analyzing clusters using Pandas
Summary
Section 11—Geographic location visualization and analysis
11.1 The great-circle distance: A metric for computing the distance between two global points
11.2 Plotting maps using Cartopy
11.2.1 Manually installing GEOS and Cartopy
11.2.2 Utilizing the Conda package manager
11.2.3 Visualizing maps
11.3 Location tracking using GeoNamesCache
11.3.1 Accessing country information
11.3.2 Accessing city information
11.3.3 Limitations of the GeoNamesCache library
11.4 Matching location names in text
Summary
Section 12—Case study 3 solution
12.1 Extracting locations from headline data
12.2 Visualizing and clustering the extracted location data
12.3 Extracting insights from location clusters
Summary
Case study 4—Using online job postings to improve your data science resume
Section 13—Measuring text similarities
13.1 Simple text comparison
13.1.1 Exploring the Jaccard similarity
13.1.2 Replacing words with numeric values
13.2 Vectorizing texts using word counts
13.2.1 Using normalization to improve TF vector similarity
13.2.2 Using unit vector dot products to convert between relevance metrics
13.3 Matrix multiplication for efficient similarity calculation
13.3.1 Basic matrix operations
13.3.2 Computing all-by-all matrix similarities
13.4 Computational limits of matrix multiplication
Summary
Section 14—Dimension reduction of matrix data
14.1 Clustering 2D data in one dimension
14.1.1 Reducing dimensions using rotation
14.2 Dimension reduction using PCA and scikit-learn
14.3 Clustering 4D data in two dimensions
14.3.1 Limitations of PCA
14.4 Computing principal components without rotation
14.4.1 Extracting eigenvectors using power iteration
14.5 Efficient dimension reduction using SVD and scikit-learn
Summary
Section 15—NLP analysis of large text datasets
15.1 Loading online forum discussions using scikit-learn
15.2 Vectorizing documents using scikit-learn
15.3 Ranking words by both post frequency and count
15.3.1 Computing TFIDF vectors with scikit-learn
15.4 Computing similarities across large document datasets
15.5 Clustering texts by topic
15.5.1 Exploring a single text cluster
15.6 Visualizing text clusters
15.6.1 Using subplots to display multiple word clouds
Summary
Section 16—Extracting text from web pages
16.1 The structure of HTML documents
16.2 Parsing HTML using Beautiful Soup
16.3 Downloading and parsing online data
Summary
Section 17—Case study 4 solution
17.1 Extracting skill requirements from job posting data
17.1.1 Exploring the HTML for skill descriptions
17.2 Filtering jobs by relevance
17.3 Clustering skills in relevant job postings
17.3.1 Grouping the job skills into 15 clusters
17.3.2 Investigating the technical skill clusters
17.3.3 Investigating the soft-skill clusters
17.3.4 Exploring clusters at alternative values of K
17.3.5 Analyzing the 700 most relevant postings
17.4 Conclusion
Summary
Case study 5—Predicting future friendships from social network data
Section 18—An introduction to graph theory and network analysis
18.1 Using basic graph theory to rank websites by popularity
18.1.1 Analyzing web networks using NetworkX
18.2 Utilizing undirected graphs to optimize the travel time between towns
18.2.1 Modeling a complex network of towns and counties
18.2.2 Computing the fastest travel time between nodes
Summary
Section 19—Dynamic graph theory techniques for node ranking and social network analysis
19.1 Uncovering central nodes based on expected traffic in a network
19.1.1 Measuring centrality using traffic simulations
19.2 Computing travel probabilities using matrix multiplication
19.2.1 Deriving PageRank centrality from probability theory
19.2.2 Computing PageRank centrality using NetworkX
19.3 Community detection using Markov clustering
19.4 Uncovering friend groups in social networks
Summary
Section 20—Network-driven supervised machine learning
20.1 The basics of supervised machine learning
20.2 Measuring predicted label accuracy
20.2.1 Scikit-learn’s prediction measurement functions
20.3 Optimizing KNN performance
20.4 Running a grid search using scikit-learn
20.5 Limitations of the KNN algorithm
Summary
Section 21—Training linear classifiers with logistic regression
21.1 Linearly separating customers by size
21.2 Training a linear classifier
21.2.1 Improving perceptron performance through standardization
21.3 Improving linear classification with logistic regression
21.3.1 Running logistic regression on more than two features
21.4 Training linear classifiers using scikit-learn
21.4.1 Training multiclass linear models
21.5 Measuring feature importance with coefficients
21.6 Linear classifier limitations
Summary
Section 22—Training nonlinear classifiers with decision tree techniques
22.1 Automated learning of logical rules
22.1.1 Training a nested if/else model using two features
22.1.2 Deciding which feature to split on
22.1.3 Training if/else models with more than two features
22.2 Training decision tree classifiers using scikit-learn
22.2.1 Studying cancerous cells using feature importance
22.3 Decision tree classifier limitations
22.4 Improving performance using random forest classification
22.5 Training random forest classifiers using scikit-learn
Summary
Section 23—Case study 5 solution
23.1 Exploring the data
23.1.1 Examining the profiles
23.1.2 Exploring the experimental observations
23.1.3 Exploring the Friendships linkage table
23.2 Training a predictive model using network features
23.3 Adding profile features to the model
23.4 Optimizing performance across a steady set of features
23.5 Interpreting the trained model
23.5.1 Why are generalizable models so important?
Summary
index
Symbols
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
R
S
T
U
V
W
X
Y