Python Data Science Handbook: Essential Tools for Working with Data

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all—IPython, NumPy, pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you'll learn how: • IPython and Jupyter provide computational environments for scientists using Python • NumPy includes the ndarray for efficient storage and manipulation of dense data arrays • Pandas contains the DataFrame for efficient storage and manipulation of labeled/columnar data • Matplotlib includes capabilities for a flexible range of data visualizations • Scikit-learn helps you build efficient and clean Python implementations of the most important and established machine learning algorithms

Author(s): Jake VanderPlas
Edition: 2
Publisher: O'Reilly Media
Year: 2023

Language: English
Commentary: Publisher's PDF
Pages: 588
City: Sebastopol, CA
Tags: Machine Learning; Decision Trees; Data Science; Classification; Principal Component Analysis; Support Vector Machines; Data Visualization; Feature Engineering; Hyperparameter Tuning; Linear Regression; scikit-learn; NumPy; matplotlib; pandas; Jupyter; Random Forest; Naïve Bayes; Seaborn; Manifold Learning

Cover
Copyright
Table of Contents
Preface
What Is Data Science?
Who Is This Book For?
Why Python?
Outline of the Book
Installation Considerations
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Part I. Jupyter: Beyond Normal Python
Chapter 1. Getting Started in IPython and Jupyter
Launching the IPython Shell
Launching the Jupyter Notebook
Help and Documentation in IPython
Accessing Documentation with ?
Accessing Source Code with ??
Exploring Modules with Tab Completion
Keyboard Shortcuts in the IPython Shell
Navigation Shortcuts
Text Entry Shortcuts
Command History Shortcuts
Miscellaneous Shortcuts
Chapter 2. Enhanced Interactive Features
IPython Magic Commands
Running External Code: %run
Timing Code Execution: %timeit
Help on Magic Functions: ?, %magic, and %lsmagic
Input and Output History
IPython’s In and Out Objects
Underscore Shortcuts and Previous Outputs
Suppressing Output
Related Magic Commands
IPython and Shell Commands
Quick Introduction to the Shell
Shell Commands in IPython
Passing Values to and from the Shell
Shell-Related Magic Commands
Chapter 3. Debugging and Profiling
Errors and Debugging
Controlling Exceptions: %xmode
Debugging: When Reading Tracebacks Is Not Enough
Profiling and Timing Code
Timing Code Snippets: %timeit and %time
Profiling Full Scripts: %prun
Line-by-Line Profiling with %lprun
Profiling Memory Use: %memit and %mprun
More IPython Resources
Web Resources
Books
Part II. Introduction to NumPy
Chapter 4. Understanding Data Types in Python
A Python Integer Is More Than Just an Integer
A Python List Is More Than Just a List
Fixed-Type Arrays in Python
Creating Arrays from Python Lists
Creating Arrays from Scratch
NumPy Standard Data Types
Chapter 5. The Basics of NumPy Arrays
NumPy Array Attributes
Array Indexing: Accessing Single Elements
Array Slicing: Accessing Subarrays
One-Dimensional Subarrays
Multidimensional Subarrays
Subarrays as No-Copy Views
Creating Copies of Arrays
Reshaping of Arrays
Array Concatenation and Splitting
Concatenation of Arrays
Splitting of Arrays
Chapter 6. Computation on NumPy Arrays: Universal Functions
The Slowness of Loops
Introducing Ufuncs
Exploring NumPy’s Ufuncs
Array Arithmetic
Absolute Value
Trigonometric Functions
Exponents and Logarithms
Specialized Ufuncs
Advanced Ufunc Features
Specifying Output
Aggregations
Outer Products
Ufuncs: Learning More
Chapter 7. Aggregations: min, max, and Everything in Between
Summing the Values in an Array
Minimum and Maximum
Multidimensional Aggregates
Other Aggregation Functions
Example: What Is the Average Height of US Presidents?
Chapter 8. Computation on Arrays: Broadcasting
Introducing Broadcasting
Rules of Broadcasting
Broadcasting Example 1
Broadcasting Example 2
Broadcasting Example 3
Broadcasting in Practice
Centering an Array
Plotting a Two-Dimensional Function
Chapter 9. Comparisons, Masks, and Boolean Logic
Example: Counting Rainy Days
Comparison Operators as Ufuncs
Working with Boolean Arrays
Counting Entries
Boolean Operators
Boolean Arrays as Masks
Using the Keywords and/or Versus the Operators &/|
Chapter 10. Fancy Indexing
Exploring Fancy Indexing
Combined Indexing
Example: Selecting Random Points
Modifying Values with Fancy Indexing
Example: Binning Data
Chapter 11. Sorting Arrays
Fast Sorting in NumPy: np.sort and np.argsort
Sorting Along Rows or Columns
Partial Sorts: Partitioning
Example: k-Nearest Neighbors
Chapter 12. Structured Data: NumPy’s Structured Arrays
Exploring Structured Array Creation
More Advanced Compound Types
Record Arrays: Structured Arrays with a Twist
On to Pandas
Part III. Data Manipulation with Pandas
Chapter 13. Introducing Pandas Objects
The Pandas Series Object
Series as Generalized NumPy Array
Series as Specialized Dictionary
Constructing Series Objects
The Pandas DataFrame Object
DataFrame as Generalized NumPy Array
DataFrame as Specialized Dictionary
Constructing DataFrame Objects
The Pandas Index Object
Index as Immutable Array
Index as Ordered Set
Chapter 14. Data Indexing and Selection
Data Selection in Series
Series as Dictionary
Series as One-Dimensional Array
Indexers: loc and iloc
Data Selection in DataFrames
DataFrame as Dictionary
DataFrame as Two-Dimensional Array
Additional Indexing Conventions
Chapter 15. Operating on Data in Pandas
Ufuncs: Index Preservation
Ufuncs: Index Alignment
Index Alignment in Series
Index Alignment in DataFrames
Ufuncs: Operations Between DataFrames and Series
Chapter 16. Handling Missing Data
Trade-offs in Missing Data Conventions
Missing Data in Pandas
None as a Sentinel Value
NaN: Missing Numerical Data
NaN and None in Pandas
Pandas Nullable Dtypes
Operating on Null Values
Detecting Null Values
Dropping Null Values
Filling Null Values
Chapter 17. Hierarchical Indexing
A Multiply Indexed Series
The Bad Way
The Better Way: The Pandas MultiIndex
MultiIndex as Extra Dimension
Methods of MultiIndex Creation
Explicit MultiIndex Constructors
MultiIndex Level Names
MultiIndex for Columns
Indexing and Slicing a MultiIndex
Multiply Indexed Series
Multiply Indexed DataFrames
Rearranging Multi-Indexes
Sorted and Unsorted Indices
Stacking and Unstacking Indices
Index Setting and Resetting
Chapter 18. Combining Datasets: concat and append
Recall: Concatenation of NumPy Arrays
Simple Concatenation with pd.concat
Duplicate Indices
Concatenation with Joins
The append Method
Chapter 19. Combining Datasets: merge and join
Relational Algebra
Categories of Joins
One-to-One Joins
Many-to-One Joins
Many-to-Many Joins
Specification of the Merge Key
The on Keyword
The left_on and right_on Keywords
The left_index and right_index Keywords
Specifying Set Arithmetic for Joins
Overlapping Column Names: The suffixes Keyword
Example: US States Data
Chapter 20. Aggregation and Grouping
Planets Data
Simple Aggregation in Pandas
groupby: Split, Apply, Combine
Split, Apply, Combine
The GroupBy Object
Aggregate, Filter, Transform, Apply
Specifying the Split Key
Grouping Example
Chapter 21. Pivot Tables
Motivating Pivot Tables
Pivot Tables by Hand
Pivot Table Syntax
Multilevel Pivot Tables
Additional Pivot Table Options
Example: Birthrate Data
Chapter 22. Vectorized String Operations
Introducing Pandas String Operations
Tables of Pandas String Methods
Methods Similar to Python String Methods
Methods Using Regular Expressions
Miscellaneous Methods
Example: Recipe Database
A Simple Recipe Recommender
Going Further with Recipes
Chapter 23. Working with Time Series
Dates and Times in Python
Native Python Dates and Times: datetime and dateutil
Typed Arrays of Times: NumPy’s datetime64
Dates and Times in Pandas: The Best of Both Worlds
Pandas Time Series: Indexing by Time
Pandas Time Series Data Structures
Regular Sequences: pd.date_range
Frequencies and Offsets
Resampling, Shifting, and Windowing
Resampling and Converting Frequencies
Time Shifts
Rolling Windows
Example: Visualizing Seattle Bicycle Counts
Visualizing the Data
Digging into the Data
Chapter 24. High-Performance Pandas: eval and query
Motivating query and eval: Compound Expressions
pandas.eval for Efficient Operations
DataFrame.eval for Column-Wise Operations
Assignment in DataFrame.eval
Local Variables in DataFrame.eval
The DataFrame.query Method
Performance: When to Use These Functions
Further Resources
Part IV. Visualization with Matplotlib
Chapter 25. General Matplotlib Tips
Importing Matplotlib
Setting Styles
show or No show? How to Display Your Plots
Plotting from a Script
Plotting from an IPython Shell
Plotting from a Jupyter Notebook
Saving Figures to File
Two Interfaces for the Price of One
Chapter 26. Simple Line Plots
Adjusting the Plot: Line Colors and Styles
Adjusting the Plot: Axes Limits
Labeling Plots
Matplotlib Gotchas
Chapter 27. Simple Scatter Plots
Scatter Plots with plt.plot
Scatter Plots with plt.scatter
plot Versus scatter: A Note on Efficiency
Visualizing Uncertainties
Basic Errorbars
Continuous Errors
Chapter 28. Density and Contour Plots
Visualizing a Three-Dimensional Function
Histograms, Binnings, and Density
Two-Dimensional Histograms and Binnings
plt.hist2d: Two-Dimensional Histogram
plt.hexbin: Hexagonal Binnings
Kernel Density Estimation
Chapter 29. Customizing Plot Legends
Choosing Elements for the Legend
Legend for Size of Points
Multiple Legends
Chapter 30. Customizing Colorbars
Customizing Colorbars
Choosing the Colormap
Color Limits and Extensions
Discrete Colorbars
Example: Handwritten Digits
Chapter 31. Multiple Subplots
plt.axes: Subplots by Hand
plt.subplot: Simple Grids of Subplots
plt.subplots: The Whole Grid in One Go
plt.GridSpec: More Complicated Arrangements
Chapter 32. Text and Annotation
Example: Effect of Holidays on US Births
Transforms and Text Position
Arrows and Annotation
Chapter 33. Customizing Ticks
Major and Minor Ticks
Hiding Ticks or Labels
Reducing or Increasing the Number of Ticks
Fancy Tick Formats
Summary of Formatters and Locators
Chapter 34. Customizing Matplotlib: Configurations and Stylesheets
Plot Customization by Hand
Changing the Defaults: rcParams
Stylesheets
Default Style
FiveThiryEight Style
ggplot Style
Bayesian Methods for Hackers Style
Dark Background Style
Grayscale Style
Seaborn Style
Chapter 35. Three-Dimensional Plotting in Matplotlib
Three-Dimensional Points and Lines
Three-Dimensional Contour Plots
Wireframes and Surface Plots
Surface Triangulations
Example: Visualizing a Möbius Strip
Chapter 36. Visualization with Seaborn
Exploring Seaborn Plots
Histograms, KDE, and Densities
Pair Plots
Faceted Histograms
Categorical Plots
Joint Distributions
Bar Plots
Example: Exploring Marathon Finishing Times
Further Resources
Other Python Visualization Libraries
Part V. Machine Learning
Chapter 37. What Is Machine Learning?
Categories of Machine Learning
Qualitative Examples of Machine Learning Applications
Classification: Predicting Discrete Labels
Regression: Predicting Continuous Labels
Clustering: Inferring Labels on Unlabeled Data
Dimensionality Reduction: Inferring Structure of Unlabeled Data
Summary
Chapter 38. Introducing Scikit-Learn
Data Representation in Scikit-Learn
The Features Matrix
The Target Array
The Estimator API
Basics of the API
Supervised Learning Example: Simple Linear Regression
Supervised Learning Example: Iris Classification
Unsupervised Learning Example: Iris Dimensionality
Unsupervised Learning Example: Iris Clustering
Application: Exploring Handwritten Digits
Loading and Visualizing the Digits Data
Unsupervised Learning Example: Dimensionality Reduction
Classification on Digits
Summary
Chapter 39. Hyperparameters and Model Validation
Thinking About Model Validation
Model Validation the Wrong Way
Model Validation the Right Way: Holdout Sets
Model Validation via Cross-Validation
Selecting the Best Model
The Bias-Variance Trade-off
Validation Curves in Scikit-Learn
Learning Curves
Validation in Practice: Grid Search
Summary
Chapter 40. Feature Engineering
Categorical Features
Text Features
Image Features
Derived Features
Imputation of Missing Data
Feature Pipelines
Chapter 41. In Depth: Naive Bayes Classification
Bayesian Classification
Gaussian Naive Bayes
Multinomial Naive Bayes
Example: Classifying Text
When to Use Naive Bayes
Chapter 42. In Depth: Linear Regression
Simple Linear Regression
Basis Function Regression
Polynomial Basis Functions
Gaussian Basis Functions
Regularization
Ridge Regression (L2 Regularization)
Lasso Regression (L1 Regularization)
Example: Predicting Bicycle Traffic
Chapter 43. In Depth: Support Vector Machines
Motivating Support Vector Machines
Support Vector Machines: Maximizing the Margin
Fitting a Support Vector Machine
Beyond Linear Boundaries: Kernel SVM
Tuning the SVM: Softening Margins
Example: Face Recognition
Summary
Chapter 44. In Depth: Decision Trees and Random Forests
Motivating Random Forests: Decision Trees
Creating a Decision Tree
Decision Trees and Overfitting
Ensembles of Estimators: Random Forests
Random Forest Regression
Example: Random Forest for Classifying Digits
Summary
Chapter 45. In Depth: Principal Component Analysis
Introducing Principal Component Analysis
PCA as Dimensionality Reduction
PCA for Visualization: Handwritten Digits
What Do the Components Mean?
Choosing the Number of Components
PCA as Noise Filtering
Example: Eigenfaces
Summary
Chapter 46. In Depth: Manifold Learning
Manifold Learning: “HELLO”
Multidimensional Scaling
MDS as Manifold Learning
Nonlinear Embeddings: Where MDS Fails
Nonlinear Manifolds: Locally Linear Embedding
Some Thoughts on Manifold Methods
Example: Isomap on Faces
Example: Visualizing Structure in Digits
Chapter 47. In Depth: k-Means Clustering
Introducing k-Means
Expectation–Maximization
Examples
Example 1: k-Means on Digits
Example 2: k-Means for Color Compression
Chapter 48. In Depth: Gaussian Mixture Models
Motivating Gaussian Mixtures: Weaknesses of k-Means
Generalizing E–M: Gaussian Mixture Models
Choosing the Covariance Type
Gaussian Mixture Models as Density Estimation
Example: GMMs for Generating New Data
Chapter 49. In Depth: Kernel Density Estimation
Motivating Kernel Density Estimation: Histograms
Kernel Density Estimation in Practice
Selecting the Bandwidth via Cross-Validation
Example: Not-so-Naive Bayes
Anatomy of a Custom Estimator
Using Our Custom Estimator
Chapter 50. Application: A Face Detection Pipeline
HOG Features
HOG in Action: A Simple Face Detector
1. Obtain a Set of Positive Training Samples
2. Obtain a Set of Negative Training Samples
3. Combine Sets and Extract HOG Features
4. Train a Support Vector Machine
5. Find Faces in a New Image
Caveats and Improvements
Further Machine Learning Resources
Index
About the Author
Colophon