For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all—IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools.
Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.
With this handbook, you’ll learn how to use:
* IPython and Jupyter: provide computational environments for data scientists using Python
* NumPy: includes the ndarray for efficient storage and manipulation of dense data arrays in Python
* Pandas: features the DataFrame for efficient storage and manipulation of labeled/columnar data in Python
* Matplotlib: includes capabilities for a flexible range of data visualizations in Python
* Scikit-Learn: for efficient and clean Python implementations of the most important and established machine learning algorithms
Author(s): Jake Vanderplas
Publisher: O'Reilly Media
Year: 2016
Language: English
Pages: 500
Tags: data analytics, analysis, math, statistics, analytics, big data
Copyright
Table of Contents
Preface
What Is Data Science?
Who Is This Book For?
Why Python?
Python 2 Versus Python 3
Outline of This Book
Using Code Examples
Installation Considerations
Conventions Used in This Book
O’Reilly Safari
How to Contact Us
Chapter 1. IPython: Beyond Normal Python
Shell or Notebook?
Launching the IPython Shell
Launching the Jupyter Notebook
Help and Documentation in IPython
Accessing Documentation with ?
Accessing Source Code with ??
Exploring Modules with Tab Completion
Keyboard Shortcuts in the IPython Shell
Navigation Shortcuts
Text Entry Shortcuts
Command History Shortcuts
Miscellaneous Shortcuts
IPython Magic Commands
Pasting Code Blocks: %paste and %cpaste
Running External Code: %run
Timing Code Execution: %timeit
Help on Magic Functions: ?, %magic, and %lsmagic
Input and Output History
IPython’s In and Out Objects
Underscore Shortcuts and Previous Outputs
Suppressing Output
Related Magic Commands
IPython and Shell Commands
Quick Introduction to the Shell
Shell Commands in IPython
Passing Values to and from the Shell
Shell-Related Magic Commands
Errors and Debugging
Controlling Exceptions: %xmode
Debugging: When Reading Tracebacks Is Not Enough
Profiling and Timing Code
Timing Code Snippets: %timeit and %time
Profiling Full Scripts: %prun
Line-by-Line Profiling with %lprun
Profiling Memory Use: %memit and %mprun
More IPython Resources
Web Resources
Books
Chapter 2. Introduction to NumPy
Understanding Data Types in Python
A Python Integer Is More Than Just an Integer
A Python List Is More Than Just a List
Fixed-Type Arrays in Python
Creating Arrays from Python Lists
Creating Arrays from Scratch
NumPy Standard Data Types
The Basics of NumPy Arrays
NumPy Array Attributes
Array Indexing: Accessing Single Elements
Array Slicing: Accessing Subarrays
Reshaping of Arrays
Array Concatenation and Splitting
Computation on NumPy Arrays: Universal Functions
The Slowness of Loops
Introducing UFuncs
Exploring NumPy’s UFuncs
Advanced Ufunc Features
Ufuncs: Learning More
Aggregations: Min, Max, and Everything in Between
Summing the Values in an Array
Minimum and Maximum
Example: What Is the Average Height of US Presidents?
Computation on Arrays: Broadcasting
Introducing Broadcasting
Rules of Broadcasting
Broadcasting in Practice
Comparisons, Masks, and Boolean Logic
Example: Counting Rainy Days
Comparison Operators as ufuncs
Working with Boolean Arrays
Boolean Arrays as Masks
Fancy Indexing
Exploring Fancy Indexing
Combined Indexing
Example: Selecting Random Points
Modifying Values with Fancy Indexing
Example: Binning Data
Sorting Arrays
Fast Sorting in NumPy: np.sort and np.argsort
Partial Sorts: Partitioning
Example: k-Nearest Neighbors
Structured Data: NumPy’s Structured Arrays
Creating Structured Arrays
More Advanced Compound Types
RecordArrays: Structured Arrays with a Twist
On to Pandas
Chapter 3. Data Manipulation with Pandas
Installing and Using Pandas
Introducing Pandas Objects
The Pandas Series Object
The Pandas DataFrame Object
The Pandas Index Object
Data Indexing and Selection
Data Selection in Series
Data Selection in DataFrame
Operating on Data in Pandas
Ufuncs: Index Preservation
UFuncs: Index Alignment
Ufuncs: Operations Between DataFrame and Series
Handling Missing Data
Trade-Offs in Missing Data Conventions
Missing Data in Pandas
Operating on Null Values
Hierarchical Indexing
A Multiply Indexed Series
Methods of MultiIndex Creation
Indexing and Slicing a MultiIndex
Rearranging Multi-Indices
Data Aggregations on Multi-Indices
Combining Datasets: Concat and Append
Recall: Concatenation of NumPy Arrays
Simple Concatenation with pd.concat
Combining Datasets: Merge and Join
Relational Algebra
Categories of Joins
Specification of the Merge Key
Specifying Set Arithmetic for Joins
Overlapping Column Names: The suffixes Keyword
Example: US States Data
Aggregation and Grouping
Planets Data
Simple Aggregation in Pandas
GroupBy: Split, Apply, Combine
Pivot Tables
Motivating Pivot Tables
Pivot Tables by Hand
Pivot Table Syntax
Example: Birthrate Data
Vectorized String Operations
Introducing Pandas String Operations
Tables of Pandas String Methods
Example: Recipe Database
Working with Time Series
Dates and Times in Python
Pandas Time Series: Indexing by Time
Pandas Time Series Data Structures
Frequencies and Offsets
Resampling, Shifting, and Windowing
Where to Learn More
Example: Visualizing Seattle Bicycle Counts
High-Performance Pandas: eval() and query()
Motivating query() and eval(): Compound Expressions
pandas.eval() for Efficient Operations
DataFrame.eval() for Column-Wise Operations
DataFrame.query() Method
Performance: When to Use These Functions
Further Resources
Chapter 4. Visualization with Matplotlib
General Matplotlib Tips
Importing matplotlib
Setting Styles
show() or No show()? How to Display Your Plots
Saving Figures to File
Two Interfaces for the Price of One
Simple Line Plots
Adjusting the Plot: Line Colors and Styles
Adjusting the Plot: Axes Limits
Labeling Plots
Simple Scatter Plots
Scatter Plots with plt.plot
Scatter Plots with plt.scatter
plot Versus scatter: A Note on Efficiency
Visualizing Errors
Basic Errorbars
Continuous Errors
Density and Contour Plots
Visualizing a Three-Dimensional Function
Histograms, Binnings, and Density
Two-Dimensional Histograms and Binnings
Customizing Plot Legends
Choosing Elements for the Legend
Legend for Size of Points
Multiple Legends
Customizing Colorbars
Customizing Colorbars
Example: Handwritten Digits
Multiple Subplots
plt.axes: Subplots by Hand
plt.subplot: Simple Grids of Subplots
plt.subplots: The Whole Grid in One Go
plt.GridSpec: More Complicated Arrangements
Text and Annotation
Example: Effect of Holidays on US Births
Transforms and Text Position
Arrows and Annotation
Customizing Ticks
Major and Minor Ticks
Hiding Ticks or Labels
Reducing or Increasing the Number of Ticks
Fancy Tick Formats
Summary of Formatters and Locators
Customizing Matplotlib: Configurations and Stylesheets
Plot Customization by Hand
Changing the Defaults: rcParams
Stylesheets
Three-Dimensional Plotting in Matplotlib
Three-Dimensional Points and Lines
Three-Dimensional Contour Plots
Wireframes and Surface Plots
Surface Triangulations
Geographic Data with Basemap
Map Projections
Drawing a Map Background
Plotting Data on Maps
Example: California Cities
Example: Surface Temperature Data
Visualization with Seaborn
Seaborn Versus Matplotlib
Exploring Seaborn Plots
Example: Exploring Marathon Finishing Times
Further Resources
Matplotlib Resources
Other Python Graphics Libraries
Chapter 5. Machine Learning
What Is Machine Learning?
Categories of Machine Learning
Qualitative Examples of Machine Learning Applications
Summary
Introducing Scikit-Learn
Data Representation in Scikit-Learn
Scikit-Learn’s Estimator API
Application: Exploring Handwritten Digits
Summary
Hyperparameters and Model Validation
Thinking About Model Validation
Selecting the Best Model
Learning Curves
Validation in Practice: Grid Search
Summary
Feature Engineering
Categorical Features
Text Features
Image Features
Derived Features
Imputation of Missing Data
Feature Pipelines
In Depth: Naive Bayes Classification
Bayesian Classification
Gaussian Naive Bayes
Multinomial Naive Bayes
When to Use Naive Bayes
In Depth: Linear Regression
Simple Linear Regression
Basis Function Regression
Regularization
Example: Predicting Bicycle Traffic
In-Depth: Support Vector Machines
Motivating Support Vector Machines
Support Vector Machines: Maximizing the Margin
Example: Face Recognition
Support Vector Machine Summary
In-Depth: Decision Trees and Random Forests
Motivating Random Forests: Decision Trees
Ensembles of Estimators: Random Forests
Random Forest Regression
Example: Random Forest for Classifying Digits
Summary of Random Forests
In Depth: Principal Component Analysis
Introducing Principal Component Analysis
PCA as Noise Filtering
Example: Eigenfaces
Principal Component Analysis Summary
In-Depth: Manifold Learning
Manifold Learning: “HELLO”
Multidimensional Scaling (MDS)
MDS as Manifold Learning
Nonlinear Embeddings: Where MDS Fails
Nonlinear Manifolds: Locally Linear Embedding
Some Thoughts on Manifold Methods
Example: Isomap on Faces
Example: Visualizing Structure in Digits
In Depth: k-Means Clustering
Introducing k-Means
k-Means Algorithm: Expectation–Maximization
Examples
In Depth: Gaussian Mixture Models
Motivating GMM: Weaknesses of k-Means
Generalizing E–M: Gaussian Mixture Models
GMM as Density Estimation
Example: GMM for Generating New Data
In-Depth: Kernel Density Estimation
Motivating KDE: Histograms
Kernel Density Estimation in Practice
Example: KDE on a Sphere
Example: Not-So-Naive Bayes
Application: A Face Detection Pipeline
HOG Features
HOG in Action: A Simple Face Detector
Caveats and Improvements
Further Machine Learning Resources
Machine Learning in Python
General Machine Learning
Index
About the Author
Colophon