Fully expanded and upgraded, the second edition of Python Data Science Essentials takes you through all you need to know to suceed in data science using Python. Get modern insight into the core of Python data, including the latest versions of Jupyter notebooks, NumPy, pandas and scikit-learn. Look beyond the fundamentals with beautiful data visualizations with Seaborn and ggplot, web development with Bottle, and even the new frontiers of deep learning with Theano and TensorFlow.
Dive into building your essential Python 3.5 data science toolbox, using a single-source approach that will allow to to work with Python 2.7 as well. Get to grips fast with data munging and preprocessing, and all the techniques you need to load, analyse, and process your data. Finally, get a complete overview of principal machine learning algorithms, graph analysis techniques, and all the visualization and deployment instruments that make it easier to present your results to an audience of both data science experts and business users.
Author(s): Luca Massaron; Alberto Boschetti
Edition: 2
Publisher: Packt Publishing
Year: 2016
Cover
Copyright
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Table of Contents
Preface
Chapter 1: First Steps
Introducing data science and Python
Installing Python
Python 2 or Python 3?
Step-by-step installation
The installation of packages
Package upgrades
Scientific distributions
Anaconda
Leveraging conda to install packages
Enthought Canopy
PythonXY
WinPython
Explaining virtual environments
conda for managing environments
A glance at the essential packages
NumPy
SciPy
pandas
Scikit-learn
Jupyter
Matplotlib
Statsmodels
Beautiful Soup
NetworkX
NLTK
Gensim
PyPy
XGBoost
Theano
Keras
Introducing Jupyter
Fast installation and first test usage
Jupyter magic commands
How Jupyter Notebooks can help data scientists
Alternatives to Jupyter
Datasets and code used in the book
Scikit-learn toy datasets
The MLdata.org public repository
LIBSVM data examples
Loading data directly from CSV or text files
Scikit-learn sample generators
Summary
Chapter 2: Data Munging
The data science process
Data loading and preprocessing with pandas
Fast and easy data loading
Dealing with problematic data
Dealing with big datasets
Accessing other data formats
Data preprocessing
Data selection
Working with categorical and text data
A special type of data – text
Scraping the Web with Beautiful Soup
Data processing with NumPy
NumPy's n-dimensional array
The basics of NumPy ndarray objects
Creating NumPy arrays
From lists to unidimensional arrays
Controlling the memory size
Heterogeneous lists
From lists to multidimensional arrays
Resizing arrays
Arrays derived from NumPy functions
Getting an array directly from a file
Extracting data from pandas
NumPy's fast operations and computations
Matrix operations
Slicing and indexing with NumPy arrays
Stacking NumPy arrays
Summary
Chapter 3: The Data Pipeline
Introducing EDA
Building new features
Dimensionality reduction
The covariance matrix
Principal Component Analysis (PCA)
PCA for big data – RandomizedPCA
Latent Factor Analysis (LFA)
Linear Discriminant Analysis (LDA)
Latent Semantical Analysis (LSA)
Independent Component Analysis (ICA)
Kernel PCA
T-SNE
Restricted Boltzmann Machine (RBM)
The detection and treatment of outliers
Univariate outlier detection
EllipticEnvelope
OneClassSVM
Validation metrics
Multilabel classification
Binary classification
Regression
Testing and validating
Cross-validation
Using cross-validation iterators
Sampling and bootstrapping
Hyperparameter optimization
Building custom scoring functions
Reducing the grid search runtime
Feature selection
Selection based on feature variance
Univariate selection
Recursive elimination
Stability and L1-based selection
Wrapping everything in a pipeline
Combining features together and chaining transformations
Building custom transformation functions
Summary
Chapter 4: Machine Learning
Preparing tools and datasets
Linear and logistic regression
Naive Bayes
K-Nearest Neighbors
Nonlinear algorithms
SVM for classification
SVM for regression
Tuning SVM
Ensemble strategies
Pasting by random samples
Bagging with weak classifiers
Random subspaces and random patches
Random Forests and Extra-Trees
Estimating probabilities from an ensemble
Sequences of models – AdaBoost
Gradient tree boosting (GTB)
XGBoost
Dealing with big data
Creating some big datasets as examples
Scalability with volume
Keeping up with velocity
Dealing with variety
An overview of Stochastic Gradient Descent (SGD)
Approaching deep learning
A peek at Natural Language Processing (NLP)
Word tokenization
Stemming
Word tagging
Named Entity Recognition (NER)
Stopwords
A complete data science example – text classification
An overview of unsupervised learning
Summary
Chapter 5: Social Network Analysis
Introduction to graph theory
Graph algorithms
Graph loading, dumping, and sampling
Summary
Chapter 6: Visualization, Insights, and Results
Introducing the basics of matplotlib
Curve plotting
Using panels
Scatterplots for relationships in data
Histograms
Bar graphs
Image visualization
Selected graphical examples with pandas
Boxplots and histograms
Scatterplots
Parallel coordinates
Wrapping up matplotlib's commands
Introducing Seaborn
Enhancing your EDA capabilities
Interactive visualizations with Bokeh
Advanced data-learning representations
Learning curves
Validation curves
Feature importance for RandomForests
GBT partial dependence plots
Creating a prediction server for ML-AAS
Summary
Appendix: Strengthen Your Python Foundations
Your learning list
Lists
Dictionaries
Defining functions
Classes, objects, and OOP
Exceptions
Iterators and generators
Conditionals
Comprehensions for lists and dictionaries
Learn by watching, reading, and doing
MOOCs
PyCon and PyData
Interactive Jupyter
Don't be shy, take a real challenge
Index