Learn how to apply powerful data analysis techniques with popular open source Python modulesAbout This Book* Find, manipulate, and analyze your data using the Python 3.5 libraries* Perform advanced, high performance linear algebra and mathematical calculations with clean and efficient Python code* An easy-to-follow guide with realistic examples that are frequently used in real-world data analysis projectsWho This Book Is ForThis book is for programmers, scientists, and engineers who have the knowledge of Python and know the basics of data science. It is for those who wish to learn different data analysis methods using Python 3.5 and its libraries.This book contains all the basic ingredients you need to become an expert data analyst.What you will learn* Install open source Python modules like NumPy, SciPy, Pandas, stasmodels, scikit-learn, theano, keras, and tensorflow on various platforms* Prepare, clean your data, and use it for exploratory analysis* Manipulate your data with Pandas* Retrieve and store your data from RDBMS, NoSQL, and Distributed Filesystems such as HDFS and HDF5* Visualize your data with open source libraries such as matplotlib, bokeh, plotly* Learn about various Machine Learning methods such as supervised, unsupervised, probabilistic and bayesian.* Understand signal processing and time-series data analysis* Get to grips with Graph processing, Deep Learning and EnsemblesIn DetailData analysis allows making sense of heaps of data. Python, with its strong set of libraries, is a popular language used today to conduct various data analysis, machine learning and visualization tasks.With this book, you will learn about data analysis with Python in the broadest sense possible, covering everything from data retrieval, cleaning, manipulation, visualization, and storage to complex analysis and modeling. It focuses on a plethora of open source Python modules such as NumPy, SciPy, matplotlib, pandas, IPython, Cython, scikit-learn, and NLTK. In later chapters, the book covers topics such as data visualization, signal processing, and time-series analysis, databases, predictive analytics and machine learning. This book will turn you into an ace data analyst in no time.
Author(s): Armando Fandango
Edition: 2
Year: 2017
Language: English
Pages: 409
Python Data Analysis - Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Getting Started with Python Libraries
Installing Python 3
Installing data analysis libraries
On Linux or Mac OS X
On Windows
Using IPython as a shell
Reading manual pages
Jupyter Notebook
NumPy arrays
A simple application
Where to find help and references
Listing modules inside the Python libraries
Visualizing data using Matplotlib
Summary
2. NumPy Arrays
The NumPy array object
Advantages of NumPy arrays
Creating a multidimensional array
Selecting NumPy array elements
NumPy numerical types
Data type objects
Character codes
The dtype constructors
The dtype attributes
One-dimensional slicing and indexing
Manipulating array shapes
Stacking arrays
Splitting NumPy arrays
NumPy array attributes
Converting arrays
Creating array views and copies
Fancy indexing
Indexing with a list of locations
Indexing NumPy arrays with Booleans
Broadcasting NumPy arrays
Summary
References
3. The Pandas Primer
Installing and exploring Pandas
The Pandas DataFrames
The Pandas Series
Querying data in Pandas
Statistics with Pandas DataFrames
Data aggregation with Pandas DataFrames
Concatenating and appending DataFrames
Joining DataFrames
Handling missing values
Dealing with dates
Pivot tables
Summary
References
4. Statistics and Linear Algebra
Basic descriptive statistics with NumPy
Linear algebra with NumPy
Inverting matrices with NumPy
Solving linear systems with NumPy
Finding eigenvalues and eigenvectors with NumPy
NumPy random numbers
Gambling with the binomial distribution
Sampling the normal distribution
Performing a normality test with SciPy
Creating a NumPy masked array
Disregarding negative and extreme values
Summary
5. Retrieving, Processing, and Storing Data
Writing CSV files with NumPy and Pandas
The binary .npy and pickle formats
Storing data with PyTables
Reading and writing Pandas DataFrames to HDF5 stores
Reading and writing to Excel with Pandas
Using REST web services and JSON
Reading and writing JSON with Pandas
Parsing RSS and Atom feeds
Parsing HTML with Beautiful Soup
Summary
Reference
6. Data Visualization
The matplotlib subpackages
Basic matplotlib plots
Logarithmic plots
Scatter plots
Legends and annotations
Three-dimensional plots
Plotting in Pandas
Lag plots
Autocorrelation plots
Plot.ly
Summary
7. Signal Processing and Time Series
The statsmodels modules
Moving averages
Window functions
Defining cointegration
Autocorrelation
Autoregressive models
ARMA models
Generating periodic signals
Fourier analysis
Spectral analysis
Filtering
Summary
8. Working with Databases
Lightweight access with sqlite3
Accessing databases from Pandas
SQLAlchemy
Installing and setting up SQLAlchemy
Populating a database with SQLAlchemy
Querying the database with SQLAlchemy
Pony ORM
Dataset - databases for lazy people
PyMongo and MongoDB
Storing data in Redis
Storing data in memcache
Apache Cassandra
Summary
9. Analyzing Textual Data and Social Media
Installing NLTK
About NLTK
Filtering out stopwords, names, and numbers
The bag-of-words model
Analyzing word frequencies
Naive Bayes classification
Sentiment analysis
Creating word clouds
Social network analysis
Summary
10. Predictive Analytics and Machine Learning
Preprocessing
Classification with logistic regression
Classification with support vector machines
Regression with ElasticNetCV
Support vector regression
Clustering with affinity propagation
Mean shift
Genetic algorithms
Neural networks
Decision trees
Summary
11. Environments Outside the Python Ecosystem and Cloud Computing
Exchanging information with Matlab/Octave
Installing rpy2 package
Interfacing with R
Sending NumPy arrays to Java
Integrating SWIG and NumPy
Integrating Boost and Python
Using Fortran code through f2py
PythonAnywhere Cloud
Summary
12. Performance Tuning, Profiling, and Concurrency
Profiling the code
Installing Cython
Calling C code
Creating a process pool with multiprocessing
Speeding up embarrassingly parallel for loops with Joblib
Comparing Bottleneck to NumPy functions
Performing MapReduce with Jug
Installing MPI for Python
IPython Parallel
Summary
A. Key Concepts
B. Useful Functions
Matplotlib
NumPy
Pandas
Scikit-learn
SciPy
scipy.fftpack
scipy.signal
scipy.stats
C. Online Resources