Python Data Analysis: Perform data collection, data processing, wrangling, visualization, model building using Python

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you'll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines. Starting with the essential statistical and data analysis fundamentals using Python, you'll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You'll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you'll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you'll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask. By the end of this data analysis book, you'll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.

Author(s): Avinash Navlani, Armando Fandango, Ivan Idris
Edition: 3
Publisher: Packt Publishing
Year: 2021

Language: English
Pages: 463

Cover
Title Page
Copyright and Credits
About Packt
Contributors
Table of Contents
Preface
Section 1: Foundation for Data Analysis
Chapter 1: Getting Started with Python Libraries
Understanding data analysis
The standard process of data analysis
The KDD process
SEMMA 
CRISP-DM
Comparing data analysis and data science
The roles of data analysts and data scientists
The skillsets of data analysts and data scientists
Installing Python 3
Python installation and setup on Windows
Python installation and setup on Linux
Python installation and setup on Mac OS X with a GUI installer
Python installation and setup on Mac OS X with brew
Software used in this book
Using IPython as a shell
Reading manual pages
Where to find help and references to Python data analysis libraries
Using JupyterLab
Using Jupyter Notebooks
Advanced features of Jupyter Notebooks
Keyboard shortcuts
Installing other kernels
Running shell commands
Extensions for Notebook
Summary
Chapter 2: NumPy and pandas
Technical requirements
Understanding NumPy arrays
Array features
Selecting array elements
NumPy array numerical data types
dtype objects
Data type character codes
dtype constructors
dtype attributes
Manipulating array shapes
The stacking of NumPy arrays
Partitioning NumPy arrays
Changing the data type of NumPy arrays
Creating NumPy views and copies
Slicing NumPy arrays
Boolean and fancy indexing
Broadcasting arrays
Creating pandas DataFrames
Understanding pandas Series
Reading and querying the Quandl data
Describing pandas DataFrames
Grouping and joining pandas DataFrame
Working with missing values
Creating pivot tables
Dealing with dates
Summary
References
Chapter 3: Statistics
Technical requirements
Understanding attributes and their types
Types of attributes
Discrete and continuous attributes
Measuring central tendency
Mean
Mode
Median
Measuring dispersion
Skewness and kurtosis
Understanding relationships using covariance and correlation coefficients
Pearson's correlation coefficient
Spearman's rank correlation coefficient
Kendall's rank correlation coefficient
Central limit theorem
Collecting samples
Performing parametric tests
Performing non-parametric tests 
Summary
Chapter 4: Linear Algebra
Technical requirements
Fitting to polynomials with NumPy
Determinant
Finding the rank of a matrix
Matrix inverse using NumPy
Solving linear equations using NumPy
Decomposing a matrix using SVD
Eigenvectors and Eigenvalues using NumPy
Generating random numbers
Binomial distribution
Normal distribution
Testing normality of data using SciPy
Creating a masked array using the numpy.ma subpackage
Summary
Section 2: Exploratory Data Analysis and Data Cleaning
Chapter 5: Data Visualization
Technical requirements
Visualization using Matplotlib
Accessories for charts
Scatter plot
Line plot
Pie plot
Bar plot
Histogram plot
Bubble plot
pandas plotting
Advanced visualization using the Seaborn package
lm plots
Bar plots
Distribution plots
Box plots
KDE plots
Violin plots
Count plots
Joint plots
Heatmaps
Pair plots
Interactive visualization with Bokeh
Plotting a simple graph
Glyphs
Layouts
Nested layout using row and column layouts
Multiple plots
Interactions
Hide click policy
Mute click policy
Annotations
Hover tool
Widgets
Tab panel
Slider
Summary
Chapter 6: Retrieving, Processing, and Storing Data
Technical requirements
Reading and writing CSV files with NumPy
Reading and writing CSV files with pandas
Reading and writing data from Excel
Reading and writing data from JSON
Reading and writing data from HDF5
Reading and writing data from HTML tables
Reading and writing data from Parquet
Reading and writing data from a pickle pandas object
Lightweight access with sqllite3
Reading and writing data from MySQL
Inserting a whole DataFrame into the database
Reading and writing data from MongoDB
Reading and writing data from Cassandra
Reading and writing data from Redis
PonyORM
Summary
Chapter 7: Cleaning Messy Data
Technical requirements
Exploring data
Filtering data to weed out the noise
Column-wise filtration  
Row-wise filtration  
Handling missing values
Dropping missing values
Filling in a missing value
Handling outliers
Feature encoding techniques
One-hot encoding
Label encoding
Ordinal encoder
Feature scaling
Methods for feature scaling
Feature transformation
Feature splitting
Summary
Chapter 8: Signal Processing and Time Series
Technical requirements
The statsmodels modules
Moving averages
Window functions
Defining cointegration
STL decomposition
Autocorrelation
Autoregressive models
ARMA models
Generating periodic signals
Fourier analysis
Spectral analysis filtering
Summary
Section 3: Deep Dive into Machine Learning
Chapter 9: Supervised Learning - Regression Analysis
Technical requirements
Linear regression
Multiple linear regression
Understanding multicollinearity
Removing multicollinearity
Dummy variables
Developing a linear regression model
Evaluating regression model performance
R-squared
MSE
MAE
RMSE
Fitting polynomial regression
Regression models for classification
Logistic regression
Characteristics of the logistic regression model
Types of logistic regression algorithms
Advantages and disadvantages of logistic regression
Implementing logistic regression using scikit-learn
Summary
Chapter 10: Supervised Learning - Classification Techniques
Technical requirements
Classification
Naive Bayes classification
Decision tree classification
KNN classification
SVM classification
Terminology
Splitting training and testing sets
Holdout
K-fold cross-validation
Bootstrap method
Evaluating the classification model performance
Confusion matrix
Accuracy
Precision
Recall
F-measure
ROC curve and AUC
Summary
Chapter 11: Unsupervised Learning - PCA and Clustering
Technical requirements
Unsupervised learning
Reducing the dimensionality of data
PCA
Performing PCA
Clustering
Finding the number of clusters
The elbow method
The silhouette method
Partitioning data using k-means clustering
Hierarchical clustering
DBSCAN clustering
Spectral clustering
Evaluating clustering performance
Internal performance evaluation
The Davies-Bouldin index
The silhouette coefficient
External performance evaluation
The Rand score
The Jaccard score
F-Measure or F1-score
The Fowlkes-Mallows score
Summary
Section 4: NLP, Image Analytics, and Parallel Computing
Chapter 12: Analyzing Textual Data
Technical requirements
Installing NLTK and SpaCy
Text normalization
Tokenization
Removing stopwords
Stemming and lemmatization
POS tagging
Recognizing entities
Dependency parsing
Creating a word cloud
Bag of Words
TF-IDF
Sentiment analysis using text classification
Classification using BoW
Classification using TF-IDF
Text similarity
Jaccard similarity
Cosine similarity
Summary
Chapter 13: Analyzing Image Data
Technical requirements
Installing OpenCV
Understanding image data
Binary images
Grayscale images
Color images
Color models
Drawing on images
Writing on images
Resizing images
Flipping images
Changing the brightness
Blurring an image
Face detection
Summary
Chapter 14: Parallel Computing Using Dask
Parallel computing using Dask
Dask data types
Dask Arrays
Dask DataFrames
DataFrame Indexing
Filter data
Groupby
Converting a pandas DataFrame into a Dask DataFrame
Converting a Dask DataFrame into a pandas DataFrame
Dask Bags
Creating a Dask Bag using Python iterable items
Creating a Dask Bag using a text file
Storing a Dask Bag in a text file
Storing a Dask Bag in a DataFrame
Dask Delayed
Preprocessing data at scale
Feature scaling in Dask
Feature encoding in Dask
Machine learning at scale
Parallel computing using scikit-learn
Reimplementing ML algorithms for Dask
Logistic regression
Clustering
Summary
Other Books You May Enjoy
Index