Get to grips with pandas - a versatile and high-performance library for manipulating, processing, cleaning, and crunching datasets in Python
Key Features
• Perform efficient data analysis and manipulation tasks using pandas 1.x
• Implement pandas in different real-world domains with the help of step-by-step demonstrations
• Become well versed in using pandas as an effective data exploration tool
Book Description
pandas is a powerful and popular library synonymous with Python data science that makes data wrangling and visualization easy by enabling you to work efficiently with tabular data. This second edition will help you get well-versed with the new features in pandas 1.x and enhance your data analysis skills for extracting significant insights and value from data.
Hands-On Data Analysis with Pandas will show you how to analyze your data, get started with machine learning, and work effectively with the Python libraries often used for data science, such as pandas, NumPy, matplotlib, seaborn, and scikit-learn. Using real-world datasets, the book shows you how to use the powerful pandas library to perform data wrangling to reshape, clean, and aggregate your data. As you advance, you'll learn how to conduct exploratory data analysis by calculating summary statistics and visualizing the data to find patterns. You'll also explore some applications of anomaly detection, regression, clustering, and classification using scikit-learn to make predictions based on past data.
By the end of this data analysis book, you'll be equipped with the skills you need to use pandas to ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple domains.
What you will learn
• Understand how data analysts and scientists gather and analyze data
• Perform data analysis and data wrangling using Python
• Combine, group, and aggregate data from multiple sources
• Create data visualizations with pandas, matplotlib, and seaborn
• Apply machine learning algorithms to identify patterns and make predictions
• Use Python data science libraries to analyze real-world datasets
• Solve common data representation and analysis problems using pandas
• Build Python scripts, modules, and packages for reusable analysis code
Who This Book Is For
This book is for data science beginners, data analysts, and Python developers who want to explore each stage of data analysis and scientific computing using a wide range of datasets. You'll also find this book useful if you are a data scientist looking to implement pandas in your machine learning workflow. Working knowledge of the Python programming language will assist with understanding the key concepts covered in this book.
Author(s): Stefanie Molin
Edition: 2
Publisher: Packt Publishing
Year: 2021
Language: English
Commentary: Vector PDF
Pages: 788
City: Birmingham, UK
Tags: Machine Learning; Data Analysis; Regression; Anomaly Detection; Python; Classification; Clustering; Data Visualization; Feature Engineering; Statistics; Hyperparameter Tuning; Finance; scikit-learn; Ensemble Learning; matplotlib; pandas; Jupyter; Data Wrangling; Seaborn; Bitcoin; Statistical Inference; Stock Valuation; Data Collection; ARIMA; Data Preprocessing; Data Exploration
Cover
Title Page
Copyright and Credits
Dedicated
Foreword to the Second Edition
Foreword to the First Edition
Contributors
Table of Contents
Preface
Section 1: Getting Started with Pandas
Chapter 1: Introduction to Data Analysis
Chapter materials
The fundamentals of data analysis
Data collection
Data wrangling
Exploratory data analysis
Drawing conclusions
Statistical foundations
Sampling
Descriptive statistics
Prediction and forecasting
Inferential statistics
Setting up a virtual environment
Virtual environments
Installing the required Python packages
Why pandas?
Jupyter Notebooks
Summary
Exercises
Further reading
Chapter 2: Working with Pandas DataFrames
Chapter materials
Pandas data structures
Series
Index
DataFrame
Creating a pandas DataFrame
From a Python object
From a file
From a database
From an API
Inspecting a DataFrame object
Examining the data
Describing and summarizing the data
Grabbing subsets of the data
Selecting columns
Slicing
Indexing
Filtering
Adding and removing data
Creating new data
Deleting unwanted data
Summary
Exercises
Further reading
Section 2: Using Pandas for Data Analysis
Chapter 3: Data Wrangling with Pandas
Chapter materials
Understanding data wrangling
Data cleaning
Data transformation
Data enrichment
Exploring an API to find and collect temperature data
Cleaning data
Renaming columns
Type conversion
Reordering, reindexing, and sorting data
Reshaping data
Transposing DataFrames
Pivoting DataFrames
Melting DataFrames
Handling duplicate, missing, or invalid data
Finding the problematic data
Mitigating the issues
Summary
Exercises
Further reading
Chapter 4: Aggregating Pandas DataFrames
Chapter materials
Performing database-style operations on DataFrames
Querying DataFrames
Merging DataFrames
Using DataFrame operations to enrich data
Arithmetic and statistics
Binning
Applying functions
Window calculations
Pipes
Aggregating data
Summarizing DataFrames
Aggregating by group
Pivot tables and crosstabs
Working with time series data
Time-based selection and filtering
Shifting for lagged data
Differenced data
Resampling
Merging time series
Summary
Exercises
Further reading
Chapter 5: Visualizing Data with Pandas and Matplotlib
Chapter materials
An introduction to matplotlib
The basics
Plot components
Additional options
Plotting with pandas
Evolution over time
Relationships between variables
Distributions
Counts and frequencies
The pandas.plotting module
Scatter matrices
Lag plots
Autocorrelation plots
Bootstrap plots
Summary
Exercises
Further reading
Chapter 6: Plotting with Seaborn and Customization Techniques
Chapter materials
Utilizing seaborn for advanced plotting
Categorical data
Correlations and heatmaps
Regression plots
Faceting
Formatting plots with matplotlib
Titles and labels
Legends
Formatting axes
Customizing visualizations
Adding reference lines
Shading regions
Annotations
Colors
Textures
Summary
Exercises
Further reading
Section 3: Applications – Real-World Analyses Using Pandas
Chapter 7: Financial Analysis – Bitcoin and the Stock Market
Chapter materials
Building a Python package
Package structure
Overview of the stock_analysis package
UML diagrams
Collecting financial data
The StockReader class
Collecting historical data from Yahoo! Finance
Exploratory data analysis
The Visualizer class family
Visualizing a stock
Visualizing multiple assets
Technical analysis of financial instruments
The StockAnalyzer class
The AssetGroupAnalyzer class
Comparing assets
Modeling performance using historical data
The StockModeler class
Time series decomposition
ARIMA
Linear regression with statsmodels
Comparing models
Summary
Exercises
Further reading
Chapter 8: Rule-Based Anomaly Detection
Chapter materials
Simulating login attempts
Assumptions
The login_attempt_simulator package
Simulating from the command line
Exploratory data analysis
Implementing rule-based anomaly detection
Percent difference
Tukey fence
Z-score
Evaluating performance
Summary
Exercises
Further reading
Section 4: Introduction to Machine Learning with Scikit-Learn
Chapter 9: Getting Started with Machine Learning in Python
Chapter materials
Overview of the machine learning landscape
Types of machine learning
Common tasks
Machine learning in Python
Exploratory data analysis
Red wine quality data
White and red wine chemical properties data
Planets and exoplanets data
Preprocessing data
Training and testing sets
Scaling and centering data
Encoding data
Imputing
Additional transformers
Building data pipelines
Clustering
k-means
Evaluating clustering results
Regression
Linear regression
Evaluating regression results
Classification
Logistic regression
Evaluating classification results
Summary
Exercises
Further reading
Chapter 10: Making Better Predictions – Optimizing Models
Chapter materials
Hyperparameter tuning with grid search
Feature engineering
Interaction terms and polynomial features
Dimensionality reduction
Feature unions
Feature importances
Ensemble methods
Random forest
Gradient boosting
Voting
Inspecting classification prediction confidence
Addressing class imbalance
Under-sampling
Over-sampling
Regularization
Summary
Exercises
Further reading
Chapter 11: Machine Learning Anomaly Detection
Chapter materials
Exploring the simulated login attempts data
Utilizing unsupervised methods of anomaly detection
Isolation forest
Local outlier factor
Comparing models
Implementing supervised anomaly detection
Baselining
Logistic regression
Incorporating a feedback loop with online learning
Creating the PartialFitPipeline subclass
Stochastic gradient descent classifier
Summary
Exercises
Further reading
Section 5: Additional Resources
Chapter 12: The Road Ahead
Data resources
Python packages
Searching for data
APIs
Websites
Practicing working with data
Python practice
Summary
Exercises
Further reading
Solutions
Appendix
About Packt
Other Books You May Enjoy
Index