The Pandas Workshop: A comprehensive guide to using Python for data analysis with real-world case studies

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Learn the fundamentals of data science with Python by analyzing real datasets and solving problems using pandas

Key Features

  • Learn how to apply data retrieval, transformation, visualization, and modeling techniques using pandas
  • Become highly efficient in unlocking deeper insights from your data, including databases, web data, and more
  • Build your experience and confidence with hands-on exercises and activities

Book Description

The Pandas Workshop will teach you how to be more productive with data and generate real business insights to inform your decision-making. You will be guided through real-world data science problems and shown how to apply key techniques in the context of realistic examples and exercises. Engaging activities will then challenge you to apply your new skills in a way that prepares you for real data science projects.

You'll see how experienced data scientists tackle a wide range of problems using data analysis with pandas. Unlike other Python books, which focus on theory and spend too long on dry, technical explanations, this workshop is designed to quickly get you to write clean code and build your understanding through hands-on practice. As you work through this Python pandas book, you'll tackle various real-world scenarios, such as using an air quality dataset to understand the pattern of nitrogen dioxide emissions in a city, as well as analyzing transportation data to improve bus transportation services.

By the end of this data analytics book, you'll have the knowledge, skills, and confidence you need to solve your own challenging data science problems with pandas.

What you will learn

  • Access and load data from different sources using pandas
  • Work with a range of data types and structures to understand your data
  • Perform data transformation to prepare it for analysis
  • Use Matplotlib for data visualization to create a variety of plots
  • Create data models to find relationships and test hypotheses
  • Manipulate time-series data to perform date-time calculations
  • Optimize your code to ensure more efficient business data analysis

Who this book is for

This data analysis book is for anyone with prior experience working with the Python programming language who wants to learn the fundamentals of data analysis with pandas. Previous knowledge of pandas is not necessary.

Table of Contents

  1. An Introduction to pandas
  2. Working with Data Structures
  3. Data I/O
  4. pandas Data Types
  5. Data Selection – DataFrames
  6. Data Selection – Series
  7. Data Exploration and Transformation
  8. Data Visualization
  9. Data Modeling – Preprocessing
  10. Data Modeling – Modeling Basics
  11. Data Modeling – Regression Modeling
  12. Using Time in pandas
  13. Exploring Time Series
  14. Applying pandas Data Processing for Case Studies

Author(s): Blaine Bateman, Saikat Basak, Thomas V. Joseph, William So
Publisher: Packt Publishing
Year: 2022

Language: English
Pages: 744

Cover
Title Page
Copyright and Credits
Contributors
Table of Contents
Preface
Part 1 – Introduction to pandas
Chapter 1: Introduction to pandas
Introduction to the world of pandas
Exploring the history and evolution of pandas
Components and applications of pandas
Understanding the basic concepts of pandas
The Series object
The DataFrame object
Working with local files
Reading a CSV file
Displaying a snapshot of the data
Writing data to a file
Data types in pandas
Data selection
Data transformation
Data visualization
Time series data
Code optimization
Utility functions
Exercise 1.02 – basic numerical operations with pandas
Data modeling
Exercise 1.03 – comparing data from two DataFrames
Activity 1.01 – comparing sales data for two stores
Summary
Chapter 2: Working with Data Structures
Introduction to data structures
The need for data structures
Data structures
Creating DataFrames in pandas
Exercise 2.01 – Creating a DataFrame
Indexes and columns
Exercise 2.02 – Reading DataFrames and manipulating the index
Working with columns
Series
The Series index
Exercise 2.03 – Series to DataFrames
Using time as the index
Exercise 2.04 – DataFrame indices
Activity 2.01 – Working with pandas data structures
Summary
Chapter 3: Data I/O
The world of data
Exploring data sources
Text files and binary files
Online data sources
Exercise 3.01 – reading data from web pages
Fundamental formats
Text data
Exercise 3.02 – text character encoding and data separators
Binary data
Databases – SQL data
sqlite3
Additional text formats
Working with JSON
Working with HTML/XML
Working with XML data
Working with Excel
SAS data
SPSS data
Stata data
HDF5 data
Manipulating SQL data
Exercise 3.03 – working with SQL
Choosing a format for a project
Activity 3.01 – using SQL data for pandas analytics
Summary
Chapter 4: Pandas Data Types
Introducing pandas dtypes
Obtaining the underlying data types
Converting from one type into another
Exercise 4.01 – underlying data types and conversion
Missing data types
The missing alphabet soup
Nullable types
Exercise 4.02 – missing data and converting into non-nullable dtypes
Activity 4.01 – optimizing memory usage by converting into the appropriate dtypes
Subsetting by data types
Working with the dtype category
Working with dtype = datetime64[ns]
Working with dtype = timedelta64[ns]
Exercise 4.03 – working with text data using string methods
Selecting data in a DataFrame by its dtype
Summary
Part 2 – Working with Data
Chapter 5: Data Selection – DataFrames
Introduction to DataFrames
The need for data selection methods
Data selection in pandas DataFrames
The index and its forms
Exercise 5.01 – identifying the row and column indices in a dataset
Slicing and indexing methods
Exercise 5.02 – subsetting rows and columns
Using labels as the index and the pandas multi-index
Creating a multi-index from columns
Activity 5.01: Creating a multi-index from columns
Bracket and dot notation
Bracket notation
Dot notation
Exercise 5.03 – integer row numbers versus labels
Using extended indexing
Type exceptions
Changing DataFrame values using bracket or dot notation
Exercise 5.04 – selecting data using bracket and dot notation
Summary
Chapter 6: Data Selection – Series
Introduction to pandas Series
The Series index
Data selection in a pandas Series
Brackets, dots, Series.loc, and Series.iloc
Exercise 6.01 – basic Series data selection
Preparing Series from DataFrames and vice versa
Exercise 6.02 – using a Series index to select values
Activity 6.1 – Series data selection
Understanding the differences between base Python and pandas data selection
Lists versus Series access
DataFrames versus dictionary access
Activity 6.02 – DataFrame data selection
Summary
Chapter 7: Data Exploration and Transformation
Introduction to data transformation
Dealing with messy data
Working on data without column headers
Multiple values in one column
Duplicate observations in both rows and columns
Exercise 6.01 – working with messy addresses
Multiple variables stored in one column
Multiple DataFrames with identical structures
Exercise 6.02 – storing sales by demographics
Dealing with missing data
What is missing data?
Strategies for missing data
Summarizing data
Grouping and aggregation
Exploring pivot tables
Activity 6.01 – data analysis using pivot tables
Summary
Chapter 8: Understanding Data Visualization
Introduction to data visualization
Understanding the basics of pandas visualization
Exercise 8.01 – Building histograms for the Titanic dataset
Exploring Matplotlib
Visualizing data of different types
Visualizing numerical data
Visualizing categorical data
Visualizing statistical data
Exercise 8.02 – Boxplots for the Titanic dataset
Visualizing multiple data plots
Activity 8.01 – Using data visualization for exploratory data analysis
Summary
Part 3 – Data Modeling
Chapter 9: Data Modeling – Preprocessing
An introduction to data modeling
Exploring dependent and independent variables
Training, validation, and test splits of data
Exercise 9.1 – Creating training, validation, and test data
Avoiding information leakage
Complete model validation
Understanding data scaling and normalization
Different ways to Scale Data
Scaling data yourself
Min/max scaling
Standardization – addressing variance
Transforming back to real units
Exercise 9.02 – Scaling and normalizing data
Activity 9.1 – Data splitting, scaling, and modeling
Summary
Chapter 10: Data Modeling – Modeling Basics
Introduction to data modeling
Learning the modeling basics
Modeling tools
Pandas modeling tools
Predicting future values of time series
Exercise 10.1 – Smoothing data to discover patterns
Activity 10.1 – Normalizing and smoothing data
Summary
Chapter 11: Data Modeling – Regression Modeling
An introduction to regression modeling
Exploring regression modeling
Using linear models
Exercise 11.1 – Linear regression
Non-linear models
Model diagnostics
Comparing predicted and actual values
Using the Q-Q plot
Exercise 11.2 – Multiple regression and non-linear models
Activity 11.1 – Multiple regression with non-linear models
Summary
Part 4 – Additional Use Cases for pandas
Chapter 12: Using Time in pandas
Introduction to time series
What are datetimes?
Attributes of datetime objects
Exercise 12.1 – working with datetime
Creating and manipulating datetime objects/time series
Time periods in pandas
Information in pandas time-aware objects
Exercise 12.2 – math with datetimes
Timestamp formats
Activity 12.1 – understanding power usage
Datetime math operations
Date ranges
Timedeltas, offsets, and differences
Date offsets
Exercise 12.3 – timedeltas and date offsets
Summary
Chapter 13: Exploring Time Series
The time series as an index
Time series periods/frequencies
Shifting, lagging, and converting frequency
Resampling, grouping, and aggregation by time
Using the resample method
Exercise 13.01 – Aggregating and resampling
Windowing operations with the rolling method
Activity 13.01 – Creating a time series model
Summary
Chapter 14: Applying pandas Data Processing for Case Studies
Introduction to the case studies and datasets
Recap of the preprocessing steps
Preprocessing the German climate data
Exercise 14.01 – preprocessing the German climate data
Exercise 14.02 – merging DataFrames and renaming variables
Exercise 14.03 – data interpolation and answering questions after data preprocessing
Exercise 14.04 – using data visualizations to answer questions
Exercise 14.05 – using data visualizations to answer questions
Exercise 14.06 – analyzing data on bus trajectories
Activity 14.01 – analyzing air quality data
Summary
Appendix
Index
Other Books You May Enjoy