The world around us is full of data that holds unique insights and valuable stories, and this book will help you uncover them. Whether you already work with data or want to learn more about its possibilities, the examples and techniques in this practical book will help you more easily clean, evaluate, and analyze data so that you can generate meaningful insights and compelling visualizations.
Complementing foundational concepts with expert advice, author Susan E. McGregor provides the resources you need to extract, evaluate, and analyze a wide variety of data sources and formats, along with the tools to communicate your findings effectively. This book delivers a methodical, jargon-free way for data practitioners at any level, from true novices to seasoned professionals, to harness the power of data.
• Use Python 3.8+ to read, write, and transform data from a variety of sources
• Understand and use programming basics in Python to wrangle data at scale
• Organize, document, and structure your code using best practices
• Collect data from structured data files, web pages, and APIs
• Perform basic statistical analyses to make meaning from datasets
• Visualize and present data in clear and compelling ways
Author(s): Susan McGregor
Edition: 1
Publisher: O'Reilly Media
Year: 2021
Language: English
Commentary: Vector PDF
Pages: 412
City: Sebastopol, CA
Tags: Data Analysis; Python; JSON; Data Cleaning; Refactoring; Data Wrangling; XML; Presentations; Data Quality; Data Augmentation
Cover
Copyright
Table of Contents
Preface
Who Should Read This Book?
Who Shouldn’t Read This Book?
What to Expect from This Volume
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Introduction to Data Wrangling and Data Quality
What Is “Data Wrangling”?
What Is “Data Quality”?
Data Integrity
Data “Fit”
Why Python?
Versatility
Accessibility
Readability
Community
Python Alternatives
Writing and “Running” Python
Working with Python on Your Own Device
Getting Started with the Command Line
Installing Python, Jupyter Notebook, and a Code Editor
Working with Python Online
Hello World!
Using Atom to Create a Standalone Python File
Using Jupyter to Create a New Python Notebook
Using Google Colab to Create a New Python Notebook
Adding the Code
In a Standalone File
In a Notebook
Running the Code
In a Standalone File
In a Notebook
Documenting, Saving, and Versioning Your Work
Documenting
Saving
Versioning
Conclusion
Chapter 2. Introduction to Python
The Programming “Parts of Speech”
Nouns ≈ Variables
Verbs ≈ Functions
Cooking with Custom Functions
Libraries: Borrowing Custom Functions from Other Coders
Taking Control: Loops and Conditionals
In the Loop
One Condition…
Understanding Errors
Syntax Snafus
Runtime Runaround
Logic Loss
Hitting the Road with Citi Bike Data
Starting with Pseudocode
Seeking Scale
Conclusion
Chapter 3. Understanding Data Quality
Assessing Data Fit
Validity
Reliability
Representativeness
Assessing Data Integrity
Necessary, but Not Sufficient
Important
Achievable
Improving Data Quality
Data Cleaning
Data Augmentation
Conclusion
Chapter 4. Working with File-Based and Feed-Based Data in Python
Structured Versus Unstructured Data
Working with Structured Data
File-Based, Table-Type Data—Take It to Delimit
Wrangling Table-Type Data with Python
Real-World Data Wrangling: Understanding Unemployment
XLSX, ODS, and All the Rest
Finally, Fixed-Width
Feed-Based Data—Web-Driven Live Updates
Wrangling Feed-Type Data with Python
Working with Unstructured Data
Image-Based Text: Accessing Data in PDFs
Wrangling PDFs with Python
Accessing PDF Tables with Tabula
Conclusion
Chapter 5. Accessing Web-Based Data
Accessing Online XML and JSON
Introducing APIs
Basic APIs: A Search Engine Example
Specialized APIs: Adding Basic Authentication
Getting a FRED API Key
Using Your API key to Request Data
Reading API Documentation
Protecting Your API Key When Using Python
Creating Your “Credentials” File
Using Your Credentials in a Separate Script
Getting Started with .gitignore
Specialized APIs: Working With OAuth
Applying for a Twitter Developer Account
Creating Your Twitter “App” and Credentials
Encoding Your API Key and Secret
Requesting an Access Token and Data from the Twitter API
API Ethics
Web Scraping: The Data Source of Last Resort
Carefully Scraping the MTA
Using Browser Inspection Tools
The Python Web Scraping Solution: Beautiful Soup
Conclusion
Chapter 6. Assessing Data Quality
The Pandemic and the PPP
Assessing Data Integrity
Is It of Known Pedigree?
Is It Timely?
Is It Complete?
Is It Well-Annotated?
Is It High Volume?
Is It Consistent?
Is It Multivariate?
Is It Atomic?
Is It Clear?
Is It Dimensionally Structured?
Assessing Data Fit
Validity
Reliability
Representativeness
Conclusion
Chapter 7. Cleaning, Transforming, and Augmenting Data
Selecting a Subset of Citi Bike Data
A Simple Split
Regular Expressions: Supercharged String Matching
Making a Date
De-crufting Data Files
Decrypting Excel Dates
Generating True CSVs from Fixed-Width Data
Correcting for Spelling Inconsistencies
The Circuitous Path to “Simple” Solutions
Gotchas That Will Get Ya!
Augmenting Your Data
Conclusion
Chapter 8. Structuring and Refactoring Your Code
Revisiting Custom Functions
Will You Use It More Than Once?
Is It Ugly and Confusing?
Do You Just Really Hate the Default Functionality?
Understanding Scope
Defining the Parameters for Function “Ingredients”
What Are Your Options?
Getting Into Arguments?
Return Values
Climbing the “Stack”
Refactoring for Fun and Profit
A Function for Identifying Weekdays
Metadata Without the Mess
Documenting Your Custom Scripts and Functions with pydoc
The Case for Command-Line Arguments
Where Scripts and Notebooks Diverge
Conclusion
Chapter 9. Introduction to Data Analysis
Context Is Everything
Same but Different
What’s Typical? Evaluating Central Tendency
What’s That Mean?
Embrace the Median
Think Different: Identifying Outliers
Visualization for Data Analysis
What’s Our Data’s Shape? Understanding Histograms
The Significance of Symmetry
Counting “Clusters”
The $2 Million Question
Proportional Response
Conclusion
Chapter 10. Presenting Your Data
Foundations for Visual Eloquence
Making Your Data Statement
Charts, Graphs, and Maps: Oh My!
Pie Charts
Bar and Column Charts
Line Charts
Scatter Charts
Maps
Elements of Eloquent Visuals
The “Finicky” Details Really Do Make a Difference
Trust Your Eyes (and the Experts)
Selecting Scales
Choosing Colors
Above All, Annotate!
From Basic to Beautiful: Customizing a Visualization with seaborn and matplotlib
Beyond the Basics
Conclusion
Chapter 11. Beyond Python
Additional Tools for Data Review
Spreadsheet Programs
OpenRefine
Additional Tools for Sharing and Presenting Data
Image Editing for JPGs, PNGs, and GIFs
Software for Editing SVGs and Other Vector Formats
Reflecting on Ethics
Conclusion
Appendix A. More Python Programming Resources
Official Python Documentation
Installing Python Resources
Where to Look for Libraries
Keeping Your Tools Sharp
Where to Learn More
Appendix B. A Bit More About Git
You Run git push/pull and End Up in a Weird Text Editor
Your git push/pull Command Gets Rejected
Run git pull
Git Quick Reference
Appendix C. Finding Data
Data Repositories and APIs
Subject Matter Experts
FOIA/L Requests
Custom Data Collection
Appendix D. Resources for Visualization and Information Design
Foundational Books on Information Visualization
The Quick Reference You’ll Reach For
Sources of Inspiration
Index
About the Author
Colophon