Author(s): Ryan Mitchell
Publisher: O'Reilly Media
Year: 2015
Language: English
Pages: 256
Preface
What Is Web Scraping?
Why Web Scraping?
About This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
I. Building Scrapers
1. Your First Web Scraper
Connecting
An Introduction to BeautifulSoup
Installing BeautifulSoup
Running BeautifulSoup
Connecting Reliably
2. Advanced HTML Parsing
You Don’t Always Need a Hammer
Another Serving of BeautifulSoup
find() and findAll() with BeautifulSoup
Other BeautifulSoup Objects
Navigating Trees
Dealing with children and other descendants
Dealing with siblings
Dealing with your parents
Regular Expressions
Regular Expressions and BeautifulSoup
Accessing Attributes
Lambda Expressions
Beyond BeautifulSoup
3. Starting to Crawl
Traversing a Single Domain
Crawling an Entire Site
Collecting Data Across an Entire Site
Crawling Across the Internet
Crawling with Scrapy
4. Using APIs
How APIs Work
Common Conventions
Methods
Authentication
Responses
API Calls
Echo Nest
A Few Examples
Twitter
Getting Started
A Few Examples
Google APIs
Getting Started
A Few Examples
Parsing JSON
Bringing It All Back Home
More About APIs
5. Storing Data
Media Files
Storing Data to CSV
MySQL
Installing MySQL
Some Basic Commands
Integrating with Python
Database Techniques and Good Practice
“Six Degrees” in MySQL
Email
6. Reading Documents
Document Encoding
Text
Text Encoding and the Global Internet
A brief overview of encoding types
Encodings in action
CSV
Reading CSV Files
PDF
Microsoft Word and .docx
II. Advanced Scraping
7. Cleaning Your Dirty Data
Cleaning in Code
Data Normalization
Cleaning After the Fact
OpenRefine
Installation
Using OpenRefine
Filtering
Cleaning
8. Reading and Writing Natural Languages
Summarizing Data
Markov Models
Six Degrees of Wikipedia: Conclusion
Natural Language Toolkit
Installation and Setup
Statistical Analysis with NLTK
Lexicographical Analysis with NLTK
Additional Resources
9. Crawling Through Forms and Logins
Python Requests Library
Submitting a Basic Form
Radio Buttons, Checkboxes, and Other Inputs
Submitting Files and Images
Handling Logins and Cookies
HTTP Basic Access Authentication
Other Form Problems
10. Scraping JavaScript
A Brief Introduction to JavaScript
Common JavaScript Libraries
jQuery
Google Analytics
Google Maps
Ajax and Dynamic HTML
Executing JavaScript in Python with Selenium
Handling Redirects
11. Image Processing and Text Recognition
Overview of Libraries
Pillow
Tesseract
Installing Tesseract
NumPy
Processing Well-Formatted Text
Scraping Text from Images on Websites
Reading CAPTCHAs and Training Tesseract
Training Tesseract
Retrieving CAPTCHAs and Submitting Solutions
12. Avoiding Scraping Traps
A Note on Ethics
Looking Like a Human
Adjust Your Headers
Handling Cookies
Timing Is Everything
Common Form Security Features
Hidden Input Field Values
Avoiding Honeypots
The Human Checklist
13. Testing Your Website with Scrapers
An Introduction to Testing
What Are Unit Tests?
Python unittest
Testing Wikipedia
Testing with Selenium
Interacting with the Site
Drag and drop
Taking screenshots
Unittest or Selenium?
14. Scraping Remotely
Why Use Remote Servers?
Avoiding IP Address Blocking
Portability and Extensibility
Tor
PySocks
Remote Hosting
Running from a Website Hosting Account
Running from the Cloud
Additional Resources
Moving Forward
A. Python at a Glance
Installation and “Hello, World!”
B. The Internet at a Glance
C. The Legalities and Ethics of Web Scraping
Trademarks, Copyrights, Patents, Oh My!
Copyright Law
Trespass to Chattels
The Computer Fraud and Abuse Act
robots.txt and Terms of Service
Three Web Scrapers
eBay versus Bidder’s Edge and Trespass to Chattels
United States v. Auernheimer and The Computer Fraud and Abuse Act
Field v. Google: Copyright and robots.txt
Index