Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Michael Heydt
Year: 2018

Language: English

Cover
Copyright and Credits
Contributors
Packt Upsell
Table of Contents
Preface
Chapter 1: Getting Started with Scraping
Introduction
Setting up a Python development environment 
Getting ready
How to do it...
Scraping Python.org with Requests and Beautiful Soup
Getting ready...
How to do it...
How it works...
Scraping Python.org in urllib3 and Beautiful Soup
Getting ready...
How to do it...
How it works
There's more...
Scraping Python.org with Scrapy
Getting ready...
How to do it...
How it works
Scraping Python.org with Selenium and PhantomJS
Getting ready
How to do it...
How it works
There's more...
Chapter 2: Data Acquisition and Extraction
Introduction
How to parse websites and navigate the DOM using BeautifulSoup
Getting ready
How to do it...
How it works
There's more...
Searching the DOM with Beautiful Soup's find methods
Getting ready
How to do it...
Querying the DOM with XPath and lxml
Getting ready
How to do it...
How it works
There's more...
Querying data with XPath and CSS selectors
Getting ready
How to do it...
How it works
There's more...
Using Scrapy selectors
Getting ready
How to do it...
How it works
There's more...
Loading data in unicode / UTF-8
Getting ready
How to do it...
How it works
There's more...
Chapter 3: Processing Data
Introduction
Working with CSV and JSON data
Getting ready
How to do it
How it works
There's more...
Storing data using AWS S3
Getting ready
How to do it
How it works
There's more...
Storing data using MySQL
Getting ready
How to do it
How it works
There's more...
Storing data using PostgreSQL
Getting ready
How to do it
How it works
There's more...
Storing data in Elasticsearch
Getting ready
How to do it
How it works
There's more...
How to build robust ETL pipelines with AWS SQS
Getting ready
How to do it - posting messages to an AWS queue
How it works
How to do it - reading and processing messages
How it works
There's more...
Chapter 4: Working with Images, Audio, and other Assets
Introduction
Downloading media content from the web
Getting ready
How to do it
How it works
There's more...
 Parsing a URL with urllib to get the filename
Getting ready
How to do it
How it works
There's more...
Determining the type of content for a URL 
Getting ready
How to do it
How it works
There's more...
Determining the file extension from a content type
Getting ready
How to do it
How it works
There's more...
Downloading and saving images to the local file system
How to do it
How it works
There's more...
Downloading and saving images to S3
Getting ready
How to do it
How it works
There's more...
 Generating thumbnails for images
Getting ready
How to do it
How it works
Taking a screenshot of a website
Getting ready
How to do it
How it works
Taking a screenshot of a website with an external service
Getting ready
How to do it
How it works
There's more...
Performing OCR on an image with pytesseract
Getting ready
How to do it
How it works
There's more...
Creating a Video Thumbnail
Getting ready
How to do it
How it works
There's more..
Ripping an MP4 video to an MP3
Getting ready
How to do it
There's more...
Chapter 5: Scraping - Code of Conduct
Introduction
Scraping legality and scraping politely
Getting ready
How to do it
Respecting robots.txt
Getting ready
How to do it
How it works
There's more...
Crawling using the sitemap
Getting ready
How to do it
How it works
There's more...
Crawling with delays
Getting ready
How to do it
How it works
There's more...
Using identifiable user agents 
How to do it
How it works
There's more...
Setting the number of concurrent requests per domain
How it works
Using auto throttling
How to do it
How it works
There's more...
Using an HTTP cache for development
How to do it
How it works
There's more...
Chapter 6: Scraping Challenges and Solutions
Introduction
Retrying failed page downloads
How to do it
How it works
Supporting page redirects
How to do it
How it works
Waiting for content to be available in Selenium
How to do it
How it works
Limiting crawling to a single domain
How to do it
How it works
Processing infinitely scrolling pages
Getting ready
How to do it
How it works
There's more...
Controlling the depth of a crawl
How to do it
How it works
Controlling the length of a crawl
How to do it
How it works
Handling paginated websites
Getting ready
How to do it
How it works
There's more...
Handling forms and forms-based authorization
Getting ready
How to do it
How it works
There's more...
Handling basic authorization
How to do it
How it works
There's more...
Preventing bans by scraping via proxies
Getting ready
How to do it
How it works
Randomizing user agents
How to do it
Caching responses
How to do it
There's more...
Chapter 7: Text Wrangling and Analysis
Introduction
Installing NLTK
How to do it
Performing sentence splitting
How to do it
There's more...
Performing tokenization
How to do it
Performing stemming
How to do it
Performing lemmatization
How to do it
Determining and removing stop words
How to do it
There's more...
Calculating the frequency distributions of words
How to do it
There's more...
Identifying and removing rare words
How to do it
Identifying and removing rare words
How to do it
Removing punctuation marks
How to do it
There's more...
Piecing together n-grams
How to do it
There's more...
Scraping a job listing from StackOverflow 
Getting ready
How to do it
There's more...
Reading and cleaning the description in the job listing
Getting ready
How to do it...
Chapter 8: Searching, Mining and Visualizing Data
Introduction
Geocoding an IP address
Getting ready
How to do it
How to collect IP addresses of Wikipedia edits
Getting ready
How to do it
How it works
There's more...
Visualizing contributor location frequency on Wikipedia
How to do it
Creating a word cloud from a StackOverflow job listing
Getting ready
How to do it
Crawling links on Wikipedia
Getting ready
How to do it
How it works
Theres more...
Visualizing page relationships on Wikipedia
Getting ready
How to do it
How it works
There's more...
Calculating degrees of separation
How to do it
How it works
There's more...
Chapter 9: Creating a Simple Data API
Introduction
Creating a REST API with Flask-RESTful
Getting ready
How to do it
How it works
There's more...
Integrating the REST API with scraping code
Getting ready
How to do it
Adding an API to find the skills for a job listing
Getting ready
How to do it
Storing data in Elasticsearch as the result of a scraping request
Getting ready
How to do it
How it works
There's more...
Checking Elasticsearch for a listing before scraping
How to do it
There's more...
Chapter 10: Creating Scraper Microservices with Docker
Introduction
Installing Docker
Getting ready
How to do it
Installing a RabbitMQ container from Docker Hub
Getting ready
How to do it
Running a Docker container (RabbitMQ)
Getting ready
How to do it
There's more...
Creating and running an Elasticsearch container
How to do it
Stopping/restarting a container and removing the image
How to do it
There's more...
Creating a generic microservice with Nameko
Getting ready
How to do it
How it works
There's more...
Creating a scraping microservice
How to do it
There's more...
Creating a scraper container
Getting ready
How to do it
How it works
Creating an API container
Getting ready
How to do it
There's more...
Composing and running the scraper locally with docker-compose
Getting ready
How to do it
There's more...
Chapter 11: Making the Scraper as a Service Real
Introduction
Creating and configuring an Elastic Cloud trial account
How to do it
Accessing the Elastic Cloud cluster with curl
How to do it
Connecting to the Elastic Cloud cluster with Python
Getting ready
How to do it
There's more...
Performing an Elasticsearch query with the Python API 
Getting ready
How to do it
There's more...
Using Elasticsearch to query for jobs with specific skills
Getting ready
How to do it
Modifying the API to search for jobs by skill
How to do it
How it works
There's more...
Storing configuration in the environment 
How to do it
Creating an AWS IAM user and a key pair for ECS
Getting ready
How to do it
Configuring Docker to authenticate with ECR
Getting ready
How to do it
Pushing containers into ECR
Getting ready
How to do it
Creating an ECS cluster
How to do it
Creating a task to run our containers
Getting ready
How to do it
How it works
Starting and accessing the containers in AWS
Getting ready
How to do it
There's more...
Other Books You May Enjoy
Index