Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Utilize web scraping at scale to quickly get unlimited amounts of free data available on the web into a structured format. This book teaches you to use Python scripts to crawl through websites at scale and scrape data from HTML and JavaScript-enabled pages and convert it into structured data formats such as CSV, Excel, JSON, or load it into a SQL database of your choice.

This book goes beyond the basics of web scraping and covers advanced topics such as natural language processing (NLP) and text analytics to extract names of people, places, email addresses, contact details, etc., from a page at production scale using distributed big data techniques on an Amazon Web Services (AWS)-based cloud infrastructure. It book covers developing a robust data processing and ingestion pipeline on the Common Crawl corpus, containing petabytes of data publicly available and a web crawl data set available on AWS's registry of open data.

Getting Structured Data from the Internet also includes a step-by-step tutorial on deploying your own crawlers using a production web scraping framework (such as Scrapy) and dealing with real-world issues (such as breaking Captcha, proxy IP rotation, and more). Code used in the book is provided to help you understand the concepts in practice and write your own web crawler to power your business ideas.


What You Will Learn

  • Understand web scraping, its applications/uses, and how to avoid web scraping by hitting publicly available rest API endpoints to directly get data
  • Develop a web scraper and crawler from scratch using lxml and BeautifulSoup library, and learn about scraping from JavaScript-enabled pages using Selenium
  • Use AWS-based cloud computing with EC2, S3, Athena, SQS, and SNS to analyze, extract, and store useful insights from crawled pages
  • Use SQL language on PostgreSQL running on Amazon Relational Database Service (RDS) and SQLite using SQLalchemy
  • Review sci-kit learn, Gensim, and spaCy to perform NLP tasks on scraped web pages such as name entity recognition, topic clustering (Kmeans, Agglomerative Clustering), topic modeling (LDA, NMF, LSI), topic classification (naive Bayes, Gradient Boosting Classifier) and text similarity (cosine distance-based nearest neighbors)
  • Handle web archival file formats and explore Common Crawl open data on AWS
  • Illustrate practical applications for web crawl data by building a similar website tool and a technology profiler similar to builtwith.com
  • Write scripts to create a backlinks database on a web scale similar to Ahrefs.com, Moz.com, Majestic.com, etc., for search engine optimization (SEO), competitor research, and determining website domain authority and ranking
  • Use web crawl data to build a news sentiment analysis system or alternative financial analysis covering stock market trading signals
  • Write a production-ready crawler in Python using Scrapy framework and deal with practical workarounds for Captchas, IP rotation, and more


Who This Book Is For

Primary audience: data analysts and scientists with little to no exposure to real-world data processing challenges, secondary: experienced software developers doing web-heavy data processing who need a primer, tertiary: business owners and startup founders who need to know more about implementation to better direct their technical team

Author(s): Jay M. Patel
Publisher: Apress
Year: 2020

Language: English
Pages: 397
City: New York

Table of Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: Introduction to Web Scraping
Who uses web scraping?
Marketing and lead generation
Search engines
On-site search and recommendation
Google Ads and other pay-per-click (PPC) keyword research tools
Search engine results page (SERP) scrapers
Search engine optimization (SEO)
Relevance
Trust and authority
Estimating traffic to a site
Vertical search engines for recruitment, real estate, and travel
Brand, competitor, and price monitoring
Social listening, public relations (PR) tools, and media contacts database
Historical news databases
Web technology database
Alternative financial datasets
Miscellaneous uses
Programmatically searching user comments in Reddit
Why is web scraping essential?
How to turn web scraping into full-fledged product
Summary
Chapter 2: Web Scraping in Python Using Beautiful Soup Library
What are web pages all about?
Styling with Cascading Style Sheets (CSS)
Scraping a web page with Beautiful Soup
find() and find_all()
Getting links from a Wikipedia page
Scrape an ecommerce store site
Profiling Beautiful Soup parsers
XPath
Profiling XPath-based lxml
Crawling an entire site
URL normalization
Robots.txt and crawl delay
Status codes and retries
Crawl depth and crawl order
Link importance
Advanced link crawler
Getting things “dynamic” with JavaScript
Variables and data types
Functions
Conditionals and loops
HTML DOM manipulation
AJAX
Scraping JavaScript with Selenium
Scraping the US FDA warning letters database
Scraping from XHR directly
Summary
Chapter 3: Introduction to Cloud Computing and Amazon Web Services (AWS)
What is cloud computing?
List of AWS products
How to interact with AWS
AWS Identity and Access Management (IAM)
Setting up an IAM user
Setting up custom IAM policy
Setting up a new IAM role
Amazon Simple Storage Service (S3)
Creating a bucket
Accessing S3 through SDKs
Cloud storage browser
Amazon EC2
EC2 server types
Spinning your first EC2 server
Communicating with your EC2 server using SSH
Transferring files using SFTP
Amazon Simple Notification Service (SNS) and Simple Queue Service (SQS)
Scraping the US FDA warning letters database on cloud
Summary
Chapter 4: Natural Language Processing (NLP) and Text Analytics
Regular expressions
Extract email addresses using regex
Re2 regex engine
Named entity recognition (NER)
Training SpaCy NER
Exploratory data analytics for NLP
Tokenization
Advanced tokenization, stemming, and lemmatization
Punctuation removal
Ngrams
Stop word removal
Method 1: Create an exclusion list
Method 2: Using statistical language modeling
Method 3: Corpus-specific stop words
Method 4: Using term frequency–inverse document frequency (tf-idf) vectorization
Topic modeling
Latent Dirichlet allocation (LDA)
Non-negative matrix factorization (NMF)
Latent semantic indexing (LSI)
Text clustering
Text classification
Packaging text classification models
Performance decay of text classifiers
Summary
Chapter 5: Relational Databases and SQL Language
Why do we need a relational database?
What is a relational database?
Data definition language (DDL)
Sample database schema for web scraping
SQLite
DBeaver
PostgreSQL
Setting up AWS RDS PostgreSQL
SQLAlchemy
Data manipulation language (DML) and Data Query Language (DQL)
Data insertion in SQLite
Inserting other tables
Full text searching in SQLite
Data insertion in PostgreSQL
Full text searching in PostgreSQL
Why do NoSQL databases exist?
Summary
Chapter 6: Introduction to Common Crawl Datasets
WARC file format
Common crawl index
WET file format
Website similarity
WAT file format
Web technology profiler
Backlinks database
Summary
Chapter 7: Web Crawl Processing on Big Data Scale
Domain ranking and authority using Amazon Athena
Batch querying for domain ranking and authority
Processing parquet files for a common crawl index
Parsing web pages at scale
Microdata, microformat, JSON-LD, and RDFa
Parsing news articles using newspaper3k
Revisiting sentiment analysis
Scraping media outlets and journalist data
Introduction to distributed computing
Rolling your own search engine
Summary
Chapter 8: Advanced Web Crawlers
Scrapy
Advanced crawling strategies
Ethics and legality of web scraping
Proxy IP and user-agent rotation
Cloudflare
CAPTCHA solving services
Summary
Index