Learn the art of efficient web scraping and crawling with PythonAbout This Book• Extract data from any source to perform real time analytics.• Full of techniques and examples to help you crawl websites and extract data within hours.• A hands-on guide to web scraping and crawling with real-life problems and solutionsWho This Book Is ForIf you are a software developer, data scientist, NLP or machine-learning enthusiast or just need to migrate your company's wiki from a legacy platform, then this book is for you. It is perfect for someone , who needs instant access to large amounts of semi-structured data effortlessly.What You Will Learn• Understand HTML pages and write XPath to extract the data you need• Write Scrapy spiders with simple Python and do web crawls• Push your data into any database, search engine or analytics system• Configure your spider to download files, images and use proxies• Create efficient pipelines that shape data in precisely the form you want• Use Twisted Asynchronous API to process hundreds of items concurrently• Make your crawler super-fast by learning how to tune Scrapy's performance• Perform large scale distributed crawls with scrapyd and scrapinghubIn DetailThis book covers the long awaited Scrapy v 1.0 that empowers you to extract useful data from virtually any source with very little effort. It starts off by explaining the fundamentals of Scrapy framework, followed by a thorough description of how to extract data from any source, clean it up, shape it as per your requirement using Python and 3rd party APIs. Next you will be familiarised with the process of storing the scrapped data in databases as well as search engines and performing real time analytics on them with Spark Streaming. By the end of this book, you will perfect the art of scarping data for your applications with easeStyle and approachIt is a hands on guide, with first few chapters written as a tutorial, aiming to motivate you and get you started quickly. As the book progresses, more advanced features are explained with real world examples that can be reffered while developing your own web applications.
Author(s): Dimitris Kouzis - Loukas
Year: 2016
Language: English
Pages: 270
Learning Scrapy
Credits
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introducing Scrapy
Hello Scrapy
More reasons to love Scrapy
About this book: aim and usage
The importance of mastering automated data scraping
Developing robust, quality applications, and providing realistic schedules
Developing quality minimum viable products quickly
Scraping gives you scale; Google couldn't use forms
Discovering and integrating into your ecosystem
Being a good citizen in a world full of spiders
What Scrapy is not
Summary
2. Understanding HTML and XPath
HTML, the DOM tree representation, and the XPath
The URL
The HTML document
The tree representation
What you see on the screen
Selecting HTML elements with XPath
Useful XPath expressions
Using Chrome to get XPath expressions
Examples of common tasks
Anticipating changes
Summary
3. Basic Crawling
Installing Scrapy
MacOS
Windows
Linux
Ubuntu or Debian Linux
Red Hat or CentOS Linux
From the latest source
Upgrading Scrapy
Vagrant: this book's official way to run examples
UR2IM – the fundamental scraping process
The URL
The request and the response
The Items
A Scrapy project
Defining items
Writing spiders
Populating an item
Saving to files
Cleaning up – item loaders and housekeeping fields
Creating contracts
Extracting more URLs
Two-direction crawling with a spider
Two-direction crawling with a CrawlSpider
Summary
4. From Scrapy to a Mobile App
Choosing a mobile application framework
Creating a database and a collection
Populating the database with Scrapy
Creating a mobile application
Creating a database access service
Setting up the user interface
Mapping data to the User Interface
Mappings between database fields and User Interface controls
Testing, sharing, and exporting your mobile app
Summary
5. Quick Spider Recipes
A spider that logs in
A spider that uses JSON APIs and AJAX pages
Passing arguments between responses
A 30-times faster property spider
A spider that crawls based on an Excel file
Summary
6. Deploying to Scrapinghub
Signing up, signing in, and starting a project
Deploying our spiders and scheduling runs
Accessing our items
Scheduling recurring crawls
Summary
7. Configuration and Management
Using Scrapy settings
Essential settings
Analysis
Logging
Stats
Telnet
Example 1 – using telnet
Performance
Stopping crawls early
HTTP caching and working offline
Example 2 – working offline by using the cache
Crawling style
Feeds
Downloading media
Other media
Example 3 – downloading images
Amazon Web Services
Using proxies and crawlers
Example 4 – using proxies and Crawlera's clever proxy
Further settings
Project-related settings
Extending Scrapy settings
Fine-tuning downloading
Autothrottle extension settings
Memory UsageExtension settings
Logging and debugging
Summary
8. Programming Scrapy
Scrapy is a Twisted application
Deferreds and deferred chains
Understanding Twisted and nonblocking I/O – a Python tale
Overview of Scrapy architecture
Example 1 - a very simple pipeline
Signals
Example 2 - an extension that measures throughput and latencies
Extending beyond middlewares
Summary
9. Pipeline Recipes
Using REST APIs
Using treq
A pipeline that writes to Elasticsearch
A pipeline that geocodes using the Google Geocoding API
Enabling geoindexing on Elasticsearch
Interfacing databases with standard Python clients
A pipeline that writes to MySQL
Interfacing services using Twisted-specific clients
A pipeline that reads/writes to Redis
Interfacing CPU-intensive, blocking, or legacy functionality
A pipeline that performs CPU-intensive or blocking operations
A pipeline that uses binaries or scripts
Summary
10. Understanding Scrapy's Performance
Scrapy's engine – an intuitive approach
Cascading queuing systems
Identifying the bottleneck
Scrapy's performance model
Getting component utilization using telnet
Our benchmark system
The standard performance model
Solving performance problems
Case #1 – saturated CPU
Case #2 – blocking code
Case #3 – "garbage" on the downloader
Case #4 – overflow due to many or large responses
Case #5 – overflow due to limited/excessive item concurrency
Case #6 – the downloader doesn't have enough to do
Troubleshooting flow
Summary
11. Distributed Crawling with Scrapyd and Real-Time Analytics
How does the title of a property affect the price?
Scrapyd
Overview of our distributed system
Changes to our spider and middleware
Sharded-index crawling
Batching crawl URLs
Getting start URLs from settings
Deploy your project to scrapyd servers
Creating our custom monitoring command
Calculating the shift with Apache Spark streaming
Running a distributed crawl
System performance
The key take-away
Summary
A. Installing and troubleshooting prerequisite software
Installing prerequisites
The system
Installation in a nutshell
Installing on Linux
Installing on Windows or Mac
Install Vagrant
How to access the terminal
Install VirtualBox and Git
Ensure that VirtualBox supports 64-bit images
Enable ssh client for Windows
Download this book's code and set up the system
System setup and operations FAQ
What do I download and how much time does it take?
What should I do if Vagrant freezes?
How do I shut down/resume the VM quickly?
How do I fully reset the VM?
How do I resize the virtual machine?
How do I resolve any port conflicts?
On Linux using Docker natively
On Windows or Mac using a VM
How do I make it work behind a corporate proxy?
How do I connect with the Docker provider VM?
How much CPU/memory does each server use?
How can I see the size of Docker container images?
How can I reset the system if Vagrant doesn't respond?
There's a problem I can't work around, what can I do?
Index