Despite all the advancements in web APIs and interoperability, it’s inevitable that, at some point in your career, you will have to “scrape” content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire legitimate activity—for example, to capture data from an old version of a website for insertion into a modern CMS.
This book, written by scraping expert Matthew Turland, covers web scraping techniques and topics that range from the simple to exotic using a variety of technologies and frameworks:
* Understanding HTTP requests
* The PHP HTTP streams wrapper
* cURL
* pecl_http
* PEAR:HTTP
* Zend_Http_Client
* Building your own scraping library
* Using Tidy
* Analyzing code with the DOM, SimpleXML and XMLReader extensions
* CSS selector libraries
* PCRE pattern matching
* Tips and Tricks
* Multiprocessing / parallel processing
Author(s): Matthew Turland
Publisher: musketeers.me, LLC
Year: 2010
Language: English
Pages: 192
Credits......Page 14
Foreword......Page 18
Intended Audience......Page 21
Web Scraping Defined......Page 22
Applications of Web Scraping......Page 23
Topics Covered......Page 24
HTTP......Page 27
Requests......Page 28
GET Requests......Page 29
Anatomy of a URL......Page 30
Query Strings......Page 31
POST Requests......Page 32
Responses......Page 33
Cookies......Page 35
Referring URLs......Page 36
Persistent Connections......Page 37
User Agents......Page 38
Ranges......Page 39
Basic HTTP Authentication......Page 40
Digest HTTP Authentication......Page 41
Wrap-Up......Page 44
HTTP Streams Wrapper......Page 47
Simple Request and Response Handling......Page 48
Stream Contexts and POST Requests......Page 49
Error Handling......Page 51
HTTP Authentication......Page 52
Wrap-Up......Page 53
cURL Extension......Page 55
Contrasting GET and POST......Page 56
Handling Headers......Page 58
Debugging......Page 59
Cookies......Page 60
HTTP Authentication......Page 61
User Agents......Page 62
DNS Caching......Page 63
Request Pooling......Page 64
Wrap-Up......Page 66
pecl_http PECL Extension......Page 69
POST Requests......Page 70
Handling Headers......Page 72
Content Encoding......Page 74
Cookies......Page 75
HTTP Authentication......Page 76
Byte Ranges......Page 77
Request Pooling......Page 78
Wrap-Up......Page 79
PEAR::HTTP_Client......Page 81
Requests and Responses......Page 82
Juggling Data......Page 84
Wrangling Headers......Page 85
Using the Client......Page 86
Observing Requests......Page 87
Wrap-Up......Page 88
Basic Requests......Page 91
Responses......Page 92
Custom Headers......Page 93
Configuration......Page 94
Debugging......Page 95
Cookies......Page 96
User Agents......Page 97
Wrap-Up......Page 98
Sending Requests......Page 101
Parsing Responses......Page 103
Transfer Encoding......Page 104
Content Encoding......Page 105
Timing......Page 106
Validation......Page 109
Input......Page 110
Configuration......Page 111
Options......Page 112
Debugging......Page 113
Wrap-Up......Page 116
DOM Extension......Page 119
Loading Documents......Page 120
Tree Terminology......Page 121
Locating Nodes......Page 123
XPath and DOMXPath......Page 124
Absolute Addressing......Page 125
Addressing Attributes......Page 127
Conditions......Page 128
Resources......Page 129
Loading a Document......Page 133
Accessing Elements......Page 134
Accessing Attributes......Page 135
XPath......Page 137
Wrap-Up......Page 138
XMLReader Extension......Page 141
Loading a Document......Page 142
Iteration......Page 143
Elements and Attributes......Page 144
Wrap-Up......Page 147
Reason to Use Them......Page 149
Basics......Page 150
Basic Filters......Page 152
Attribute Filters......Page 154
Form Filters......Page 156
Zend_Dom_Query......Page 158
DOMQuery......Page 159
Wrap-Up......Page 160
PCRE Extension......Page 163
Pattern Basics......Page 164
Anchors......Page 165
Repetition and Quantifiers......Page 166
Subpatterns......Page 167
Matching......Page 168
Escape Sequences......Page 170
Modifiers......Page 173
Wrap-Up......Page 174
Batch Jobs......Page 177
Parallel Processing......Page 178
Forms......Page 179
Testing......Page 181
That's All Folks......Page 182
Legality of Web Scraping......Page 185
Multiprocessing......Page 189