Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?

Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to launch and schedule the work of your bots, and how to create Internet agents that:

  • Send email or SMS notifications to alert you to new information quickly
  • Search different data sources and combine the results on one page, making the data easier to interpret and analyze
  • Automate purchases, auction bids, and other online activities to save time

Sample projects for automating tasks like price monitoring and news aggregation will show you how to put the concepts you learn into practice.

This second edition of Webbots, Spiders, and Screen Scrapers includes tricks for dealing with sites that are resistant to crawling and scraping, writing stealthy webbots that mimic human search behavior, and using regular expressions to harvest specific data. As you discover the possibilities of web scraping, you'll see how webbots can save you precious time and give you much greater control over the data available on the Web.

Author(s): Michael Schrenk
Edition: Second Edition
Publisher: No Starch Press
Year: 2012

Language: English
Pages: 392

Introduction......Page 29
What to Expect from This Book......Page 30
About the Website......Page 31
About the Code......Page 32
Hardware......Page 33
A Disclaimer (This Is Important)......Page 34
PART I: Fundamental Concepts and Techniques
......Page 35
Uncovering the Internet’s True Potential......Page 37
Webbot Developers Are in Demand......Page 38
What’s in It for Business Leaders?......Page 39
Final Thoughts......Page 40
Inspiration from Browser Limitations......Page 43
Webbots That Aggregate and Filter Information for Relevance......Page 44
Webbots That Act on Your Behalf......Page 45
Help Out a Busy Executive......Page 46
Protect Intellectual Property......Page 47
Create an Online Clipping Service......Page 48
Allow Incompatible Systems to Communicate......Page 49
Final Thoughts......Page 50
3: Downloading Web Pages
......Page 51
Think About Files, Not Web Pages......Page 52
Downloading Files with fopen() and fgets()......Page 53
Downloading Files with file()......Page 55
Basic Authentication......Page 56
Agent Name Spoofing......Page 57
LIB_http......Page 58
Using LIB_http......Page 59
Learning More About HTTP Headers......Page 62
Final Thoughts......Page 63
Content Is Mixed with Markup......Page 65
Standard Parse Routines......Page 66
Splitting a String at a Delimiter: split_string()......Page 67
Parsing Text Between Delimiters: return_between()
......Page 68
Parsing a Data Set into an Array: parse_array()
......Page 69
Parsing Attribute Values: get_attribute()......Page 70
Removing Unwanted Text: remove()......Page 71
Detecting Whether a String Is Within Another String
......Page 72
Parsing Unformatted Text
......Page 73
Use Regular Expressions Sparingly
......Page 74
5: Advanced Parsing with Regular Expressions
......Page 77
PHP Regular Expressions Functions
......Page 78
Learning Patterns Through Examples
......Page 80
Matching Alpha Characters
......Page 81
Specifying Alternate Matches
......Page 82
Parsing Phone Numbers
......Page 83
Where to Go from Here......Page 87
Disadvantages of Pattern Matching While Parsing Web Pages......Page 88
Final Thoughts......Page 90
6: Automating Form Submission
......Page 91
Reverse Engineering Form Interfaces......Page 92
Form Handlers......Page 93
Data Fields......Page 94
Methods......Page 95
Multipart Encoding......Page 97
Cookies Aren’t Included in the Form, but Can Affect Operation......Page 98
Analyzing a Form......Page 99
Don’t Blow Your Cover......Page 102
Avoid Form Errors......Page 103
Organizing Data......Page 105
Naming Conventions......Page 106
Storing Data in Structured Files......Page 107
Storing Text in a Database......Page 108
Storing Images in a Database......Page 111
Storing References to Image Files......Page 113
Compressing Data......Page 114
Removing Formatting......Page 116
Thumbnailing Images......Page 117
Final Thoughts......Page 118
PART II: Projects
......Page 119
8: Price-Monitoring Webbots
......Page 121
The Target......Page 122
Initialization and Downloading the Target......Page 123
Further Exploration......Page 128
9: Image-Capturing Webbots
......Page 129
Creating the Image-Capturing Webbot......Page 130
Binary-Safe Download Routine......Page 131
Directory Structure......Page 132
The Main Script......Page 133
Final Thoughts......Page 136
Initializing the Webbot and Downloading the Target......Page 137
Setting the Page Base......Page 138
Running a Verification Loop......Page 139
Generating Fully Resolved URLs......Page 140
Displaying the Page Status......Page 141
LIB_http_codes......Page 142
Further Exploration......Page 143
11: Search-Ranking Webbots
......Page 145
Description of a Search Result Page......Page 146
How the Search-Ranking Webbot Works......Page 148
Initializing Variables......Page 149
Starting the Loop......Page 150
Parsing the Search Results......Page 151
Spidering Search Engines Is a Bad Idea......Page 154
Further Exploration......Page 155
12: Aggregation Webbots
......Page 157
Choosing Data Sources for Webbots......Page 158
Familiarizing Yourself with RSS Feeds......Page 159
Writing the Aggregation Webbot......Page 161
Adding Filtering to Your Aggregation Webbot......Page 163
Further Exploration......Page 165
13: FTP Webbots
......Page 167
Example FTP Webbot......Page 168
PHP and FTP......Page 170
Further Exploration......Page 171
14: Webbots That Read Email
......Page 173
Reading Mail from a POP3 Mail Server......Page 174
Executing POP3 Commands with a Webbot......Page 177
Email-Controlled Webbots......Page 179
Email Interfaces......Page 180
Email, Webbots, and Spam......Page 181
Configuring PHP to Send Mail......Page 182
Sending an Email with mail()......Page 183
Writing a Webbot That Sends Email Notifications......Page 185
Keeping Legitimate Mail out of Spam Filters......Page 186
Sending HTML-Formatted Email......Page 187
Using Returned Emails to Prune Access Lists......Page 188
Writing Webbots That Send Text Messages......Page 189
16: Converting a Website into a Function
......Page 191
Writing a Function Interface......Page 192
Analyzing the Target Web Page......Page 193
Using describe_zipcode()......Page 195
Distributing Resources......Page 197
Designing a Custom Lightweight “Web Service”......Page 198
PART III: Advanced Technical Considerations
......Page 199
17: Spiders
......Page 201
How Spiders Work......Page 202
Example Spider......Page 203
LIB_simple_spider......Page 204
harvest_links()......Page 205
get_domain()......Page 206
exclude_link()......Page 207
Experimenting with the Spider......Page 208
Save Links in a Database......Page 209
Distribute Tasks Across Multiple Computers......Page 210
Regulate Page Requests......Page 211
18: Procurement Webbots and Snipers
......Page 213
Get Purchase Criteria......Page 214
Make Purchase......Page 215
Get Purchase Criteria......Page 216
Synchronize Clocks......Page 217
Further Exploration......Page 219
Final Thoughts......Page 220
19: Webbots and Cryptography
......Page 221
Encryption and PHP/CURL......Page 222
A Quick Overview of Web Encryption......Page 223
Final Thoughts......Page 224
What Is Authentication?......Page 225
Strengthening Authentication by Combining Techniques......Page 226
Basic Authentication......Page 227
Authentication with Cookie Sessions......Page 230
Authentication with Query Sessions......Page 233
Final Thoughts......Page 235
How Cookies Work......Page 237
PHP/CURL and Cookies......Page 239
Purging Temporary Cookies......Page 240
Managing Multiple Users’ Cookies......Page 241
Further Exploration......Page 242
22: Scheduling Webbots and Spiders
......Page 243
The Windows XP Task Scheduler......Page 244
Scheduling a Webbot to Run Daily......Page 245
Complex Schedules......Page 246
The Windows 7 Task Scheduler......Page 248
Non-calendar-based Triggers......Page 251
Add Variety to Your Schedule......Page 253
23: Scraping Difficult Websites with Browser Macros
......Page 255
Flash......Page 257
Installing and Using iMacros......Page 258
Creating Your First Macro......Page 259
Other Uses......Page 265
24: Hacking iMacros
......Page 267
Reasons for Not Using the iMacros Scripting Engine......Page 268
Creating a Dynamic Macro......Page 269
Launching iMacros Automatically......Page 273
Further Exploration......Page 275
25: Deployment and Scaling
......Page 277
One-to-Many Environment......Page 278
Many-to-Many Environment......Page 279
Inefficiencies at the Target......Page 280
Forking Processes......Page 281
Distributing the Task over Multiple Computers......Page 282
Botnet Communication Methods......Page 283
Further Exploration......Page 290
PART IV: Larger Considerations
......Page 291
Why Design a Stealthy Webbot?......Page 293
Log Files......Page 294
Be Kind to Your Resources......Page 297
Final Thoughts......Page 298
What Is a Proxy?......Page 301
Using Proxies to Become Anonymous......Page 302
Using a Proxy Server......Page 305
Types of Proxy Servers......Page 306
Open Proxies......Page 307
Tor......Page 309
Commercial Proxies......Page 310
Creating Your Own Proxy Service......Page 311
28: Writing Fault-Tolerant Webbots
......Page 313
Adapting to Changes in URLs......Page 314
Adapting to Changes in Page Content......Page 319
Adapting to Changes in Forms......Page 320
Adapting to Network Outages and Network Congestion......Page 322
Error Handlers......Page 323
Further Exploration......Page 324
Optimizing Web Pages for Search Engine Spiders......Page 325
Title Tags......Page 326
Header Tags......Page 327
JavaScript......Page 328
XML......Page 329
Lightweight Data Exchange......Page 330
SOAP......Page 333
REST......Page 334
Final Thoughts......Page 335
30: Killing Spiders
......Page 337
Create a Terms of Service Agreement......Page 338
Use the robots.txt File......Page 339
Selectively Allow Access to Specific Web Agents......Page 340
Use Cookies, Encryption, JavaScript, and Redirection......Page 341
Embed Text in Other Media......Page 342
Create a Spider Trap......Page 343
Final Thoughts......Page 344
31: Keeping Webbots out of Trouble
......Page 345
It’s All About Respect......Page 346
Don’t Be an Armchair Lawyer......Page 347
Trespass to Chattels......Page 350
Internet Law......Page 352
Final Thoughts......Page 353
Creating a Minimal PHP/CURL Session......Page 355
Setting PHP/CURL Options......Page 356
CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS......Page 357
CURLOPT_NOBODY and CURLOPT_HEADER......Page 358
CURLOPT_HTTPHEADER......Page 359
CURLOPT_POST and CURLOPT_POSTFIELDS......Page 360
Executing the PHP/CURL Command......Page 361
Viewing PHP/CURL Errors......Page 362
Closing PHP/CURL Sessions......Page 363
HTTP Codes......Page 365
NNTP Codes......Page 367
C: SMS Gateways
......Page 369
A Sampling of Text Message Email Addresses......Page 370
Index......Page 373