Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The power of this book is not so much in it's code examples but rather in it's ability to change your perspective. We are all aware that the Internet is a client-server topology, but what does that really mean? Reading the first few chapters gave me a whole new viewpoint of the Internet and what I could do with it, or to it. In the year since I first read it, I have stopped developing websites and now code web agents exclusively. It's amazing the number of uses they fulfill.The code in the book is basic, not fit for production (the author tells you this) but it is invaluable to teach the theory and fundamentals of CURL. If you use the code and the provided website to practice with, you will soon be able to develop your own code library. Scale is also left to you to figure out. The obvious first step is a database and a NAS. Start small and use this book for the invaluable reference it is.I really have to rate this book as one of my most influential reads of the last few years.

Author(s): Michael Schrenk
Edition: Annotated
Publisher: No Starch Press
Year: 2007

Language: English
Pages: 315

Webbots, Spiders, and Screen Scrapers......Page 1
Table of Contents......Page 2
Dedication......Page 4
ACKNOWLEDGMENTS......Page 5
Introduction......Page 6
FUNDAMENTAL CONCEPTS AND TECHNIQUES......Page 10
WHAT'S IN IT FOR YOU?......Page 12
Uncovering the Internet's True Potential......Page 14
What's in It for Developers?......Page 16
What's in It for Business Leaders?......Page 17
Final Thoughts......Page 18
IDEAS FOR WEBBOT PROJECTS......Page 19
Inspiration from Browser Limitations......Page 21
A Few Crazy Ideas to Get You Started......Page 23
Final Thoughts......Page 25
DOWNLOADING WEB PAGES......Page 26
Think About Files, Not Web Pages......Page 27
Downloading Files with PHP's Built-in Functions......Page 28
Introducing PHP/CURL......Page 30
Installing PHP/CURL......Page 32
LIB_http......Page 33
Final Thoughts......Page 37
PARSING TECHNIQUES......Page 38
Parsing Poorly Written HTML......Page 39
Standard Parse Routines......Page 40
Using LIB_parse......Page 41
Useful PHP Functions......Page 45
Final Thoughts......Page 47
AUTOMATING FORM SUBMISSION......Page 48
Reverse Engineering Form Interfaces......Page 49
Form Handlers, Data Fields, Methods, and Event Triggers......Page 50
Unpredictable Forms......Page 53
Analyzing a Form......Page 54
Final Thoughts......Page 57
MANAGING LARGE AMOUNTS OF DATA......Page 59
Organizing Data......Page 64
Making Data Smaller......Page 69
Thumbnailing Images......Page 72
Final Thoughts......Page 73
PROJECTS......Page 74
PRICE-MONITORING WEBBOTS......Page 76
The Target......Page 78
Designing the Parsing Script......Page 80
Initialization and Downloading the Target......Page 81
Further Exploration......Page 84
IMAGE-CAPTURING WEBBOTS......Page 85
Example Image-Capturing Webbot......Page 86
Creating the Image-Capturing Webbot......Page 87
Further Exploration......Page 91
Final Thoughts......Page 92
LINK-VERIFICATION WEBBOTS......Page 93
Creating the Link-Verification Webbot......Page 97
Running the Webbot......Page 101
Further Exploration......Page 102
ANONYMOUS BROWSING WEBBOTS......Page 103
Anonymity with Proxies......Page 105
The Anonymizer Project......Page 107
Final Thoughts......Page 110
SEARCH-RANKING WEBBOTS......Page 111
Description of a Search Result Page......Page 112
What the Search-Ranking Webbot Does......Page 113
Running the Search-Ranking Webbot......Page 114
How the Search-Ranking Webbot Works......Page 115
The Search-Ranking Webbot Script......Page 116
Final Thoughts......Page 120
Further Exploration......Page 121
AGGREGATION WEBBOTS......Page 122
Choosing Data Sources for Webbots......Page 123
Example Aggregation Webbot......Page 124
Adding Filtering to Your Aggregation Webbot......Page 128
Further Exploration......Page 129
FTP WEBBOTS......Page 130
Example FTP Webbot......Page 133
PHP and FTP......Page 136
Further Exploration......Page 137
NNTP NEWS WEBBOTS......Page 138
NNTP Use and History......Page 139
Webbots and Newsgroups......Page 140
Further Exploration......Page 144
WEBBOTS THAT READ EMAIL......Page 145
The POP3 Protocol......Page 148
Executing POP3 Commands with a Webbot......Page 151
Further Exploration......Page 153
WEBBOTS THAT SEND EMAIL......Page 154
Email, Webbots, and Spam......Page 155
Sending Mail with SMTP and PHP......Page 156
Writing a Webbot That Sends Email Notifications......Page 158
Further Exploration......Page 161
CONVERTING A WEBSITE INTO A FUNCTION......Page 162
Writing a Function Interface......Page 166
Final Thoughts......Page 170
ADVANCED TECHNICAL CONSIDERATIONS......Page 171
SPIDERS......Page 173
How Spiders Work......Page 175
Example Spider......Page 177
LIB_simple_spider......Page 179
Experimenting with the Spider......Page 182
Adding the Payload......Page 183
Further Exploration......Page 184
PROCUREMENT WEBBOTS AND SNIPERS......Page 185
Procurement Webbot Theory......Page 187
Sniper Theory......Page 189
Testing Your Own Webbots and Snipers......Page 191
Further Exploration......Page 192
Final Thoughts......Page 193
WEBBOTS AND CRYPTOGRAPHY......Page 194
Designing Webbots That Use Encryption......Page 195
A Quick Overview of Web Encryption......Page 196
Local Certificates......Page 197
Final Thoughts......Page 198
AUTHENTICATION......Page 199
What Is Authentication?......Page 200
Example Scripts and Practice Pages......Page 201
Basic Authentication......Page 202
Session Authentication......Page 204
Final Thoughts......Page 207
ADVANCED COOKIE MANAGEMENT......Page 208
How Cookies Work......Page 209
PHP/CURL and Cookies......Page 210
How Cookies Challenge Webbot Design......Page 211
Further Exploration......Page 212
SCHEDULING WEBBOTS AND SPIDERS......Page 213
The Windows Task Scheduler......Page 215
Complex Schedules......Page 217
Non-Calendar-Based Triggers......Page 218
Final Thoughts......Page 219
LARGER CONSIDERATIONS......Page 220
DESIGNING STEALTHY WEBBOTS AND SPIDERS......Page 223
Why Design a Stealthy Webbot?......Page 226
Stealth Means Simulating Human Patterns......Page 229
Final Thoughts......Page 230
WRITING FAULT-TOLERANT WEBBOTS......Page 231
Types of Webbot Fault Tolerance......Page 237
Error Handlers......Page 243
DESIGNING WEBBOT-FRIENDLY WEBSITES......Page 244
Optimizing Web Pages for Search Engine Spiders......Page 246
Web Design Techniques That Hinder Search Engine Spiders......Page 248
Designing Data-Only Interfaces......Page 249
KILLING SPIDERS......Page 253
Asking Nicely......Page 255
Building Speed Bumps......Page 257
Setting Traps......Page 259
Final Thoughts......Page 260
KEEPING WEBBOTS OUT OF TROUBLE......Page 261
It's All About Respect......Page 262
Copyright......Page 263
Trespass to Chattels......Page 265
Internet Law......Page 266
Final Thoughts......Page 267
PHP/CURL REFERENCE......Page 268
Creating a Minimal PHP/CURL Session......Page 269
Initiating PHP/CURL Sessions......Page 270
Setting PHP/CURL Options......Page 271
Executing the PHP/CURL Command......Page 275
Closing PHP/CURL Sessions......Page 277
STATUS CODES......Page 278
HTTP Codes......Page 279
NNTP Codes......Page 280
SMS EMAIL ADDRESSES......Page 281
Colophon......Page 283
Index......Page 284
SYMBOL......Page 285
A......Page 286
B......Page 287
C......Page 288
D......Page 290
E......Page 291
F......Page 292
G......Page 293
H......Page 294
I......Page 295
J......Page 296
K......Page 297
L......Page 298
M......Page 299
N......Page 300
O......Page 301
P......Page 302
Q......Page 304
R......Page 305
S......Page 306
T......Page 308
U......Page 309
V......Page 310
W......Page 311
X......Page 312
Y......Page 313
Z......Page 314