Programming Collective Intelligence: Building Smart Web 2.0 Applications

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it. Programming Collective Intelligence takes you into the world of machine learning and statistics, and explains how to draw conclusions about user experience, marketing, personal tastes, and human behavior in general--all from information that you and others collect every day. Each algorithm is described clearly and concisely with code that can immediately be used on your web site, blog, Wiki, or specialized application. This book explains:
  • Collaborative filtering techniques that enable online retailers to recommend products or media
  • Methods of clustering to detect groups of similar items in a large dataset
  • Search engine features--crawlers, indexers, query engines, and the PageRank algorithm
  • Optimization algorithms that search millions of possible solutions to a problem and choose the best one
  • Bayesian filtering, used in spam filters for classifying documents based on word types and other features
  • Using decision trees not only to make predictions, but to model the way decisions are made
  • Predicting numerical values rather than classifications to build price models
  • Support vector machines to match people in online dating sites
  • Non-negative matrix factorization to find the independent features in adataset
  • Evolving intelligence for problem solving--how a computer develops its skill by improving its own code the more it plays a game 
Each chapter includes exercises for extending the algorithms to make them more powerful. Go beyond simple database-backed applications and put the wealth of Internet data to work for you. "Bravo! I cannot think of a better way for a developer to first learn these algorithms and methods, nor can I think of a better way for me (an old AI dog) to reinvigorate my knowledge of the details." -- Dan Russell, Google "Toby's book does a great job of breaking down the complex subject matter of machine-learning algorithms into practical, easy-to-understand examples that can be directly applied to analysis of social interaction across the Web today. If I had this book two years ago, it would have saved precious time going down some fruitless paths." -- Tim Wolters, CTO, Collective Intellect

Author(s): Toby Segaran
Edition: 1st
Publisher: O'Reilly Media
Year: 2007

Language: English
Pages: 360

Programming Collective Intelligence......Page 1
Table of Contents......Page 8
Prerequisites......Page 14
Why Python?......Page 15
Significant Whitespace......Page 16
Open APIs......Page 17
Overview of the Chapters......Page 18
Conventions......Page 20
How to Contact Us......Page 21
Acknowledgments......Page 22
Introduction to Collective Intelligence......Page 24
What Is Collective Intelligence?......Page 25
What Is Machine Learning?......Page 26
Limits of Machine Learning......Page 27
Other Uses for Learning Algorithms......Page 28
Collaborative Filtering......Page 30
Collecting Preferences......Page 31
Finding Similar Users......Page 32
Euclidean Distance Score......Page 33
Pearson Correlation Score......Page 34
Ranking the Critics......Page 37
Recommending Items......Page 38
Matching Products......Page 40
Building a del.icio.us Link Recommender......Page 42
Building the Dataset......Page 43
Item-Based Filtering......Page 45
Building the Item Comparison Dataset......Page 46
Getting Recommendations......Page 47
Using the MovieLens Dataset......Page 48
User-Based or Item-Based Filtering?......Page 50
Exercises......Page 51
Supervised versus Unsupervised Learning......Page 52
Pigeonholing the Bloggers......Page 53
Counting the Words in a Feed......Page 54
Hierarchical Clustering......Page 56
Drawing the Dendrogram......Page 61
Column Clustering......Page 63
K-Means Clustering......Page 65
Clusters of Preferences......Page 67
Scraping the Zebo Results......Page 68
Clustering Results......Page 70
Viewing Data in Two Dimensions......Page 72
Exercises......Page 76
What’s in a Search Engine?......Page 77
Using urllib2......Page 79
Crawler Code......Page 80
Building the Index......Page 81
Setting Up the Schema......Page 82
Finding the Words on a Page......Page 83
Adding to the Index......Page 84
Querying......Page 86
Content-Based Ranking......Page 87
Word Frequency......Page 89
Document Location......Page 90
Word Distance......Page 91
Simple Count......Page 92
The PageRank Algorithm......Page 93
Using the Link Text......Page 96
Design of a Click-Tracking Network......Page 97
Setting Up the Database......Page 98
Feeding Forward......Page 101
Training with Backpropagation......Page 103
Connecting to the Search Engine......Page 106
Exercises......Page 107
Optimization......Page 109
Group Travel......Page 110
Representing Solutions......Page 111
The Cost Function......Page 112
Random Searching......Page 114
Hill Climbing......Page 115
Simulated Annealing......Page 118
Genetic Algorithms......Page 120
The Kayak API......Page 124
Flight Searches......Page 125
Student Dorm Optimization......Page 129
Running the Optimization......Page 132
The Layout Problem......Page 133
Counting Crossed Lines......Page 135
Drawing the Network......Page 136
Other Possibilities......Page 138
Exercises......Page 139
Filtering Spam......Page 140
Documents and Words......Page 141
Training the Classifier......Page 142
Calculating Probabilities......Page 144
Starting with a Reasonable Guess......Page 145
A Naïve Classifier......Page 146
Probability of a Whole Document......Page 147
A Quick Introduction to Bayes’ Theorem......Page 148
Choosing a Category......Page 149
The Fisher Method......Page 150
Category Probabilities for Features......Page 151
Combining the Probabilities......Page 152
Classifying Items......Page 153
Using SQLite......Page 155
Filtering Blog Feeds......Page 157
Improving Feature Detection......Page 159
Using Akismet......Page 161
Alternative Methods......Page 162
Exercises......Page 163
Predicting Signups......Page 165
Introducing Decision Trees......Page 167
Training the Tree......Page 168
Gini Impurity......Page 170
Entropy......Page 171
Recursive Tree Building......Page 172
Displaying the Tree......Page 174
Graphical Display......Page 175
Classifying New Observations......Page 176
Pruning the Tree......Page 177
Dealing with Missing Data......Page 179
Modeling Home Prices......Page 181
The Zillow API......Page 182
Modeling “Hotness”......Page 184
When to Use Decision Trees......Page 187
Exercises......Page 188
Building a Sample Dataset......Page 190
Number of Neighbors......Page 192
Code for k-Nearest Neighbors......Page 194
Inverse Function......Page 195
Subtraction Function......Page 196
Gaussian Function......Page 197
Weighted kNN......Page 198
Cross-Validation......Page 199
Adding to the Dataset......Page 201
Scaling Dimensions......Page 203
Optimizing the Scale......Page 204
Estimating the Probability Density......Page 206
Graphing the Probabilities......Page 208
Getting a Developer Key......Page 212
Setting Up a Connection......Page 213
Performing a Search......Page 214
Getting Details for an Item......Page 216
Building a Price Predictor......Page 217
When to Use k-Nearest Neighbors......Page 218
Exercises......Page 219
Matchmaker Dataset......Page 220
Decision Tree Classifier......Page 222
Basic Linear Classification......Page 225
Categorical Features......Page 228
Lists of Interests......Page 229
Using the Geocoding API......Page 230
Calculating the Distance......Page 231
Scaling the Data......Page 232
Understanding Kernel Methods......Page 234
The Kernel Trick......Page 235
Support-Vector Machines......Page 238
A Sample Session......Page 240
Applying SVM to the Matchmaker Dataset......Page 241
Getting a Developer Key......Page 242
Creating a Session......Page 243
Download Friend Data......Page 245
Building a Match Dataset......Page 246
Creating an SVM Model......Page 247
Exercises......Page 248
Finding Independent Features......Page 249
Selecting Sources......Page 250
Downloading Sources......Page 251
Converting to a Matrix......Page 253
Bayesian Classification......Page 254
A Quick Introduction to Matrix Math......Page 255
What Does This Have to Do with the Articles Matrix?......Page 257
Using NumPy......Page 259
The Algorithm......Page 260
Displaying the Results......Page 263
Displaying by Article......Page 265
What Is Trading Volume?......Page 266
Downloading Data from Yahoo! Finance......Page 267
Preparing a Matrix......Page 268
Displaying the Results......Page 269
Exercises......Page 271
What Is Genetic Programming?......Page 273
Genetic Programming Versus Genetic Algorithms......Page 274
Programs As Trees......Page 276
Representing Trees in Python......Page 277
Building and Evaluating Trees......Page 278
Displaying the Program......Page 279
Creating the Initial Population......Page 280
A Simple Mathematical Test......Page 282
Mutating Programs......Page 283
Crossover......Page 286
Building the Environment......Page 288
A Simple Game......Page 291
A Round-Robin Tournament......Page 293
Playing Against Real People......Page 295
More Numerical Functions......Page 296
Different Datatypes......Page 297
Exercises......Page 299
Bayesian Classifier......Page 300
Training......Page 301
Using Your Code......Page 302
Strengths and Weaknesses......Page 303
Training......Page 304
Using Your Decision Tree Classifier......Page 306
Strengths and Weaknesses......Page 307
Neural Networks......Page 308
Using Your Neural Network Code......Page 310
Strengths and Weaknesses......Page 311
Support-Vector Machines......Page 312
The Kernel Trick......Page 313
Using LIBSVM......Page 314
Strengths and Weaknesses......Page 315
k-Nearest Neighbors......Page 316
Scaling and Superfluous Variables......Page 317
Using Your kNN Code......Page 318
Clustering......Page 319
K-Means Clustering......Page 320
Using Your Clustering Code......Page 321
Multidimensional Scaling......Page 323
Using Your Multidimensional Scaling Code......Page 324
Non-Negative Matrix Factorization......Page 325
Optimization......Page 327
The Cost Function......Page 328
Using Your Optimization Code......Page 329
Python Imaging Library......Page 332
Beautiful Soup......Page 333
Installation on Other Platforms......Page 334
Installation on Windows......Page 335
Installation......Page 336
Simple Usage Example......Page 337
Euclidean Distance......Page 339
Pearson Correlation Coefficient......Page 340
Tanimoto Coefficient......Page 341
Gini Impurity......Page 342
Entropy......Page 343
Gaussian Function......Page 344
Dot-Products......Page 345
Index......Page 346