This book provides a comprehensive text on Web data mining. Key topics of structure mining, content mining, and usage mining are covered. The book brings together all the essential concepts and algorithms from related areas such as data mining, machine learning, and text processing to form an authoritative and coherent text. The book offers a rich blend of theory and practice. It is suitable for students, researchers and practitioners interested in Web mining. Lecturers can readily use it for classes on data mining, Web mining, and Web search. Internet support with lecture slides and project problems is available online.
Author(s): Bing Liu
Series: Data-Centric Systems and Applications
Edition: 1st ed. 2007. Corr. 2nd printing
Publisher: Springer
Year: 2007
Language: English
Pages: 544
Cover......Page 1
Copyright......Page 4
Table of Contents......Page 9
1.1. What is the World Wide Web?......Page 18
1.2. A Brief History of the Web and the Internet......Page 19
1.3. Web Data Mining......Page 21
1.3.2. What is Web Mining?......Page 23
1.4. Summary of Chapters......Page 25
1.5. How to Read this Book......Page 28
Bibliographic Notes......Page 29
Part II: Web Mining......Page 0
2.1. Basic Concepts of Association Rules......Page 30
2.2.1. Frequent Itemset Generation......Page 33
2.2.2. Association Rule Generation......Page 37
2.4. Mining with Multiple Minimum Supports......Page 39
2.4.1. Extended Model......Page 41
2.4.2. Mining Algorithm......Page 43
2.4.3. Rule Generation......Page 48
2.5.1. Problem Definition......Page 49
2.5.2. Mining Algorithm......Page 51
2.6. Basic Concepts of Sequential Patterns......Page 54
2.7.1. GSP Algorithm......Page 56
2.7.2. Mining with Multiple Minimum Supports......Page 58
2.8. Mining Sequential Patterns Based on PrefixSpan......Page 62
2.8.1. PrefixSpan Algorithm......Page 63
2.8.2. Mining with Multiple Minimum Supports......Page 65
2.9. Generating Rules from Sequential Patterns......Page 66
2.9.2. Label Sequential Rules......Page 67
2.9.3. Class Sequential Rules......Page 68
Bibliographic Notes......Page 69
3.1. Basic Concepts......Page 72
3.2. Decision Tree Induction......Page 76
3.2.1. Learning Algorithm......Page 79
3.2.2. Impurity Function......Page 80
3.2.3. Handling of Continuous Attributes......Page 84
3.2.4. Some Other Issues......Page 85
3.3.1. Evaluation Methods......Page 88
3.3.2. Precision, Recall, F-score and Breakeven Point......Page 90
3.4.1. Sequential Covering......Page 92
3.4.2. Rule Learning: Learn-One-Rule Function......Page 95
3.5. Classification Based on Associations......Page 98
3.5.1. Classification Using Class Association Rules......Page 99
3.5.3. Classification Using Normal Association Rules......Page 103
3.6. Naive Bayesian Classification......Page 104
3.7. Naive Bayesian Text Classification......Page 108
3.7.1. Probabilistic Framework......Page 109
3.7.2. Naive Bayesian Model......Page 110
3.7.3. Discussion......Page 113
3.8. Support Vector Machines......Page 114
3.8.1. Linear SVM: Separable Case......Page 116
3.8.2. Linear SVM: Non-Separable Case......Page 122
3.8.3. Nonlinear SVM: Kernel Functions......Page 125
3.9. K-Nearest Neighbor Learning......Page 129
3.10. Ensemble of Classifiers......Page 130
3.10.2. Boosting......Page 131
Bibliographic Notes......Page 132
4.1. Basic Concepts......Page 134
4.2.1. K-means Algorithm......Page 137
4.2.2. Disk Version of the K-means Algorithm......Page 140
4.2.3. Strengths and Weaknesses......Page 141
4.3. Representation of Clusters......Page 145
4.3.1. Common Ways of Representing Clusters......Page 146
4.3.2. Clusters of Arbitrary Shapes......Page 147
4.4. Hierarchical Clustering......Page 148
4.4.2. Complete-Link Method......Page 150
4.4.4. Strengths and Weaknesses......Page 151
4.5.1. Numeric Attributes......Page 152
4.5.2. Binary and Nominal Attributes......Page 153
4.5.3. Text Documents......Page 155
4.6. Data Standardization......Page 156
4.7. Handling of Mixed Attributes......Page 158
4.9. Cluster Evaluation......Page 160
4.10. Discovering Holes and Data Regions......Page 163
Bibliographic Notes......Page 166
5.1. Learning from Labeled and Unlabeled Examples......Page 168
5.1.1. EM Algorithm with Naive Bayesian Classification......Page 170
5.1.2. Co-Training......Page 173
5.1.3. Self-Training......Page 175
5.1.4. Transductive Support Vector Machines......Page 176
5.1.5. Graph-Based Methods......Page 177
5.1.6. Discussion......Page 181
5.2.1. Applications of PU Learning......Page 182
5.2.2. Theoretical Foundation......Page 185
5.2.3. Building Classifiers: Two-Step Approach......Page 186
5.2.4. Building Classifiers: Direct Approach......Page 192
5.2.5. Discussion......Page 195
Appendix: Derivation of EM for Naive Bayesian Classification......Page 196
Bibliographic Notes......Page 198
6. Information Retrieval and Web Search......Page 200
6.1. Basic Concepts of Information Retrieval......Page 201
6.2. Information Retrieval Models......Page 204
6.2.2. Vector Space Model......Page 205
6.2.3. Statistical Language Model......Page 208
6.3. Relevance Feedback......Page 209
6.4. Evaluation Measures......Page 212
6.5.1. Stopword Removal......Page 216
6.5.3. Other Pre-Processing Tasks for Text......Page 217
6.5.4. Web Page Pre-Processing......Page 218
6.5.5. Duplicate Detection......Page 220
6.6.1. Inverted Index......Page 221
6.6.2. Search Using an Inverted Index......Page 223
6.6.3. Index Construction......Page 224
6.6.4. Index Compression......Page 226
6.7.1. Singular Value Decomposition......Page 232
6.7.2. Query and Retrieval......Page 235
6.7.3. An Example......Page 236
6.7.4. Discussion......Page 238
6.8. Web Search......Page 239
6.9. Meta-Search: Combining Multiple Rankings......Page 242
6.9.1. Combination Using Similarity Scores......Page 243
6.9.2. Combination Using Rank Positions......Page 244
6.10. Web Spamming......Page 246
6.10.1. Content Spamming......Page 247
6.10.2. Link Spamming......Page 248
6.10.3. Hiding Techniques......Page 250
6.10.4. Combating Spam......Page 251
Bibliographic Notes......Page 252
7. Link Analysis......Page 254
7.1.1. Centrality......Page 255
7.1.2. Prestige......Page 258
7.2. Co-Citation and Bibliographic Coupling......Page 260
7.2.1. Co-Citation......Page 261
7.3. PageRank......Page 262
7.3.1. PageRank Algorithm......Page 263
7.3.2. Strengths and Weaknesses of PageRank......Page 270
7.3.3. Timed PageRank......Page 271
7.4. HITS......Page 272
7.4.1. HITS Algorithm......Page 273
7.4.3. Relationships with Co-Citation and Bibliographic Coupling......Page 276
7.4.4. Strengths and Weaknesses of HITS......Page 277
7.5. Community Discovery......Page 278
7.5.1. Problem Definition......Page 279
7.5.2. Bipartite Core Communities......Page 281
7.5.3. Maximum Flow Communities......Page 282
7.5.4. Email Communities Based on Betweenness......Page 285
7.5.5. Overlapping Communities of Named Entities......Page 287
Bibliographic Notes......Page 288
8. Web Crawling......Page 289
8.1. A Basic Crawler Algorithm......Page 290
8.1.1. Breadth-First Crawlers......Page 291
8.1.2. Preferential Crawlers......Page 292
8.2.1. Fetching......Page 293
8.2.2. Parsing......Page 294
8.2.4. Link Extraction and Canonicalization......Page 296
8.2.5. Spider Traps......Page 298
8.2.6. Page Repository......Page 299
8.2.7. Concurrency......Page 300
8.3. Universal Crawlers......Page 301
8.3.1. Scalability......Page 302
8.3.2. Coverage vs Freshness vs Importance......Page 304
8.4. Focused Crawlers......Page 305
8.5. Topical Crawlers......Page 308
8.5.1. Topical Locality and Cues......Page 310
8.5.2. Best-First Variations......Page 316
8.5.3. Adaptation......Page 319
8.6. Evaluation......Page 326
8.7. Crawler Ethics and Conflicts......Page 331
8.8. Some New Developments......Page 334
Bibliographic Notes......Page 336
9. Structured Data Extraction: Wrapper Generation......Page 338
9.1.1. Two Types of Data Rich Pages......Page 339
9.1.2. Data Model......Page 341
9.1.3. HTML Mark-Up Encoding of Data Instances......Page 343
9.2.1. Extraction from a Page......Page 345
9.2.2. Learning Extraction Rules......Page 348
9.2.3. Identifying Informative Examples......Page 352
9.3. Instance-Based Wrapper Learning......Page 353
9.4. Automatic Wrapper Generation: Problems......Page 356
9.4.1. Two Extraction Problems......Page 357
9.4.2. Patterns as Regular Expressions......Page 358
9.5.1. String Edit Distance......Page 359
9.5.2. Tree Matching......Page 361
9.6.1. Center Star Method......Page 365
9.6.2. Partial Tree Alignment......Page 366
9.7. Building DOM Trees......Page 371
9.8. Extraction Based on a Single List Page: Flat Data Records......Page 372
9.8.1. Two Observations about Data Records......Page 373
9.8.2. Mining Data Regions......Page 374
9.8.3. Identifying Data Records in Data Regions......Page 379
9.8.4. Data Item Alignment and Extraction......Page 380
9.8.6. Some Other Techniques......Page 381
9.9. Extraction Based on a Single List Page: Nested Data Records......Page 382
9.10.1. Using Techniques in Previous Sections......Page 388
9.10.2. RoadRunner Algorithm......Page 389
9.11.1. Extraction from Other Pages......Page 390
9.11.2. Disjunction or Optional......Page 391
9.11.3. A Set Type or a Tuple Type......Page 392
9.11.5. Domain Specific Extraction......Page 393
Bibliographic Notes......Page 394
10. Information Integration......Page 396
10.1. Introduction to Schema Matching......Page 397
10.2. Pre-Processing for Schema Matching......Page 399
10.3.1. Linguistic Approaches......Page 400
10.3.2. Constraint Based Approaches......Page 401
10.4. Domain and Instance-Level Matching......Page 402
10.5. Combining Similarities......Page 405
10.6. 1:m Match......Page 406
10.7.1. Reuse of Previous Match Results......Page 407
10.7.3. Schema Match Results......Page 408
10.8. Integration of Web Query Interfaces......Page 409
10.8.1. A Clustering Based Approach......Page 412
10.8.2. A Correlation Based Approach......Page 415
10.8.3. An Instance Based Approach......Page 418
10.9.1. Structural Appropriateness and the Merge Algorithm......Page 421
10.9.2. Lexical Appropriateness......Page 423
10.9.3. Instance Appropriateness......Page 424
Bibliographic Notes......Page 425
11. Opinion Mining......Page 426
11.1. Sentiment Classification......Page 427
11.1.1. Classification Based on Sentiment Phrases......Page 428
11.1.2. Classification Using Text Classification Methods......Page 430
11.1.3. Classification Using a Score Function......Page 431
11.2. Feature-Based Opinion Mining and Summarization......Page 432
11.2.1. Problem Definition......Page 433
11.2.2. Object Feature Extraction......Page 439
11.2.3. Feature Extraction from Pros and Cons of Format 1......Page 440
11.2.4. Feature Extraction from Reviews of Formats 2 and 3......Page 444
11.2.5. Opinion Orientation Classification......Page 445
11.3. Comparative Sentence and Relation Mining......Page 447
11.3.1. Problem Definition......Page 448
11.3.2. Identification of Gradable Comparative Sentences......Page 450
11.3.3. Extraction of Comparative Relations......Page 452
11.4. Opinion Search......Page 454
11.5.1. Objectives and Actions of Opinion Spamming......Page 456
11.5.2. Types of Spam and Spammers......Page 457
11.5.3. Hiding Techniques......Page 458
11.5.4. Spam Detection......Page 459
Bibliographic Notes......Page 461
12. Web Usage Mining......Page 463
12.1. Data Collection and Pre-Processing......Page 464
12.1.1. Sources and Types of Data......Page 466
12.1.2. Key Elements of Web Usage Data Pre-Processing......Page 469
12.2. Data Modeling for Web Usage Mining......Page 476
12.3.1. Session and Visitor Analysis......Page 480
12.3.2. Cluster Analysis and Visitor Segmentation......Page 481
12.3.3. Association and Correlation Analysis......Page 485
12.3.4. Analysis of Sequential and Navigational Patterns......Page 489
12.3.5. Classification and Prediction Based on Web User Transactions......Page 493
Bibliographic Notes......Page 496
References......Page 498
Index......Page 529