Mining the Web: Discovering Knowledge from Hypertext Data is the first book devoted entirely to techniques for producing knowledge from the vast body of unstructured Web data. Building on an initial survey of infrastructural issues — including Web crawling and indexing — Chakrabarti examines low-level machine learning techniques as they relate specifically to the challenges of Web mining. He then devotes the final part of the book to applications that unite infrastructure and analysis to bring machine learning to bear on systematically acquired and stored data. Here the focus is on results: the strengths and weaknesses of these applications, along with their potential as foundations for further progress. From Chakrabarti's work — painstaking, critical, and forward-looking — readers will gain the theoretical and practical understanding they need to contribute to the Web mining effort.
Author(s): Soumen Chakrabarti
Edition: 1
Publisher: Morgan Kaufmann
Year: 2002
Language: English
Pages: 344
Cover......Page 1
FOREWORD......Page 7
Contents......Page 8
PREFACE......Page 16
1 - Introduction......Page 20
1.1 Crawling and Indexing......Page 25
1.2 Topic Directories......Page 26
1.3 Clustering and Classification......Page 27
1.4 Hyperlink Analysis......Page 28
1.6 Structured vs. Unstructured Data Mining......Page 30
1.7 Bibliographic Notes......Page 32
Part I - Infrastructure......Page 34
2 - Crawling the Web......Page 36
2.1 HTML and HTTP Basics......Page 37
2.2 Crawling Basics......Page 38
2.3 Engineering Large- Scale Crawlers......Page 40
2.4 Putting Together a Crawler......Page 54
2.5 Bibliographic Notes......Page 59
3.1 Boolean Queries and the Inverted Index......Page 64
3.2 Relevance Ranking......Page 72
3.3 Similarity Search......Page 86
3.4 Bibliographic Notes......Page 94
Part II - Learning......Page 96
4 - Similarity and Clustering......Page 98
4.1 Formulations and Approaches......Page 100
4.2 Bottom- Up and Top- Down Partitioning Paradigms......Page 103
4.3 Clustering and Visualization via Embeddings......Page 108
4.4 Probabilistic Approaches to Clustering......Page 118
4.5 Collaborative Filtering......Page 134
4.6 Bibliographic Notes......Page 140
5 - Supervised Learning......Page 144
5.1 The Supervised Learning Scenario......Page 145
5.2 Overview of Classification Strategies......Page 147
5.3 Evaluating Text Classifiers......Page 148
5.4 Nearest Neighbor Learners......Page 152
5.5 Feature Selection......Page 155
5.6 Bayesian Learners......Page 166
5.7 Exploiting Hierarchy among Topics......Page 174
5.8 Maximum Entropy Learners......Page 179
5.9 Discriminative Classification......Page 182
5.10 Hypertext Classification......Page 188
5.11 Bibliographic Notes......Page 192
6 - Semisupervised Learning......Page 196
6.1 Expectation Maximization......Page 197
6.2 Labeling Hypertext Graphs......Page 203
6.3 Co- training......Page 214
6.4 Bibliographic Notes......Page 217
Part III - Applications......Page 220
7 - Social Network Analysis......Page 222
7.1 Social Sciences and Bibliometry......Page 224
7.2 PageRank and HITS......Page 228
7.3 Shortcomings of the Coarse- Grained Graph Model......Page 238
7.4 Enhanced Models and Techniques......Page 244
7.5 Evaluation of Topic Distillation......Page 254
7.6 Measuring and Modeling the Web......Page 262
7.7 Bibliographic Notes......Page 273
8 - Resource Discovery......Page 274
8.1 Collecting Important Pages Preferentially......Page 276
8.2 Similarity Search Using Link Topology......Page 283
8.3 Topical Locality and Focused Crawling......Page 287
8.4 Discovering Communities......Page 303
8.5 Bibliographic Notes......Page 307
9 - The Future of Web Mining......Page 308
9.1 Information Extraction......Page 309
9.2 Natural Language Processing......Page 314
9.3 Question Answering......Page 321
9.4 Profiles, Personalization, and Collaboration......Page 324
REFERENCES......Page 326
INDEX......Page 346