Existence of huge amounts of data on the Web has developed an undeferring need to locate right information at right time, as well as to integrating information effectively to provide a comprehensive source of relevant information. There is a need to develop efficient tools for analyzing and managing Web data, and efficiently managing Web information from the database perspective. The book proposes a data model called WHOM (Warehouse Object Model) to represent HTML and XML documents in the warehouse. It defines a set of web algebraic operators for building new web tables by extracting relevant data from the Web, as well as generating new tables from existing ones. These algebraic operators are used for change detection.
Author(s): Sourav S. Bhowmick Sanjay K. Madria Wee K. Ng
Edition: 1
Year: 2003
Language: English
Pages: 470
Contents......Page 16
Preface......Page 8
1 Introduction......Page 24
1.1.1 Problems with Web Data......Page 25
1.1.2 Limitations of Search Engines......Page 28
1.1.3 Limitations of Traditional Data Warehouse......Page 30
1.1.4 Warehousing the Web......Page 33
1.2 Architecture and Functionalities......Page 34
1.2.1 Scope of This Book......Page 36
1.3 Research Issues......Page 37
1.4 Contributions of the Book......Page 38
2 A Survey of Web Data Management Systems......Page 40
2.1.1 Search Engines......Page 41
2.1.2 Metasearch Engines......Page 43
2.1.3 W3QS......Page 44
2.1.4 WebSQL......Page 50
2.1.5 WebLog......Page 51
2.1.6 NetQL......Page 53
2.1.7 FLORID......Page 55
2.2 Web Information Integration Systems......Page 58
2.2.1 Information Manifold......Page 59
2.2.2 TSIMMIS......Page 60
2.2.3 Ariadne......Page 62
2.3 Web Data Restructuring......Page 63
2.3.1 STRUDEL......Page 64
2.3.2 WebOQL......Page 67
2.3.3 ARANEUS......Page 68
2.4.1 Lore......Page 70
2.5 XML Query Languages......Page 73
2.5.1 Lorel......Page 75
2.5.2 XML-QL......Page 79
2.6 Summary......Page 84
3.1.1 Motivation......Page 88
3.2.1 Metadata Associated with HTML and XML Documents......Page 92
3.3 Representing Structure and Content of Web Documents......Page 93
3.3.1 Issues for Modeling Structure and Content......Page 95
3.3.2 Node Structural Attributes......Page 97
3.3.3 Location Attributes......Page 102
3.4 Representing Structure and Content of Hyperlinks......Page 103
3.4.1 Issues for Modeling Hyperlinks......Page 104
3.4.3 Reference Identifier......Page 105
3.6 Node and Link Structure Trees......Page 107
3.7 Recent Approaches in Modeling Web Data......Page 110
3.7.1 Semistructured Data Modeling......Page 111
3.7.3 XML Data Modeling......Page 112
3.7.4 Open Hypermedia System......Page 113
3.8 Summary......Page 114
4 Predicates on Node and Link Objects......Page 116
4.1 Introduction......Page 117
4.1.1 Features of Predicate......Page 119
4.1.2 Overview of Predicates......Page 120
4.2 Components of Comparison-Free Predicates......Page 123
4.2.1 Attribute Path Expressions......Page 124
4.2.2 Predicate Qualifier......Page 128
4.2.3 Value of a Comparison-Free Predicate......Page 129
4.2.4 Predicate Operators......Page 132
4.3 Comparison Predicates......Page 137
4.3.1 Components of a Comparison Predicate......Page 138
4.3.2 Types of Comparison Predicates......Page 140
4.4 Summary......Page 148
5.1 Introduction......Page 150
5.1.2 Difficulties in Modeling Connectivities......Page 152
5.1.3 Features of Connectivities......Page 155
5.2 Components of Connectivities......Page 156
5.2.2 Link Path Expressions......Page 157
5.3.2 Complex Connectivities......Page 158
5.4.1 Transformation of Case 1......Page 159
5.4.2 Transformation of Case 2......Page 160
5.4.3 Transformation of Case 3......Page 161
5.4.5 Steps for Transformation......Page 162
5.5.1 Simple Connectivities......Page 164
5.6 Summary......Page 165
6.1.1 Motivation......Page 168
6.1.2 Our Approach......Page 172
6.2.1 The Information Space......Page 177
6.2.2 Components......Page 178
6.2.3 Definition of Coupling Query......Page 189
6.2.4 Types of Coupling Query......Page 192
6.2.5 Valid Canonical Coupling Query......Page 193
6.3 Examples of Coupling Queries......Page 195
6.3.1 Noncanonical Coupling Query......Page 196
6.3.2 Canonical Coupling Query......Page 202
6.4.1 Outline......Page 204
6.4.2 Phase 1: Coupling Query Reduction......Page 205
6.4.3 Phase 2: Validity Checking......Page 212
6.5.1 Definition of Coupling Graph......Page 213
6.5.2 Types of Coupling Graph......Page 214
6.5.3 Limitations of Coupling Graphs......Page 217
6.5.4 Hybrid Graph......Page 221
6.6 Coupling Query Results......Page 223
6.7 Computability of Valid Coupling Queries......Page 224
6.7.1 Browser and Browse/Search Coupling Queries......Page 225
6.8 Recent Approaches for Querying the Web......Page 226
6.9 Summary......Page 228
7 Schemas for Warehouse Data......Page 230
7.1.1 Recent Approaches for Modeling Schema for Web Data......Page 231
7.1.2 Features of Our Web Schema......Page 233
7.1.3 Summary of Our Methodology......Page 235
7.1.4 Importance of Web Schema in a Web Warehouse......Page 236
7.2.1 Definition......Page 237
7.2.2 Types of Web Schema......Page 239
7.2.3 Schema Conformity......Page 240
7.2.4 Web Table......Page 242
7.4 Phase 1: Valid Canonical Coupling Query to Schema Transformation......Page 244
7.4.1 Schema from Query Containing Schema-Independent Predicates......Page 245
7.4.2 Schema from Query Containing Schema-Influencing Predicates......Page 246
7.5.1 Motivation......Page 248
7.5.2 Discussion......Page 249
7.5.3 Limitations......Page 250
7.6.2 Classifications of Simple Schemas......Page 251
7.6.3 Schema Pruning Process......Page 254
7.6.4 Phase 1: Preprocessing Phase......Page 255
7.6.6 Phase 3: Nonoverlapping Partitioning Phase......Page 256
7.7 Algorithm Schema Generator......Page 259
7.7.1 Pruning Ratio......Page 260
7.7.2 Algorithm of GenerateSchemaFromQuery......Page 261
7.7.3 Algorithm for the Construct Partition......Page 263
7.8.1 Schema Generation Phase......Page 269
7.8.2 Schema Pruning Phase......Page 271
7.9 Summary......Page 272
8.1 Types of Manipulation......Page 274
8.2.1 Definition......Page 275
8.2.2 Global Web Coupling Operation......Page 276
8.2.3 Web Tuples Generation Phase......Page 277
8.2.4 Limitations......Page 280
8.3.1 Selection Criteria......Page 282
8.3.3 Simple Web Schema Set......Page 283
8.3.4 Selection Schema......Page 284
8.3.6 Select Table Generation......Page 288
8.4.2 Projection Attributes......Page 296
8.4.3 Algorithm for Web Project......Page 301
8.5 Web Distinct......Page 310
8.6 Web Cartesian Product......Page 311
8.7.1 Motivation and Overview......Page 312
8.7.2 Concept of Web Join......Page 314
8.7.3 Join Existence Phase......Page 327
8.7.4 Join Construction Phase When X[sub(pj)] ≠ Ø......Page 338
8.7.5 Joined Partition Pruning......Page 350
8.7.6 Join Construction Phase When X[sub(j)] = Ø......Page 353
8.8.1 σ-Web Join......Page 361
8.8.2 Outer Web Join......Page 367
8.9 Web Union......Page 373
8.10 Summary......Page 374
9 Web Data Visualization......Page 376
9.1.1 Web Nest......Page 378
9.1.2 Web Unnest......Page 379
9.1.3 Web Coalesce......Page 380
9.1.4 Web Expand......Page 382
9.1.5 Web Pack......Page 383
9.1.6 Web Unpack......Page 385
9.1.7 Web Sort......Page 387
9.2 Summary......Page 388
10.1 Introduction......Page 390
10.1.1 Overview......Page 391
10.2 Related Work......Page 392
10.3.1 Problem Definition......Page 394
10.3.3 Representing Changes......Page 395
10.4.1 Storage of Web Objects......Page 397
10.4.2 Outline of the Algorithm......Page 398
10.4.3 Algorithm Delta......Page 402
10.5 Conclusions and Future Work......Page 410
11.1 Introduction......Page 412
11.1.1 Motivation......Page 413
11.1.2 Overview......Page 414
11.2 Related Work......Page 415
11.2.2 Mutual Reinforcement Approach......Page 416
11.2.3 Rafiei and Mendelzon’s Approach......Page 417
11.2.4 SALSA......Page 418
11.2.5 Approach of Borodin et al......Page 419
11.3 Concept of Web Bag......Page 420
11.4.1 Terminology......Page 422
11.4.2 Visibility of Web Documents and Intersite Connectivity......Page 423
11.4.3 Luminosity of Web Documents......Page 429
11.4.4 Luminous Paths......Page 431
11.4.5 Query Language Design Considerations......Page 436
11.4.6 Query Language for Knowledge Discovery......Page 437
11.5 Conclusions and Future Work......Page 438
12.1 Summary of the Book......Page 440
12.3 Extending Coupling Queries and Global Web Coupling Operation......Page 443
12.5 Extension of the Web Algebra......Page 444
12.5.1 Schema Operators......Page 445
12.5.4 Operators for Manipulation at Subpage Level......Page 447
12.7 Retrieving and Manipulating Data from the Hidden Web......Page 448
12.8 Data Mining in the Web Warehouse......Page 449
12.9 Conclusions......Page 450
A: Table of Symbols......Page 452
B: Regular Expressions in Comparison-Free Predicate Values......Page 454
C: Examples of Comparison-Free Predicates......Page 459
D: Examples of Comparison Operators......Page 466
E: Nodes and Links......Page 468
References......Page 472
C......Page 482
G......Page 483
M......Page 484
P......Page 485
S......Page 486
W......Page 487
Y......Page 488