The latest techniques and principles of parallel and grid database processingThe growth in grid databases, coupled with the utility of parallel query processing, presents an important opportunity to understand and utilize high-performance parallel database processing within a major database management system (DBMS). This important new book provides readers with a fundamental understanding of parallelism in data-intensive applications, and demonstrates how to develop faster capabilities to support them. It presents a balanced treatment of the theoretical and practical aspects of high-performance databases to demonstrate how parallel query is executed in a DBMS, including concepts, algorithms, analytical models, and grid transactions.High-Performance Parallel Database Processing and Grid Databases serves as a valuable resource for researchers working in parallel databases and for practitioners interested in building a high-performance database. It is also a much-needed, self-contained textbook for database courses at the advanced undergraduate and graduate levels.
Author(s): David Taniar, Clement H. C. Leung, Wenny Rahayu, Sushant Goel
Edition: 1
Year: 2008
Language: English
Pages: 575
High-Performance Parallel Database Processing and Grid Databases......Page 4
Contents......Page 8
Preface......Page 18
Part I Introduction......Page 22
1. Introduction......Page 24
1.1. A Brief Overview: Parallel Databases and Grid Databases......Page 25
1.2. Parallel Query Processing: Motivations......Page 26
1.3.1. Speed Up......Page 28
1.3.2. Scale Up......Page 29
1.3.3. Parallel Obstacles......Page 31
1.4. Forms of Parallelism......Page 33
1.4.1. Interquery Parallelism......Page 34
1.4.2. Intraquery Parallelism......Page 35
1.4.4. Interoperation Parallelism......Page 36
1.4.5. Mixed Parallelism—A More Practical Solution......Page 39
1.5. Parallel Database Architectures......Page 40
1.5.1. Shared-Memory and Shared-Disk Architectures......Page 41
1.5.2. Shared-Nothing Architecture......Page 43
1.5.3. Shared-Something Architecture......Page 44
1.5.4. Interconnection Networks......Page 45
1.6. Grid Database Architecture......Page 47
1.7. Structure of this Book......Page 50
1.9. Bibliographical Notes......Page 51
1.10. Exercises......Page 52
2.1. Cost Models......Page 54
2.2.1. Data Parameters......Page 55
2.2.2. Systems Parameters......Page 57
2.2.4. Time Unit Costs......Page 58
2.2.5. Communication Costs......Page 59
2.3. Skew Model......Page 60
2.4. Basic Operations in Parallel Databases......Page 64
2.4.1. Disk Operations......Page 65
2.4.3. Data Computation and Data Distribution......Page 66
2.7. Exercises......Page 68
Part II Basic Query Parallelism......Page 70
3.1. Search Queries......Page 72
3.1.1. Exact-Match Search......Page 73
3.1.2. Range Search Query......Page 74
3.2. Data Partitioning......Page 75
3.2.1. Basic Data Partitioning......Page 76
3.2.2. Complex Data Partitioning......Page 81
3.3.1. Serial Search Algorithms......Page 90
3.3.2. Parallel Search Algorithms......Page 94
3.4. Summary......Page 95
3.6. Exercises......Page 96
4. Parallel Sort and GroupBy......Page 98
4.1.1. Sorting and Duplicate Removal......Page 99
4.1.2. Scalar Aggregate......Page 100
4.2. Serial External Sorting Method......Page 101
4.3.1. Parallel Merge-All Sort......Page 104
4.3.2. Parallel Binary-Merge Sort......Page 106
4.3.3. Parallel Redistribution Binary-Merge Sort......Page 107
4.3.4. Parallel Redistribution Merge-All Sort......Page 109
4.3.5. Parallel Partitioned Sort......Page 111
4.4.1. Traditional Methods (Merge-All and Hierarchical Merging)......Page 113
4.4.2. Two-Phase Method......Page 114
4.4.3. Redistribution Method......Page 115
4.5.1. Cost Models for Serial External Merge-Sort......Page 117
4.5.2. Cost Models for Parallel Merge-All Sort......Page 119
4.5.3. Cost Models for Parallel Binary-Merge Sort......Page 121
4.5.4. Cost Models for Parallel Redistribution Binary-Merge Sort......Page 122
4.5.5. Cost Models for Parallel Redistribution Merge-All Sort......Page 123
4.5.6. Cost Models for Parallel Partitioned Sort......Page 124
4.6.1. Cost Models for Parallel Two-Phase Method......Page 125
4.6.2. Cost Models for Parallel Redistribution Method......Page 128
4.7. Summary......Page 130
4.9. Exercises......Page 131
5.1. Join Operations......Page 133
5.2.1. Nested-Loop Join Algorithm......Page 135
5.2.2. Sort-Merge Join Algorithm......Page 137
5.2.3. Hash-Based Join Algorithm......Page 138
5.3. Parallel Join Algorithms......Page 141
5.3.1. Divide and Broadcast-Based Parallel Join Algorithms......Page 142
5.3.2. Disjoint Partitioning-Based Parallel Join Algorithms......Page 145
5.4.1. Cost Models for Divide and Broadcast......Page 149
5.4.2. Cost Models for Disjoint Partitioning......Page 150
5.4.3. Cost Models for Local Join......Page 151
5.5.1. Optimizing Main Memory......Page 153
5.5.2. Load Balancing......Page 154
5.6. Summary......Page 155
5.7. Bibliographical Notes......Page 156
5.8. Exercises......Page 157
Part III Advanced Parallel Query Processing......Page 160
6.1. Groupby-Join Queries......Page 162
6.1.2. Groupby After Join......Page 163
6.2.1. Early Distribution Scheme......Page 164
6.2.2. Early GroupBy with Partitioning Scheme......Page 166
6.2.3. Early GroupBy with Replication Scheme......Page 167
6.3.1. Join Partitioning Scheme......Page 169
6.3.2. GroupBy Partitioning Scheme......Page 171
6.4. Cost Model Notations......Page 172
6.5.1. Cost Models for the Early Distribution Scheme......Page 174
6.5.2. Cost Models for the Early GroupBy with Partitioning Scheme......Page 177
6.5.3. Cost Models for the Early GroupBy with Replication Scheme......Page 179
6.6.1. Cost Models for the Join Partitioning Scheme......Page 180
6.6.2. Cost Models for the GroupBy Partitioning Scheme......Page 182
6.7. Summary......Page 184
6.9. Exercises......Page 185
7. Parallel Indexing......Page 188
7.1. Parallel Indexing – an Internal Perspective on Parallel Indexing Structures......Page 189
7.2.1. Nonreplicated Indexing (NRI) Structures......Page 190
7.2.2. Partially Replicated Indexing (PRI) Structures......Page 192
7.2.3. Fully Replicated Indexing (FRI) Structures......Page 199
7.3. Index Maintenance......Page 201
7.3.2. Maintaining a Parallel Partially Replicated Index......Page 203
7.4. Index Storage Analysis......Page 209
7.4.1. Storage Cost Models for Uniprocessors......Page 210
7.4.2. Storage Cost Models for Parallel Processors......Page 212
7.5.1. Parallel One-Index Search Query Processing......Page 213
7.5.2. Parallel Multi-Index Search Query Processing......Page 216
7.6.1. Parallel One-Index Join......Page 221
7.6.2. Parallel Two-Index Join......Page 224
7.7.1. Comparative Analysis of Parallel Search Index......Page 228
7.7.2. Comparative Analysis of Parallel Index Join......Page 234
7.8. Summary......Page 237
7.10. Exercises......Page 238
8. Parallel Universal Qualification—Collection Join Queries......Page 240
8.1. Universal Quantification and Collection Join......Page 241
8.2.1. Collection-Equi Join Queries......Page 243
8.2.2. Collection–Intersect Join Queries......Page 244
8.2.3. Subcollection Join Queries......Page 245
8.4. Parallel Collection-Equi Join Algorithms......Page 246
8.4.1. Disjoint Data Partitioning......Page 247
8.4.2. Parallel Double Sort-Merge Collection-Equi Join Algorithm......Page 248
8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm......Page 249
8.4.4. Parallel Hash Collection-Equi Join Algorithm......Page 253
8.5. Parallel Collection-Intersect Join Algorithms......Page 254
8.5.1. Non-Disjoint Data Partitioning......Page 255
8.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join Algorithm......Page 265
8.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm......Page 266
8.6. Parallel Subcollection Join Algorithms......Page 267
8.6.1. Data Partitioning......Page 268
8.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join Algorithm......Page 269
8.6.3. Parallel Sort-Hash Subcollection Join Algorithm......Page 270
8.6.4. Parallel Hash Subcollection Join Algorithm......Page 272
8.8. Bibliographical Notes......Page 273
8.9. Exercises......Page 275
9. Parallel Query Scheduling and Optimization......Page 277
9.1. Query Execution Plan......Page 278
9.2.1. Serial Execution Among Subqueries......Page 280
9.2.2. Parallel Execution Among Subqueries......Page 282
9.3.1. Nonskewed Subqueries......Page 285
9.3.2. Skewed Subqueries......Page 286
9.3.3. Skewed and Nonskewed Subqueries......Page 288
9.4. Scheduling Rules......Page 290
9.5. Cluster Query Processing Model......Page 291
9.5.1. Overview of Dynamic Query Processing......Page 292
9.5.2. A Cluster Query Processing Architecture......Page 293
9.5.3. Load Information Exchange......Page 294
9.6. Dynamic Cluster Query Optimization......Page 296
9.6.1. Correction......Page 297
9.6.2. Migration......Page 301
9.6.3. Partition......Page 302
9.7. Other Approaches to Dynamic Query Optimization......Page 305
9.8. Summary......Page 306
9.10. Exercises......Page 307
Part IV Grid Databases......Page 310
10. Transactions in Distributed and Grid Databases......Page 312
10.1. Grid Database Challenges......Page 313
10.2.1. Distributed Database Systems......Page 314
10.2.2. Multidatabase Systems......Page 318
10.3. Basic Definitions on Transaction Management......Page 320
10.4. Acid Properties of Transactions......Page 322
10.5.1. Transaction Management in Centralized and Homogeneous Distributed Database Systems......Page 324
10.5.2. Transaction Management in Heterogeneous Distributed Database Systems......Page 326
10.6. Requirements in Grid Database Systems......Page 328
10.7. Concurrency Control Protocols......Page 330
10.8.1. Homogeneous Distributed Database Systems......Page 331
10.8.2. Heterogeneous Distributed Database Systems......Page 334
10.9. Replica Synchronization Protocols......Page 335
10.9.1. Network Partitioning......Page 336
10.9.2. Replica Synchronization Protocols......Page 337
10.11. Bibliographical Notes......Page 339
10.12. Exercises......Page 340
11.1. A Grid Database Environment......Page 342
11.2. An Example......Page 343
11.3.1. Basic Functions Required by GCC......Page 345
11.3.2. Grid Serializability Theorem......Page 346
11.3.3. Grid Concurrency Control Protocol......Page 350
11.3.4. Revisiting the Earlier Example......Page 354
11.3.5. Comparison with Traditional Concurrency Control Protocols......Page 355
11.4. Correctness of GCC Protocol......Page 357
11.5. Features of GCC Protocol......Page 359
11.8. Exercises......Page 360
12. Grid Transaction Atomicity and Durability......Page 362
12.1. Motivation......Page 363
12.2.1. State Diagram of Grid-ACP......Page 364
12.2.2. Grid-ACP Algorithm......Page 365
12.2.3. Early-Abort Grid-ACP......Page 367
12.2.4. Discussion......Page 369
12.2.5. Message and Time Complexity Comparison Analysis......Page 370
12.2.6. Correctness of Grid-ACP......Page 371
12.3.1. Model for Storing Log Files at the Originator and Participating Sites......Page 372
12.3.2. Logs Required at the Originator Site......Page 373
12.3.4. Failure Recovery Algorithm for Grid-ACP......Page 374
12.3.5. Comparison of Recovery Protocols......Page 380
12.3.6. Correctness of Recovery Algorithm......Page 382
12.4. Summary......Page 386
12.6. Exercises......Page 387
13.1. Motivation......Page 388
13.2.1. High-Level Replica Management Architecture......Page 389
13.2.2. Some Problems......Page 390
13.3.1. Read Transaction Operation for GRAP......Page 392
13.3.2. Write Transaction Operation for GRAP......Page 393
13.3.3. Revisiting the Example Problem......Page 396
13.3.4. Correctness of GRAP......Page 398
13.4.1. Contingency GRAP......Page 399
13.4.2. Comparison of Replica Management Protocols......Page 402
13.4.3. Correctness of Contingency GRAP......Page 404
13.5. Summary......Page 405
13.7. Exercises......Page 406
14. Grid Atomic Commitment in Replicated Data......Page 408
14.1.2. Motivating Example......Page 409
14.2.1. Modified Grid-ACP......Page 411
14.2.2. Correctness of Modified Grid-ACP......Page 414
14.3. Transaction Properties in Replicated Environment......Page 416
14.5. Bibliographical Notes......Page 418
14.6. Exercises......Page 419
Part V Other Data-Intensive Applications......Page 420
15. Parallel Online Analytic Processing (OLAP) and Business Intelligence......Page 422
15.1. Parallel Multidimensional Analysis......Page 423
15.2.1. Analysis of Basic Single ROLLUP Queries......Page 426
15.2.2. Analysis of Multiple ROLLUP Queries......Page 430
15.2.3. Analysis of Partial ROLLUP Queries......Page 432
15.3. Parallelization of CUBE Queries......Page 433
15.3.1. Analysis of Basic CUBE Queries......Page 434
15.3.2. Analysis of Partial CUBE Queries......Page 437
15.3.3. Parallelization Without Using CUBE......Page 438
15.4. Parallelization of Top-N and Ranking Queries......Page 439
15.5. Parallelization of Cume_Dist Queries......Page 440
15.6. Parallelization of NTILE and Histogram Queries......Page 441
15.7. Parallelization of Moving Average and Windowing Queries......Page 443
15.9. Bibliographical Notes......Page 445
15.10. Exercises......Page 446
16. Parallel Data Mining—Association Rules and Sequential Patterns......Page 448
16.1. From Databases To Data Warehousing To Data Mining: A Journey......Page 449
16.2.1. Data Mining Tasks......Page 452
16.2.2. Querying vs. Mining......Page 454
16.2.3. Parallelism in Data Mining......Page 457
16.3. Parallel Association Rules......Page 461
16.3.1. Association Rules: Concepts......Page 462
16.3.2. Association Rules: Processes......Page 465
16.3.3. Association Rules: Parallel Processing......Page 469
16.4. Parallel Sequential Patterns......Page 471
16.4.1. Sequential Patterns: Concepts......Page 473
16.4.2. Sequential Patterns: Processes......Page 477
16.4.3. Sequential Patterns: Parallel Processing......Page 480
16.6. Bibliographical Notes......Page 482
16.7. Exercises......Page 483
17.1.1. Clustering......Page 485
17.1.2. Classification......Page 486
17.2.1. Clustering: Concepts......Page 488
17.2.2. k-Means Algorithm......Page 489
17.2.3. Parallel k-Means Clustering......Page 492
17.3.1. Decision Tree Classification: Structures......Page 498
17.3.2. Decision Tree Classification: Processes......Page 501
17.3.3. Decision Tree Classification: Parallel Processing......Page 509
17.4. Summary......Page 516
17.6. Exercises......Page 519
Permissions......Page 522
List of Conferences and Journals......Page 528
Bibliography......Page 532
Index......Page 562