The state of the art of high-performance computing Prominent researchers from around the world have gathered to present the state-of-the-art techniques and innovations in high-performance computing (HPC), including: * Programming models for parallel computing: graph-oriented programming (GOP), OpenMP, the stages and transformation (SAT) approach, the bulk-synchronous parallel (BSP) model, Message Passing Interface (MPI), and Cilk * Architectural and system support, featuring the code tiling compiler technique, the MigThread application-level migration and checkpointing package, the new prefetching scheme of atomicity, a new ''receiver makes right'' data conversion method, and lessons learned from applying reconfigurable computing to HPC * Scheduling and resource management issues with heterogeneous systems, bus saturation effects on SMPs, genetic algorithms for distributed computing, and novel task-scheduling algorithms * Clusters and grid computing: design requirements, grid middleware, distributed virtual machines, data grid services and performance-boosting techniques, security issues, and open issues * Peer-to-peer computing (P2P) including the proposed search mechanism of hybrid periodical flooding (HPF) and routing protocols for improved routing performance * Wireless and mobile computing, featuring discussions of implementing the Gateway Location Register (GLR) concept in 3G cellular networks, maximizing network longevity, and comparisons of QoS-aware scatternet scheduling algorithms * High-performance applications including partitioners, running Bag-of-Tasks applications on grids, using low-cost clusters to meet high-demand applications, and advanced convergent architectures and protocols High-Performance Computing: Paradigm and Infrastructure is an invaluable compendium for engineers, IT professionals, and researchers and students of computer science and applied mathematics.
Author(s): Laurence T. Yang, Minyi Guo
Publisher: Wiley-Interscience
Year: 2005
Language: English
Commentary: 47913
Pages: 818
HIGH-PERFORMANCE COMPUTING......Page 3
Contents......Page 7
Preface......Page 25
Contributors......Page 33
1.1 Introduction......Page 41
1.2 GOP Model and ClusterGOP Architecture......Page 43
1.2.1 The ClusterGOP Architecture......Page 44
1.3 VisualGOP......Page 46
1.4 The ClusterGOP Library......Page 49
1.5 MPMD programming Support......Page 50
1.6.1 Support for Program Development......Page 52
1.6.2 Performance of ClusterGOP......Page 54
1.7 Summary......Page 56
2.1 Introduction......Page 61
2.2.1 Early Parallel Processing Platforms......Page 62
2.2.2 Current HPC Systems......Page 63
2.3 HPC Programming Models: The First Generation......Page 64
2.3.1 The Message Passing Interface (MPI)......Page 65
2.3.2 High Performance Fortran (HPF)......Page 66
2.4.1 OpenMP......Page 70
2.4.2 Other Shared-Memory APIs......Page 72
2.4.3 Is A Standard High-Level API for HPC in Sight?......Page 74
2.5 OpenMP for DMPs......Page 75
2.5.1 A Basic Translation to GA......Page 76
2.5.2 Implementing Sequential Regions......Page 77
2.5.3 Data and Work Distribution in GA......Page 78
2.5.4 Irregular Computation Example......Page 79
2.6 Experiments with OpenMP on DMPs......Page 83
2.7 Conclusions......Page 84
3.1 Introduction......Page 91
3.2.1 Motivation and Methodology......Page 92
3.2.2 Abstraction View: Basic Skeletons and Compositions......Page 93
3.2.4 SAT: Combining Abstraction with Performance......Page 94
3.3.1 The H Skeleton and Its Standard Implementation......Page 96
3.3.2 Transformations for Performance View......Page 98
3.4 Case Study: Maximum Segment SUM (MSS)......Page 99
3.5.1 Performance Predictability......Page 102
3.6 Conclusions and Related Work......Page 104
4.1 The BSP Model......Page 109
4.1.2 BSP Versus Traditional Parallelism......Page 111
4.1.4 Memory Management......Page 112
4.1.6 Subset Synchronization......Page 113
4.1.7 Other Variants of BSP......Page 114
4.2.2 Beyond BSPlib......Page 115
4.3 Conclusions......Page 116
5.1 Introduction......Page 121
5.1.3 Terminology......Page 122
5.2.1 Programs......Page 123
5.2.2 Test Bed......Page 126
5.3.1 Fibonacci......Page 128
5.3.2 Traveling Salesman Problem......Page 129
5.3.4 Matrix Multiplication......Page 130
5.3.5 Finite Differencing......Page 132
5.4 Conclusion......Page 134
6.1 Introduction......Page 139
6.2.1 Parallelism Definition......Page 141
6.2.2 Thread Groups......Page 143
6.2.3 Evaluation of the Proposal......Page 144
6.3.1 Precedence Relations......Page 148
6.3.2 Evaluation of the Proposal......Page 151
6.4 Summary......Page 153
7.1 Introduction......Page 157
7.2.1 Quads......Page 158
7.3.1 Data Distribution......Page 160
7.3.2 Computation Division......Page 161
7.4.1 Controlling the DSEs......Page 162
7.4.2 Extensions for DSEs......Page 163
7.5 Optimization for OpenMP......Page 165
7.5.2 Double Buffer and Data Prefetching......Page 166
7.5.3 Data Privatization and Other Functions......Page 167
7.6 Implementation......Page 168
7.7 Performance Evaluation......Page 170
7.8 Conclusions......Page 173
8.1 Introduction......Page 175
8.2.1 Data Distribution and Data Alignment......Page 177
8.2.2 Data Temporal Dependency and Spatial Locality......Page 180
8.2.3 Computation Decomposition and Scheduling......Page 183
8.3 Compiling Regular Programs on DMPCs......Page 185
8.4 Compiler and Run-Time Support for Irregular Programs......Page 191
8.4.1 The inspectors and Executors......Page 192
8.4.2 The ARF Compiler......Page 194
8.4.3 Language Interfaces for Data Partitioners......Page 200
8.5.1 Data Partitioning and Reordering......Page 202
8.5.2 Solving Sparse Linear Systems......Page 204
8.6 Related Works......Page 209
8.7 Concluding Remarks......Page 213
9.1 Introduction......Page 223
9.2.1 Dynamic Value Representation......Page 224
9.2.2 Partial Cache-Line Prefetching......Page 226
9.3.1 Cache Organization......Page 229
9.3.2 Dynamic Value Conversion......Page 230
9.3.3 Cache Operation......Page 231
9.4.1 Experimental Setup......Page 233
9.4.2 Memory Traffic......Page 234
9.4.3 Execution Time......Page 235
9.4.4 Cache Miss Comparison......Page 236
9.5 Related Work......Page 239
9.6 Conclusion......Page 240
10.1 Introduction......Page 243
10.2 Concurrent Overlapping I/O......Page 244
10.2.2 MPI Atomicity Semantics......Page 245
10.3 Implementation Strategies......Page 246
10.3.1 Row- and Column-Wise 2D Array Partitioning......Page 248
10.3.2 Byte-Range File Locking......Page 249
10.3.3 Processor Handshaking......Page 251
10.3.4 Scalability Analysis......Page 253
10.5 Summary......Page 254
11.1 Introduction......Page 259
11.3 Code Tiling......Page 260
11.4 Data Tiling......Page 263
11.4.1 A Sufficient Condition......Page 264
11.4.2 Constructing a Data Tiling for Two-D SOR......Page 266
11.4.3 Constructing a Data Tiling for Equation (1.1)......Page 268
11.5 Finding Optimal Tile Sizes......Page 270
11.6 Experimental Results......Page 272
11.7 Related Work......Page 278
11.8 Conclusion......Page 279
12.1 Introduction......Page 281
12.2.1 MigThread......Page 282
12.2.2 Migration and Checkpointing Safety......Page 283
12.3.1 Data-Conversion Issues......Page 284
12.4 Coarse-Grain Tagged RMR in MigThread......Page 285
12.4.1 Tagging and Padding Detection......Page 286
12.4.2 Data Restoration......Page 287
12.4.4 Address Resizing......Page 290
12.5 Microbenchmarks and Experiments......Page 291
12.6 Related Work......Page 298
12.7 Conclusions and Future Work......Page 299
13.1 Background......Page 301
13.2.1 Prediction Method......Page 303
13.2.3 Static Algorithm Selection by Profiling......Page 305
13.2.4 Dynamic Algorithm Switching......Page 306
13.3 Implementation of the Method in the MIPI Libraries......Page 307
13.4.1 Evaluation Environments......Page 309
13.4.2 Basic Characteristics of the Receiving-Message Prediction Method......Page 310
13.4.3 Effects of Profiling......Page 312
13.5 Conclusing Remarks......Page 313
14.1 Introduction......Page 317
14.2 High Performance Computing with Cluster Computing......Page 319
14.3 Reconfigurable Computing with EPGAs......Page 320
14.4 DRMC: A Distributed Reconfigurable Metacomputer......Page 322
14.4.2 Metacomputer Overview......Page 323
14.4.3 Hardware Setup......Page 324
14.4.5 Programming the RC1000 Board......Page 325
14.5 Algorithms Suited to the Implementation on FPGAs/DRMC......Page 326
14.6 Algorithms Not Suited to the Implementation on FPGAs/DRMC......Page 328
14.7 Summary......Page 331
15.1 Introduction......Page 335
15.3 System Model and Problem Statement......Page 337
15.4 Resource Allocation to Maximize System Throughput......Page 339
15.4.1 A Linear Programming Formulation......Page 340
15.4.2 An Extended Network Flow Representation......Page 342
15.5 Experimental Results......Page 344
15.6 Conclusion......Page 351
16.1 Introduction......Page 353
16.2 Related Work......Page 354
16.3 The Implications of Bus Bandwidth for Application Performance......Page 356
16.4 Scheduling Policies For Preserving Bus Bandwidth......Page 362
16.5 Experimental Evaluation......Page 366
16.6 Conclusions......Page 369
17.1 Introduction......Page 373
17.2.1 A Performance Model......Page 374
17.2.3 A Schedule......Page 375
17.3 Related Works......Page 377
17.3.2 The Case of Invariable Processor Speed......Page 378
17.4 The Proposed Algorithm RR......Page 379
17.5 The Performance Guarantee of The Proposed Algorithm......Page 384
17.6 Conclusion......Page 387
18.1 Introduction......Page 389
18.2.1 The Fitness Function......Page 390
18.3 The Algorithm......Page 391
18.4 Illustrative Examples......Page 392
18.5 Discussions and Conclusion......Page 396
19.1 Introduction......Page 401
19.2 Problem Definition......Page 402
19.3 The Suggested Algorithm......Page 404
19.3.1 Listing Mechanism......Page 405
19.3.2 Duplication Mechanism......Page 406
19.3.3 Algorithm Complexity Analysis......Page 408
19.5 Experimental Results and Discussion......Page 410
19.5.2 Random Graph Generator......Page 411
19.5.3 Performance Results......Page 412
19.5.4 Applications......Page 413
19.5.5 Performance of Parents-Selection Methods......Page 416
19.5.6 Performance of the Machine Assignment Mechanism......Page 417
19.6 Conclusion......Page 419
20.1 Introduction......Page 421
20.2 Related Work......Page 422
20.3 Information Acquisition......Page 424
20.4 Linux Process Classification Model......Page 426
20.4.1 Training Algorithm......Page 428
20.4.2 Labeling Algorithm......Page 430
20.4.3 Classification Model Implementation......Page 431
20.5 Results......Page 432
20.6 Evaluation of The Model Intrusion on the System Performance......Page 434
20.7 Conclusions......Page 437
21.1 Introduction......Page 443
21.2 Background......Page 444
21.3 Desktop Grid Middleware Considerations......Page 445
21.4 Representation Desktop Grid Systems......Page 446
21.5 Alchemi Desktop Grid Framework......Page 448
21.5.1 Application Models......Page 449
21.5.2 Distributed Components......Page 451
21.5.3 Security......Page 453
21.6.1 Overview......Page 456
21.6.2 Grid Application Lifecycle......Page 459
21.7.1 Stand-Alone Alchemi Desktop Grid......Page 461
21.7.2 Alchemi as Node of a Cross-Platform Global Grid......Page 462
21.8 Summary and Future Work......Page 465
22.1 Introduction......Page 471
22.2 Overview of Grid Middleware Systems......Page 473
22.3 Unicore......Page 476
22.3.1 Overview of Job Creation, Submission, and Execution in UNICORE Middleware......Page 478
22.4 Globus......Page 480
22.4.1 GSI Security Layer......Page 481
22.4.3 Information Services......Page 482
22.5 Legion......Page 483
22.6 Gridbus......Page 485
22.6.2 Libra......Page 486
22.6.6 Web Portals......Page 487
22.7 Implementation of UNICORE Adaptor for Gridbus Broker......Page 488
22.7.1 UnicoreComputServer......Page 489
22.7.2 UnicoreJobWrapper......Page 491
22.7.4 UnicoreJobOutput......Page 492
22.8 Comparison of Middleware Systems......Page 493
22.9 Summary......Page 495
23.1.1 Java......Page 499
23.1.2 Java Virtual Machine......Page 500
23.1.3 Programming Paradigms for Parallel Java Computing......Page 501
23.2.2 Solutions......Page 503
23.3.1 Overview......Page 504
23.3.2 Global object space......Page 506
23.3.3 Transparent Java Thread Migration......Page 509
23.4.1 Effects of Optimizations in GOS......Page 512
23.4.2 Thread Migration Overheads......Page 514
23.4.3 Application Benchmark......Page 515
23.5 Related Work......Page 516
23.5.3 Distributed JVM......Page 517
23.6 Summary......Page 518
24.1 Introduction......Page 521
24.2.1 Metadata Services for Data Grid Systems......Page 522
24.2.3 Performance Measurement in Data Grid......Page 523
24.3.1 Data Replication......Page 524
24.3.2 Scheduling in Data Grid Systems......Page 526
24.3.3 Data Movement......Page 527
24.5.1 Application Replication......Page 528
24.5.3 Asynchronized Data Movement......Page 529
24.6 Conclusions......Page 530
25.1 Introduction......Page 535
25.2 Application I/O......Page 536
25.3.2 Parallel Virtual File System (PVFS)......Page 537
25.4 Standard Unix & Parallel I/O......Page 538
25.5 Example: Seismic Imaging......Page 543
25.5.1 I/O......Page 546
25.5.2 Performance Analysis of Seismic Migration......Page 547
25.6 Discussion and Conclusion......Page 548
26.2 Hardware and Software Setup......Page 551
26.3.1 Performance Model of HPL......Page 552
26.3.2 BLAS Library......Page 554
26.3.4 Results Using Gigabit Ethernet......Page 556
26.4 Performance Costs and Benefits......Page 559
27.1 Introduction......Page 563
27.2 MPI Implementation of The Internode Domain Decomposition......Page 564
27.3 Integration of The Internode Domain Decomposition with Intranode Particle Decomposition Strategies......Page 568
27.4 The MPICH-G2 Implementation......Page 569
27.5 Conclusions......Page 572
28.1 Motivation For Evaluating Trust......Page 575
28.2.1 The Communication Model......Page 577
28.2.2 Trust Definition......Page 578
28.2.3 Service Provider Trust......Page 579
28.2.4 Service Consumer Trust......Page 581
28.2.5 Limitations of Current Approaches......Page 583
28.3 Evidence-Aware Trust Model......Page 584
28.4.2 The SLA Negotiation Phase......Page 585
28.4.3 The Trust Verification Phase (TVP)......Page 587
28.5 Conclusion......Page 588
29.1 Introduction......Page 591
29.2 Design Requirements......Page 592
29.3.1 Gnutella......Page 594
29.3.2 Freenet......Page 595
29.3.3 Optimizations......Page 596
29.4 Structured P2P Systems......Page 597
29.4.2 Routing and Joining Process......Page 598
29.4.3 Discussion......Page 601
29.4.4 Revisiting Design Requirements......Page 603
29.5.1 Keyword Search......Page 604
29.5.2 Search by Attribute-Value Pairs......Page 607
29.5.3 Range Search......Page 609
29.6 Summary......Page 610
30.1 Introduction......Page 613
30.2 Search Mechanisms......Page 614
30.2.1 Uniformed Selection of Relay Neighbors......Page 615
30.2.3 Other Approaches......Page 616
30.2.4 Partial Coverage Problem......Page 617
30.3.1 Periodical Flooding (PF)......Page 618
30.3.2 Hybrid Periodical Flooding......Page 619
30.4.2 Simulation Setup......Page 621
30.5.1 Partial Coverage Problem......Page 622
30.5.2 Performance of Random PF......Page 623
30.5.4 Alleviating the Partial Coverage Problem......Page 627
30.6 Conclusion......Page 629
31.2 Hierarchical P2P Architecture......Page 633
31.2.2 Distributed Binning Scheme......Page 634
31.2.4 Hierarchy Depth......Page 635
31.3.1 Data Structures......Page 636
31.3.2 Routing Algorithm......Page 637
31.3.3 Node Operations......Page 639
31.4.2 Routing Costs......Page 640
31.4.3 Routing Cost Distribution......Page 641
31.4.4 Landmark Nodes Effects......Page 643
31.4.5 Hierarchy Depth Effect......Page 646
31.5 Related Works......Page 647
31.6 Summary......Page 649
32.1 Introduction......Page 651
32.2.1 Autonomic Group Agent......Page 652
32.2.2 Views......Page 653
32.3 Functions of Group Protocol......Page 654
32.4.1 Local Protocol Instance......Page 657
32.4.2 Global Protocol Instance......Page 658
32.5 Retransmission......Page 660
32.5.2 Change of Retransmission Class......Page 661
32.5.3 Evaluation......Page 662
32.6 Concluding Remarks......Page 665
33.1 Introduction......Page 667
33.2.1 Movement-Based Location Management in 3G Cellular Networks......Page 668
33.3 The Cache-Enhanced Location Management Scheme......Page 670
33.3.2 The Location Update Scheme in Detail......Page 671
33.3.3 Paging Scheme......Page 673
33.4 Simulation Results and Analysis......Page 675
33.4.2 Experiment 1: Comparison of the Four Schemes of Location Management......Page 676
33.4.3 Experiment 2: The Effect of Cache Size on the Improved Movement-Based Location Management Scheme......Page 678
33.5 Conclusion......Page 680
34.1 Introduction......Page 683
34.2 Energy Consumption Model In WANETs......Page 685
34.3 Definitions of Maximum Multicast Lifetime......Page 686
34.4.1 MMLM......Page 691
34.4.2 MBLS......Page 694
34.4.3 A Distributed MMLS Algorithm: L-REMiT......Page 695
34.5.1 MMLM......Page 696
34.5.2 Some Heuristic Solutions......Page 698
34.6 Summary......Page 699
35.1 Introduction......Page 701
35.2 Perfect Scheduling Problem for Bipartite Scatternet......Page 703
35.3 Perfect Assignment Scheduling Algorithm for Bipartite Scatternets......Page 705
35.4 Distributed, Local, and Incremental Scheduling Algorithms......Page 709
35.5 Performance and QOS Analysis......Page 712
35.5.1 Slot Assignment Algorithm Efficiency......Page 713
35.5.2 Bandwidth......Page 716
35.5.3 Delay and Jitter......Page 717
35.6 Conclusion......Page 720
36.1 Introduction......Page 723
36.2.2 Configuration Graph......Page 725
36.2.5 Partitioning Metrics......Page 726
36.3.1 MinEX Data Structures......Page 727
36.3.2 Contraction......Page 728
36.3.3 Partitioning......Page 729
36.3.4 Reassignment Filter......Page 730
36.3.5 Refinement......Page 731
36.4 N-Body Application......Page 732
36.4.2 Partition Graph Construction......Page 733
36.5 Experimental Study......Page 734
36.5.1 Multiple Time Step Test......Page 735
36.5.3 Partitioner Speed Comparsions......Page 736
36.5.4 Partitioner Quality Comparsions......Page 737
36.6 Conclusions......Page 740
37.1 Introduction......Page 745
37.2 Design Goals......Page 746
37.3 Architecture......Page 747
37.4 Working Environment......Page 750
37.5 Scheduling......Page 752
37.6 Implementation......Page 754
37.7 Performance Evaluation......Page 756
37.2.2 Fighting Aids......Page 757
37.7.3 Gauging the Home Machine Bottleneck......Page 758
37.8 Conclusions and Future Work......Page 760
38.1 Introduction......Page 765
38.1.1 An Efficient Sequential Algorithm......Page 766
38.2 Computing in Parallel......Page 767
38.3 Experimental Results......Page 769
38.4 Conclusion......Page 770
39.1 Introduction......Page 773
39.2 System Requirements and Design......Page 775
39.2.1 Hardware Configuration......Page 777
39.2.3 Database Files Refreshing Scheme......Page 778
39.2.4 Job Parsing, Scheduling, and Processing......Page 779
39.2.5 Integration with SmedDb......Page 781
39.4 Conclusions......Page 783
40.1 Introduction......Page 785
40.2 The B&B Parallel Application......Page 786
40.3 The OpenMP Extension......Page 789
40.4 Experimental Results......Page 793
40.5 Conclusions......Page 795
41.1 Introduction......Page 797
41.1.1 Network Architecture......Page 798
41.1.2 Service Features......Page 801
41.1.3 Market Targets......Page 803
Index......Page 805