Beowulf clusters, which exploit mass-market PC hardware and software in conjunction with cost-effective commercial network technology, are becoming the platform for many scientific, engineering, and commercial applications. With growing popularity has come growing complexity. Addressing that complexity, Beowulf Cluster Computing with Linux and Beowulf Cluster Computing with Windows provide system users and administrators with the tools they need to run the most advanced Beowulf clusters. The book is appearing in both Linux and Windows versions in order to reach the entire PC cluster community, which is divided into two distinct camps according to the node operating system. Each book consists of three stand-alone parts. The first provides an introduction to the underlying hardware technology, assembly, and configuration. The second part offers a detailed presentation of the major parallel programming librairies. The third, and largest, part describes software infrastructures and tools for managing cluster resources. This includes some of the most popular of the software packages available for distributed task scheduling, as well as tools for monitoring and administering system resources and user accounts. Approximately 75% of the material in the two books is shared, with the other 25% pertaining to the specific operating system. Most of the chapters include text specific to the operating system. The Linux volume includes a discussion of parallel file systems.
Author(s): Thomas Sterling
Edition: 1
Year: 2001
Language: English
Pages: 533
Contents......Page 8
Series Foreword......Page 20
Foreword......Page 22
Preface......Page 30
1.1 Definitions and Taxonomy......Page 38
1.2 Opportunities and Advantages......Page 40
1.3 A Short History......Page 43
1.4 Elements of a Cluster......Page 45
1.5 Description of the Book......Page 47
I: Enabling Technologies......Page 50
2 An Overview of Cluster Computing......Page 52
2.1 A Taxonomy of Parallel Computing......Page 53
2.2 Hardware System Structure......Page 56
2.4 Resource Management......Page 62
2.5 Distributed Programming......Page 64
2.6 Conclusions......Page 66
3 Node Hardware......Page 68
3.1 Overview of a Beowulf Node......Page 69
3.2 Processors......Page 75
3.3 Motherboard......Page 78
3.4 Memory......Page 80
3.5 BIOS......Page 83
3.6 Secondary Storage......Page 84
3.7 PCI Bus......Page 86
3.9 Boxes, Shelves, Piles, and Racks......Page 87
3.10 Node Assembly......Page 89
4.1 What Is Linux?......Page 98
4.2 The Linux Kernel......Page 108
4.3 Pruning Your Beowulf Node......Page 119
4.4 Other Considerations......Page 123
4.5 Final Tuning with /proc......Page 125
4.6 Conclusions......Page 129
5.1 Interconnect Technologies......Page 132
5.2 A Detailed Look at Ethernet......Page 137
5.3 Network Practicalities: Interconnect Choice......Page 143
6.1 TCP/IP......Page 150
6.2 Sockets......Page 153
6.3 Higher-Level Protocols......Page 157
6.4 Distributed File Systems......Page 163
6.5 Remote Command Execution......Page 165
7.1 System Access Models......Page 168
7.2 Assigning Names......Page 170
7.3 Installing Node Software......Page 172
7.4 Basic System Administration......Page 177
7.5 Avoiding Security Compromises......Page 181
7.6 Job Scheduling......Page 186
7.7 Some Advice on Upgrading Your Software......Page 187
8.1 Metrics......Page 188
8.3 The LINPACK Benchmark......Page 191
8.4 The NAS Parallel Benchmark Suite......Page 193
II: Parallel Programming......Page 196
9 Parallel Programming with MPI......Page 198
9.1 Hello World in MPI......Page 199
9.2 Manager/Worker Example......Page 206
9.3 Two-Dimensional Jacobi Example with One-Dimensional Decomposition......Page 211
9.4 Collective Operations......Page 215
9.6 Installing MPICH under Linux......Page 220
9.7 Tools......Page 229
9.9 MPI Routine Summary......Page 231
10.1 Dynamic Process Management in MPI......Page 236
10.2 Fault Tolerance......Page 239
10.3 Revisiting Mesh Exchanges......Page 241
10.4 Motivation for Communicators......Page 248
10.5 More on Collective Operations......Page 250
10.6 Parallel I/O......Page 252
10.7 Remote Memory Access......Page 258
10.8 Using C++ and Fortran 90......Page 261
10.9 MPI, OpenMP, and Threads......Page 263
10.10 Measuring MPI Performance......Page 264
10.12 MPI Routine Summary......Page 267
11.1 Overview......Page 274
11.3 Fork/Join......Page 279
11.4 Dot Product......Page 283
11.5 Matrix Multiply......Page 288
11.6 One-Dimensional Heat Equation......Page 294
11.7 Using PVM......Page 302
11.8 PVM Console Details......Page 306
11.9 Host File Options......Page 309
11.10 XPVM......Page 311
12 Fault-Tolerant and Adaptive Programs with PVM......Page 318
12.1 Considerations for Fault Tolerance......Page 319
12.2 Building Fault-Tolerant Parallel Applications......Page 320
12.3 Adaptive Programs......Page 326
III: Managing Clusters......Page 336
13.1 Goal of Workload Management Software......Page 338
13.2 Workload Management Activities......Page 339
14.1 Introduction to Condor......Page 344
14.2 Using Condor......Page 350
14.3 Condor Architecture......Page 369
14.4 Installing Condor under Linux......Page 373
14.5 Configuring Condor......Page 375
14.6 Administration Tools......Page 380
14.7 Cluster Setup Scenarios......Page 383
14.8 Conclusion......Page 387
15.1 Overview......Page 388
15.2 Installation and Initial Configuration......Page 389
15.3 Advanced Configuration......Page 390
15.4 Steering Workload and Improving Quality of Information......Page 402
15.6 Conclusions......Page 404
16.1 History of PBS......Page 406
16.2 Using PBS......Page 410
16.3 Installing PBS......Page 415
16.4 Configuring PBS......Page 416
16.5 Managing PBS......Page 423
16.6 Troubleshooting......Page 425
17.1 Introduction......Page 428
17.2 Using PVFS......Page 439
17.3 Administering PVFS......Page 449
17.4 Final Words......Page 466
18.1 Chiba City Configuration......Page 468
18.2 Chiba City Timeline......Page 479
18.3 Chiba City Software Environment......Page 484
18.4 Chiba City Use......Page 496
18.5 Final Thoughts......Page 497
19.1 Future Directions for Hardware Components......Page 500
19.2 Future Directions for Software Components......Page 502
19.3 Final Thoughts......Page 505
C......Page 508
E......Page 509
J......Page 510
N......Page 511
Q......Page 512
T......Page 513
X......Page 514
B: Annotated Reading List......Page 516
C: Annotated URLs......Page 518
References......Page 522
C......Page 525
D......Page 526
I......Page 527
M......Page 528
P......Page 530
R......Page 531
V......Page 532
X......Page 533