This textbook covers the new development in processor architecture and parallel hardware. It provides detailed descriptions of parallel programming techniques that are necessary for developing efficient programs for multicore processors as well as for parallel cluster systems and supercomputers.
The content of the book consists of three main parts, covering all areas of parallel computing: the architecture of parallel systems, parallel programming models and environments, and the implementation of efficient application algorithms. The emphasis lies on parallel programming techniques needed for different architectures. The first part contains an overview of the architecture of parallel systems, including cache and memory organization, interconnection networks, routing and switching techniques as well as technologies that are relevant for modern and future multicore processors. Issues of power and energy consumption are also covered.
The second part presents parallel programming models, performance models, and parallel programming environments for message passing and shared memory models, including the message passing interface (MPI), Pthreads, Java threads, and OpenMP. For each of these parallel programming environments, the book introduces basic concepts as well as more advanced programming methods and enables the reader to write and run semantically correct and computationally efficient parallel programs. Parallel design patterns, such as pipelining, client-server, or task pools are presented for different environments to illustrate parallel programming techniques and to facilitate the implementation of efficient parallel programs for a wide variety of application areas. Performance models and techniques for runtime analysis are described in detail, as they are a prerequisite for achieving efficiency and high performance. A chapter gives a detailed description of the architecture of GPUs and also contains an introduction into programming approaches for general purpose GPUs concentrating on CUDA and OpenCL. Programming examples are provided to demonstrate the use of the specific programming techniques introduced.
The third part applies the parallel programming techniques from the second part to representative algorithms from scientific computing. The emphasis lies on basic methods for solving linear equation systems, which play an important role for many scientific simulations. The focus of the presentation is the analysis of the algorithmic structure of the different algorithms, which is the basis for a parallelization, and not so much on mathematical properties of the solution methods. For each algorithm, the book discusses different parallelization variants, using different methods and strategies.
The main goal of this book is to present parallel programming techniques that can be used in many situations for many application areas and to enable the reader to develop correct and efficient parallel programs. Many example programs and exercises are provided to support this goal and to show how the techniques can be applied to further applications. The book can be used as a textbook for students as well as a reference book for professionals. The material of the book has been used for courses in parallel programming at different universities for many years.
Author(s): Thomas Rauber, Gudula Runger
Edition: 3
Publisher: Springer
Year: 2023
Language: English
Pages: 563
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classical Use of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Parallelism in Today’s Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Basic Concepts of parallel programming . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Parallel Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Processor Architecture and Technology Trends . . . . . . . . . . . . . . . . . . 10
2.2 Power and Energy Consumption of Processors . . . . . . . . . . . . . . . . . . 13
2.3 Memory access times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 DRAM access times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Multithreading for hiding memory access times . . . . . . . . . . . 18
2.3.3 Caches for reducing the average memory access time . . . . . . 18
2.4 Flynn’s Taxonomy of Parallel Architectures . . . . . . . . . . . . . . . . . . . . 19
2.5 Memory Organization of Parallel Computers . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Computers with Distributed Memory Organization . . . . . . . . 22
2.5.2 Computers with Shared Memory Organization . . . . . . . . . . . . 25
2.6 Thread-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.1 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.2 Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.3 Architecture of Multicore Processors . . . . . . . . . . . . . . . . . . . . 32
2.7 Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.1 Properties of Interconnection Networks . . . . . . . . . . . . . . . . . . 38
2.7.2 Direct Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . 41
2.7.3 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.4 Dynamic Interconnection Networks . . . . . . . . . . . . . . . . . . . . . 49
2.8 Routing and Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.8.1 Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.8.2 Routing in the Omega Network . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8.3 Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.8.4 Flow control mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
IX
X Contents
2.9 Caches and Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.9.1 Characteristics of Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.9.2 Write Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.9.3 Cache coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.9.4 Memory consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.10 Examples for hardware parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.10.1 Intel Cascade Lake and Ice Lake Architectures . . . . . . . . . . . 101
2.10.2 Top500 list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.11 Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1 Models for parallel systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.2 Parallelization of programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.3 Levels of parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.3.1 Parallelism at instruction level . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3.2 Data parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.3 Loop parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.3.4 Functional parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.3.5 Explicit and implicit representation of parallelism . . . . . . . . . 122
3.3.6 Parallel programming patterns . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.4 SIMD Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4.1 Execution of vector operations . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4.2 SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5 Data distributions for arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.1 Data distribution for one-dimensional arrays . . . . . . . . . . . . . 134
3.5.2 Data distribution for two-dimensional arrays . . . . . . . . . . . . . 135
3.5.3 Parameterized data distribution . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.6 Information exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.6.1 Shared variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.6.2 Communication operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.7 Parallel matrix-vector product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.7.1 Parallel computation of scalar products . . . . . . . . . . . . . . . . . . 147
3.7.2 Parallel computation of the linear combinations . . . . . . . . . . . 150
3.8 Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.8.1 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.8.2 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.8.3 Synchronization mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 158
3.8.4 Developing efficient and correct thread programs . . . . . . . . . 161
3.9 Further parallel programming approaches . . . . . . . . . . . . . . . . . . . . . . 163
3.10 Exercices for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4 Performance Analysis of Parallel Programs . . . . . . . . . . . . . . . . . . . . . . . 169
4.1 Performance Evaluation of Computer Systems . . . . . . . . . . . . . . . . . . 170
4.1.1 Evaluation of CPU Performance . . . . . . . . . . . . . . . . . . . . . . . . 170
4.1.2 MIPS and MFLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Contents XI
4.1.3 Performance of Processors with a Memory Hierarchy . . . . . . 174
4.1.4 Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.2 Performance Metrics for Parallel Programs . . . . . . . . . . . . . . . . . . . . . 181
4.2.1 Parallel Runtime and Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.2.2 Speedup and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4.2.3 Weak and Strong Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.3 Energy Measurement and Energy Metrics . . . . . . . . . . . . . . . . . . . . . . 191
4.3.1 Performance and Energy Measurement Techniques . . . . . . . . 191
4.3.2 Modeling of Power and Energy Consumption for DVFS . . . . 196
4.3.3 Energy Metrics for Parallel Programs . . . . . . . . . . . . . . . . . . . 197
4.4 Asymptotic Times for Global Communication . . . . . . . . . . . . . . . . . . 200
4.4.1 Implementing Global Communication Operations . . . . . . . . . 201
4.4.2 Communications Operations on a Hypercube . . . . . . . . . . . . . 207
4.4.3 Communication Operations on a Complete Binary Tree . . . . 215
4.5 Analysis of Parallel Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.5.1 Parallel Scalar Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.5.2 Parallel Matrix-vector Product . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6 Parallel Computational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.6.1 PRAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.6.2 BSP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.6.3 LogP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.7 Loop Scheduling and Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.7.1 Loop Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.7.2 Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.8 Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
5 Message-Passing Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
5.1 Introduction to MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
5.1.1 MPI point-to-point communication . . . . . . . . . . . . . . . . . . . . . 252
5.1.2 Deadlocks with Point-to-point Communications . . . . . . . . . . 257
5.1.3 Nonblocking Point-to-Point Operations . . . . . . . . . . . . . . . . . . 260
5.1.4 Communication modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
5.2 Collective Communication Operations . . . . . . . . . . . . . . . . . . . . . . . . . 266
5.2.1 Collective Communication in MPI . . . . . . . . . . . . . . . . . . . . . . 266
5.2.2 Deadlocks with Collective Communication . . . . . . . . . . . . . . 280
5.2.3 Nonblocking Collective Communication Operations . . . . . . . 283
5.3 Process Groups and Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
5.3.1 Process Groups in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
5.3.2 Process Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.3.3 Timings and aborting processes . . . . . . . . . . . . . . . . . . . . . . . . 295
5.4 Advanced topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
5.4.1 Dynamic Process Generation and Management . . . . . . . . . . . 296
5.4.2 One-sided communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
5.5 Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
XII Contents
6 Thread Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
6.1 Programming with Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
6.1.1 Creating and Merging Threads . . . . . . . . . . . . . . . . . . . . . . . . . 315
6.1.2 Thread Coordination with Pthreads . . . . . . . . . . . . . . . . . . . . . 318
6.1.3 Condition Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
6.1.4 Extended Lock Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
6.1.5 One-time initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
6.2 Parallel programming patterns with Pthreads . . . . . . . . . . . . . . . . . . . . 333
6.2.1 Implementation of a Task Pool . . . . . . . . . . . . . . . . . . . . . . . . . 333
6.2.2 Parallelism by Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
6.2.3 Implementation of a Client-Server Model . . . . . . . . . . . . . . . . 345
6.3 Advanced Pthread features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
6.3.1 Thread Attributes and Cancellation . . . . . . . . . . . . . . . . . . . . . 349
6.3.2 Thread Scheduling with Pthreads . . . . . . . . . . . . . . . . . . . . . . . 358
6.3.3 Priority Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
6.3.4 Thread-specific Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
6.4 Java Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
6.4.1 Thread Generation in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
6.4.2 Synchronization of Java Threads . . . . . . . . . . . . . . . . . . . . . . . 370
6.4.3 Wait and Notify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
6.4.4 Extended Synchronization Patterns . . . . . . . . . . . . . . . . . . . . . 385
6.4.5 Thread Scheduling in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
6.4.6 Package java.util.concurrent . . . . . . . . . . . . . . . . . . . . . 391
6.5 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
6.5.1 Compiler directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
6.5.2 Execution environment routines . . . . . . . . . . . . . . . . . . . . . . . . 406
6.5.3 Coordination and synchronization of threads . . . . . . . . . . . . . 407
6.5.4 OpenMP task model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
6.6 Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
7 General Purpose GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
7.1 The Architecture of GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
7.2 Introduction to CUDA programming . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.3 Synchronization and Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 435
7.4 CUDA Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.5 Efficient Memory Access and Tiling Technique . . . . . . . . . . . . . . . . . 441
7.6 Introduction to OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
7.7 Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
8 Algorithms for Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . 451
8.1 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
8.1.1 Gaussian Elimination and LU Decomposition . . . . . . . . . . . . 452
8.1.2 Parallel Row-Cyclic Implementation . . . . . . . . . . . . . . . . . . . . 456
8.1.3 Parallel Implementation with Checkerboard Distribution . . . 460
8.1.4 Analysis of the Parallel Execution Time . . . . . . . . . . . . . . . . . 464
Contents XIII
8.2 Direct Methods for Linear Systems with Banded Structure . . . . . . . . 470
8.2.1 Discretization of the Poisson Equation . . . . . . . . . . . . . . . . . . 470
8.2.2 Tridiagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
8.2.3 Generalization to Banded Matrices . . . . . . . . . . . . . . . . . . . . . 489
8.2.4 Solving the Discretized Poisson Equation . . . . . . . . . . . . . . . . 491
8.3 Iterative Methods for Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 493
8.3.1 Standard Iteration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.3.2 Parallel implementation of the Jacobi Iteration . . . . . . . . . . . . 498
8.3.3 Parallel Implementation of the Gauss-Seidel Iteration . . . . . . 499
8.3.4 Gauss-Seidel Iteration for Sparse Systems . . . . . . . . . . . . . . . 501
8.3.5 Red-black Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
8.4 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.4.1 Sequential CG method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.4.2 Parallel CG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
8.5 Cholesky Factorization for Sparse Matrices . . . . . . . . . . . . . . . . . . . . . 518
8.5.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
8.5.2 Storage Scheme for Sparse Matrices . . . . . . . . . . . . . . . . . . . . 525
8.5.3 Implementation for Shared Variables . . . . . . . . . . . . . . . . . . . . 526
8.6 Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547