This textbook covers the new development in processor architecture and parallel hardware. It provides detailed descriptions of parallel programming techniques that are necessary for developing efficient programs for multicore processors as well as for parallel cluster systems and supercomputers. The book is structured in three main parts, covering all areas of parallel computing: the architecture of parallel systems, parallel programming models and environments, and the implementation of efficient application algorithms. The emphasis lies on parallel programming techniques needed for different architectures. In particular, this third edition includes an extended update of the chapter on computer architecture and performance analysis taking new developments such as the aspect of energy consumption into consideration. The description of OpenMP has been extended and now also captures the task concept of OpenMP. The chapter on message-passing programming has been extended and updated to include new features of MPI such as extended reduction operations and non-blocking collective communication operations. The chapter on GPU programming also has been updated. All other chapters also have been revised carefully. The main goal of this book is to present parallel programming techniques that can be used in many situations for many application areas and to enable the reader to develop correct and efficient parallel programs. Many example programs and exercises are provided to support this goal and to show how the techniques can be applied to further applications. The book can be used as a textbook for students as well as a reference book for professionals. The material of the book has been used for courses in parallel programming at different universities for many years.
Author(s): Thomas Rauber; Gudula Rünger
Edition: 3
Publisher: Springer
Year: 2023
Language: English
Pages: 567
City: Cham
Tags: Distributed Programming; Grid Computing; Multithreading; Networking; Parallel Programming; Scientific Programming
Preface
Contents
Chapter 1 Introduction
1.1 Classical Use of Parallelism
1.2 Parallelism in Today’s Hardware
1.3 Basic Concepts of parallel programming
1.4 Overview of the Book
Chapter 2 Parallel Computer Architecture
2.1 Processor Architecture and Technology Trends
2.2 Power and Energy Consumption of Processors
2.3 Memory access times
2.3.1 DRAM access times
2.3.2 Multithreading for hiding memory access times
2.3.3 Caches for reducing the average memory access time
2.4 Flynn’s Taxonomy of Parallel Architectures
2.5 Memory Organization of Parallel Computers
2.5.1 Computers with Distributed Memory Organization
2.5.2 Computers with Shared Memory Organization
2.6 Thread-Level Parallelism
2.6.1 Simultaneous Multithreading
2.6.2 Multicore Processors
2.6.3 Architecture of Multicore Processors
2.6.3.1 Homogeneous Multicore Processors
2.6.3.2 Heterogeneous Multicore Processors
2.6.3.3 Future Trends and Developments
2.7 Interconnection Networks
2.7.1 Properties of Interconnection Networks
2.7.2 Direct Interconnection Networks
2.7.3 Embeddings
2.7.3.1 Embedding a ring into a hypercube network
2.7.3.2 Embedding a 2-dimensional mesh into a hypercube network
2.7.3.3 Embedding of a d-dimensional mesh into a hypercube network
2.7.4 Dynamic Interconnection Networks
2.7.4.1 Bus networks
2.7.4.2 Crossbar networks
2.7.4.3 Multistage switching networks
2.7.4.4 Omega network
2.7.4.5 Butterfly network
2.7.4.6 Baseline network
2.7.4.7 Bene.s network
2.7.4.8 Fat tree network
2.8 Routing and Switching
2.8.1 Routing Algorithms
2.8.1.1 Dimension-order routing
2.8.1.2 Deadlocks and routing algorithms
2.8.1.3 Source-based routing
2.8.1.4 Table-driven routing
2.8.1.5 Turn model routing
2.8.1.6 Virtual channels
2.8.2 Routing in the Omega Network
2.8.3 Switching
2.8.3.1 Message transmission between neighboring processors
2.8.3.2 Circuit switching
2.8.3.3 Packet switching
2.8.3.4 Cut-through routing
2.8.4 Flow control mechanisms
2.9 Caches and Memory Hierarchy
2.9.1 Characteristics of Caches
2.9.1.1 Cache size
2.9.1.2 Mapping of memory blocks to cache blocks
2.9.1.3 Direct mapped caches
2.9.1.4 Fully associative caches
2.9.1.5 Set associative caches
2.9.1.6 Block replacement methods
2.9.2 Write Policy
2.9.2.1 Write-through policy
2.9.2.2 Write-back policy
2.9.2.3 Number of caches
2.9.3 Cache coherency
2.9.3.1 Snooping protocols
2.9.3.2 Write-back invalidation protocol
2.9.3.3 Directory-based cache coherence protocols
2.9.4 Memory consistency
2.9.4.1 Sequential consistency
2.9.4.2 Relaxed consistency models
2.10 Examples for hardware parallelism
2.10.1 Intel Cascade Lake and Ice Lake Architectures
2.10.2 Top500 list
2.11 Exercises for Chapter 2
Chapter 3 Parallel Programming Models
3.1 Models for parallel systems
3.2 Parallelization of programs
3.3 Levels of parallelism
3.3.1 Parallelism at instruction level
3.3.2 Data parallelism
3.3.3 Loop parallelism
3.3.3.1 forall loop
3.3.3.2 dopar loop
3.3.4 Functional parallelism
3.3.5 Explicit and implicit representation of parallelism
3.3.5.1 Implicit parallelism
3.3.5.2 Explicit parallelism with implicit distribution
3.3.5.3 Explicit distribution
3.3.5.4 Explicit assignment to processors
3.3.5.5 Explicit communication and synchronization
3.3.6 Parallel programming patterns
3.3.6.1 Creation of processes or threads
3.3.6.2 Fork-join
3.3.6.3 Parbegin-parend
3.3.6.4 SPMD and SIMD
3.3.6.5 Master-Slave or Master-Worker
3.3.6.6 Client-Server
3.3.6.7 Pipelining
3.3.6.8 Task pools
3.3.6.9 Producer-Consumer
3.4 SIMD Computations
3.4.1 Execution of vector operations
3.4.2 SIMD instructions
3.5 Data distributions for arrays
3.5.1 Data distribution for one-dimensional arrays
3.5.2 Data distribution for two-dimensional arrays
3.5.3 Parameterized data distribution
3.6 Information exchange
3.6.1 Shared variables
3.6.2 Communication operations
3.6.2.1 A set of communication operations
3.6.2.2 Duality of communication operations
3.6.2.3 Hierarchy of communication operations
3.7 Parallel matrix-vector product
3.7.1 Parallel computation of scalar products
3.7.2 Parallel computation of the linear combinations
3.8 Processes and Threads
3.8.1 Processes
3.8.2 Threads
3.8.2.1 Basic concepts of threads
3.8.2.2 Execution models for threads
3.8.2.3 Thread states
3.8.2.4 Visibility of data
3.8.3 Synchronization mechanisms
3.8.3.1 Lock synchronization
3.8.3.2 Thread execution control
3.8.4 Developing efficient and correct thread programs
3.8.4.1 Number of threads and sequentialization
3.8.4.2 Deadlock
3.8.4.3 Memory access times and cache effects
3.9 Further parallel programming approaches
3.10 Exercices for Chapter 3
Chapter 4 Performance Analysis of Parallel Programs
4.1 Performance Evaluation of Computer Systems
4.1.1 Evaluation of CPU Performance
4.1.2 MIPS and MFLOPS
4.1.3 Performance of Processors with a Memory Hierarchy
4.1.4 Benchmark Programs
4.2 Performance Metrics for Parallel Programs
4.2.1 Parallel Runtime and Cost
4.2.2 Speedup and Efficiency
4.2.2.1 Speedup
4.2.2.2 Efficiency
4.2.2.3 Amdahl’s law
4.2.3 Weak and Strong Scalability
4.2.3.1 Software scalability
4.2.3.2 Performance behavior in practice
4.2.3.3 Gustafson’s law
4.2.3.4 Scaled speedup and convergence behavior
4.3 Energy Measurement and Energy Metrics
4.3.1 Performance and Energy Measurement Techniques
4.3.2 Modeling of Power and Energy Consumption for DVFS
4.3.3 Energy Metrics for Parallel Programs
4.4 Asymptotic Times for Global Communication
4.4.1 Implementing Global Communication Operations
4.4.1.1 Asymptotic notation
4.4.1.2 Complete graph
4.4.1.3 Linear array
4.4.1.4 Ring
4.4.1.5 Mesh
4.4.2 Communications Operations on a Hypercube
4.4.2.1 Single-broadcast operation
4.4.2.2 Multi-broadcast operation on a hypercube
4.4.2.3 Scatter operation
4.4.2.4 Total exchange
4.4.3 Communication Operations on a Complete Binary Tree
4.5 Analysis of Parallel Execution Times
4.5.1 Parallel Scalar Product
4.5.1.1 Linear array
4.5.1.2 Hypercube
4.5.2 Parallel Matrix-vector Product
4.5.2.1 Linear array
4.5.2.2 Hypercube
4.6 Parallel Computational Models
4.6.1 PRAM Model
4.6.2 BSP Model
4.6.3 LogP Model
4.7 Loop Scheduling and Loop Tiling
4.7.1 Loop Scheduling
scheduling
Static scheduling
Dynamic scheduling
parallel loops,
4.7.1.1 Static loop scheduling
Static loop scheduling
4.7.1.2 Dynamic loop scheduling
Self-scheduling
Chunk scheduling
chunk.
Guided self-scheduling
while
do
4.7.1.3 Loop Coalescing
Example:
4.7.1.4 Parallel execution of coalesced loops
Example:
4.7.2 Loop Tiling
Example:
for
4.8 Exercises for Chapter 4
Chapter 5 Message-Passing Programming
5.1 Introduction to MPI
5.1.1 MPI point-to-point communication
5.1.2 Deadlocks with Point-to-point Communications
5.1.3 Nonblocking Point-to-Point Operations
5.1.4 Communication modes
5.1.4.1 Standard mode
5.1.4.2 Synchronous mode
5.1.4.3 Buffered mode
5.2 Collective Communication Operations
5.2.1 Collective Communication in MPI
5.2.1.1 Broadcast operation
5.2.1.2 Reduction operation
5.2.1.3 Scan operation as further reduction operation
5.2.1.4 Gather operation
5.2.1.5 Scatter operation
5.2.1.6 Multi-broadcast operation
5.2.1.7 Multi-accumulation operation
5.2.1.8 Total exchange
5.2.2 Deadlocks with Collective Communication
5.2.3 Nonblocking Collective Communication Operations
5.3 Process Groups and Communicators
5.3.1 Process Groups in MPI
5.3.1.1 Operations on Process Groups
5.3.1.2 Operations on Communicators
5.3.2 Process Topologies
5.3.3 Timings and aborting processes
5.4 Advanced topics
5.4.1 Dynamic Process Generation and Management
5.4.1.1 MPI Info objects
5.4.1.2 Process creation and management
5.4.2 One-sided communication
5.4.2.1 Window objects
5.4.2.2 RMA operations
5.4.2.3 Global synchronization
5.4.2.4 Loose synchronization
5.4.2.5 Lock synchronization
5.5 Exercises for Chapter 5
Chapter 6 Thread Programming
6.1 Programming with Pthreads
6.1.1 Creating and Merging Threads
6.1.2 Thread Coordination with Pthreads
6.1.2.1 Mutex variables
6.1.2.2 Mutex variables and deadlocks
6.1.3 Condition Variables
6.1.4 Extended Lock Mechanism
6.1.5 One-time initialization
6.2 Parallel programming patterns with Pthreads
6.2.1 Implementation of a Task Pool
6.2.2 Parallelism by Pipelining
6.2.3 Implementation of a Client-Server Model
6.3 Advanced Pthread features
6.3.1 Thread Attributes and Cancellation
6.3.1.1 Return value
6.3.1.2 Stack characteristics
6.3.1.3 Thread Cancellation
6.3.1.4 Cleanup Stack
6.3.1.5 Producer-Consumer threads
6.3.2 Thread Scheduling with Pthreads
6.3.2.1 Explicit setting of scheduling attributes
6.3.2.2 Implicit setting of scheduling attributes
6.3.2.3 Dynamic setting of scheduling attributes
6.3.3 Priority Inversion
6.3.3.1 Priority ceiling
6.3.3.2 Priority inheritance
6.3.4 Thread-specific Data
6.4 Java Threads
6.4.1 Thread Generation in Java
6.4.1.1 Inheriting from the Thread class
6.4.1.2 Using the interface Runnable
6.4.1.3 Further methods of the Thread class
6.4.2 Synchronization of Java Threads
6.4.2.1 Deadlocks
6.4.2.2 Synchronization with variable lock granularity
6.4.2.3 Synchronization of static methods
6.4.3 Wait and Notify
6.4.3.1 Producer-Consumer pattern
6.4.3.2 Modification of the MyMutex class
6.4.3.3 Barrier Synchronization
6.4.3.4 Condition Variables
6.4.4 Extended Synchronization Patterns
6.4.5 Thread Scheduling in Java
6.4.6 Package ava.util.concurrent
6.4.6.1 Semaphore mechanism
6.4.6.2 Barrier synchronization
6.4.6.3 Lock Mechanisms
6.4.6.4 Signal mechanism
6.4.6.5 Atomic Operations
6.4.6.6 Task-based execution of programs
6.5 OpenMP
6.5.1 Compiler directives
6.5.1.1 Parallel region
Example:
6.5.1.2 Parallel loops
parallel loop
Example:
6.5.1.3 Non-iterativ Work-Sharing Constructs
6.5.1.4 Single excution
6.5.1.5 Syntactic abbreviations
6.5.2 Execution environment routines
6.5.3 Coordination and synchronization of threads
6.5.3.1 Locking mechanism
6.5.4 OpenMP task model
6.6 Exercises for Chapter 6
Chapter 7 General Purpose GPU Programming
7.1 The Architecture of GPUs
7.2 Introduction to CUDA programming
7.3 Synchronization and Shared Memory
7.4 CUDA Thread Scheduling
7.5 Efficient Memory Access and Tiling Technique
7.6 Introduction to OpenCL
7.7 Exercises for Chapter 7
Chapter 8 Algorithms for Systems of Linear Equations
8.1 Gaussian Elimination
8.1.1 Gaussian Elimination and LU Decomposition
8.1.1.1 LU decomposition and triangularization
8.1.1.2 Pivoting
8.1.2 Parallel Row-Cyclic Implementation
8.1.3 Parallel Implementation with Checkerboard Distribution
8.1.4 Analysis of the Parallel Execution Time
8.2 Direct Methods for Linear Systems with Banded Structure
8.2.1 Discretization of the Poisson Equation
8.2.2 Tridiagonal Systems
8.2.2.1 Gaussian elimination for tridiagonal systems
8.2.2.2 Recursive doubling for tridiagonal systems
8.2.2.3 Cyclic reduction for tridiagonal systems
8.2.2.4 Parallel implementation of cyclic reduction
8.2.2.5 Parallel execution time
8.2.3 Generalization to Banded Matrices
8.2.4 Solving the Discretized Poisson Equation
8.3 Iterative Methods for Linear Systems
8.3.1 Standard Iteration Methods
8.3.1.1 Jacobi iteration
8.3.1.2 Gauss-Seidel iteration
8.3.1.3 Convergence criteria
8.3.1.4 JOR method
8.3.1.5 SOR method
8.3.1.6 Implementation using matrix operations
8.3.2 Parallel implementation of the Jacobi Iteration
8.3.3 Parallel Implementation of the Gauss-Seidel Iteration
8.3.4 Gauss-Seidel Iteration for Sparse Systems
8.3.5 Red-black Ordering
8.3.5.1 Gauss-Seidel iteration for red-black systems
8.3.5.2 SOR method for red-black systems
8.4 Conjugate Gradient Method
8.4.1 Sequential CG method
8.4.2 Parallel CG Method
8.4.2.1 Basic operations of the CG algorithm
8.4.2.2 Parallel CG implementation with blockwise distribution
8.4.2.3 Parallel execution time
8.5 Cholesky Factorization for Sparse Matrices
8.5.1 Sequential Algorithm
8.5.1.1 Left-looking algorithms
8.5.1.2 Right-looking algorithm
8.5.1.3 Supernodes
8.5.2 Storage Scheme for Sparse Matrices
8.5.3 Implementation for Shared Variables
8.5.3.1 Parallel left-looking algorithms
8.5.3.2 Parallel right-looking algorithm
8.5.3.3 Parallel supernodal algorithm
8.6 Exercises for Chapter 8
References
Index