An Introduction to Parallel Programming, Second Edition presents a tried-and-true tutorial approach that shows students how to develop effective parallel programs with MPI, Pthreads and OpenMP.
As the first undergraduate text to directly address compiling and running parallel programs on multi-core and cluster architecture, this second edition carries forward its clear explanations for designing, debugging and evaluating the performance of distributed and shared-memory programs while adding coverage of accelerators via new content on GPU programming and heterogeneous programming. New and improved user-friendly exercises teach students how to compile, run and modify example programs.
Author(s): Peter Pacheco, Matthew Malensek
Edition: 2
Publisher: Morgan Kaufmann
Year: 2020
Language: English
Pages: 496
Tags: parallel programming; MPI; Pthreads; OpenMP; GPU programming;
Front Cover
An Introduction to Parallel Programming
Copyright
Contents
Preface
1 Why parallel computing
1.1 Why we need ever-increasing performance
1.2 Why we're building parallel systems
1.3 Why we need to write parallel programs
1.4 How do we write parallel programs?
1.5 What we'll be doing
1.6 Concurrent, parallel, distributed
1.7 The rest of the book
1.8 A word of warning
1.9 Typographical conventions
1.10 Summary
1.11 Exercises
2 Parallel hardware and parallel software
2.1 Some background
2.1.1 The von Neumann architecture
2.1.2 Processes, multitasking, and threads
2.2 Modifications to the von Neumann model
2.2.1 The basics of caching
2.2.2 Cache mappings
2.2.3 Caches and programs: an example
2.2.4 Virtual memory
2.2.5 Instruction-level parallelism
Pipelining
Multiple issue
2.2.6 Hardware multithreading
2.3 Parallel hardware
2.3.1 Classifications of parallel computers
2.3.2 SIMD systems
Vector processors
Graphics processing units
2.3.3 MIMD systems
Shared-memory systems
Distributed-memory systems
2.3.4 Interconnection networks
Shared-memory interconnects
Distributed-memory interconnects
Latency and bandwidth
2.3.5 Cache coherence
Snooping cache coherence
Directory-based cache coherence
False sharing
2.3.6 Shared-memory vs. distributed-memory
2.4 Parallel software
2.4.1 Caveats
2.4.2 Coordinating the processes/threads
2.4.3 Shared-memory
Dynamic and static threads
Nondeterminism
Thread safety
2.4.4 Distributed-memory
Message-passing
One-sided communication
Partitioned global address space languages
2.4.5 GPU programming
2.4.6 Programming hybrid systems
2.5 Input and output
2.5.1 MIMD systems
2.5.2 GPUs
2.6 Performance
2.6.1 Speedup and efficiency in MIMD systems
2.6.2 Amdahl's law
2.6.3 Scalability in MIMD systems
2.6.4 Taking timings of MIMD programs
2.6.5 GPU performance
2.7 Parallel program design
2.7.1 An example
A serial program
Parallelizing the serial program
2.8 Writing and running parallel programs
2.9 Assumptions
2.10 Summary
2.10.1 Serial systems
2.10.2 Parallel hardware
2.10.3 Parallel software
2.10.4 Input and output
2.10.5 Performance
2.10.6 Parallel program design
2.10.7 Assumptions
2.11 Exercises
3 Distributed memory programming with MPI
3.1 Getting started
3.1.1 Compilation and execution
3.1.2 MPI programs
3.1.3 MPI_Init and MPI_Finalize
3.1.4 Communicators, MPI_Comm_size, and MPI_Comm_rank
3.1.5 SPMD programs
3.1.6 Communication
3.1.7 MPI_Send
3.1.8 MPI_Recv
3.1.9 Message matching
3.1.10 The status_p argument
3.1.11 Semantics of MPI_Send and MPI_Recv
3.1.12 Some potential pitfalls
3.2 The trapezoidal rule in MPI
3.2.1 The trapezoidal rule
3.2.2 Parallelizing the trapezoidal rule
3.3 Dealing with I/O
3.3.1 Output
3.3.2 Input
3.4 Collective communication
3.4.1 Tree-structured communication
3.4.2 MPI_Reduce
3.4.3 Collective vs. point-to-point communications
3.4.4 MPI_Allreduce
3.4.5 Broadcast
3.4.6 Data distributions
3.4.7 Scatter
3.4.8 Gather
3.4.9 Allgather
3.5 MPI-derived datatypes
3.6 Performance evaluation of MPI programs
3.6.1 Taking timings
3.6.2 Results
3.6.3 Speedup and efficiency
3.6.4 Scalability
3.7 A parallel sorting algorithm
3.7.1 Some simple serial sorting algorithms
3.7.2 Parallel odd-even transposition sort
3.7.3 Safety in MPI programs
3.7.4 Final details of parallel odd-even sort
3.8 Summary
3.9 Exercises
3.10 Programming assignments
4 Shared-memory programming with Pthreads
4.1 Processes, threads, and Pthreads
4.2 Hello, world
4.2.1 Execution
4.2.2 Preliminaries
4.2.3 Starting the threads
4.2.4 Running the threads
4.2.5 Stopping the threads
4.2.6 Error checking
4.2.7 Other approaches to thread startup
4.3 Matrix-vector multiplication
4.4 Critical sections
4.5 Busy-waiting
4.6 Mutexes
4.7 Producer–consumer synchronization and semaphores
4.8 Barriers and condition variables
4.8.1 Busy-waiting and a mutex
4.8.2 Semaphores
4.8.3 Condition variables
4.8.4 Pthreads barriers
4.9 Read-write locks
4.9.1 Sorted linked list functions
4.9.2 A multithreaded linked list
4.9.3 Pthreads read-write locks
4.9.4 Performance of the various implementations
4.9.5 Implementing read-write locks
4.10 Caches, cache-coherence, and false sharing
4.11 Thread-safety
4.11.1 Incorrect programs can produce correct output
4.12 Summary
4.13 Exercises
4.14 Programming assignments
5 Shared-memory programming with OpenMP
5.1 Getting started
5.1.1 Compiling and running OpenMP programs
5.1.2 The program
5.1.3 Error checking
5.2 The trapezoidal rule
5.2.1 A first OpenMP version
5.3 Scope of variables
5.4 The reduction clause
5.5 The parallel for directive
5.5.1 Caveats
5.5.2 Data dependences
5.5.3 Finding loop-carried dependences
5.5.4 Estimating π
5.5.5 More on scope
5.6 More about loops in OpenMP: sorting
5.6.1 Bubble sort
5.6.2 Odd-even transposition sort
5.7 Scheduling loops
5.7.1 The schedule clause
5.7.2 The static schedule type
5.7.3 The dynamic and guided schedule types
5.7.4 The runtime schedule type
5.7.5 Which schedule?
5.8 Producers and consumers
5.8.1 Queues
5.8.2 Message-passing
5.8.3 Sending messages
5.8.4 Receiving messages
5.8.5 Termination detection
5.8.6 Startup
5.8.7 The atomic directive
5.8.8 Critical sections and locks
5.8.9 Using locks in the message-passing program
5.8.10 Critical directives, atomic directives, or locks?
5.8.11 Some caveats
5.9 Caches, cache coherence, and false sharing
5.10 Tasking
5.11 Thread-safety
5.11.1 Incorrect programs can produce correct output
5.12 Summary
5.13 Exercises
5.14 Programming assignments
6 GPU programming with CUDA
6.1 GPUs and GPGPU
6.2 GPU architectures
6.3 Heterogeneous computing
6.4 CUDA hello
6.4.1 The source code
6.4.2 Compiling and running the program
6.5 A closer look
6.6 Threads, blocks, and grids
6.7 Nvidia compute capabilities and device architectures
6.8 Vector addition
6.8.1 The kernel
6.8.2 Get_args
6.8.3 Allocate_vectors and managed memory
6.8.4 Other functions called from main
6.8.5 Explicit memory transfers
6.9 Returning results from CUDA kernels
6.10 CUDA trapezoidal rule I
6.10.1 The trapezoidal rule
6.10.2 A CUDA implementation
6.10.3 Initialization, return value, and final update
6.10.4 Using the correct threads
6.10.5 Updating the return value and atomicAdd
6.10.6 Performance of the CUDA trapezoidal rule
6.11 CUDA trapezoidal rule II: improving performance
6.11.1 Tree-structured communication
6.11.2 Local variables, registers, shared and global memory
6.11.3 Warps and warp shuffles
6.11.4 Implementing tree-structured global sum with a warp shuffle
6.11.5 Shared memory and an alternative to the warp shuffle
6.12 Implementation of trapezoidal rule with warpSize thread blocks
6.12.1 Host code
6.12.2 Kernel with warp shuffle
6.12.3 Kernel with shared memory
6.12.4 Performance
6.13 CUDA trapezoidal rule III: blocks with more than one warp
6.13.1 __syncthreads
6.13.2 More shared memory
6.13.3 Shared memory warp sums
6.13.4 Shared memory banks
6.13.5 Finishing up
6.13.6 Performance
6.14 Bitonic sort
6.14.1 Serial bitonic sort
6.14.2 Butterflies and binary representations
6.14.3 Parallel bitonic sort I
6.14.4 Parallel bitonic sort II
6.14.5 Performance of CUDA bitonic sort
6.15 Summary
6.16 Exercises
6.17 Programming assignments
7 Parallel program development
7.1 Two n-body solvers
7.1.1 The problem
7.1.2 Two serial programs
7.1.3 Parallelizing the n-body solvers
7.1.4 A word about I/O
7.1.5 Parallelizing the basic solver using OpenMP
7.1.6 Parallelizing the reduced solver using OpenMP
7.1.7 Evaluating the OpenMP codes
7.1.8 Parallelizing the solvers using Pthreads
7.1.9 Parallelizing the basic solver using MPI
7.1.10 Parallelizing the reduced solver using MPI
7.1.11 Performance of the MPI solvers
7.1.12 Parallelizing the basic solver using CUDA
7.1.13 A note on cooperative groups in CUDA
7.1.14 Performance of the basic CUDA n-body solver
7.1.15 Improving the performance of the CUDA n-body solver
7.1.16 Using shared memory in the n-body solver
7.2 Sample sort
7.2.1 Sample sort and bucket sort
7.2.2 Choosing the sample
7.2.3 A simple implementation of the Map function
7.2.4 An alternative implementation of Map
7.2.5 Parallelizing sample sort
7.2.6 Implementing sample sort with OpenMP
First implementation
Second implementation
7.2.7 Implementing sample sort with Pthreads
7.2.8 Implementing sample sort with MPI
First implementation
Second implementation
7.2.9 Implementing sample sort with CUDA
First implementation
Prefix sums
Performance
A second CUDA implementation of sample sort
Performance of the second CUDA sample sort
7.3 A word of caution
7.4 Which API?
7.5 Summary
7.5.1 MPI
7.6 Exercises
7.7 Programming assignments
8 Where to go from here
Bibliography
Index
Back Cover