Programming Massively Parallel Processors: A Hands-on Approach shows both students and professionals alike the basic concepts of parallel programming and GPU architecture. Concise, intuitive, and practical, it is based on years of road-testing in the authors' own parallel computing courses. Various techniques for constructing and optimizing parallel programs are explored in detail, while case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. The new edition includes updated coverage of CUDA, including the newer libraries such as CuDNN. New chapters on frequently used parallel patterns have been added, and case studies have been updated to reflect current industry practices.
• Parallel Patterns Introduces new chapters on frequently used parallel patterns (stencil, reduction, sorting) and major improvements to previous chapters (convolution, histogram, sparse matrices, graph traversal, deep learning)
• Ampere Includes a new chapter focused on GPU architecture and draws examples from recent architecture generations, including Ampere
• Systematic Approach Incorporates major improvements to abstract discussions of problem decomposition strategies and performance considerations, with a new optimization checklist
Author(s): Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj
Edition: 4
Publisher: Morgan Kaufmann
Year: 2023
Language: English
Commentary: Publisher's PDF
Pages: 554
City: Cambridge, MA
Tags: Deep Learning; Parallel Programming; Clusters; GPU Programming; C; Graph Algorithms; SIMD; CUDA; Performance; Sorting Algorithms
Programming Massively Parallel Processors
Copyright
Dedication
Contents
Foreword
Preface
How to use the book
A two-phased approach
Tying it all together: the final project
The design document
The project report and symposium
Class competition
Course resources
Acknowledgments
1 Introduction
1.1 Heterogeneous parallel computing
1.2 Why more speed or parallelism?
1.3 Speeding up real applications
1.4 Challenges in parallel programming
1.5 Related parallel programming interfaces
1.6 Overarching goals
1.7 Organization of the book
References
2 Heterogeneous data parallel computing
2.1 Data parallelism
2.2 CUDA C program structure
2.3 A vector addition kernel
2.4 Device global memory and data transfer
2.5 Kernel functions and threading
2.6 Calling kernel functions
2.7 Compilation
2.8 Summary
2.8.1 Function declarations
2.8.2 Kernel call and grid launch
2.8.3 Built-in (predefined) variables
2.8.4 Runtime application programming interface
Exercises
References
3 Multidimensional grids and data
3.1 Multidimensional grid organization
3.2 Mapping threads to multidimensional data
3.3 Image blur: a more complex kernel
3.4 Matrix multiplication
3.5 Summary
Exercises
4 Compute architecture and scheduling
4.1 Architecture of a modern GPU
4.2 Block scheduling
4.3 Synchronization and transparent scalability
4.4 Warps and SIMD hardware
4.5 Control divergence
4.6 Warp scheduling and latency tolerance
4.7 Resource partitioning and occupancy
4.8 Querying device properties
4.9 Summary
Exercises
References
5 Memory architecture and data locality
5.1 Importance of memory access efficiency
5.2 CUDA memory types
5.3 Tiling for reduced memory traffic
5.4 A tiled matrix multiplication kernel
5.5 Boundary checks
5.6 Impact of memory usage on occupancy
5.7 Summary
Exercises
6 Performance considerations
6.1 Memory coalescing
6.2 Hiding memory latency
6.3 Thread coarsening
6.4 A checklist of optimizations
6.5 Knowing your computation’s bottleneck
6.6 Summary
Exercises
References
7 Convolution
7.1 Background
7.2 Parallel convolution: a basic algorithm
7.3 Constant memory and caching
7.4 Tiled convolution with halo cells
7.5 Tiled convolution using caches for halo cells
7.6 Summary
Exercises
8 Stencil
8.1 Background
8.2 Parallel stencil: a basic algorithm
8.3 Shared memory tiling for stencil sweep
8.4 Thread coarsening
8.5 Register tiling
8.6 Summary
Exercises
9 Parallel histogram
9.1 Background
9.2 Atomic operations and a basic histogram kernel
9.3 Latency and throughput of atomic operations
9.4 Privatization
9.5 Coarsening
9.6 Aggregation
9.7 Summary
Exercises
References
10 Reduction
10.1 Background
10.2 Reduction trees
10.3 A simple reduction kernel
10.4 Minimizing control divergence
10.5 Minimizing memory divergence
10.6 Minimizing global memory accesses
10.7 Hierarchical reduction for arbitrary input length
10.8 Thread coarsening for reduced overhead
10.9 Summary
Exercises
11 Prefix sum (scan)
11.1 Background
11.2 Parallel scan with the Kogge-Stone algorithm
11.3 Speed and work efficiency consideration
11.4 Parallel scan with the Brent-Kung algorithm
11.5 Coarsening for even more work efficiency
11.6 Segmented parallel scan for arbitrary-length inputs
11.7 Single-pass scan for memory access efficiency
11.8 Summary
Exercises
References
12 Merge
12.1 Background
12.2 A sequential merge algorithm
12.3 A parallelization approach
12.4 Co-rank function implementation
12.5 A basic parallel merge kernel
12.6 A tiled merge kernel to improve coalescing
12.7 A circular buffer merge kernel
12.8 Thread coarsening for merge
12.9 Summary
Exercises
References
13 Sorting
13.1 Background
13.2 Radix sort
13.3 Parallel radix sort
13.4 Optimizing for memory coalescing
13.5 Choice of radix value
13.6 Thread coarsening to improve coalescing
13.7 Parallel merge sort
13.8 Other parallel sort methods
13.9 Summary
Exercises
References
14 Sparse matrix computation
14.1 Background
14.2 A simple SpMV kernel with the COO format
14.3 Grouping row nonzeros with the CSR format
14.4 Improving memory coalescing with the ELL format
14.5 Regulating padding with the hybrid ELL-COO format
14.6 Reducing control divergence with the JDS format
14.7 Summary
Exercises
References
15 Graph traversal
15.1 Background
15.2 Breadth-first search
15.3 Vertex-centric parallelization of breadth-first search
15.4 Edge-centric parallelization of breadth-first search
15.5 Improving efficiency with frontiers
15.6 Reducing contention with privatization
15.7 Other optimizations
Reducing launch overhead
Improving load balance
Further challenges
15.8 Summary
Exercises
References
16 Deep learning
16.1 Background
Multilayer classifiers
Training models
Error function
Stochastic gradient descent
Epoch
Backpropagation
Chain rule
Learning rate
Minibatch
Training multilayer classifiers
Feedforward networks
16.2 Convolutional neural networks
Convolutional neural network inference
Convolutional neural network backpropagation
16.3 Convolutional layer: a CUDA inference kernel
16.4 Formulating a convolutional layer as GEMM
16.5 CUDNN library
16.6 Summary
Exercises
References
17 Iterative magnetic resonance imaging reconstruction
17.1 Background
17.2 Iterative reconstruction
17.3 Computing FHD
Step 1: Determine the kernel parallelism structure
Step 2: Getting around the memory bandwidth limitation
Step 3: Using hardware trigonometry functions
Step 4: Experimental performance tuning
17.4 Summary
Exercises
References
18 Electrostatic potential map
18.1 Background
18.2 Scatter versus gather in kernel design
18.3 Thread coarsening
18.4 Memory coalescing
18.5 Cutoff binning for data size scalability
18.6 Summary
Exercises
References
19 Parallel programming and computational thinking
19.1 Goals of parallel computing
19.2 Algorithm selection
19.3 Problem decomposition
19.4 Computational thinking
19.5 Summary
References
20 Programming a heterogeneous computing cluster
20.1 Background
20.2 A running example
20.3 Message passing interface basics
20.4 Message passing interface point-to-point communication
20.5 Overlapping computation and communication
20.6 Message passing interface collective communication
20.7 CUDA aware message passing interface
20.8 Summary
Exercises
References
21 CUDA dynamic parallelism
21.1 Background
21.2 Dynamic parallelism overview
21.3 An example: Bezier curves
Linear Bezier curves
Quadratic Bezier curves
Bezier curve calculation (without dynamic parallelism)
Bezier curve calculation (with dynamic parallelism)
21.4 A recursive example: quadtrees
21.5 Important considerations
Memory and data visibility
Pending launch pool configuration
Streams
Nesting depth
21.6 Summary
Exercises
A21.1 Support code for quadtree example
References
22 Advanced practices and future evolution
22.1 Model of host/device interaction
Zero-copy memory and unified virtual address space
Large virtual and physical address spaces
Unified memory
Virtual address space control
22.2 Kernel execution control
Function calls within kernel functions
Exception handling in kernel functions
Simultaneous execution of multiple grids
Hardware queues and dynamic parallelism
Interruptable grids
Cooperative kernels
22.3 Memory bandwidth and compute throughput
Double-precision speed
Better control flow efficiency
Configurable caching and scratchpad
Enhanced atomic operations
Enhanced global memory access
22.4 Programming environment
Unified device memory space
Profiling with critical path analysis
22.5 Future outlook
References
23 Conclusion and outlook
23.1 Goals revisited
23.2 Future outlook
Appendix A Numerical considerations
A.1 Floating-point data representation
A.1.1 Normalized representation of M
A.1.2 Excess encoding of E
A.2 Representable numbers
A.3 Special bit patterns and precision in IEEE format
A.4 Arithmetic accuracy and rounding
A.5 Algorithm considerations
A.6 Linear solvers and numerical stability
A.7 Summary
Exercises
References
Index