Multicore and GPU Programming: An Integrated Approach, Second Edition offers broad coverage of key parallel computing tools, essential for multi-core CPU programming and many-core "massively parallel" computing. Using threads, OpenMP, MPI, CUDA and other state-of-the-art tools, the book teaches the design and development of software capable of taking advantage of modern computing platforms that incorporate CPUs, GPUs and other accelerators.
Presenting material refined over more than two decades of teaching parallel computing, author Gerassimos Barlas minimizes the challenge of transitioning from sequential programming to mastering parallel platforms with multiple examples, extensive case studies, and full source code. By using this book, readers will better understand how to develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting parallel machines.
Author(s): Gerassimos Barlas
Edition: 2
Publisher: Morgan Kaufmann
Year: 2022
Language: English
Commentary: Publisher PDF
Pages: 1024
City: Cambridge, United States
Tags: Multicore Programming; GPU Programming; C++ Threads; C++ Concurrency; Parallel Data Structures; Distributed Memory Programming; CUDA GPU Programming; OpenCL GPU Programming; OpenMP Parallel Programming; Qt Multi-threaded Programming; Thrust Template Library; Load Balancing
Contents
List of tables
Preface
What is in this book
Using this book as a textbook
Software and hardware requirements
Sample code
Glossary
Bibliography
1 Introduction
1.1 The era of multicore machines
1.2 A taxonomy of parallel machines
1.3 A glimpse of influential computing machines
1.3.1 The Cell BE processor
1.3.2 NVidia's Ampere
1.3.3 Multicore to many-core: TILERA's TILE-Gx8072 and Intel's Xeon Phi
1.3.4 AMD's Epyc Rome: scaling up with smaller chips
1.3.5 Fujitsu A64FX: compute and memory integration
1.4 Performance metrics
1.5 Predicting and measuring parallel program performance
1.5.1 Amdahl's law
1.5.2 Gustafson–Barsis' rebuttal
Exercises
2 Multicore and parallel program design
2.1 Introduction
2.2 The PCAM methodology
2.3 Decomposition patterns
2.3.1 Task parallelism
2.3.2 Divide-and-conquer decomposition
2.3.3 Geometric decomposition
2.3.4 Recursive data decomposition
2.3.5 Pipeline decomposition
2.3.6 Event-based coordination decomposition
2.4 Program structure patterns
2.4.1 Single program, multiple data
2.4.2 Multiple program, multiple data
2.4.3 Master–worker
2.4.4 Map-reduce
2.4.5 Fork/join
2.4.6 Loop parallelism
2.5 Matching decomposition patterns with program structure patterns
Exercises
3 Threads and concurrency in standard C++
3.1 Introduction
3.2 Threads
3.2.1 What is a thread?
3.2.2 What are threads good for?
3.3 Thread creation and initialization
3.4 Sharing data between threads
3.5 Design concerns
3.6 Semaphores
3.7 Applying semaphores in classical problems
3.7.1 Producers–consumers
3.7.2 Dealing with termination
3.7.2.1 Termination using a shared data item
3.7.2.2 Termination using messages
3.7.3 The barbershop problem – introducing fairness
3.7.4 Readers–writers
3.7.4.1 A solution favoring the readers
3.7.4.2 Giving priority to the writers
3.7.4.3 A fair solution
3.7.4.4 C++17 provisions
3.8 Atomic data types
3.8.1 Memory ordering
3.9 Monitors
3.9.1 Design approach #1: critical section inside the monitor
3.9.2 Design approach #2: monitor controls entry to critical section
3.9.3 General semaphores revisited
3.10 Applying monitors in classical problems
3.10.1 Producers–consumers revisited
3.10.1.1 Producers–consumers: buffer manipulation within the monitor
3.10.1.2 Producers–consumers: buffer insertion/extraction exterior to the monitor
3.10.2 Readers–writers
3.10.2.1 A solution favoring the readers
3.10.2.2 Giving priority to the writers
3.10.2.3 A fair solution
3.11 Asynchronous threads
3.12 Dynamic vs. static thread management
3.13 Threads and fibers
3.14 Debugging multi-threaded applications
Exercises
4 Parallel data structures
4.1 Introduction
4.2 Lock-based structures
4.2.1 Queues
4.2.2 Lists
4.2.2.1 List using coarse-grain locks
4.2.2.2 List using fine-grain locks
4.2.2.3 List using optimistic synchronization
4.2.2.4 List using lazy synchronization
4.3 Lock-free structures
4.3.1 Lock-free stacks
4.3.1.1 A lock-free stack with explicit memory management
4.3.1.2 A lock-free stack using smart pointers
4.3.2 A bounded lock-free queue: first attempt
4.3.3 The ABA problem
4.3.4 A fixed bounded lock-free queue
4.3.5 An unbounded lock-free queue
4.4 Closing remarks
Exercises
5 Distributed memory programming
5.1 Introduction
5.2 MPI
5.3 Core concepts
5.4 Your first MPI program
5.5 Program architecture
5.5.1 SPMD
5.5.2 MPMD
5.6 Point-to-point communication
5.7 Alternative point-to-point communication modes
5.7.1 Buffered communications
5.8 Non-blocking communications
5.9 Point-to-point communications: summary
5.10 Error reporting & handling
5.11 Collective communications
5.11.1 Scattering
5.11.2 Gathering
5.11.3 Reduction
5.11.4 All-to-all gathering
5.11.5 All-to-all scattering
5.11.6 All-to-all reduction
5.11.7 Global synchronization
5.12 Persistent communications
5.13 Big-count communications in MPI 4.0
5.14 Partitioned communications
5.15 Communicating objects
5.15.1 Derived datatypes
5.15.2 Packing/unpacking
5.16 Node management: communicators and groups
5.16.1 Creating groups
5.16.2 Creating intracommunicators
5.17 One-sided communication
5.17.1 RMA communication functions
5.17.2 RMA synchronization functions
5.18 I/O considerations
5.19 Combining MPI processes with threads
5.20 Timing and performance measurements
5.21 Debugging, profiling, and tracing MPI programs
5.21.1 Brief introduction to Scalasca
5.21.2 Brief introduction to TAU
5.22 The Boost.MPI library
5.22.1 Blocking and non-blocking communications
5.22.2 Data serialization
5.22.3 Collective operations
5.23 A case study: diffusion-limited aggregation
5.24 A case study: brute-force encryption cracking
5.25 A case study: MPI implementation of the master–worker pattern
5.25.1 A simple master–worker setup
5.25.2 A multi-threaded master–worker setup
Exercises
6 GPU programming: CUDA
6.1 Introduction
6.2 CUDA's programming model: threads, blocks, and grids
6.3 CUDA's execution model: streaming multiprocessors and warps
6.4 CUDA compilation process
6.5 Putting together a CUDA project
6.6 Memory hierarchy
6.6.1 Local memory/registers
6.6.2 Shared memory
6.6.3 Constant memory
6.6.4 Texture and surface memory
6.7 Optimization techniques
6.7.1 Block and grid design
6.7.2 Kernel structure
6.7.2.1 Kernel structure and Independent Thread Scheduling
6.7.3 Shared memory access
6.7.4 Global memory access
6.7.4.1 Page-locked and zero-copy memory
6.7.4.2 Unified memory
6.7.5 Asynchronous execution and streams: overlapping GPU memory transfers and more
6.7.5.1 Stream synchronization: events and callbacks
6.8 Graphs
6.8.1 Creating a graph using the CUDA graph API
6.8.2 Creating a graph by capturing a stream
6.9 Warp functions
6.10 Cooperative groups
6.10.1 Intrablock cooperative groups
6.10.1.1 Coalesced group reduction
6.10.1.2 Asynchronous memory copy to shared memory
6.10.2 Interblock cooperative groups
6.10.3 Grid-level reduction
6.11 Dynamic parallelism
6.12 Debugging CUDA programs
6.13 Profiling CUDA programs
6.14 CUDA and MPI
6.15 Case studies
6.15.1 Fractal set calculation
6.15.1.1 Version #1: One thread per pixel
6.15.1.2 Version #2: Pinned host and pitched device memory
6.15.1.3 Version #3: Multiple pixels per thread
6.15.1.4 Performance comparison
6.15.2 Block cipher encryption
6.15.2.1 Version #1: The case of a stand-alone GPU machine
6.15.2.2 Version #2: Overlapping GPU communication and computation
6.15.2.3 Version #3: Using a cluster of GPU machines
6.15.2.4 Performance comparison
Exercises
7 GPU and accelerator programming: OpenCL
7.1 The OpenCL architecture
7.2 The platform model
7.3 The execution model
7.4 The programming model
7.4.1 Summarizing the structure of an OpenCL program
7.5 The memory model
7.5.1 Buffer objects
7.5.2 Local memory
7.5.3 Image objects
7.5.4 Pipe objects
7.6 Shared virtual memory
7.7 Atomics and synchronization
7.8 Work group functions
7.9 Events and profiling OpenCL programs
7.10 OpenCL and other parallel software platforms
7.11 Case study: Mandelbrot set
7.11.1 Calculating the Mandelbrot set using OpenCL
7.11.2 Hybrid calculation of the Mandelbrot set using OpenCL and C++11
7.11.3 Hybrid calculation of the Mandelbrot set using OpenCL on both host and device
7.11.4 Performance comparison
Exercises
8 Shared-memory programming: OpenMP
8.1 Introduction
8.2 Your first OpenMP program
8.3 Variable scope
8.3.1 OpenMP integration V.0: manual partitioning
8.3.2 OpenMP integration V.1: manual partitioning without a race condition
8.3.3 OpenMP integration V.2: implicit partitioning with locking
8.3.4 OpenMP integration V.3: implicit partitioning with reduction
8.3.5 Final words on variable scope
8.4 Loop-level parallelism
8.4.1 Data dependencies
8.4.1.1 Flow dependencies
8.4.1.2 Anti-dependencies
8.4.1.3 Output dependencies
8.4.2 Nested loops
8.4.3 Scheduling
8.5 Task parallelism
8.5.1 The sections directive
8.5.1.1 Producers–consumers in OpenMP
8.5.2 The task directive
8.5.3 Task dependencies
8.5.4 The taskloop directive
8.5.5 The taskgroup directive and task-level reduction
8.6 Synchronization constructs
8.7 Cancellation constructs
8.8 SIMD extensions
8.9 Offloading to devices
8.9.1 Device work-sharing directives
8.9.2 Device memory management directives
8.9.3 CUDA interoperability
8.10 The loop construct
8.11 Thread affinity
8.12 Correctness and optimization issues
8.12.1 Thread safety
8.12.2 False-sharing
8.13 A case study: sorting in OpenMP
8.13.1 Bottom-up mergesort in OpenMP
8.13.2 Top-down mergesort in OpenMP
8.13.3 Performance comparison
8.14 A case study: brute-force encryption cracking, combining MPI and OpenMP
Exercises
9 High-level multi-threaded programming with the Qt library
9.1 Introduction
9.2 Implicit thread creation
9.3 Qt's pool of threads
9.4 Higher-level constructs – multi-threaded programming without threads!
9.4.1 Concurrent map
9.4.2 Map-reduce
9.4.3 Concurrent filter
9.4.4 Filter-reduce
9.4.5 A case study: multi-threaded sorting
9.4.6 A case study: multi-threaded image matching
Exercises
10 The Thrust template library
10.1 Introduction
10.2 First steps in Thrust
10.3 Working with Thrust datatypes
10.4 Thrust algorithms
10.4.1 Transformations
10.4.2 Sorting & searching
10.4.3 Reductions
10.4.4 Scans/prefix-sums
10.4.5 Data management and reordering
10.5 Fancy iterators
10.6 Switching device back-ends
10.7 Thrust execution policies and asynchronous execution
10.8 Case studies
10.8.1 Monte Carlo integration
10.8.2 DNA sequence alignment
Exercises
11 Load balancing
11.1 Introduction
11.2 Dynamic load balancing: the Linda legacy
11.3 Static load balancing: the divisible load theory approach
11.3.1 Modeling costs
11.3.2 Communication configuration
11.3.3 Analysis
11.3.3.1 N-Port, block-type, single-installment solution
11.3.3.2 One-port, block-type, single-installment solution
11.3.4 Summary – short literature review
11.4 DLTLib: a library for partitioning workloads
11.5 Case studies
11.5.1 Hybrid computation of a Mandelbrot set ``movie'': a case study in dynamic load balancing
11.5.2 Distributed block cipher encryption: a case study in static load balancing
Exercises
A Creating Qt programs
A.1 Using an IDE
A.2 The qmake utility
B Running MPI programs: preparatory and configuration steps
B.1 Preparatory steps
B.2 Computing nodes discovery for MPI program deployment
B.2.1 Host discovery with the nmap utility
B.2.2 Automatic generation of a hostfile
C Time measurement
C.1 Introduction
C.2 POSIX high-resolution timing
C.3 Timing in C++11
C.4 Timing in Qt
C.5 Timing in OpenMP
C.6 Timing in MPI
C.7 Timing in CUDA
D Boost.MPI
D.1 Mapping from MPI C to Boost.MPI
E Setting up CUDA
E.1 Installation
E.2 Issues with GCC
E.3 Combining CUDA with third-party libraries
F OpenCL helper functions
F.1 Function readCLFromFile
F.2 Function isError
F.3 Function getCompilationError
F.4 Function handleError
F.5 Function setupDevice
F.6 Function setupProgramAndKernel
G DLTlib
G.1 DLTlib functions
G.1.1 Class Network: generic methods
G.1.2 Class Network: query processing
G.1.3 Class Network: image processing
G.1.4 Class Network: image registration
G.2 DLTlib files
Index