The essential guide for writing portable, parallel programs for GPUs using the OpenMP programming model.
Today’s computers are complex, multi-architecture systems: multiple cores in a shared address space, graphics processing units (GPUs), and specialized accelerators. To get the most from these systems, programs must use all these different processors. In Programming Your GPU with OpenMP, Tom Deakin and Timothy Mattson help everyone, from beginners to advanced programmers, learn how to use OpenMP to program a GPU using just a few directives and runtime functions. Then programmers can go further to maximize performance by using CPUs and GPUs in parallel—true heterogeneous programming. And since OpenMP is a portable API, the programs will run on almost any system.
This book will help you learn how to program a GPU with OpenMP. The first part of the book provides the background you need to understand GPU programming with OpenMP. We start by reviewing hardware developments that programmers need to understand. We explain the GPU, its differences and similarities to the modern CPU. Next, we include a chapter that summarizes how to use OpenMP to program multithreaded systems (i.e., multicore systems with a shared address space). With this background in place, you will be ready for our core topic: how to use OpenMP to program heterogeneous systems composed of CPUs and GPUs.
GPU programming is the topic for Part II of the book. Parallel programming is hard. Just as the original version of OpenMP made it easier to write multithreaded code, modern OpenMP greatly simplifies GPU programming. With 10 items consisting of directives, runtime functions, and environment variables, you’ll be able to write programs that run on a GPU. In many cases, these programs will run with performance on par with that from lower-level (and often nonportable) approaches. We call these 10 items the OpenMP GPU Common Core. Explaining the GPU Common Core is our main goal for the second part of the book. After covering the items that make up the GPU common core, we close Part II with a discussion of the key principles of performance optimization for GPU programming: the so-called Eightfold Path to performance.
Programming Your GPU with OpenMP shares best practices for writing performance portable programs. Key features include:
The most up-to-date APIs for programming GPUs with OpenMP with concepts that transfer to other approaches for GPU programming.
Written in a tutorial style that embraces active learning, so that readers can make immediate use of what they learn via provided source code.
Builds the OpenMP GPU Common Core to get programmers to serious production-level GPU programming as fast as possible.
Additional features:
A reference guide at the end of the book covering all relevant parts of OpenMP 5.2.
An online repository containing source code for the example programs from the book—provided in all languages currently supported by OpenMP: C, C++, and Fortran.
Tutorial videos and lecture slides.
Author(s): Tom Deakin, Timothy G.Mattson
Publisher: The MIT Press
Year: 2023
Language: English
Pages: 336
Scientific and Engineering Computation
Programming Your GPU with OpenMP
Contents
List of Figures
List of Tables
Series Forward
Preface
Acknowledgments
I SETTING THE STAGE
1 Heterogeneity and the Future of Computing
1.1 The Basic Building Blocks of Modern Computing
1.1.1 The CPU
1.1.2 The SIMD Vector Unit
1.1.3 The GPU
1.2 OpenMP: A Single Code-Base for Heterogeneous Hardware
1.3 The Structure of This Book
1.4 Supplementary Materials
2 OpenMP Overview
2.1 Threads: Basic Concepts
2.2 OpenMP: Basic Syntax
2.3 The Fundamental Design Patterns of OpenMP
2.3.1 The SPMD Pattern
2.3.2 The Loop-Level Parallelism Pattern
2.3.3 The Divide-and-Conquer Pattern
2.3.3.1 Tasks in OpenMP
2.3.3.2 Parallelizing Divide-and-Conquer
2.4 Task Execution
2.5 Our Journey Ahead
II THE GPU COMMON CORE
3 Running Parallel Code on a GPU
3.1 Target Construct: Offloading Execution onto a Device
3.2 Moving Data between the Host and a Device
3.2.1 Scalar Variables
3.2.2 Arrays on the Stack
3.2.3 Derived Types
3.3 Parallel Execution on the Target Device
3.4 Concurrency and the Loop Construct
3.5 Example: Walking through Matrix Multiplication
4 Memory Movement
4.1 OpenMP Array Syntax
4.2 Sharing Data Explicitly with the Map Clause
4.2.1 The Map Clause
4.2.2 Example: Vector Add on the Heap
4.2.3 Example: Mapping Arrays in Matrix Multiplication
4.3 Reductions and Mapping the Result from the Device
4.4 Optimizing Data Movement
4.4.1 Target Data Construct
4.4.2 Target Update Directive
4.4.3 Target Enter/Exit Data
4.4.4 Pointer Swapping
4.5 Summary
5 Using the GPU Common Core
5.1 Recap of the GPU Common Core
5.2 The Eightfold Path to Performance
5.2.1 Portability
5.2.2 Libraries
5.2.3 The Right Algorithm
5.2.4 Occupancy
5.2.5 Converged Execution Flow
5.2.6 Data Movement
5.2.7 Memory Coalescence
5.2.8 Load Balance
5.3 Concluding the GPU Common Core
III BEYOND THE COMMON CORE
6 Managing a GPU’s Hierarchical Parallelism
6.1 Parallel Threads
6.2 League of Teams of Threads
6.2.1 Controlling the Number of Teams and Threads
6.2.2 Distributing Work between Teams
6.3 Hierarchical Parallelism in Practice
6.3.1 Example: Batched Matrix Multiplication
6.3.2 Example: Batched Gaussian Elimination
6.4 Hierarchical Parallelism and the Loop Directive
6.4.1 Combined Constructs that Include Loop
6.4.2 Reductions and Combined Constructs
6.4.3 The Bind Clause
6.5 Summary
7 Revisiting Data Movement
7.1 Manipulating the Device Data Environment
7.1.1 Allocating and Deleting Variables
7.1.2 Map Type Modifiers
7.1.3 Changing the Default Mapping
7.2 Compiling External Functions and Static Variables for the Device
7.3 User-Defined Mappers
7.4 Team-Only Memory
7.5 Becoming a Cartographer: Mapping Device Memory by Hand
7.6 Unified Shared Memory for Productivity
7.7 Summary
8 Asynchronous Offload to Multiple GPUs
8.1 Device Discovery
8.2 Selecting a Default Device
8.3 Offload to Multiple Devices
8.3.1 Reverse Offload
8.4 Conditional Offload
8.5 Asynchronous Offload
8.5.1 Task Dependencies
8.5.2 Asynchronous Data Transfers
8.5.3 Task Reductions
8.6 Summary
9 Working with External Runtime Environments
9.1 Calling External Library Routines from OpenMP
9.2 Sharing OpenMP Data with Foreign Functions
9.2.1 The Need for Synchronization
9.2.2 Example: Sharing OpenMP Data with cuBLAS
9.3 Using Data from a Foreign Runtime with OpenMP
9.3.1 Example: Sharing cuBLAS Data with OpenMP
9.3.2 Avoiding Unportable Code
9.4 Direct Control of Foreign Runtimes
9.4.1 Query Properties of the Foreign Runtime
9.4.2 Using the Interop Construct to Correctly Synchronize with Foreign Functions
9.4.3 Non-blocking Synchronization with a Foreign Runtime
9.4.4 Example: Calling CUDA Kernels without Blocking
9.5 Enhanced Portability Using Variant Directives
9.5.1 Declaring Function Variants
9.5.1.1 OpenMP Context and the Match Clause
9.5.1.2 Modifying Variant Function Arguments
9.5.2 Controlling Variant Substitution with the Dispatch Construct
9.5.3 Putting It All Together
10 OpenMP and the Future of Heterogeneous Computing
Appendix: Reference Guide
A.1 Programming a CPU with OpenMP
A.2 Directives and Constructs for the GPU
A.2.1 Parallelism with Loop, Teams, and Worksharing Constructs
A.2.2 Constructs for Interoperability
A.2.3 Constructs for Device Data Environment Manipulation
A.3 Combined Constructs
A.4 Internal Control Variables, Environment Variables, and OpenMP API Functions
Glossary
References
Subject Index
Scientific and Engineering Computation