Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand. This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book. Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations. Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems. What You'll Learn Accelerate C++ programs using data-parallel programming Target multiple device types (e.g. CPU, GPU, FPGA) Use SYCL and SYCL compilers Connect with computing’s heterogeneous future via Intel’s oneAPI initiative Who This Book Is For Those new data-parallel programming and computer programmers interested in data-parallel programming using C++.

Author(s): James Reinders, Ben Ashbaugh et al
Edition: 1
Publisher: Apress
Year: 2021

Language: English
Pages: 565
Tags: c++20 c++17 c++ sycl dpc parallel intel

Table of Contents
About the Authors
Preface
Acknowledgments
Chapter 1: Introduction
Read the Book, Not the Spec
SYCL 1.2.1 vs. SYCL 2020, and DPC++
Getting a DPC++ Compiler
Book GitHub
Hello, World! and a SYCL Program Dissection
Queues and Actions
It Is All About Parallelism
Throughput
Latency
Think Parallel
Amdahl and Gustafson
Scaling
Heterogeneous Systems
Data-Parallel Programming
Key Attributes of DPC++ and SYCL
Single-Source
Host
Devices
Sharing Devices
Kernel Code
Kernel: Vector Addition (DAXPY)
Asynchronous Task Graphs
Race Conditions When We Make a Mistake
C++ Lambda Functions
Portability and Direct Programming
Concurrency vs. Parallelism
Summary
Chapter 2: Where Code Executes
Single-Source
Host Code
Device Code
Choosing Devices
Method#1: Run on a Device of Any Type
Queues
Binding a Queue to a Device, When Any Device Will Do
Method#2: Using the Host Device for Development and Debugging
Method#3: Using a GPU (or Other Accelerators)
Device Types
Accelerator Devices
Device Selectors
When Device Selection Fails
Method#4: Using Multiple Devices
Method#5: Custom (Very Specific) Device Selection
device_selector Base Class
Mechanisms to Score a Device
Three Paths to Device Code Execution on CPU
Creating Work on a Device
Introducing the Task Graph
Where Is the Device Code?
Actions
Fallback
Summary
Chapter 3: Data Management
Introduction
The Data Management Problem
Device Local vs. Device Remote
Managing Multiple Memories
Explicit Data Movement
Implicit Data Movement
Selecting the Right Strategy
USM, Buffers, and Images
Unified Shared Memory
Accessing Memory Through Pointers
USM and Data Movement
Explicit Data Movement in USM
Implicit Data Movement in USM
Buffers
Creating Buffers
Accessing Buffers
Access Modes
Ordering the Uses of Data
In-order Queues
Out-of-Order (OoO) Queues
Explicit Dependences with Events
Implicit Dependences with Accessors
Choosing a Data Management Strategy
Handler Class: Key Members
Summary
Chapter 4: Expressing Parallelism
Parallelism Within Kernels
Multidimensional Kernels
Loops vs. Kernels
Overview of Language Features
Separating Kernels from Host Code
Different Forms of Parallel Kernels
Basic Data-Parallel Kernels
Understanding Basic Data-Parallel Kernels
Writing Basic Data-Parallel Kernels
Details of Basic Data-Parallel Kernels
The range Class
The id Class
The item Class
Explicit ND-Range Kernels
Understanding Explicit ND-Range Parallel Kernels
Work-Items
Work-Groups
Sub-Groups
Writing Explicit ND-Range Data-Parallel Kernels
Details of Explicit ND-Range Data-Parallel Kernels
The nd_range Class
The nd_item Class
The group Class
The sub_group Class
Hierarchical Parallel Kernels
Understanding Hierarchical Data-Parallel Kernels
Writing Hierarchical Data-Parallel Kernels
Details of Hierarchical Data-Parallel Kernels
The h_item Class
The private_memory Class
Mapping Computation to Work-Items
One-to-One Mapping
Many-to-One Mapping
Choosing a Kernel Form
Summary
Chapter 5: Error Handling
Safety First
Types of Errors
Let’s Create Some Errors!
Synchronous Error
Asynchronous Error
Application Error Handling Strategy
Ignoring Error Handling
Synchronous Error Handling
Asynchronous Error Handling
The Asynchronous Handler
Invocation of the Handler
Errors on a Device
Summary
Chapter 6: Unified Shared Memory
Why Should We Use USM?
Allocation Types
Device Allocations
Host Allocations
Shared Allocations
Allocating Memory
What Do We Need to Know?
Multiple Styles
Allocations à la C
Allocations à la C++
C++ Allocators
Deallocating Memory
Allocation Example
Data Management
Initialization
Data Movement
Explicit
Implicit
Migration
Fine-Grained Control
Queries
Summary
Chapter 7: Buffers
Buffers
Creation
Buffer Properties
use_host_ptr
use_mutex
context_bound
What Can We Do with a Buffer?
Accessors
Accessor Creation
What Can We Do with an Accessor?
Summary
Chapter 8: Scheduling Kernels and Data Movement
What Is Graph Scheduling?
How Graphs Work in DPC++
Command Group Actions
How Command Groups Declare Dependences
Examples
When Are the Parts of a CG Executed?
Data Movement
Explicit
Implicit
Synchronizing with the Host
Summary
Chapter 9: Communication and Synchronization
Work-Groups and Work-Items
Building Blocks for Efficient Communication
Synchronization via Barriers
Work-Group Local Memory
Using Work-Group Barriers and Local Memory
Work-Group Barriers and Local Memory in ND-Range Kernels
Local Accessors
Synchronization Functions
A Full ND-Range Kernel Example
Work-Group Barriers and Local Memory in Hierarchical Kernels
Scopes for Local Memory and Barriers
A Full Hierarchical Kernel Example
Sub-Groups
Synchronization via Sub-Group Barriers
Exchanging Data Within a Sub-Group
A Full Sub-Group ND-Range Kernel Example
Collective Functions
Broadcast
Votes
Shuffles
Loads and Stores
Summary
Chapter 10: Defining Kernels
Why Three Ways to Represent a Kernel?
Kernels As Lambda Expressions
Elements of a Kernel Lambda Expression
Naming Kernel Lambda Expressions
Kernels As Named Function Objects
Elements of a Kernel Named Function Object
Interoperability with Other APIs
Interoperability with API-Defined Source Languages
Interoperability with API-Defined Kernel Objects
Kernels in Program Objects
Summary
Chapter 11: Vectors
How to Think About Vectors
Vector Types
Vector Interface
Load and Store Member Functions
Swizzle Operations
Vector Execution Within a Parallel Kernel
Vector Parallelism
Summary
Chapter 12: Device Information
Refining Kernel Code to Be More Prescriptive
How to Enumerate Devices and Capabilities
Custom Device Selector
Being Curious: get_info<>
Being More Curious: Detailed Enumeration Code
Inquisitive: get_info<>
Device Information Descriptors
Device-Specific Kernel Information Descriptors
The Specifics: Those of “Correctness”
Device Queries
Kernel Queries
The Specifics: Those of “Tuning/Optimization”
Device Queries
Kernel Queries
Runtime vs. Compile-Time Properties
Summary
Chapter 13: Practical Tips
Getting a DPC++ Compiler and Code Samples
Online Forum and Documentation
Platform Model
Multiarchitecture Binaries
Compilation Model
Adding SYCL to Existing C++ Programs
Debugging
Debugging Kernel Code
Debugging Runtime Failures
Initializing Data and Accessing Kernel Outputs
Multiple Translation Units
Performance Implications of Multiple Translation Units
When Anonymous Lambdas Need Names
Migrating from CUDA to SYCL
Summary
Chapter 14: Common Parallel Patterns
Understanding the Patterns
Map
Stencil
Reduction
Scan
Pack and Unpack
Pack
Unpack
Using Built-In Functions and Libraries
The DPC++ Reduction Library
The reduction Class
The reducer Class
User-Defined Reductions
oneAPI DPC++ Library
Group Functions
Direct Programming
Map
Stencil
Reduction
Scan
Pack and Unpack
Pack
Unpack
Summary
For More Information
Chapter 15: Programming for GPUs
Performance Caveats
How GPUs Work
GPU Building Blocks
Simpler Processors (but More of Them)
Expressing Parallelism
Expressing More Parallelism
Simplified Control Logic (SIMD Instructions)
Predication and Masking
SIMD Efficiency
SIMD Efficiency and Groups of Items
Switching Work to Hide Latency
Offloading Kernels to GPUs
SYCL Runtime Library
GPU Software Drivers
GPU Hardware
Beware the Cost of Offloading!
Transfers to and from Device Memory
GPU Kernel Best Practices
Accessing Global Memory
Accessing Work-Group Local Memory
Avoiding Local Memory Entirely with Sub-Groups
Optimizing Computation Using Small Data Types
Optimizing Math Functions
Specialized Functions and Extensions
Summary
For More Information
Chapter 16: Programming for CPUs
Performance Caveats
The Basics of a General-Purpose CPU
The Basics of SIMD Hardware
Exploiting Thread-Level Parallelism
Thread Affinity Insight
Be Mindful of First Touch to Memory
SIMD Vectorization on CPU
Ensure SIMD Execution Legality
SIMD Masking and Cost
Avoid Array-of-Struct for SIMD Efficiency
Data Type Impact on SIMD Efficiency
SIMD Execution Using single_task
Summary
Chapter 17: Programming for FPGAs
Performance Caveats
How to Think About FPGAs
Pipeline Parallelism
Kernels Consume Chip “Area”
When to Use an FPGA
Lots and Lots of Work
Custom Operations or Operation Widths
Scalar Data Flow
Low Latency and Rich Connectivity
Customized Memory Systems
Running on an FPGA
Compile Times
The FPGA Emulator
FPGA Hardware Compilation Occurs “Ahead-of-Time”
Writing Kernels for FPGAs
Exposing Parallelism
Keeping the Pipeline Busy Using ND-Ranges
Pipelines Do Not Mind Data Dependences!
Spatial Pipeline Implementation of a Loop
Loop Initiation Interval
Pipes
Blocking and Non-blocking Pipe Accesses
For More Information on Pipes
Custom Memory Systems
Some Closing Topics
FPGA Building Blocks
Clock Frequency
Summary
Chapter 18: Libraries
Built-In Functions
Use the sycl:: Prefix with Built-In Functions
DPC++ Library
Standard C++ APIs in DPC++
DPC++ Parallel STL
DPC++ Execution Policy
FPGA Execution Policy
Using DPC++ Parallel STL
Using Parallel STL with USM
Error Handling with DPC++ Execution Policies
Summary
Chapter 19: Memory Model and Atomics
What Is in a Memory Model?
Data Races and Synchronization
Barriers and Fences
Atomic Operations
Memory Ordering
The Memory Model
The memory_order Enumeration Class
The memory_scope Enumeration Class
Querying Device Capabilities
Barriers and Fences
Atomic Operations in DPC++
The atomic Class
The atomic_ref Class
Using Atomics with Buffers
Using Atomics with Unified Shared Memory
Using Atomics in Real Life
Computing a Histogram
Implementing Device-Wide Synchronization
Summary
For More Information
Epilogue: Future Direction of DPC++
Alignment with C++20 and C++23
Address Spaces
Extension and Specialization Mechanism
Hierarchical Parallelism
Summary
For More Information
Index