Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

"This book, now in is second edition, is the premier resource to learn SYCL 2020 and is the ONLY book you need to become part of this community." Erik Lindahl, GROMACS and Stockholm University


Learn how to accelerate C++ programs using data parallelism and SYCL.

This open access book enables C++ programmers to be at the forefront of this exciting and important development that is helping to push computing to new levels. This updated second edition is full of practical advice, detailed explanations, and code examples to illustrate key topics.

SYCL enables access to parallel resources in modern accelerated heterogeneous systems. Now, a single C++ application can use any combination of devices–including GPUs, CPUs, FPGAs, and ASICs–that are suitable to the problems at hand.

This book teaches data-parallel programming using C++ with SYCL and walks through everything needed to program accelerated systems. The book begins by introducing data parallelism and foundational topics for effective use of SYCL. Later chapters cover advanced topics, including error handling, hardware-specific programming, communication and synchronization, and memory model considerations.

All source code for the examples used in this book is freely available on GitHub. The examples are written in modern SYCL and are regularly updated to ensure compatibility with multiple compilers.


What You Will Learn

  • Accelerate C++ programs using data-parallel programming
  • Use SYCL and C++ compilers that support SYCL
  • Write portable code for accelerators that is vendor and device agnostic
  • Optimize code to improve performance for specific accelerators
  • Be poised to benefit as new accelerators appear from many vendors

Who This Book Is For

New data-parallel programming and computer programmers interested in data-parallel programming using C++


This is an open access book.

Author(s): James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xinmin Tian
Edition: 2
Publisher: Apress
Year: 2023

Language: English
Commentary: Publisher PDF | Published Due: 18 October 2023
Pages: 660
City: Berkeley, CA
Tags: C++; Data Parallel C++; DPC++; FPGA Programming; GPU Programming; Parallel Programming; Data Parallelism; SYCL; Intel One API

Table of Contents
About the Authors
Preface
Foreword
Acknowledgments
Chapter 1: Introduction
Read the Book, Not the Spec
SYCL 2020 and DPC++
Why Not CUDA?
Why Standard C++ with SYCL?
Getting a C++ Compiler with SYCL Support
Hello, World! and a SYCL Program Dissection
Queues and Actions
It Is All About Parallelism
Throughput
Latency
Think Parallel
Amdahl and Gustafson
Scaling
Heterogeneous Systems
Data-Parallel Programming
Key Attributes of C++ with SYCL
Single-Source
Host
Devices
Sharing Devices
Kernel Code
Kernel: Vector Addition (DAXPY)
Asynchronous Execution
Race Conditions When We Make a Mistake
Deadlock
C++ Lambda Expressions
Functional Portability and Performance Portability
Concurrency vs. Parallelism
Summary
Chapter 2: Where Code Executes
Single-Source
Host Code
Device Code
Choosing Devices
Method#1: Run on a Device of Any Type
Queues
Binding a Queue to a Device When Any Device Will Do
Method#2: Using a CPU Device for Development, Debugging, and Deployment
Method#3: Using a GPU (or Other Accelerators)
Accelerator Devices
Device Selectors
When Device Selection Fails
Method#4: Using Multiple Devices
Method#5: Custom (Very Specific) Device Selection
Selection Based on Device Aspects
Selection Through a Custom Selector
Mechanisms to Score a Device
Creating Work on a Device
Introducing the Task Graph
Where Is the Device Code?
Actions
Host tasks
Summary
Chapter 3: Data Management
Introduction
The Data Management Problem
Device Local vs. Device Remote
Managing Multiple Memories
Explicit Data Movement
Implicit Data Movement
Selecting the Right Strategy
USM, Buffers, and Images
Unified Shared Memory
Accessing Memory Through Pointers
USM and Data Movement
Explicit Data Movement in USM
Implicit Data Movement in USM
Buffers
Creating Buffers
Accessing Buffers
Access Modes
Ordering the Uses of Data
In-order Queues
Out-of-Order Queues
Explicit Dependences with Events
Implicit Dependences with Accessors
Choosing a Data Management Strategy
Handler Class: Key Members
Summary
Chapter 4: Expressing Parallelism
Parallelism Within Kernels
Loops vs. Kernels
Multidimensional Kernels
Overview of Language Features
Separating Kernels from Host Code
Different Forms of Parallel Kernels
Basic Data-Parallel Kernels
Understanding Basic Data-Parallel Kernels
Writing Basic Data-Parallel Kernels
Details of Basic Data-Parallel Kernels
The range Class
The id Class
The item Class
Explicit ND-Range Kernels
Understanding Explicit ND-Range Parallel Kernels
Work-Items
Work-Groups
Sub-Groups
Writing Explicit ND-Range Data-Parallel Kernels
Details of Explicit ND-Range Data-Parallel Kernels
The nd_range Class
The nd_item Class
The group Class
The sub_group Class
Mapping Computation to Work-Items
One-to-One Mapping
Many-to-One Mapping
Choosing a Kernel Form
Summary
Chapter 5: Error Handling
Safety First
Types of Errors
Let’s Create Some Errors!
Synchronous Error
Asynchronous Error
Application Error Handling Strategy
Ignoring Error Handling
Synchronous Error Handling
Asynchronous Error Handling
The Asynchronous Handler
Invocation of the Handler
Errors on a Device
Summary
Chapter 6: Unified Shared Memory
Why Should We Use USM?
Allocation Types
Device Allocations
Host Allocations
Shared Allocations
Allocating Memory
What Do We Need to Know?
Multiple Styles
Allocations à la C
Allocations à la C++
C++ Allocators
Deallocating Memory
Allocation Example
Data Management
Initialization
Data Movement
Explicit
Implicit
Migration
Fine-Grained Control
Queries
One More Thing
Summary
Chapter 7: Buffers
Buffers
Buffer Creation
Buffer Properties
use_host_ptr
use_mutex
context_bound
What Can We Do with a Buffer?
Accessors
Accessor Creation
What Can We Do with an Accessor?
Summary
Chapter 8: Scheduling Kernels and Data Movement
What Is Graph Scheduling?
How Graphs Work in SYCL
Command Group Actions
How Command Groups Declare Dependences
Examples
When Are the Parts of a Command Group Executed?
Data Movement
Explicit Data Movement
Implicit Data Movement
Synchronizing with the Host
Summary
Chapter 9: Communication and Synchronization
Work-Groups and Work-Items
Building Blocks for Efficient Communication
Synchronization via Barriers
Work-Group Local Memory
Using Work-Group Barriers and Local Memory
Work-Group Barriers and Local Memory in ND-Range Kernels
Local Accessors
Synchronization Functions
A Full ND-Range Kernel Example
Sub-Groups
Synchronization via Sub-Group Barriers
Exchanging Data Within a Sub-Group
A Full Sub-Group ND-Range Kernel Example
Group Functions and Group Algorithms
Broadcast
Votes
Shuffles
Summary
Chapter 10: Defining Kernels
Why Three Ways to Represent a Kernel?
Kernels as Lambda Expressions
Elements of a Kernel Lambda Expression
Identifying Kernel Lambda Expressions
Kernels as Named Function Objects
Elements of a Kernel Named Function Object
Kernels in Kernel Bundles
Interoperability with Other APIs
Summary
Chapter 11: Vectors and Math Arrays
The Ambiguity of Vector Types
Our Mental Model for SYCL Vector Types
Math Array (marray)
Vector (vec)
Loads and Stores
Interoperability with Backend-Native Vector Types
Swizzle Operations
How Vector Types Execute
Vectors as Convenience Types
Vectors as SIMD Types
Summary
Chapter 12: Device Information and Kernel Specialization
Is There a GPU Present?
Refining Kernel Code to Be More Prescriptive
How to Enumerate Devices and Capabilities
Aspects
Custom Device Selector
Being Curious: get_info<>
Being More Curious: Detailed Enumeration Code
Very Curious: get_info plus has()
Device Information Descriptors
Device-Specific Kernel Information Descriptors
The Specifics: Those of “Correctness”
Device Queries
Kernel Queries
The Specifics: Those of “Tuning/Optimization”
Device Queries
Kernel Queries
Runtime vs. Compile-Time Properties
Kernel Specialization
Summary
Chapter 13: Practical Tips
Getting the Code Samples and a Compiler
Online Resources
Platform Model
Multiarchitecture Binaries
Compilation Model
Contexts: Important Things to Know
Adding SYCL to Existing C++ Programs
Considerations When Using Multiple Compilers
Debugging
Debugging Deadlock and Other Synchronization Issues
Debugging Kernel Code
Debugging Runtime Failures
Queue Profiling and Resulting Timing Capabilities
Tracing and Profiling Tools Interfaces
Initializing Data and Accessing Kernel Outputs
Multiple Translation Units
Performance Implication of Multiple Translation Units
When Anonymous Lambdas Need Names
Summary
Chapter 14: Common Parallel Patterns
Understanding the Patterns
Map
Stencil
Reduction
Scan
Pack and Unpack
Pack
Unpack
Using Built-In Functions and Libraries
The SYCL Reduction Library
The reduction Class
The reducer Class
User-Defined Reductions
Group Algorithms
Direct Programming
Map
Stencil
Reduction
Scan
Pack and Unpack
Pack
Unpack
Summary
For More Information
Chapter 15: Programming for GPUs
Performance Caveats
How GPUs Work
GPU Building Blocks
Simpler Processors (but More of Them)
Expressing Parallelism
Expressing More Parallelism
Simplified Control Logic (SIMD Instructions)
Predication and Masking
SIMD Efficiency
SIMD Efficiency and Groups of Items
Switching Work to Hide Latency
Offloading Kernels to GPUs
SYCL Runtime Library
GPU Software Drivers
GPU Hardware
Beware the Cost of Offloading!
Transfers to and from Device Memory
GPU Kernel Best Practices
Accessing Global Memory
Accessing Work-Group Local Memory
Avoiding Local Memory Entirely with Sub-Groups
Optimizing Computation Using Small Data Types
Optimizing Math Functions
Specialized Functions and Extensions
Summary
For More Information
Chapter 16: Programming for CPUs
Performance Caveats
The Basics of Multicore CPUs
The Basics of SIMD Hardware
Exploiting Thread-Level Parallelism
Thread Affinity Insight
Be Mindful of First Touch to Memory
SIMD Vectorization on CPU
Ensure SIMD Execution Legality
SIMD Masking and Cost
Avoid Array of Struct for SIMD Efficiency
Data Type Impact on SIMD Efficiency
SIMD Execution Using single_task
Summary
Chapter 17: Programming for  FPGAs
Performance Caveats
How to Think About FPGAs
Pipeline Parallelism
Kernels Consume Chip “Area”
When to Use an FPGA
Lots and Lots of Work
Custom Operations or Operation Widths
Scalar Data Flow
Low Latency and Rich Connectivity
Customized Memory Systems
Running on an FPGA
Compile Times
The FPGA Emulator
FPGA Hardware Compilation Occurs “Ahead-of-Time”
Writing Kernels for FPGAs
Exposing Parallelism
Keeping the Pipeline Busy Using ND-Ranges
Pipelines Do Not Mind Data Dependences!
Spatial Pipeline Implementation of a Loop
Loop Initiation Interval
Pipes
Blocking and Non-blocking Pipe Accesses
For More Information on Pipes
Custom Memory Systems
Some Closing Topics
FPGA Building Blocks
Clock Frequency
Summary
Chapter 18: Libraries
Built-In Functions
Use the sycl:: Prefix with Built-In Functions
The C++ Standard Library
oneAPI DPC++ Library (oneDPL)
SYCL Execution Policy
Using oneDPL with Buffers
Using oneDPL with USM
Error Handling with SYCL Execution Policies
Summary
Chapter 19: Memory Model and Atomics
What’s in a Memory Model?
Data Races and Synchronization
Barriers and Fences
Atomic Operations
Memory Ordering
The Memory Model
The memory_order Enumeration Class
The memory_scope Enumeration Class
Querying Device Capabilities
Barriers and Fences
Atomic Operations in SYCL
The atomic Class
The atomic_ref Class
Using Atomics with Buffers
Using Atomics with Unified Shared Memory
Using Atomics in Real Life
Computing a Histogram
Implementing Device-Wide Synchronization
Summary
For More Information
Chapter 20: Backend Interoperability
What Is Backend Interoperability?
When Is Backend Interoperability Useful?
Adding SYCL to an Existing Codebase
Using Existing Libraries with SYCL
Getting Backend Objects with Free Functions
Getting Backend Objects via an Interop Handle
Using Backend Interoperability for Kernels
Interoperability with API-Defined Kernel Objects
Interoperability with Non-SYCL Source Languages
Backend Interoperability Hints and Tips
Choosing a Device for a Specific Backend
Be Careful About Contexts!
Access Low-Level API-Specific Features
Support for Other Backends
Summary
Chapter 21: Migrating CUDA Code
Design Differences Between CUDA and SYCL
Multiple Targets vs. Single Device Targets
Aligning to C++ vs. Extending C++
Terminology Differences Between CUDA and SYCL
Similarities and Differences
Execution Model
In-Order vs. Out-of-Order Queues
Contiguous Dimension
Sub-Group Sizes (Warp Sizes)
Forward Progress Guarantees
Barriers
Memory Model
Barriers
Atomics and Fences
Other Differences
Item Classes vs. Built-In Variables
Contexts
Error Checking
Features in CUDA That Aren’t In SYCL… Yet!
Global Variables
Cooperative Groups
Matrix Multiplication Hardware
Porting Tools and Techniques
Migrating Code with dpct and SYCLomatic
Running dpct
Examining the dpct Output
Summary
For More Information
Epilogue: Future Direction of SYCL
Closer Alignment with C++11, C++14, and C++17
Adopting Features from C++20, C++23 and Beyond
Mixing SPMD and SIMD Programming
Address Spaces
Specialization Mechanism
Compile-Time Properties
Summary
For More Information
Index