The CUDA Handbook: A Comprehensive Guide to GPU Programming

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Nicholas Wilt
Publisher: Addison-Wesley Professional
Year: 2013

Language: English
Pages: 521

Contents......Page 8
Preface......Page 22
Acknowledgments......Page 24
About the Author......Page 26
PART I......Page 28
Chapter 1: Background......Page 30
1.1 Our Approach......Page 32
1.2 Code......Page 33
1.3 Administrative Items......Page 34
1.4 Road Map......Page 35
2.1 CPU Configurations......Page 38
2.2 Integrated GPUs......Page 44
2.3 Multiple GPUs......Page 46
2.4 Address Spaces in CUDA......Page 49
2.5 CPU/GPU Interactions......Page 59
2.6 GPU Architecture......Page 68
2.7 Further Reading......Page 77
3.1 Software Layers......Page 78
3.2 Devices and Initialization......Page 86
3.3 Contexts......Page 94
3.4 Modules and Functions......Page 98
3.5 Kernels (Functions)......Page 100
3.6 Device Memory......Page 102
3.7 Streams and Events......Page 103
3.8 Host Memory......Page 106
3.9 CUDA Arrays and Texturing......Page 109
3.10 Graphics Interoperability......Page 113
3.11 The CUDA Runtime and CUDA Driver API......Page 114
4.1 nvcc—CUDA Compiler Driver......Page 120
4.2 ptxas—the PTX Assembler......Page 127
4.3 cuobjdump......Page 132
4.4 nvidia-smi......Page 133
4.5 Amazon Web Services......Page 136
PART II......Page 146
Chapter 5: Memory......Page 148
5.1 Host Memory......Page 149
5.2 Global Memory......Page 157
5.3 Constant Memory......Page 183
5.4 Local Memory......Page 185
5.6 Shared Memory......Page 189
5.7 Memory Copy......Page 191
Chapter 6: Streams and Events......Page 200
6.1 CPU/GPU Concurrency: Covering Driver Overhead......Page 201
6.2 Asynchronous Memcpy......Page 205
6.3 CUDA Events: CPU/GPU Synchronization......Page 210
6.4 CUDA Events: Timing......Page 213
6.5 Concurrent Copying and Kernel Processing......Page 214
6.6 Mapped Pinned Memory......Page 224
6.7 Concurrent Kernel Processing......Page 226
6.9 Source Code Reference......Page 229
7.1 Overview......Page 232
7.2 Syntax......Page 233
7.3 Blocks, Threads, Warps, and Lanes......Page 238
7.4 Occupancy......Page 247
7.5 Dynamic Parallelism......Page 249
Chapter 8: Streaming Multiprocessors......Page 258
8.1 Memory......Page 260
8.2 Integer Support......Page 268
8.3 Floating-Point Support......Page 271
8.4 Conditional Code......Page 294
8.5 Textures and Surfaces......Page 296
8.6 Miscellaneous Instructions......Page 297
8.7 Instruction Sets......Page 302
9.1 Overview......Page 314
9.2 Peer-to-Peer......Page 315
9.3 UVA: Inferring Device from Address......Page 318
9.4 Inter-GPU Synchronization......Page 319
9.5 Single-Threaded Multi-GPU......Page 321
9.6 Multithreaded Multi-GPU......Page 326
10.1 Overview......Page 332
10.2 Texture Memory......Page 333
10.3 1D Texturing......Page 341
10.4 Texture as a Read Path......Page 344
10.5 Texturing with Unnormalized Coordinates......Page 350
10.6 Texturing with Normalized Coordinates......Page 358
10.7 1D Surface Read/Write......Page 360
10.8 2D Texturing......Page 362
10.9 2D Texturing: Copy Avoidance......Page 365
10.10 3D Texturing......Page 367
10.11 Layered Textures......Page 369
10.12 Optimal Block Sizing and Performance......Page 370
10.13 Texturing Quick References......Page 372
PART III......Page 378
Chapter 11: Streaming Workloads......Page 380
11.1 Device Memory......Page 382
11.2 Asynchronous Memcpy......Page 385
11.3 Streams......Page 386
11.4 Mapped Pinned Memory......Page 388
11.5 Performance and Summary......Page 389
12.1 Overview......Page 392
12.2 Two-Pass Reduction......Page 394
12.3 Single-Pass Reduction......Page 400
12.4 Reduction with Atomics......Page 403
12.5 Arbitrary Block Sizes......Page 404
12.6 Reduction Using Arbitrary Data Types......Page 405
12.8 Warp Reduction with Shuffle......Page 409
13.1 Definition and Variations......Page 412
13.2 Overview......Page 414
13.3 Scan and Circuit Design......Page 417
13.4 CUDA Implementations......Page 421
13.5 Warp Scans......Page 434
13.6 Stream Compaction......Page 441
13.7 References (Parallel Scan Algorithms)......Page 445
13.8 Further Reading (Parallel Prefix Sum Circuits)......Page 446
Chapter 14: N-Body......Page 448
14.1 Introduction......Page 450
14.2 Naïve Implementation......Page 455
14.3 Shared Memory......Page 459
14.4 Constant Memory......Page 461
14.5 Warp Shuffle......Page 463
14.6 Multiple GPUs and Scalability......Page 465
14.7 CPU Optimizations......Page 466
14.8 Conclusion......Page 471
14.9 References and Further Reading......Page 473
15.1 Overview......Page 476
15.2 Naïve Texture-Texture Implementation......Page 479
15.3 Template in Constant Memory......Page 483
15.4 Image in Shared Memory......Page 486
15.5 Further Optimizations......Page 490
15.6 Source Code......Page 492
15.7 Performance and Further Reading......Page 493
15.8 Further Reading......Page 496
A.1 Timing......Page 498
A.2 Threading......Page 499
A.3 Driver API Facilities......Page 501
A.4 Shmoos......Page 502
A.5 Command Line Parsing......Page 503
A.6 Error Handling......Page 504
C......Page 508
G......Page 509
O......Page 510
P......Page 511
S......Page 512
X......Page 513
C......Page 514
D......Page 516
K......Page 517
N......Page 518
S......Page 519
T......Page 520
Z......Page 521