«For those interested in the GPU path to parallel enlightenment, this new book from David Kirk and Wen-mei Hwu is a godsend, as it introduces CUDA (tm), a C-like data parallel language, and Tesla(tm), the architecture of the current generation of NVIDIA GPUs. In addition to explaining the language and the architecture, they define the nature of data parallel problems that run well on the heterogeneous CPU-GPU hardware ... This book is a valuable addition to the recently reinvigorated parallel computing literature.» - David Patterson, Director of The Parallel Computing Research Laboratory and the Pardee Professor of Computer Science, U.C. Berkeley. Co-author of Computer Architecture: A Quantitative Approach
«Written by two teaching pioneers, this book is the definitive practical reference on programming massively parallel processors--a true technological gold mine. The hands-on learning included is cutting-edge, yet very readable. This is a most rewarding read for students, engineers, and scientists interested in supercharging computational resources to solve todays and tomorrows hardest problems.» - Nicolas Pinto, MIT, NVIDIA Fellow, 2009
«I have always admired Wen-mei Hwus and David Kirks ability to turn complex problems into easy-to-comprehend concepts. They have done it again in this book. This joint venture of a passionate teacher and a GPU evangelizer tackles the trade-off between the simple explanation of the concepts and the in-depth analysis of the programming techniques. This is a great book to learn both massive parallel programming and CUDA.» - Mateo Valero, Director, Barcelona Supercomputing Center
«The use of GPUs is having a big impact in scientific computing. David Kirk and Wen-mei Hwus new book is an important contribution towards educating our students on the ideas and techniques of programming for massively parallel processors.» - Mike Giles, Professor of Scientific Computing, University of Oxford
«This book is the most comprehensive and authoritative introduction to GPU computing yet. David Kirk and Wen-mei Hwu are the pioneers in this increasingly important field, and their insights are invaluable and fascinating. This book will be the standard reference for years to come.» - Hanspeter Pfister, Harvard University
«This is a vital and much-needed text. GPU programming is growing by leaps and bounds. This new book will be very welcomed and highly useful across inter-disciplinary fields.» - Shannon Steinfadt, Kent State University
«GPUs have hundreds of cores capable of delivering transformative performance increases across a wide range of computational challenges. The rise of these multi-core architectures has raised the need to teach advanced programmers a new and essential skill: how to program massively parallel processors.» - CNNMoney.com
Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.
Author(s): David B. Kirk, Wen-mei W. Hwu
Edition: 2
Publisher: Elsevier / Morgan Kaufmann
Year: 2012
Language: English
Pages: 514
Tags: Библиотека;Компьютерная литература;CUDA / OpenCL;
Front Cover......Page 1
Programming Massively Parallel Processors......Page 4
Copyright Page......Page 5
Contents......Page 6
Preface......Page 14
How to Use the Book......Page 15
Tying It All Together: The Final Project......Page 16
Design Document......Page 17
Online Supplements......Page 18
Acknowledgements......Page 20
Dedication......Page 22
1 Introduction......Page 24
1.1 Heterogeneous Parallel Computing......Page 25
1.2 Architecture of a Modern GPU......Page 31
1.3 Why More Speed or Parallelism?......Page 33
1.4 Speeding Up Real Applications......Page 35
1.5 Parallel Programming Languages and Models......Page 37
1.6 Overarching Goals......Page 39
1.7 Organization of the Book......Page 40
References......Page 44
2.1 Evolution of Graphics Pipelines......Page 46
The Era of Fixed-Function Graphics Pipelines......Page 47
Evolution of Programmable Real-Time Graphics......Page 51
Unified Graphics and Computing Processors......Page 54
2.2 GPGPU: An Intermediate Step......Page 56
2.3 GPU Computing......Page 57
Scalable GPUs......Page 58
Recent Developments......Page 59
References and Further Reading......Page 60
3 Introduction to Data Parallelism and CUDA C......Page 64
3.1 Data Parallelism......Page 65
3.2 CUDA Program Structure......Page 66
3.3 A Vector Addition Kernel......Page 68
3.4 Device Global Memory and Data Transfer......Page 71
3.5 Kernel Functions and Threading......Page 76
Predefined Variables......Page 82
3.7 Exercises......Page 83
References......Page 85
4 Data-Parallel Execution Model......Page 86
4.1 Cuda Thread Organization......Page 87
4.2 Mapping Threads to Multidimensional Data......Page 91
4.3 Matrix-Matrix Multiplication—A More Complex Kernel......Page 97
4.4 Synchronization and Transparent Scalability......Page 104
4.5 Assigning Resources to Blocks......Page 106
4.6 Querying Device Properties......Page 108
4.7 Thread Scheduling and Latency Tolerance......Page 110
4.9 Exercises......Page 114
5 CUDA Memories......Page 118
5.1 Importance of Memory Access Efficiency......Page 119
5.2 CUDA Device Memory Types......Page 120
5.3 A Strategy for Reducing Global Memory Traffic......Page 128
5.4 A Tiled Matrix–Matrix Multiplication Kernel......Page 132
5.5 Memory as a Limiting Factor to Parallelism......Page 138
5.6 Summary......Page 141
5.7 Exercises......Page 142
6 Performance Considerations......Page 146
6.1 Warps and Thread Execution......Page 147
6.2 Global Memory Bandwidth......Page 155
6.3 Dynamic Partitioning of Execution Resources......Page 164
6.4 Instruction Mix and Thread Granularity......Page 166
6.6 Exercises......Page 168
References......Page 172
7 Floating-Point Considerations......Page 174
Normalized Representation of M......Page 175
Excess Encoding of E......Page 176
7.2 Representable Numbers......Page 178
7.3 Special Bit Patterns and Precision in Ieee Format......Page 183
7.4 Arithmetic Accuracy and Rounding......Page 184
7.5 Algorithm Considerations......Page 185
7.6 Numerical Stability......Page 187
7.7 Summary......Page 192
7.8 Exercises......Page 193
References......Page 194
8 Parallel Patterns: Convolution......Page 196
8.1 Background......Page 197
8.2 1D Parallel Convolution—A Basic Algorithm......Page 202
8.3 Constant Memory and Caching......Page 204
8.4 Tiled 1D Convolution with Halo Elements......Page 208
8.5 A Simpler Tiled 1D Convolution—General Caching......Page 215
8.6 Summary......Page 216
8.7 Exercises......Page 217
9 Parallel Patterns: Prefix Sum......Page 220
9.1 Background......Page 221
9.2 A Simple Parallel Scan......Page 223
9.3 Work Efficiency Considerations......Page 227
9.4 A Work-Efficient Parallel Scan......Page 228
9.5 Parallel Scan for Arbitrary-Length Inputs......Page 233
9.6 Summary......Page 237
9.7 Exercises......Page 238
Reference......Page 239
10 Parallel Patterns: Sparse Matrix–Vector Multiplication......Page 240
10.1 Background......Page 241
10.2 Parallel SpMV Using CSR......Page 245
10.3 Padding and Transposition......Page 247
10.4 Using Hybrid to Control Padding......Page 249
10.5 Sorting and Partitioning for Regularization......Page 253
10.6 Summary......Page 255
10.7 Exercises......Page 256
References......Page 257
11 Application Case Study: Advanced MRI Reconstruction......Page 258
11.1 Application Background......Page 259
11.2 Iterative Reconstruction......Page 262
11.3 Computing FHD......Page 264
Step 1: Determine the Kernel Parallelism Structure......Page 266
Step 2: Getting Around the Memory Bandwidth Limitation......Page 272
Step 3: Using Hardware Trigonometry Functions......Page 278
Step 4: Experimental Performance Tuning......Page 282
11.4 Final Evaluation......Page 283
11.5 Exercises......Page 285
References......Page 287
12 Application Case Study: Molecular Visualization and Analysis......Page 288
12.1 Application Background......Page 289
12.2 A Simple Kernel Implementation......Page 291
12.3 Thread Granularity Adjustment......Page 295
12.4 Memory Coalescing......Page 297
12.5 Summary......Page 300
References......Page 302
13 Parallel Programming and Computational Thinking......Page 304
13.1 Goals of Parallel Computing......Page 305
13.2 Problem Decomposition......Page 306
13.3 Algorithm Selection......Page 310
13.4 Computational Thinking......Page 316
13.6 Exercises......Page 317
References......Page 318
14.1 Background......Page 320
14.2 Data Parallelism Model......Page 322
14.3 Device Architecture......Page 324
14.4 Kernel Functions......Page 326
14.5 Device Management and Kernel Launch......Page 327
14.6 Electrostatic Potential Map in Opencl......Page 330
14.7 Summary......Page 334
14.8 Exercises......Page 335
References......Page 336
15.1 OpenACC Versus CUDA C......Page 338
15.2 Execution Model......Page 341
15.3 Memory Model......Page 342
Parallel Region, Gangs, and Workers......Page 343
Gang Loop......Page 345
OpenACC Versus CUDA......Page 346
Vector Loop......Page 349
Prescriptive Versus Descriptive......Page 350
Ways to Help an OpenACC Compiler......Page 352
Data Clauses......Page 354
Data Construct......Page 355
Asynchronous Computation and Data Transfer......Page 358
15.5 Future Directions of OpenACC......Page 359
15.6 Exercises......Page 360
16.1 Background......Page 362
16.2 Motivation......Page 365
16.3 Basic Thrust Features......Page 366
Iterators and Memory Space......Page 367
Interoperability......Page 368
16.4 Generic Programming......Page 370
16.6 Programmer Productivity......Page 372
Real-World Performance......Page 373
16.7 Best Practices......Page 375
Fusion......Page 376
Structure of Arrays......Page 377
Implicit Ranges......Page 379
16.8 Exercises......Page 380
References......Page 381
17 CUDA FORTRAN......Page 382
17.1 CUDA FORTRAN and CUDA C Differences......Page 383
17.2 A First CUDA FORTRAN Program......Page 384
17.3 Multidimensional Array in CUDA FORTRAN......Page 386
17.4 Overloading Host/Device Routines With Generic Interfaces......Page 387
17.5 Calling CUDA C Via Iso_C_Binding......Page 390
17.6 Kernel Loop Directives and Reduction Operations......Page 392
17.7 Dynamic Shared Memory......Page 393
17.8 Asynchronous Data Transfers......Page 394
17.9 Compilation and Profiling......Page 400
17.10 Calling Thrust from CUDA FORTRAN......Page 401
17.11 Exercises......Page 405
18 An Introduction to C++ AMP......Page 406
18.1 Core C++ Amp Features......Page 407
Explicit and Implicit Data Copies......Page 414
Asynchronous Operation......Page 416
18.3 Managing Accelerators......Page 418
18.4 Tiled Execution......Page 421
18.5 C++ AMP Graphics Features......Page 424
18.7 Exercises......Page 428
19 Programming a Heterogeneous Computing Cluster......Page 430
19.2 A Running Example......Page 431
19.3 MPI Basics......Page 433
19.4 MPI Point-to-Point Communication Types......Page 437
19.5 Overlapping Computation and Communication......Page 444
19.7 Summary......Page 454
19.8 Exercises......Page 455
Reference......Page 456
20 CUDA Dynamic Parallelism......Page 458
20.1 Background......Page 459
20.2 Dynamic Parallelism Overview......Page 461
Events......Page 462
Streams......Page 463
Synchronization Scope......Page 464
Local Memory......Page 465
Texture Memory......Page 466
20.5 A Simple Example......Page 467
Memory Footprint......Page 469
Memory Allocation and Lifetime......Page 471
20.7 A More Complex Example......Page 472
Bezier Curve Calculation (Predynamic Parallelism)......Page 473
Bezier Curve Calculation (with Dynamic Parallelism)......Page 476
20.8 Summary......Page 479
Reference......Page 480
21.1 Goals Revisited......Page 482
21.2 Memory Model Evolution......Page 484
21.3 Kernel Execution Control Evolution......Page 487
21.5 Programming Environment......Page 490
21.6 Future Outlook......Page 491
References......Page 492
A.1 matrixmul.cu......Page 494
A.2 matrixmul_gold.cpp......Page 497
A.3 matrixmul.h......Page 498
A.4 assist.h......Page 499
A.5 Expected Output......Page 503
B.1 GPU Compute Capability Tables......Page 504
B.2 Memory Coalescing Variations......Page 505
Index......Page 510