Advances in GPU Research and Practice

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Hamid Sarbazi-Azad
Publisher: Morgan Kaufmann
Year: 2017

Language: English
Pages: 735

Front Matter......Page 3
Copyright......Page 4
Dedication......Page 5
List of Contributors......Page 6
Preface......Page 10
Acknowledgments......Page 17
GPUs in Support of Parallel Computing......Page 18
Bugs in parallel and GPU code......Page 19
Organization of threads......Page 20
Memory spaces......Page 21
Warps and lock-step execution......Page 22
Data races......Page 23
Lack of forward progress guarantees......Page 25
Floating-point accuracy......Page 26
A Taxonomy of Current Tools......Page 27
Detecting races: ``all for one and one for all''......Page 29
Symbolic Bug-Finding Case Study: GKLEE......Page 30
Verification Case Study: GPUVerify......Page 32
Equivalence checking......Page 33
References......Page 34
Introduction......Page 37
Platform Model......Page 40
Execution Model......Page 42
Memory Model......Page 43
Synchronization......Page 44
OpenCL ICD......Page 46
Limitations of OpenCL......Page 47
SnuCL CPU......Page 52
SnuCL Cluster......Page 54
Processing synchronization commands......Page 56
Minimizing Copying Overhead......Page 57
Processing Memory Commands......Page 58
Detecting Memory Objects Written by a Kernel......Page 59
SnuCL extensions to OpenCL......Page 60
Evaluation Methodology......Page 62
Scalability on the medium-scale GPU cluster......Page 64
Scalability on the large-scale CPU cluster......Page 65
References......Page 67
Introduction......Page 70
Coarse-Grained Communication and Synchronization......Page 71
Global Barrier at the Kernel Level......Page 72
Local Barrier at the Work-Group Level......Page 75
Implicit Barrier at the Wavefront Level......Page 77
Built-In Atomic Functions on Regular Variables......Page 81
Fine-Grained Communication and Synchronization......Page 85
Sequential consistency......Page 86
The OpenCL 2.0 Memory Model......Page 87
Special atomic operations and stand-alone memory fence......Page 88
Memory order parameters......Page 90
Memory scope parameters......Page 92
References......Page 93
Introduction, Problem Statement, and Context......Page 95
P3: Nondeterministic Across Runs......Page 97
SM-centric task selection......Page 98
Filling-retreating scheme......Page 99
Implementation......Page 100
Details......Page 102
Affinity-Based Scheduling......Page 103
Evaluation......Page 104
SM Partitioning for Multi-Kernel Co-Runs......Page 106
Evaluation......Page 107
Software Approaches......Page 108
Hardware Approaches......Page 111
References......Page 112
Introduction......Page 116
Overview......Page 117
Memory specification through MSL......Page 119
Deriving Data Access Patterns: A Hybrid Approach......Page 121
Reuse distance model......Page 122
Staging code to be agnostic to placement......Page 123
Lightweight Performance Model......Page 125
Dealing with phases......Page 126
Results......Page 128
Results With Regular Benchmarks......Page 129
Related Work......Page 131
References......Page 132
Introduction......Page 135
Pairwise Sequence Comparison......Page 137
Phase 2: Obtain the best local alignment......Page 139
Phase 1: Seeding......Page 140
Sequence-Profile Comparison......Page 141
Hidden Markov models......Page 143
The Viterbi algorithm......Page 144
The MSV algorithm......Page 145
Design Aspects of GPU Solutions for Biological Sequence Analysis......Page 146
Sorting Sequence Database to Achieve Load Balance......Page 147
Use of GPU Memory Hierarchy......Page 148
GPU Solutions for Pairwise Sequence Comparison......Page 149
Manavski and Valle [27]
......Page 150
Li et al. [52]
......Page 151
Ino et al. [28,30]
......Page 152
Liu et al. [45-47]
......Page 153
Comparative overview......Page 154
Liu et al. [48]
......Page 156
Comparative overview......Page 157
Horn et al. [32]
......Page 158
Du et al. [44]
......Page 159
Yao et al. [34]
......Page 160
Ferraz and Moreano [36]
......Page 161
Li et al. [35]
......Page 162
Cheng and Butler [37]
......Page 163
Araújo Neto and Moreano [50]
......Page 164
Comparative Overview......Page 165
Conclusion and Perspectives......Page 166
References......Page 167
Adjacency Matrices......Page 171
Adjacency Lists......Page 173
Edge Lists......Page 174
Graph traversal algorithms: the breadth first search (BFS)......Page 175
The Frontier-Based Parallel Implementation of BFS......Page 176
BFS-4K......Page 178
The single-source shortest path (SSSP) problem......Page 183
The SSSP Implementations for GPUs......Page 184
H-BF: An Efficient Implementation of the Bellman-Ford Algorithm......Page 185
The APSP problem......Page 188
The APSP Implementations for GPUs......Page 189
Work-items to threads......Page 191
Virtual warps......Page 192
Dynamic virtual warps + dynamic parallelism......Page 193
CTA + warp + scan......Page 194
Direct search......Page 195
Block search......Page 196
Two-phase search......Page 197
Coalesced Expansion......Page 198
Iterated Searches......Page 199
References......Page 203
Pairwise alignment......Page 207
Alignment of Three Sequences......Page 208
Pairwise alignment......Page 209
Smith-Waterman Algorithm......Page 210
Computing the Score of the Best Local Alignment......Page 214
Computing the Best Local Alignment......Page 216
StripedAlignment......Page 217
ChunkedAlignment1......Page 218
Memory requirements......Page 219
StripedScore......Page 220
StripedAlignment......Page 221
ChunkedAlignment1......Page 222
Three-Sequence Alignment Algorithm......Page 226
GPU computational strategy......Page 229
Analysis......Page 230
GPU computational strategy......Page 231
LAYERED-BT1......Page 232
LAYERED-BT2......Page 233
LAYERED-BT3......Page 234
Computing the alignment......Page 235
Conclusion......Page 237
References......Page 238
Introduction......Page 241
ABCD Solver for tridiagonal systems......Page 242
QR Method and Givens Rotation......Page 245
Coalesced Memory Access......Page 246
Padding for Givens rotation......Page 248
Comparison With CPU Implementation......Page 250
Speedup by Memory Coalescing......Page 251
Boundary Padding......Page 252
References......Page 253
Introduction......Page 255
Operations Research in Practice......Page 257
The Simplex Method......Page 259
Dynamic Programming......Page 261
Knapsack problems......Page 262
Knapsack problem......Page 263
Flow-shop scheduling problem......Page 264
Metaheuristics......Page 265
The traveling salesman problem......Page 266
Scheduling problems......Page 267
Ant Colony Optimization......Page 268
Tabu Search......Page 270
Deep greedy switching......Page 272
Conclusions......Page 273
Acknowledgments......Page 274
References......Page 275
Motivation......Page 280
Floyd-Warshall's APSP......Page 281
Dijkstra's SSSP......Page 282
Implementation Challenges in a Distributed GPU Environment......Page 284
Related Work......Page 285
Partitioned Approaches......Page 286
Graph Partitioning......Page 287
All-Pairs Shortest Path Algorithms......Page 288
A centralized approach for graphs with negative weights......Page 289
A decentralized approach for better scaling and improvedmemory distribution......Page 292
Single-Pair Shortest Path Query Algorithm......Page 295
Preprocessing mode......Page 296
Query mode......Page 297
Ex situ tropical product......Page 299
Analysis of the Centralized APSP......Page 300
Analysis of the Decentralized APSP......Page 302
Analysis of the SSST Query Algorithm......Page 303
Partitioned APSP With Real Edge Weights......Page 304
Better-Scaling Partitioned APSP Using Dijkstra......Page 307
SPSP Query Using Dijkstra......Page 309
Discussion and Perspectives......Page 310
References......Page 311
About the Authors......Page 312
GPU Architecture......Page 314
CUDA Programming Model......Page 316
Computation Models for GPUs......Page 317
Generic Programming Strategies for GPU......Page 319
Bitonic Sort......Page 320
Radix Sort......Page 323
Merge Sort......Page 324
Other Sorting Algorithms......Page 326
Sorting Algorithms for Large Data Sets......Page 327
Comparison of Sorting Methods......Page 329
References......Page 330
Introduction......Page 334
Measuring Throughput and Energy......Page 338
Data Sets......Page 339
Algorithmic Components......Page 341
Synthesis Analysis and Derivation of MPC......Page 342
Observations about individual algorithms......Page 343
Derivation of the MPC algorithm......Page 344
Compression and Decompression Speed......Page 345
Component Count......Page 349
Acknowledgments......Page 353
References......Page 354
Introduction......Page 355
Sparse matrix-vector multiplication......Page 356
GPU architecture and programming model......Page 359
NVIDIA Tesla K20X......Page 360
Optimization principles for SpMV......Page 361
Block Size......Page 362
Memory......Page 363
Platform (Adaptive Runtime System)......Page 364
Results and analysis......Page 365
Register......Page 366
Adaptive Runtime System......Page 369
Summary......Page 373
References......Page 374
Introduction......Page 376
Use of CABA for compression......Page 377
Background......Page 378
Motivation......Page 379
Unutilized compute resources......Page 380
Our goal......Page 382
Goals and Challenges......Page 383
Hardware-based management of threads......Page 384
Programmer/developer interface......Page 385
Main Hardware Additions......Page 386
Assist warp buffer......Page 387
Dynamic feedback and throttling......Page 388
A Case for CABA: Data Compression......Page 389
Algorithm overview......Page 390
Decompression......Page 391
Compression......Page 392
Implementing the FPC algorithm......Page 393
Implementing the C-Pack algorithm......Page 394
Decompression......Page 395
The decompression mechanism......Page 396
L2/memory access......Page 397
L2/memory access......Page 398
Memory controller changes......Page 399
Evaluated metrics......Page 400
Effect on Performance and Bandwidth Utilization......Page 401
Effect on Energy......Page 402
Energy-delay product......Page 403
Effect of Enabling Different Compression Algorithms......Page 404
Selective Cache Compression With CABA......Page 406
Uncompressed L2......Page 407
Other Uses of the CABA Framework......Page 408
Prefetching......Page 409
Profiling and instrumentation......Page 410
Helper threading......Page 411
Compression......Page 412
Acknowledgments......Page 413
References......Page 414
Introduction......Page 419
Why hardware acceleration?......Page 421
Safe Programming Interface......Page 422
Compilation Workflow......Page 423
Instruction-set-architecture design......Page 426
Neural accelerator: design and integration......Page 427
Integrating the Neural Accelerator to GPUs......Page 428
Orchestrating Neurally Enhanced SIMD Lanes......Page 430
Controlling quality trade-offs......Page 432
Annotations......Page 433
Quality......Page 435
Cycle-accurate simulations......Page 436
Performance and energy benefits......Page 437
Opportunity for further improvements......Page 439
Sensitivity to accelerator speed......Page 441
Sensitivity to off-chip bandwidth......Page 442
Comparison with prior CPU neural acceleration......Page 443
Related work......Page 445
References......Page 446
Introduction......Page 451
Characterization of Traffic Demands of Heterogeneous Multicore Chips......Page 452
Photo Detector......Page 453
Photonic Interconnect Architecture Model......Page 454
Methodology......Page 456
Speedup Analysis With Varying Bandwidth......Page 457
Detailed Analysis......Page 459
Bandwidth Selection Policy......Page 461
Case Study With DBA in a Heterogeneous Multicore Chip With a Photonic NoC......Page 463
Performance evaluation of the d-HetPNoC......Page 464
Case studies with synthetic and real application-based traffic patterns......Page 467
Conclusion......Page 469
References......Page 470
Introduction......Page 472
Motivation and related work......Page 474
CPU-Based Performance Models for DVFS......Page 475
Critical path......Page 476
DVFS in GPGPUs......Page 477
Memory/computation overlap......Page 479
Store stalls......Page 481
Complex stall classification......Page 482
Models for GPGPUs......Page 483
Critical Stalled Path......Page 484
Compute/Store Path Portion......Page 485
Example......Page 487
Identifying stalls......Page 489
Classifying computation......Page 490
Methodology......Page 491
Execution Time Prediction......Page 494
Energy Savings......Page 497
Optimizing for ED2P......Page 499
Generality of CRISP......Page 502
References......Page 504
Introduction......Page 507
Benchmark Programs......Page 508
System, Compiler, and Power Measurement......Page 509
Power Profiles of Regular Codes......Page 510
Power Profiles of Irregular Codes......Page 511
Comparing Expected, Regular, and Irregular Profiles......Page 513
Algorithm Implementation......Page 514
Arithmetic Precision......Page 516
Default to 614......Page 517
614 to 324......Page 519
Source-Code Optimizations......Page 520
None vs all optimizations......Page 522
Effectiveness of optimizations......Page 525
Energy efficiency vs performance......Page 528
Most biased optimizations......Page 531
Summary......Page 533
Appendix......Page 534
References......Page 538
About the authors......Page 539
Introduction......Page 540
GPU and Its Memory Hierarchy......Page 543
Nonvolatile Memories......Page 545
Related Work......Page 547
STT-RAM......Page 548
GPU memory and STT-RAM......Page 549
Motivation......Page 550
Monitoring mechanism......Page 555
Search mechanism......Page 556
Eviction policy......Page 557
Refreshing mechanism......Page 558
Threshold Analysis......Page 560
Proposed Mechanism......Page 562
LR and HR cache retention times......Page 564
Evaluation Result......Page 565
Lifetime Evaluation......Page 569
Parallel vs Sequential Search Mechanism......Page 570
Conclusion......Page 573
References......Page 574
Introduction......Page 578
GPU Power Management for Mobile Games......Page 581
Control-theoretic gaming governor......Page 583
CPU-GPU Relationship......Page 584
Power-Performance Trade-Off......Page 585
Gaming Bottlenecks......Page 586
Performance Modeling......Page 587
Utilization Models......Page 589
Meeting FPS......Page 590
Results......Page 591
State-of-the-Art Techniques......Page 593
Hardware environment......Page 595
OpenCL background......Page 596
OpenCL runtime......Page 597
GPGPU Applications on Mobile GPUs......Page 598
DVFS for Improving Power/Energy-Efficiency......Page 599
Work-group size manipulation......Page 601
GPU estimation......Page 602
CPU estimation......Page 603
Concurrent execution and effect of memory contention......Page 605
CPU-GPU partitioning ratio......Page 606
Open Problems for Gaming Applications......Page 609
Conclusions......Page 610
References......Page 611
Introduction......Page 614
Radiation-Induced Soft Errors......Page 615
GPUs as HPC Coprocessors......Page 616
Evidence of Soft Errors in HPC Systems......Page 617
Architectural and Program Vulnerability Analysis......Page 619
Fault-Injection......Page 622
Accelerated Particle Beam Testing......Page 624
Protecting SRAM Structures......Page 627
Taking Advantage of Underutilization......Page 629
Software reliability enhancements......Page 632
Redundant Multithreading......Page 633
Symptom-Based Fault Detectionand Dynamic-Reliability Management......Page 637
Summary......Page 640
References......Page 641
Introduction......Page 645
GPGPUs Architecture......Page 646
Soft-Error Vulnerability Modeling in GPGPU Microarchitecture......Page 648
Soft-error vulnerability of the GPGPU microarchitecturestructures......Page 650
Analysis on structure's AVF in SM......Page 652
Vulnerability variations at streaming multiprocessor level......Page 653
Dynamic warp formation......Page 656
Number of threads per SM......Page 658
Warp scheduling......Page 662
RISE: Improving the Streaming Processors' Reliability Against Soft Errors in GPGPUs ch23:bib37......Page 665
Full-RISE......Page 666
The observation on resource contentions among memory requests......Page 667
The concept of request pending aware Full-RISE......Page 668
Idle SMs aware Full-RISE......Page 671
The implementation of Full-RISE......Page 672
The concept of Partial-RISE......Page 674
Experimental setup......Page 677
Effectiveness of RISE......Page 678
Sensitivity analysis......Page 680
Mitigating the Susceptibility of GPGPUs to PVs ch23:bib43......Page 682
Modeling PV impact on GPGPUs' register file......Page 684
Extending VL-RF to GPGPUs RF......Page 686
Variable-latency subbanks (VL-SB) in RF......Page 687
Implementation......Page 688
Mitigating the IPC Degradation Under VL-SB and RF-BRO......Page 690
Fast-bank aware register mapping......Page 691
Fast-warp aware scheduling (FWAS) policy......Page 692
Evaluations......Page 693
Evaluation of VL-SB with RF-BRO......Page 694
Evaluation of FWAS......Page 695
Overall performance improvement......Page 696
References......Page 697
B......Page 702
C......Page 704
D......Page 705
F......Page 706
H......Page 707
I......Page 708
K......Page 709
L......Page 710
M......Page 711
O......Page 713
Q......Page 714
S......Page 715
T......Page 717
W......Page 718
Y......Page 719
Z......Page 720
B......Page 721
C......Page 722
E......Page 724
G......Page 725
H......Page 727
M......Page 728
O......Page 729
P......Page 730
S......Page 731
T......Page 734
W......Page 735