This book constitutes the thoroughly refereed post-conference proceedings of the 21th International Workshop on Languages and Compilers for Parallel Computing, LCPC 2008, held in Edmonton, Canada, in July/August 2008.
The 18 revised full papers and 6 revised short papers presented were carefully reviewed and selected from 35 submissions. The papers address all aspects of languages, compiler techniques, run-time environments, and compiler-related performance evaluation for parallel and high-performance computing and comprise also presentations on program analysis that are precursors of high performance in parallel environments.
Author(s): José Nelson Amaral
Edition: 1
Publisher: Springer
Year: 2008
Language: English
Commentary: 71816
Pages: 365
front-matter.pdf......Page 1
Introduction......Page 9
CUDA Programming Model......Page 10
Global Memory......Page 11
Shared Memory......Page 12
Memory Coalescing......Page 13
CUDA-Lite......Page 14
Experimental Results......Page 18
Related Work......Page 21
Conclusion and Future Work......Page 22
Introduction......Page 24
Programming Model Background......Page 25
Kernel Translation......Page 26
Transforming a Thread Block into a Serial Function......Page 27
Enforcing Synchronization with Deep Fission......Page 28
Replicating Thread-Local Data......Page 31
Implementation and Performance Analysis......Page 33
Related Work......Page 35
Conclusions......Page 36
Introduction......Page 39
The High Locality Cache......Page 42
The Pre-Fetch Block......Page 45
The Transactional Cache......Page 46
Regular Access Transformations......Page 48
Irregular Access Transformation......Page 50
Evaluation......Page 51
Related Work......Page 52
Conclusions......Page 53
References......Page 54
Introduction......Page 55
Reachability and Sharing......Page 56
Notation......Page 57
Abstract Operations......Page 58
Semantics of Expressions......Page 59
Semantics of Commands......Page 60
Expressions and Commands; Native Operations......Page 62
Experiments......Page 65
Related Work......Page 67
Complete Semantics for the Expressions and Commands......Page 70
PowUnion: Correctness Proof......Page 71
Introduction......Page 72
Related Work......Page 73
Background......Page 74
Overview of TRIPS......Page 75
Compiling for TRIPS......Page 76
Base Linear Scan Register Allocator......Page 77
Bank Assignment Algorithm for Spatially Partitioned Processors......Page 78
Register Dependence Graph......Page 79
Bank Assignment......Page 80
Experimental Results......Page 82
Conclusions......Page 86
Introduction......Page 88
Related Work......Page 91
Ring......Page 94
Torus......Page 95
Icosahedron Representation of the Earth......Page 97
Experimental Results......Page 98
Conclusion......Page 99
Introduction......Page 102
Running Examples......Page 103
Abstract Heap Domain......Page 104
Heap Structure Examples......Page 105
Extended Domain......Page 107
Local Data Dependence......Page 108
Interprocedural Data Dependence......Page 110
Case Studies......Page 112
Performance......Page 115
Introduction......Page 117
Delaunay Mesh Refinement......Page 118
Opportunities for Exploiting Amorphous Data-Parallelism......Page 120
The Galois Runtime and Class Libraries......Page 121
Methodology......Page 122
Speedup......Page 123
Load Balance......Page 124
Aborted Speculations......Page 126
Overdecomposition......Page 127
Result Summary......Page 128
Related Work......Page 129
References......Page 130
Introduction......Page 132
Contributions......Page 134
Profile Representation......Page 135
Measuring Execution Time......Page 136
Detecting Patterns of Behavior......Page 137
Signature Generation for Patterns......Page 138
Grouping and Distinguishing between Similar Patterns......Page 140
Ranking Impact of Patterns......Page 141
Experimental Evaluation......Page 142
Case Study: H.263enc......Page 145
Related Work......Page 146
Conclusion......Page 147
Introduction......Page 149
Concurrency Graph......Page 151
Concurrency Graph Partition......Page 152
Serializing Non-interference Graph......Page 153
MLA Heuristic......Page 155
Experimental Results......Page 158
Precision Evaluation......Page 159
Performance Study on Sun-Fire......Page 160
Related Work......Page 161
Conclusions......Page 162
Introduction......Page 164
Software TLS......Page 166
Thread Partitioning -- High Level View......Page 167
Exploiting Access-Patterns Via Adaptive TLS Models......Page 168
Thread Partitioning......Page 169
Notations, Preliminaries and Profiling Instrumentation......Page 170
Example......Page 171
Set-Congruence Model (in $\mathbb{Z} \times \mathbb{Z}$)......Page 172
Set-Congruence Algebra (in $\mathbb{Z} \times \mathbb{Z}$)......Page 173
Pattern Identification......Page 174
Memory Partitioning......Page 175
Building Variable-Based Memory Partitions......Page 176
Speed-Up Results......Page 177
Conclusions......Page 178
Introduction......Page 180
Code Example......Page 181
Disjoint Partitions......Page 182
Language Extensions for Parallelism......Page 183
Effects......Page 184
Effect Agreements......Page 185
Example: Multithreaded Server......Page 186
Flow-Sensitive Effect Analysis......Page 188
\tt forkjoin} Statements......Page 189
Enforcing Thread-Safety......Page 190
Example: Map Reduce......Page 191
Related Work......Page 192
Conclusion......Page 193
Introduction......Page 195
Motivation......Page 196
Cache Coherence Protocol Block Size......Page 197
Cache Mapping......Page 198
Processor Mapping......Page 199
Effective Bandwidth......Page 200
Pointer chaining.......Page 201
Experimental Environment......Page 202
Coherence Block Size.......Page 203
Cache Mapping.......Page 204
Effective L2 Bandwidth.......Page 205
Effective Memory Bandwidth.......Page 206
Related Work......Page 207
Conclusion......Page 208
Introduction......Page 210
Review of the Connection between Time Distance and Reuse Distance......Page 211
Basic Algorithm for Approximating Reuse Distance Histograms......Page 212
Scalable Algorithm......Page 213
Measurement Acceleration for Efficiency......Page 215
Algorithm......Page 217
Proof of Correctness......Page 218
Time Distance Measurement......Page 219
Locality Approximation on SPEC {\it ref} Runs......Page 220
Trace Generator......Page 221
Conclusions......Page 222
Introduction......Page 225
Two New Optimal Cache Management Algorithms......Page 226
Three Types of Cache Access......Page 227
OPT* Algorithm......Page 228
The Bypass LRU Algorithm......Page 229
Bypass LRU is optimal.......Page 230
The Trespass LRU Algorithm......Page 232
Trespass LRU is Optimal.......Page 233
Trespass LRU is a stack algorithm.......Page 234
A Simulation Study......Page 235
A Simple But Real Test......Page 236
Related Work......Page 237
Summary......Page 238
Introduction......Page 240
Static Modeling of TLS Profitability......Page 242
Independence Windows......Page 243
Dependence Clustering......Page 244
Summary......Page 245
DProf Runtime Library......Page 246
Dependence Profiling......Page 247
Previous Work......Page 252
Conclusions......Page 253
Introduction......Page 257
Managing Parallelism for Distributed Objects......Page 259
The Programming Model......Page 260
Synchronization and Asynchronous Event Handling......Page 261
Failure Semantics and Bootstrapping Nodes......Page 262
Sample Code......Page 263
Overlay Network Construction......Page 264
RMI Fault Detection......Page 265
Micro-benchmark......Page 266
Real-Life Application......Page 268
Conclusion......Page 269
Introduction......Page 272
Overview of LISP2 Compactor......Page 273
Live Object Marking......Page 274
Reference Fixing......Page 275
Phases Composition......Page 276
Object Relocating Implementation......Page 277
Object Moving Implementation......Page 279
Applied in a Generational GC......Page 281
The Compactor with a LOS Collector......Page 282
Scalability......Page 283
Adjustable Boundaries......Page 284
Summary......Page 285
References......Page 286
Introduction......Page 287
Background......Page 288
A Multi-paradigm Runtime System......Page 289
ParFUM: An Example of Multi-paradigm Programming......Page 292
Mesh Partitioning......Page 293
Mesh Adaptivity......Page 294
Multi-paradigm Applications......Page 295
Conclusions......Page 296
Introduction......Page 300
ASYNC Loops......Page 301
Benchmark Study......Page 304
SOR-Preconditioned CG Using ASYNC Loops......Page 306
Experimental Results......Page 307
Conclusion......Page 310
Introduction......Page 312
Related Work......Page 314
The {\tt pMatrix} pContainer......Page 315
pAlgorithms......Page 316
Interoperability for Linear Algebra Computations......Page 317
Optimizing {\tt p_matrix_multiply} with PBLAS and BLAS......Page 318
Experimental Results......Page 320
Conclusion......Page 322
Automatic Parallelization - Current State of the Art......Page 324
A Brief Introduction to Sensitivity Analysis......Page 326
Engineering an Automatic Parallelizer......Page 328
Transformations to Remove Dependences......Page 329
Sensitivity Graph (SG) Evaluation......Page 331
Speculative Execution......Page 333
Experimental Results......Page 334
Conclusions......Page 337
Introduction......Page 339
Percolation Model and Just-In-Time Locality......Page 340
Percolation for Just-In-Time Locality......Page 341
Discussion......Page 342
Percolation Programming for SSCA2......Page 343
Percolation Programming for DP......Page 344
Empirical Results for SSCA2 with Percolation......Page 346
Related Work......Page 348
Conclusion......Page 349
Introduction......Page 351
Related Work......Page 352
Orchestration of Optimizations......Page 353
Register Reuse Optimizations......Page 354
SSE Vectorization......Page 355
Memory Prefetch......Page 356
A Parameterized Search Engine for Automatic Tuning......Page 357
Experimental Results......Page 358
Search Space Exploration......Page 359
Conclusions......Page 362
back-matter.pdf......Page 364