This book constitutes the thoroughly refereed post-proceedings of the 17th International Workshop on Languages and Compilers for High Performance Computing, LCPC 2004, held in West Lafayette, IN, USA in September 2004. The 33 revised full papers presented were carefully selected during two rounds of reviewing and improvement. The papers are organized in topical sections on compiler infrastructures; predicting and reducing memory access; locality, tiling, and partitioning; tools and techniques for parallelism and locality; Java for high-performance computing; high-level languages and optimizations; large-scale data sharing; performance studies; program analysis; and exploiting architectural features.
Author(s): Rudolf Eigenmann, Zhiyuan Li, Samuel P. Midkiff
Edition: 1
Publisher: Springer
Year: 2005
Language: English
Pages: 495
front-matter......Page 2
Introduction......Page 10
Cetus Intermediate Representation......Page 11
Navigating the IR......Page 13
Type System and Symbol Table......Page 14
Normalization......Page 15
Case Studies......Page 16
Translation of OpenMP Applications......Page 17
Pointer Alias Analysis......Page 18
Software Debugging......Page 19
A Java Byte-Code Translator......Page 20
Conclusion......Page 21
Key LLVM Design Features......Page 24
Status......Page 25
Introduction......Page 26
Framework and Major Components......Page 27
New Features of Code Generator......Page 29
Other Performance Enhancements......Page 34
Development Methodology......Page 35
Evaluation......Page 36
Proliferation of Open64 and ORC......Page 37
Conclusion and Future Work......Page 38
References......Page 39
Introduction......Page 41
User Community and Field Tests......Page 42
Architecture Space......Page 43
Predicated Execution......Page 44
Software Pipelining......Page 45
Optimizing Compiler......Page 46
Instruction Set Simulator......Page 47
Concluding Remarks......Page 48
Introduction......Page 51
Locality Measurement and Prediction......Page 52
Locality Phase Analysis......Page 53
Evaluation......Page 54
Experiment Results......Page 55
Phase Sequence Prediction......Page 56
Related Work......Page 57
Conclusions......Page 58
Introduction......Page 65
Bitwidth Aware Register Allocation: Static Versus Dynamic Narrow Width Data......Page 67
Instruction Set Support......Page 69
Speculative Subword Register Allocation......Page 71
The SSRA Algorithm......Page 72
Experimental Results......Page 77
Conclusions......Page 78
Introduction......Page 81
The Computational Context......Page 83
Constituent Operations......Page 84
Generalized Matrix Multiplication (GEMM)......Page 85
Index Permutation......Page 86
Empirical Measurement of Constituent Operations......Page 88
Composite Performance Model......Page 90
Experimental Results......Page 91
Related Work......Page 93
Conclusions......Page 94
Introduction......Page 96
HTA Arithmetic Operations......Page 98
Construction of HTAs......Page 99
Parallel Programming Using HTAs......Page 100
Execution Model and Implementation......Page 101
NAS Benchmark Suite......Page 102
NAS Kernel mg......Page 103
NAS Kernel ft......Page 105
Performance......Page 106
Ease of Programming......Page 107
Related Works......Page 108
Conclusions......Page 109
Introduction......Page 111
Iteration Space Partitioning......Page 112
Terminology......Page 114
Problem Statement......Page 115
The Approach......Page 116
Experiments......Page 120
Previous Work......Page 122
Conclusions......Page 123
Introduction......Page 126
An Example......Page 129
Related Work......Page 130
JuliusC: An Overview......Page 132
Static Analysis: The Art of Divide et Impera......Page 133
Static Profiler......Page 135
Experimental Results......Page 136
Conclusion and Future Work......Page 138
Introduction......Page 141
Related Work......Page 144
Solving the MLS Problem......Page 145
Computing Parallel Loads and Stores......Page 146
Memory Assignment......Page 150
Register Assignment......Page 153
Experiment......Page 154
Conclusion......Page 156
Introduction and Motivation......Page 158
Practical View: Locality Problem......Page 159
Abstract View and Key Players......Page 160
Our Goal......Page 161
Baseline Formulation......Page 162
Temporal Locality......Page 164
Spatial Locality......Page 165
Relaxing Assumptions......Page 166
Discussion......Page 167
Experimental Results......Page 168
Concluding Remarks......Page 171
Introduction......Page 173
A Code Isolator......Page 174
Compilable Isolated Program......Page 175
Executable Isolated Program......Page 176
Machine State......Page 178
Case Study in LS-DYNA......Page 179
Implementation and Experiments......Page 183
Related Work......Page 184
Conclusion......Page 185
Introduction......Page 188
Traces......Page 189
Inlining......Page 190
Trace Collection Example......Page 191
Jikes......Page 192
Trace Collection Within Jikes......Page 193
Results......Page 194
Inlining Using Traces with a Just-in-Time Compiler......Page 195
Inlining Using Traces with an Ahead-of-Time Compiler......Page 197
Details of the Provided Inline Information......Page 198
Related Work......Page 200
Conclusion......Page 201
Introduction and Motivation......Page 203
Parallel Execution Graph......Page 204
Soot Framework......Page 206
Practical MHP Analysis......Page 207
Efficient PEG Construction......Page 208
PEG Simplification......Page 209
Practical MHP Analysis......Page 210
Benchmarks......Page 211
Results......Page 212
Related Work......Page 214
Future Work and Conclusions......Page 215
Introduction......Page 218
Background......Page 219
Contributions......Page 220
Overview......Page 221
Intra-thread Variable Access Ordering......Page 222
Variable-Order Graph......Page 224
Determining Communicator Variables......Page 225
Fence Insertion......Page 227
Experience......Page 228
Conclusions......Page 231
Introduction and Motivation......Page 233
Compiling High-Level Languages......Page 234
Previous Work on Vectorizing Compilers......Page 235
Microbenchmarks......Page 237
Application Performance......Page 242
Conclusions......Page 244
Introduction......Page 247
Implementation Considerations......Page 248
Internal Representation......Page 249
Optimization......Page 250
Common Subexpression Elimination......Page 251
Basic Block Scheduling......Page 252
Software Pipelining......Page 253
Engineering FFT Optimization......Page 254
Experiments......Page 255
Straight-Line Code......Page 256
Loop Code......Page 258
Conclusion......Page 260
Introduction......Page 262
Extended Loop Transformation......Page 263
Dependence Analysis......Page 264
Annotating Semantics of Abstractions......Page 265
Array Annotation......Page 266
Inheritance of Semantic Annotations......Page 267
Automatic Extraction and Verification of Annotations......Page 268
Property Propagation Algorithm......Page 269
Translating Array Operations......Page 270
Generating Low-Level Implementations......Page 271
Experimental Results......Page 272
Related Work......Page 274
Conclusions......Page 275
Introduction......Page 277
MSA Description......Page 279
Matrix Multiplication......Page 281
Molecular Dynamics......Page 282
FEM Graph Partitioning......Page 283
DSM, TreadMarks, and Munin......Page 285
Global Arrays......Page 286
HPF and Others......Page 287
Summary and Future Work......Page 288
Introduction......Page 292
System Overview......Page 293
Target Applications and Example Queries......Page 295
The STORM Runtime System......Page 296
Metadata Descriptors......Page 297
Overview of the Compilation Problem......Page 299
Code Generation for Data Extraction......Page 300
Code Generation for Aggregation......Page 301
Experimental Results......Page 303
Conclusions......Page 306
Introduction......Page 308
XML and XML Schemas......Page 310
XML Query Language: XQuery......Page 311
Overview of Our Approach and System......Page 312
Example Application......Page 313
HDF5 Storage Format......Page 314
Low-level and Mapping Schema......Page 315
Compiler Analysis......Page 316
Overview of Compilation Challenges......Page 317
Data-Centric Transformation Approach......Page 318
Algorithm Details......Page 319
Experimental Results......Page 322
Related Work......Page 324
Conclusions......Page 325
Introduction......Page 328
OSCAR Multigrain Parallelizing Compiler......Page 329
Earliest Executable Condition......Page 330
Macro-task Scheduling......Page 331
Global Cache Optimization......Page 332
OpenMP Code Generation......Page 333
Performance on IBM pSeries 690 24 Way SMP Server......Page 335
Performance on NEC TX7/i6010 8 Itanium 2 SMP Server......Page 336
Performance on Sun Ultra 80 4 UltraSPARC II SMP Workstation......Page 337
Conclusions......Page 338
Introduction......Page 341
Background......Page 342
Implementing CAF on Shared Memory Architectures......Page 343
Representing Co-arrays for Efficient Local Computation......Page 344
Code Generation for Remote Accesses......Page 346
STREAM......Page 348
Random Access......Page 350
Spark98......Page 352
NAS MG and SP......Page 354
Conclusions......Page 355
Motivation......Page 357
Organization of the Paper......Page 358
Compilation and Runtime Infrastructure......Page 359
Basic Parallelizer Structure......Page 360
Analyzing Model......Page 362
Loop Permutation with Examples......Page 363
Loop Cost Model......Page 365
Missed Parallelization Opportunities......Page 367
Case 2: Array Privatization......Page 368
Summary and Future Work......Page 369
Trademarks and Copyright......Page 370
Introduction......Page 372
Suffix Arrays Background......Page 373
Collecting the Trace......Page 375
Control-Flow Information Accuracy......Page 376
KMR Algorithm......Page 377
Sorting Example......Page 378
Qualified BBWS for Hot Sub-paths......Page 380
Experimental Methodology......Page 381
Evaluation......Page 382
Application Example: Adaptive Cache Reconfiguration......Page 384
Conclusions......Page 386
Introduction......Page 388
The Value Evolution Graph (VEG)......Page 390
Formal Definition......Page 391
Value Evolution Graph Construction......Page 392
Queries on Value Evolution Graphs......Page 393
VEG Logic Inference and Conditional Pruning......Page 394
VEG Enabled Memory References Analysis......Page 395
Applications: Compiler Optimizations......Page 396
Implementation Details and Experimental Results......Page 397
Related Work......Page 399
Limitations and Future Work......Page 400
Introduction......Page 403
Shape Analysis Framework......Page 404
Loop-Carried Dependence Detection......Page 406
An Example......Page 411
Some Preliminary Results......Page 413
Related Works......Page 415
Conclusions and Future Works......Page 416
Introduction......Page 418
Type I -- IV......Page 420
Type VI -- VIII......Page 421
PVNRE Algorithm......Page 422
Value Numbering......Page 423
Transparency......Page 424
Conversion of Value Numbers......Page 425
Experimental Results......Page 428
Analysis Time......Page 429
Related Work......Page 430
Conclusion......Page 431
Introduction......Page 433
Overflow in SIMD Programs......Page 435
Overflow Controlled Wraparound SIMD Arithmetic......Page 437
Overflow Controlled Saturation SIMD Arithmetic......Page 440
Experiment Result......Page 443
Related Work......Page 444
Conclusion......Page 445
Introduction......Page 448
Decision Tree Patterns at the Source Level......Page 449
Representation of Decision Tree Implementations (Assembly Level)......Page 450
Itanium2 Branch Hardware......Page 451
Code Motion and Hoisting......Page 452
Cost Evaluation of the One-Way Comb Approach: Definition......Page 453
Cost Evaluation of the One-Way Comb Approach: Analytical......Page 454
Generating a Tree of Multiway Conditional Branches......Page 455
Generating Indirect Branches......Page 456
General One-Level Switch Structures: The Indirect Linear Method......Page 457
Comparison of the Code Generation Strategies......Page 458
Comparison for Switch Structures......Page 459
First Experimental Analysis......Page 460
Application to a Real Code......Page 461
Conclusion......Page 462
Introduction......Page 464
Motivating Example......Page 466
Assumptions and Definitions......Page 467
Reuse Vectors for SIV and MIV Subscripts......Page 468
Reuse Vectors and Scalar Replacement......Page 469
Computing the Number of Required Registers......Page 470
Code Generation......Page 472
Experimental Results......Page 473
Related Work......Page 477
Conclusion......Page 478
Introduction......Page 479
Configurable SP Architecture with Power Management......Page 481
Joint Variable-Voltage Scheduling with Power Gating for PEs......Page 482
Voltage/Speed Selection of Non-PEs......Page 486
Simulator and Experiment......Page 487
Conclusions......Page 490
back-matter......Page 494