This book presents an overview of recent research and applications in computer system performance for scientific and high performance computing. After a brief introduction to the field of scientific computer performance, the text provides comprehensive coverage of performance measurement and tools, performance modeling, and automatic performance tuning. It also includes performance tools and techniques for real-world scientific applications. Various chapters address such topics as performance benchmarks, hardware performance counters, the PMaC modeling system, source code-based performance modeling, climate modeling codes, automatic tuning with ATLAS, and much more.
Author(s): David H. Bailey, Robert F. Lucas, Samuel Williams
Series: Chapman & Hall CRC Computational Science
Edition: 1
Publisher: CRC Press
Year: 2010
Language: English
Pages: 386
Contents......Page 6
1.1 Background......Page 11
1.2 "Twelve Ways to Fool the Masses"......Page 13
1.3 Examples from Other Scientific Fields......Page 15
1.4 Guidelines for Reporting High Performance......Page 17
1.5 Modern Performance Science......Page 18
1.6 Acknowledgments......Page 19
2. Parallel Computer Architecture......Page 21
2.1.1 Moore's Law and Little's Law......Page 22
2.2.1 Shared-Memory Parallel Architectures......Page 24
2.2.2 Distributed-Memory Parallel Architectures......Page 25
2.2.3 Hybrid Parallel Architectures......Page 26
2.3 Processor (Core) Architecture......Page 27
2.4.1 Latency and Bandwidth Avoiding Architectures......Page 32
2.4.2 Latency Hiding Techniques......Page 34
2.5.1 Routing......Page 36
2.5.2 Topology......Page 37
2.6 Heterogeneous Architectures......Page 38
2.8 Acknowledgments......Page 39
2.9 Glossary......Page 40
3. Software Interfaces to Hardware Counters......Page 43
3.1 Introduction......Page 44
3.3 Off-Core and Shared Counter Resources......Page 45
3.4.1 AMD......Page 46
3.4.3 IBM Blue Gene......Page 47
3.5.1 Perf Events......Page 48
3.5.2 Perfmon2......Page 49
3.6.1 Extension to Off-Processor Counters......Page 50
3.6.2 Countable Events......Page 51
3.7 Counter Usage Modes......Page 52
3.7.2 Sampling Mode......Page 53
3.8.1 Optimizing Cache Usage......Page 54
3.8.3 Optimizing Prefetching......Page 55
3.8.5 Other Uses......Page 56
3.9.2 Overhead......Page 57
3.11 Acknowledgment......Page 58
4. Measurement and Analysis of Parallel Program Performance Using TAU and HPCToolkit......Page 59
4.1 Introduction......Page 60
4.2 Terminology......Page 62
4.3.1 Instrumentation......Page 63
4.3.2 Asynchronous Sampling......Page 64
4.3.3 Contrasting Measurement Approaches......Page 66
4.3.4 Performance Measurement in Practice......Page 67
4.4.1 Design Principles......Page 68
4.4.2 Measurement......Page 71
4.4.3 Analysis......Page 73
4.4.4 Presentation......Page 76
Using hpcviewer......Page 78
4.4.5 Summary and Ongoing Work......Page 81
4.5 TAU Performance System......Page 82
4.5.1 TAU Performance System Design and Architecture......Page 83
4.5.2 TAU Instrumentation......Page 84
Source Instrumentation......Page 85
Compiler-Based Instrumentation......Page 86
4.5.3 TAU Measurement......Page 87
4.5.4 TAU Analysis......Page 90
4.5.5 Summary and Future Work......Page 94
4.6 Summary......Page 95
Acknowledgments......Page 96
5.1 Introduction......Page 97
5.2 Tracing and Its Motivation......Page 98
5.3 Challenges......Page 100
5.3.1 Scalability of Data Handling......Page 101
5.3.2 Scalability of Visualization......Page 105
5.4 Data Acquisition......Page 109
5.5.1 Signal Processing......Page 113
5.5.2 Clustering......Page 114
5.5.3 Mixing Instrumentation and Sampling......Page 117
5.6.1 Detailed Time Reconstruction......Page 118
5.6.2 Global Application Characterization......Page 121
5.6.3 Projection of Metrics......Page 125
5.7 Interoperability......Page 127
5.7.1 Profile and Trace Visualizers......Page 128
5.7.2 Visualizers and Simulators......Page 129
5.8 The Future......Page 131
Acknowledgments......Page 132
6. Large-Scale Numerical Simulations on High-End Computational Platforms......Page 133
6.2 HPC Platforms and Evaluated Applications......Page 134
6.3.1 Gyrokinetic Toroidal Code......Page 136
6.4 GTC Performance......Page 138
6.5 OLYMPUS: Unstructured FEM in Solid Mechanics......Page 141
6.5.2 Olympus Performance......Page 143
6.6.1 The Cactus Software Framework......Page 146
6.6.2 Computational Infrastructure: Mesh Refinement with Carpet......Page 147
6.6.3 Carpet Benchmark......Page 148
6.6.4 Carpet Performance......Page 149
6.7 CASTRO: Compressible Astrophysics......Page 151
6.7.2 CASTRO Performance......Page 153
6.8 MILC: Quantum Chromodynamics......Page 155
6.8.1 MILC Performance......Page 156
6.9 Summary and Conclusions......Page 158
6.10 Acknowledgments......Page 160
7.1 Introduction......Page 161
7.2 Applications of Performance Modeling......Page 162
7.3 Basic Methodology......Page 164
7.4 Performance Sensitivity Studies......Page 168
7.5 Summary......Page 172
7.6 Acknowledgments......Page 173
8. Analytic Modeling for Memory Access Patterns Based on Apex-MAP......Page 175
8.1 Introduction......Page 176
8.2.1 Patterns of Memory Access......Page 177
8.2.2 Performance Dependency on Memory Access Pattern......Page 178
8.2.3 Locality Definitions......Page 179
8.3.1 Characterizing Memory Access......Page 180
8.3.2 Generating Different Memory Patterns Using Apex-MAP......Page 181
8.4 Using Apex-MAP to Assess Processor Performance......Page 183
8.5.1 Modeling Communication for Remote Memory Access......Page 185
8.5.2 Assessing Machine Scaling Behavior Based on Apex-MAP......Page 187
8.6 Apex-MAP as an Application Proxy......Page 190
8.6.1 More Characterization Parameters......Page 191
8.6.2 The Kernel Changes for Apex-MAP......Page 192
8.6.3 Determining Characteristic Parameters for Kernel Approximations......Page 193
8.6.5 Case Studies of Applications Modeled by Apex-MAP......Page 194
8.6.6 Overall Results and Precision of Approximation......Page 200
8.8 Acknowledgment......Page 202
9. The Roofline Model......Page 205
9.1.1 Abstract Architecture Model......Page 206
9.1.2 Communication, Computation, and Locality......Page 207
9.1.4 Examples of Arithmetic Intensity......Page 208
9.2 The Roofline......Page 209
9.3.1 NUMA......Page 211
9.3.3 TLB Issues......Page 213
9.4.1 Instruction-Level Parallelism......Page 214
9.4.2 Functional Unit Heterogeneity......Page 215
9.4.3 Data-Level Parallelism......Page 216
9.4.4 Hardware Multithreading......Page 217
9.4.6 Combining Ceilings......Page 218
9.5 Arithmetic Intensity Walls......Page 219
9.5.3 Write Allocation Traffic......Page 220
9.5.6 Elimination of Superfluous Floating-Point Operations......Page 221
9.6.1 Hierarchically Architectural Model......Page 222
9.7 Summary......Page 223
9.9 Glossary......Page 224
10.1 Introduction......Page 227
10.2 Overview......Page 229
10.2.2 Need to Coordinate Auto-Tuners......Page 230
10.3.1 Application Programmers......Page 231
10.3.2 Compilers......Page 232
10.3.4 Performance Models......Page 233
10.4 Search......Page 234
10.4.1 Specification of Tunable Parameters......Page 235
10.4.2 Previous Search Algorithms......Page 236
10.4.3 Parallel Rank Ordering......Page 237
10.4.4 Parameter Space Reduction......Page 238
10.4.5 Constraining to Allowable Region......Page 239
10.4.6 Performance Variability......Page 240
10.5.1 Application Parameters — GS2......Page 241
10.5.2 Compiler Transformations — Computational Kernels......Page 242
10.5.3 Full Application — SMG2000......Page 244
10.7 Acknowledgment......Page 248
11. Languages and Compilers for Auto-Tuning......Page 249
11.1 Language and Compiler Technology......Page 250
11.1.2 Auto-Tuning a Library in its Execution Context......Page 251
11.2 Interaction between Programmers and Compiler......Page 253
11.3 Triage......Page 255
11.3.1 Code Isolation......Page 256
11.4 Code Transformation......Page 257
11.4.1 Transformation Interfaces......Page 259
11.5 Higher-Level Capabilities......Page 261
11.7 Acknowledgments......Page 263
12. Empirical Performance Tuning of Dense Linear Algebra Software......Page 265
12.1.2 Dense Linear Algebra Performance Issues......Page 266
12.1.3 Idea of Empirical Tuning......Page 267
12.2 ATLAS......Page 268
12.2.3 Level 1 BLAS Support......Page 269
12.2.7 Use of Architectural Defaults......Page 270
12.3.1 Tuning Outer and Inner Block Sizes......Page 271
12.3.2 Validation of Pruned Search......Page 274
12.4.1 GEMM Auto-Tuner......Page 277
12.4.2 Performance Results......Page 279
12.6 Acknowledgments......Page 282
13. Auto-Tuning Memory-Intensive Kernels for Multicore......Page 283
13.1 Introduction......Page 284
13.2.1 AMD Opteron 2356 (Barcelona)......Page 285
13.3 Computational Kernels......Page 286
13.3.1 Laplacian Differential Operator (Stencil)......Page 287
13.3.2 Lattice Boltzmann Magnetohydrodynamics (LBMHD)......Page 288
13.3.3 Sparse Matrix-Vector Multiplication (SpMV)......Page 290
13.4.1 Parallelism......Page 292
13.4.2 Minimizing Memory Traffic......Page 293
13.4.3 Maximizing Memory Bandwidth......Page 296
13.4.5 Interplay between Benefit and Implementation......Page 297
13.5 Automatic Performance Tuning......Page 298
13.5.1 Code Generation......Page 299
13.5.2 Auto-Tuning Benchmark......Page 300
13.6 Results......Page 301
13.6.1 Laplacian Stencil......Page 302
13.6.2 Lattice Boltzmann Magnetohydrodynamics (LBMHD)......Page 303
13.6.3 Sparse Matrix-Vector Multiplication (SpMV)......Page 304
13.7 Summary......Page 305
13.8 Acknowledgments......Page 306
14. Flexible Tools Supporting a Scalable First-Principles MD Code......Page 307
14.1 Introduction......Page 308
14.2.1 Code Structure......Page 309
14.3 Experimental Setup and Baselines......Page 310
14.3.1 The Blue Gene/L Architecture......Page 311
14.3.2 Floating Point Operation Counts......Page 312
14.3.3 Test Problem......Page 313
14.3.4 Baseline Performance......Page 314
14.4.1 Dual Core Matrix Multiply......Page 315
14.4.2 Empirical Node Mapping Optimizations......Page 316
14.4.4 Systematic Communication Optimization......Page 318
14.5 Customizing Tool Chains with PNMPI......Page 320
14.5.1 Applying Switch Modules to Qbox......Page 321
14.5.2 Experiments and Results......Page 322
14.6 Summary......Page 323
14.7 Acknowledgment......Page 324
15.1 Introduction......Page 325
15.2 CCSM Overview......Page 326
15.3 Parallel Computing and the CCSM......Page 327
15.4 Case Study: Optimizing Interprocess Communication Performance in the Spectral Transform Method......Page 328
15.5 Performance Portability: Supporting Options and Delaying Decisions......Page 332
15.6 Case Study: Engineering Performance Portability into the Community Atmosphere Model......Page 333
15.7 Case Study: Porting the Parallel Ocean Program to the Cray X1......Page 338
15.8 Monitoring Performance Evolution......Page 342
15.9 Performance at Scale......Page 345
15.10 Summary......Page 347
15.11 Acknowledgments......Page 348
16.1 Introduction......Page 349
16.2 LS3DF Algorithm Description......Page 351
16.3 LS3DF Code Optimizations......Page 353
16.4 Test Systems......Page 356
16.5 Performance Results and Analysis......Page 357
16.6 Science Results......Page 361
16.7 Summary......Page 363
16.8 Acknowledgments......Page 364
Bibliography......Page 365