With contributions from some of the most notable experts in the field, Performance Tuning of Scientific Applications presents current research in performance analysis. The book focuses on the following areas. Performance monitoring: Describes the state of the art in hardware and software tools that are commonly used for monitoring and measuring performance and managing large quantities of data Performance analysis: Discusses modern approaches to computer performance benchmarking and presents results that offer valuable insight into these studies Performance modeling: Explains how researchers deduce accurate performance models from raw performance data or from other high-level characteristics of a scientific computation Automatic performance tuning: Explores ongoing research into automatic and semi-automatic techniques for optimizing computer programs to achieve superior performance on any computer platform Application tuning: Provides examples that show how the appropriate analysis of performance and some deft changes have resulted in extremely high performance Performance analysis has grown into a full-fledged, sophisticated field of empirical science. Describing useful research in modern performance science and engineering, this book helps real-world users of parallel computer systems to better understand both the performance vagaries arising in scientific applications and the practical means for improving performance. Read about the book on HPCwire and insideHPC
Author(s): David H. Bailey, Robert F. Lucas, Samuel Williams
Edition: 1
Publisher: CRC Press
Year: 2010
Language: English
Pages: 399
Contents......Page 5
1.1 Background......Page 10
1.2 "Twelve Ways to Fool the Masses"......Page 12
1.3 Examples from Other Scientific Fields......Page 14
1.4 Guidelines for Reporting High Performance......Page 16
1.5 Modern Performance Science......Page 17
1.6 Acknowledgments......Page 18
2. Parallel Computer Architecture......Page 19
2.1.1 Moore's Law and Little's Law......Page 20
2.2.1 Shared-Memory Parallel Architectures......Page 22
2.2.2 Distributed-Memory Parallel Architectures......Page 23
2.2.3 Hybrid Parallel Architectures......Page 24
2.3 Processor (Core) Architecture......Page 25
2.4.1 Latency and Bandwidth Avoiding Architectures......Page 30
2.4.2 Latency Hiding Techniques......Page 32
2.5.1 Routing......Page 34
2.5.2 Topology......Page 35
2.6 Heterogeneous Architectures......Page 36
2.8 Acknowledgments......Page 37
2.9 Glossary......Page 38
3. Software Interfaces to Hardware Counters......Page 40
3.1 Introduction......Page 41
3.3 Off-Core and Shared Counter Resources......Page 42
3.4.1 AMD......Page 43
3.4.3 IBM Blue Gene......Page 44
3.5.1 Perf Events......Page 45
3.5.2 Perfmon2......Page 46
3.6.1 Extension to Off-Processor Counters......Page 47
3.6.2 Countable Events......Page 48
3.7 Counter Usage Modes......Page 49
3.7.2 Sampling Mode......Page 50
3.8.1 Optimizing Cache Usage......Page 51
3.8.3 Optimizing Prefetching......Page 52
3.8.5 Other Uses......Page 53
3.9.2 Overhead......Page 54
3.11 Acknowledgment......Page 55
4. Measurement and Analysis of Parallel Program Performance Using TAU and HPCToolkit......Page 56
4.1 Introduction......Page 57
4.2 Terminology......Page 59
4.3.1 Instrumentation......Page 60
4.3.2 Asynchronous Sampling......Page 61
4.3.3 Contrasting Measurement Approaches......Page 63
4.3.4 Performance Measurement in Practice......Page 64
4.4.1 Design Principles......Page 65
4.4.2 Measurement......Page 68
4.4.3 Analysis......Page 70
4.4.4 Presentation......Page 73
Using hpcviewer......Page 75
4.4.5 Summary and Ongoing Work......Page 78
4.5 TAU Performance System......Page 79
4.5.1 TAU Performance System Design and Architecture......Page 80
4.5.2 TAU Instrumentation......Page 81
Source Instrumentation......Page 82
Compiler-Based Instrumentation......Page 83
4.5.3 TAU Measurement......Page 84
4.5.4 TAU Analysis......Page 87
4.5.5 Summary and Future Work......Page 91
4.6 Summary......Page 92
Acknowledgments......Page 93
5.1 Introduction......Page 94
5.2 Tracing and Its Motivation......Page 95
5.3 Challenges......Page 97
5.3.1 Scalability of Data Handling......Page 98
5.3.2 Scalability of Visualization......Page 102
5.4 Data Acquisition......Page 106
5.5.1 Signal Processing......Page 110
5.5.2 Clustering......Page 111
5.5.3 Mixing Instrumentation and Sampling......Page 114
5.6.1 Detailed Time Reconstruction......Page 115
5.6.2 Global Application Characterization......Page 118
5.6.3 Projection of Metrics......Page 122
5.7 Interoperability......Page 124
5.7.1 Profile and Trace Visualizers......Page 125
5.7.2 Visualizers and Simulators......Page 126
5.8 The Future......Page 128
Acknowledgments......Page 129
6. Large-Scale Numerical Simulations on High-End Computational Platforms......Page 130
6.2 HPC Platforms and Evaluated Applications......Page 131
6.3.1 Gyrokinetic Toroidal Code......Page 133
6.4 GTC Performance......Page 135
6.5 OLYMPUS: Unstructured FEM in Solid Mechanics......Page 138
6.5.2 Olympus Performance......Page 140
6.6.1 The Cactus Software Framework......Page 143
6.6.2 Computational Infrastructure: Mesh Refinement with Carpet......Page 144
6.6.3 Carpet Benchmark......Page 145
6.6.4 Carpet Performance......Page 146
6.7 CASTRO: Compressible Astrophysics......Page 148
6.7.2 CASTRO Performance......Page 150
6.8 MILC: Quantum Chromodynamics......Page 152
6.8.1 MILC Performance......Page 153
6.9 Summary and Conclusions......Page 155
6.10 Acknowledgments......Page 157
7.1 Introduction......Page 158
7.2 Applications of Performance Modeling......Page 159
7.3 Basic Methodology......Page 161
7.4 Performance Sensitivity Studies......Page 165
7.5 Summary......Page 169
7.6 Acknowledgments......Page 170
8. Analytic Modeling for Memory Access Patterns Based on Apex-MAP......Page 171
8.1 Introduction......Page 172
8.2.1 Patterns of Memory Access......Page 173
8.2.2 Performance Dependency on Memory Access Pattern......Page 174
8.2.3 Locality Definitions......Page 175
8.3.1 Characterizing Memory Access......Page 176
8.3.2 Generating Different Memory Patterns Using Apex-MAP......Page 177
8.4 Using Apex-MAP to Assess Processor Performance......Page 179
8.5.1 Modeling Communication for Remote Memory Access......Page 181
8.5.2 Assessing Machine Scaling Behavior Based on Apex-MAP......Page 183
8.6 Apex-MAP as an Application Proxy......Page 186
8.6.1 More Characterization Parameters......Page 187
8.6.2 The Kernel Changes for Apex-MAP......Page 188
8.6.3 Determining Characteristic Parameters for Kernel Approximations......Page 189
8.6.5 Case Studies of Applications Modeled by Apex-MAP......Page 190
8.6.6 Overall Results and Precision of Approximation......Page 196
8.8 Acknowledgment......Page 198
9. The Roofline Model......Page 200
9.1.1 Abstract Architecture Model......Page 201
9.1.2 Communication, Computation, and Locality......Page 202
9.1.4 Examples of Arithmetic Intensity......Page 203
9.2 The Roofline......Page 204
9.3.1 NUMA......Page 206
9.3.3 TLB Issues......Page 208
9.4.1 Instruction-Level Parallelism......Page 209
9.4.2 Functional Unit Heterogeneity......Page 210
9.4.3 Data-Level Parallelism......Page 211
9.4.4 Hardware Multithreading......Page 212
9.4.6 Combining Ceilings......Page 213
9.5 Arithmetic Intensity Walls......Page 214
9.5.3 Write Allocation Traffic......Page 215
9.5.6 Elimination of Superfluous Floating-Point Operations......Page 216
9.6.1 Hierarchically Architectural Model......Page 217
9.7 Summary......Page 218
9.9 Glossary......Page 219
10.1 Introduction......Page 221
10.2 Overview......Page 223
10.2.2 Need to Coordinate Auto-Tuners......Page 224
10.3.1 Application Programmers......Page 225
10.3.2 Compilers......Page 226
10.3.4 Performance Models......Page 227
10.4 Search......Page 228
10.4.1 Specification of Tunable Parameters......Page 229
10.4.2 Previous Search Algorithms......Page 230
10.4.3 Parallel Rank Ordering......Page 231
10.4.4 Parameter Space Reduction......Page 232
10.4.5 Constraining to Allowable Region......Page 233
10.4.6 Performance Variability......Page 234
10.5.1 Application Parameters — GS2......Page 235
10.5.2 Compiler Transformations — Computational Kernels......Page 236
10.5.3 Full Application — SMG2000......Page 238
10.7 Acknowledgment......Page 242
11. Languages and Compilers for Auto-Tuning......Page 243
11.1 Language and Compiler Technology......Page 244
11.1.2 Auto-Tuning a Library in its Execution Context......Page 245
11.2 Interaction between Programmers and Compiler......Page 247
11.3 Triage......Page 249
11.3.1 Code Isolation......Page 250
11.4 Code Transformation......Page 251
11.4.1 Transformation Interfaces......Page 253
11.5 Higher-Level Capabilities......Page 255
11.7 Acknowledgments......Page 257
12. Empirical Performance Tuning of Dense Linear Algebra Software......Page 258
12.1.2 Dense Linear Algebra Performance Issues......Page 259
12.1.3 Idea of Empirical Tuning......Page 260
12.2 ATLAS......Page 261
12.2.3 Level 1 BLAS Support......Page 262
12.2.7 Use of Architectural Defaults......Page 263
12.3.1 Tuning Outer and Inner Block Sizes......Page 264
12.3.2 Validation of Pruned Search......Page 267
12.4.1 GEMM Auto-Tuner......Page 270
12.4.2 Performance Results......Page 272
12.6 Acknowledgments......Page 275
13. Auto-Tuning Memory-Intensive Kernels for Multicore......Page 276
13.1 Introduction......Page 277
13.2.1 AMD Opteron 2356 (Barcelona)......Page 278
13.3 Computational Kernels......Page 279
13.3.1 Laplacian Differential Operator (Stencil)......Page 280
13.3.2 Lattice Boltzmann Magnetohydrodynamics (LBMHD)......Page 281
13.3.3 Sparse Matrix-Vector Multiplication (SpMV)......Page 283
13.4.1 Parallelism......Page 285
13.4.2 Minimizing Memory Traffic......Page 286
13.4.3 Maximizing Memory Bandwidth......Page 289
13.4.5 Interplay between Benefit and Implementation......Page 290
13.5 Automatic Performance Tuning......Page 291
13.5.1 Code Generation......Page 292
13.5.2 Auto-Tuning Benchmark......Page 293
13.6 Results......Page 294
13.6.1 Laplacian Stencil......Page 295
13.6.2 Lattice Boltzmann Magnetohydrodynamics (LBMHD)......Page 296
13.6.3 Sparse Matrix-Vector Multiplication (SpMV)......Page 297
13.7 Summary......Page 298
13.8 Acknowledgments......Page 299
14. Flexible Tools Supporting a Scalable First-Principles MD Code......Page 300
14.1 Introduction......Page 301
14.2.1 Code Structure......Page 302
14.3 Experimental Setup and Baselines......Page 303
14.3.1 The Blue Gene/L Architecture......Page 304
14.3.2 Floating Point Operation Counts......Page 305
14.3.3 Test Problem......Page 306
14.3.4 Baseline Performance......Page 307
14.4.1 Dual Core Matrix Multiply......Page 308
14.4.2 Empirical Node Mapping Optimizations......Page 309
14.4.4 Systematic Communication Optimization......Page 311
14.5 Customizing Tool Chains with PNMPI......Page 313
14.5.1 Applying Switch Modules to Qbox......Page 314
14.5.2 Experiments and Results......Page 315
14.6 Summary......Page 316
14.7 Acknowledgment......Page 317
15.1 Introduction......Page 318
15.2 CCSM Overview......Page 319
15.3 Parallel Computing and the CCSM......Page 320
15.4 Case Study: Optimizing Interprocess Communication Performance in the Spectral Transform Method......Page 321
15.5 Performance Portability: Supporting Options and Delaying Decisions......Page 325
15.6 Case Study: Engineering Performance Portability into the Community Atmosphere Model......Page 326
15.7 Case Study: Porting the Parallel Ocean Program to the Cray X1......Page 331
15.8 Monitoring Performance Evolution......Page 335
15.9 Performance at Scale......Page 338
15.10 Summary......Page 340
15.11 Acknowledgments......Page 341
16.1 Introduction......Page 342
16.2 LS3DF Algorithm Description......Page 344
16.3 LS3DF Code Optimizations......Page 346
16.4 Test Systems......Page 349
16.5 Performance Results and Analysis......Page 350
16.6 Science Results......Page 354
16.7 Summary......Page 356
16.8 Acknowledgments......Page 357
Bibliography......Page 358