This book gives a comprehensive description of the architecture of microprocessors from simple in-order short pipeline designs to out-of-order superscalars. It discusses topics such as - the policies and mechanisms needed for out-of-order processing such as register renaming, reservation stations, and reorder buffers - optimizations for high performance such as branch predictors, instruction scheduling, and load-store speculations - design choices and enhancements to tolerate latency in the cache hierarchy of single and multiple processors - state-of-the-art multithreading and multiprocessing emphasizing single chip implementations Topics are presented as conceptual ideas, with metrics to assess the performance impact, if appropriate, and examples of realization. The emphasis is on how things work at a black box and algorithmic level. The author also provides sufficient detail at the register transfer level so that readers can appreciate how design features enhance performance as well as complexity.
Author(s): Jean-Loup Baer
Edition: 1
Publisher: Cambridge University Press
Year: 2009
Language: English
Pages: 383
Half-title......Page 3
Title......Page 5
Copyright......Page 6
Dedication......Page 7
Contents......Page 9
Synopsis......Page 13
Acknowledgments......Page 15
1 Introduction......Page 17
1.1.1 The von Neumann Machine Model......Page 18
1.1.2 Technological Advances......Page 19
1.2.1 Instructions Per Cycle (IPC)......Page 22
1.2.2 Performance, Speedup, and Efficiency......Page 25
1.3.1 Benchmarks......Page 28
Benchmark Classification......Page 29
SPEC CPU2000 and SPEC CPU2006......Page 30
Reporting Benchmarking Results......Page 32
1.3.2 Performance Simulators......Page 34
1.4 Summary......Page 38
1.5 Further Reading and Bibliographical Notes......Page 39
EXERCISES......Page 40
REFERENCES......Page 44
2.1 Pipelining......Page 45
2.1.1 The Pipelining Process......Page 46
2.1.2 A Basic Five-stage Instruction Execution Pipeline......Page 47
2.1.3 Data Hazards and Forwarding......Page 50
2.1.4 Control Hazards; Branches, Exceptions, and Interrupts......Page 55
2.1.5 Alternative Five-Stage Pipeline Designs......Page 58
2.1.6 Pipelined Functional Units......Page 60
2.2 Caches......Page 62
2.2.1 Cache Organizations......Page 65
Write Strategies......Page 69
The Three C's......Page 70
Caches and IO......Page 71
2.2.2 Cache Performance......Page 72
2.3 Virtual Memory and Paging......Page 75
2.3.1 Paging Systems......Page 76
2.3.2 Memory Hierarchy Performance Assessment......Page 82
2.5 Further Reading and Bibliographical Notes......Page 84
EXERCISES......Page 85
REFERENCES......Page 89
3.1 From Scalar to Superscalar Processors......Page 91
3.2.1 General Organization......Page 94
3.2.2 Front-end Pipeline......Page 96
3.2.3 Back-end: Memory Operations and Control Hazards......Page 98
3.2.4 Performance Assessment......Page 99
Sidebar: The Scoreboard of the CDC 6600......Page 101
3.3.1 Register Renaming......Page 105
3.3.2 Reorder Buffer......Page 107
3.3.3 Reservation Stations andor Instruction Window......Page 109
Sidebar: The IBM System 36091 Floating-point Unit and Tomasulos Algorithm......Page 111
3.4.1 General Organization......Page 118
3.4.2 Front-end......Page 121
3.4.3 Back-end......Page 122
3.4.4 Recap; P6 Microarchitecture Evolution......Page 125
3.5.1 The VLIWEPIC Design Philosophy......Page 127
Predication......Page 128
Control Speculation......Page 130
General Organization......Page 131
Front-end Operation......Page 133
Software Pipelining......Page 134
Register Stacking......Page 135
3.6 Summary......Page 137
3.7 Further Reading and Bibliographical Notes......Page 138
EXERCISES......Page 140
REFERENCES......Page 142
4 Front-End: Branch Prediction, Instruction Fetching, and Register Renaming......Page 145
4.1.1 Anatomy of a Branch Predictor......Page 146
4.1.3 Simple Dynamic Schemes......Page 148
4.1.4 Correlated Branch Prediction......Page 154
4.1.5 Two-level Branch Predictors......Page 157
4.1.6 Repair Mechanisms......Page 159
4.1.7 Branch Target Address Prediction......Page 161
Call–Return Mechanisms......Page 166
4.1.8 More Sophisticated and Hybrid Predictors......Page 167
Sidebar: The DEC Alpha 21264 Branch Predictor......Page 173
4.2.1 Impediments in Instruction Fetch Bandwidth......Page 174
4.2.2 Trace Caches......Page 177
4.3 Decoding......Page 180
4.4 Register Renaming (a Second Look)......Page 181
4.6 Further Reading and Bibliographical Notes......Page 186
EXERCISES......Page 187
REFERENCES......Page 190
5 Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters......Page 193
5.1.1 Centralized Instruction Window and Decentralized Reservation Stations......Page 194
5.1.3 Select Step......Page 196
5.1.4 Implementation Considerations and Optimizations......Page 198
5.2 Memory-Accessing Instructions......Page 200
5.2.1 Store Instructions and the Store Buffer......Page 201
5.2.2 Load Instructions and Load Speculation......Page 203
5.2.3 Load Speculation Evaluation......Page 210
5.3 Back-End Optimizations......Page 211
5.3.1 Value Prediction......Page 212
5.3.2 Critical Instructions......Page 214
5.3.3 Clustered Microarchitectures......Page 217
5.4 Summary......Page 219
5.5 Further Reading and Bibliographical Notes......Page 220
EXERCISES......Page 221
REFERENCES......Page 222
6 The Cache Hierarchy......Page 224
Physical Index and Physical Tags......Page 225
Virtual Index and Virtual Tags......Page 227
Virtual Index and Physical Tags......Page 228
6.1.2 “Faking” Associativity......Page 229
6.1.3 Code and Data Reordering......Page 233
6.2.1 Prefetching......Page 234
Software Prefetching......Page 236
Sequential Prefetching and Stream Buffers......Page 238
Stride Prefetching......Page 239
Correlation Prefetchers......Page 242
Stateless Prefetching......Page 243
Prefetching Summary......Page 244
6.2.2 Lockup-free Caches......Page 245
Write Strategies......Page 247
6.3.1 Multilevel Inclusion Property......Page 248
Variations on LRU......Page 250
Reducing Hardware Requirements for Hit and Miss Detection......Page 253
6.3.3 Sector Caches......Page 255
6.3.4 Nonuniform Cache Access......Page 256
Sidebar: The Cache Hierarchy of the IBM Power4 and Power5......Page 257
L2 Cache......Page 259
Prefetching......Page 260
6.4.1 From DRAMs to SDRAM and DDR......Page 261
6.4.2 Improving Memory Bandwidth......Page 264
6.4.3 Direct Rambus......Page 267
6.4.4 Error-correcting Codes......Page 268
6.5 Summary......Page 269
6.6 Further Reading and Bibliographical Notes......Page 270
EXERCISES......Page 271
Programming Projects......Page 273
REFERENCES......Page 274
7 Multiprocessors......Page 276
7.1.1 Flynns Taxonomy......Page 277
7.1.2 Shared Memory: Uniform vs. Nonuniform Access (UMA vs. NUMA)......Page 279
Shared Bus......Page 281
Meshes......Page 283
7.2 Cache Coherence......Page 285
7.2.1 Snoopy Cache Coherence Protocols......Page 286
7.2.2 Directory Protocols......Page 291
7.2.3 Performance Considerations......Page 296
7.3 Synchronization......Page 297
7.3.1 Atomic Instructions......Page 298
7.3.2 Lock Contention......Page 300
Queuing Locks......Page 302
Full–Empty Bits......Page 303
Transactional Memory......Page 304
7.3.3 Barrier Synchronization......Page 305
7.4 Relaxed Memory Models......Page 306
7.5 Multimedia Instruction Set Extensions......Page 310
7.6 Summary......Page 312
7.7 Further Reading and Bibliographical Notes......Page 313
EXERCISES......Page 314
REFERENCES......Page 316
8 Multithreading and (Chip) Multiprocessing......Page 319
8.1.1 Fine-grained Multithreading......Page 320
8.1.2 Coarse-grained Multithreading......Page 322
8.1.3 Simultaneous Multithreading......Page 325
8.1.4 Multithreading Performance Assessment......Page 329
Run-ahead Execution......Page 331
Speculative Multithreading......Page 333
8.2 General-Purpose Multithreaded Chip Multiprocessors......Page 334
8.2.1 The Sun Niagara Multiprocessor......Page 336
8.2.2 Intel Multicores......Page 338
8.3.1 The IBM Cell Multiprocessor......Page 340
8.3.2 A Network Processor: The Intel IXP 2800......Page 344
8.4 Summary......Page 346
8.5 Further Reading and Bibliographical Notes......Page 347
EXERCISES......Page 348
REFERENCES......Page 349
9 Current Limitations and Future Challenges......Page 351
9.1.1 Dynamic and Leakage Power Consumption......Page 352
9.1.2 Dynamic Power and Thermal Management......Page 353
Dynamic Voltage and Frequency Scaling......Page 355
Resource Adaptation......Page 356
Wires......Page 359
Optimal Pipeline Depth......Page 360
9.3 Challenges for Chip Multiprocessors......Page 362
9.4 Summary......Page 364
REFERENCES......Page 365
Bibliography......Page 367
Index......Page 377