https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
Author(s): Paul E. McKenney
Edition: 2017-01
Year: 2017
Language: English
Pages: 694
Cover......Page 1
Title......Page 2
Contents......Page 4
1.1 Roadmap......Page 12
1.2 Quick Quizzes......Page 13
1.3 Alternatives to This Book......Page 14
1.5 Whose Book Is This?......Page 15
2.1 Historic Parallel Programming Difficulties......Page 18
2.2.1 Performance......Page 20
2.2.2 Productivity......Page 22
2.2.3 Generality......Page 23
2.3 Alternatives to Parallel Programming......Page 25
2.3.3 Performance Optimization......Page 26
2.4 What Makes Parallel Programming Hard?......Page 27
2.4.1 Work Partitioning......Page 28
2.4.3 Resource Partitioning and Replication......Page 29
2.4.6 How Do Languages and Environments Assist With These Tasks?......Page 30
2.5 Discussion......Page 31
3.1.1 Pipelined CPUs......Page 32
3.1.2 Memory References......Page 34
3.1.3 Atomic Operations......Page 35
3.1.4 Memory Barriers......Page 36
3.1.6 I/O Operations......Page 37
3.2.1 Hardware System Architecture......Page 38
3.2.2 Costs of Operations......Page 39
3.3 Hardware Free Lunch?......Page 41
3.3.2 Novel Materials and Processes......Page 42
3.3.4 Special-Purpose Accelerators......Page 43
3.4 Software Design Implications......Page 44
4.1 Scripting Languages......Page 46
4.2.1 POSIX Process Creation and Destruction......Page 47
4.2.2 POSIX Thread Creation and Destruction......Page 49
4.2.3 POSIX Locking......Page 50
4.2.4 POSIX Reader-Writer Locking......Page 53
4.2.5 Atomic Operations (gcc Classic)......Page 56
4.2.6 Atomic Operations (C11)......Page 57
4.3.2 Thread Creation, Destruction, and Control......Page 58
4.3.3 Locking......Page 60
4.3.5 Per-CPU Variables......Page 62
4.4 The Right Tool for the Job: How to Choose?......Page 64
5 Counting......Page 66
5.1 Why Isn't Concurrent Counting Trivial?......Page 67
5.2.1 Design......Page 69
5.2.2 Array-Based Implementation......Page 70
5.2.3 Eventually Consistent Implementation......Page 71
5.2.4 Per-Thread-Variable-Based Implementation......Page 74
5.3.1 Design......Page 75
5.3.2 Simple Limit Counter Implementation......Page 76
5.3.3 Simple Limit Counter Discussion......Page 82
5.4 Exact Limit Counters......Page 83
5.4.1 Atomic Limit Counter Implementation......Page 84
5.4.3 Signal-Theft Limit Counter Design......Page 88
5.4.4 Signal-Theft Limit Counter Implementation......Page 90
5.5 Applying Specialized Parallel Counters......Page 96
5.6.1 Parallel Counting Performance......Page 97
5.6.2 Parallel Counting Specializations......Page 98
5.6.3 Parallel Counting Lessons......Page 99
6.1.1 Dining Philosophers Problem......Page 102
6.1.2 Double-Ended Queue......Page 106
6.2 Design Criteria......Page 114
6.3.1 Sequential Program......Page 117
6.3.2 Code Locking......Page 118
6.3.3 Data Locking......Page 120
6.3.4 Data Ownership......Page 122
6.3.5 Locking Granularity and Performance......Page 123
6.4 Parallel Fastpath......Page 126
6.4.2 Hierarchical Locking......Page 127
6.4.3 Resource Allocator Caches......Page 128
6.5.1 Work-Queue Parallel Maze Solver......Page 134
6.5.2 Alternative Parallel Maze Solver......Page 137
6.5.3 Performance Comparison I......Page 138
6.5.4 Alternative Sequential Maze Solver......Page 141
6.5.5 Performance Comparison II......Page 142
6.5.6 Future Directions and Conclusions......Page 143
6.6 Partitioning, Parallelism, and Optimization......Page 144
7 Locking......Page 146
7.1 Staying Alive......Page 147
7.1.1 Deadlock......Page 148
7.1.2 Livelock and Starvation......Page 156
7.1.3 Unfairness......Page 157
7.2 Types of Locks......Page 158
7.2.3 Beyond Reader-Writer Locks......Page 159
7.2.4 Scoped Locking......Page 161
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic Exchange......Page 163
7.3.2 Other Exclusive-Locking Implementations......Page 164
7.4 Lock-Based Existence Guarantees......Page 166
7.5.2 Locking For Parallel Libraries: Just Another Tool......Page 168
7.5.3 Locking For Parallelizing Sequential Libraries: Villain!......Page 172
7.6 Summary......Page 174
8.1 Multiple Processes......Page 176
8.3 Function Shipping......Page 177
8.6 Other Uses of Data Ownership......Page 178
9.1 Running Example......Page 180
9.2 Reference Counting......Page 181
9.3 Hazard Pointers......Page 186
9.4 Sequence Locks......Page 189
9.5 Read-Copy Update (RCU)......Page 194
9.5.1 Introduction to RCU......Page 195
9.5.2 RCU Fundamentals......Page 198
9.5.3 RCU Usage......Page 208
9.5.4 RCU Linux-Kernel API......Page 223
9.5.5 ``Toy'' RCU Implementations......Page 229
9.5.6 RCU Exercises......Page 247
9.6 Which to Choose?......Page 248
9.7 What About Updates?......Page 249
10.1 Motivating Application......Page 252
10.2.2 Hash-Table Implementation......Page 253
10.2.3 Hash-Table Performance......Page 257
10.3 Read-Mostly Data Structures......Page 258
10.3.1 RCU-Protected Hash Table Implementation......Page 259
10.3.2 RCU-Protected Hash Table Performance......Page 260
10.3.3 RCU-Protected Hash Table Discussion......Page 263
10.4.1 Resizable Hash Table Design......Page 264
10.4.2 Resizable Hash Table Implementation......Page 266
10.4.3 Resizable Hash Table Discussion......Page 272
10.4.4 Other Resizable Hash Tables......Page 274
10.5 Other Data Structures......Page 277
10.6.2 Bits and Bytes......Page 278
10.6.3 Hardware Considerations......Page 279
10.7 Summary......Page 281
11.1 Introduction......Page 282
11.1.1 Where Do Bugs Come From?......Page 283
11.1.2 Required Mindset......Page 284
11.1.3 When Should Validation Start?......Page 285
11.1.4 The Open Source Way......Page 287
11.2 Tracing......Page 288
11.4 Static Analysis......Page 289
11.5.2 Walkthroughs......Page 290
11.5.3 Self-Inspection......Page 291
11.6 Probability and Heisenbugs......Page 293
11.6.1 Statistics for Discrete Testing......Page 294
11.6.3 Statistics for Continuous Testing......Page 296
11.6.4 Hunting Heisenbugs......Page 298
11.7 Performance Estimation......Page 301
11.7.2 Profiling......Page 302
11.7.4 Microbenchmarking......Page 303
11.7.5 Isolation......Page 304
11.7.6 Detecting Interference......Page 305
11.8 Summary......Page 308
12.1.1 Promela and Spin......Page 312
12.1.2 How to Use Promela......Page 316
12.1.3 Promela Example: Locking......Page 319
12.1.4 Promela Example: QRCU......Page 321
12.1.5 Promela Parable: dynticks and Preemptible RCU......Page 328
12.1.6 Validating Preemptible RCU and dynticks......Page 333
12.2 Special-Purpose State-Space Search......Page 353
12.2.1 Anatomy of a Litmus Test......Page 354
12.2.2 What Does This Litmus Test Mean?......Page 355
12.2.3 Running a Litmus Test......Page 356
12.2.4 PPCMEM Discussion......Page 357
12.3 Axiomatic Approaches......Page 358
12.5 Summary......Page 360
13.1.2 Counting Lookups......Page 364
13.2 Refurbish Reference Counting......Page 365
13.2.1 Implementation of Reference-Counting Categories......Page 366
13.2.2 Linux Primitives Supporting Reference Counting......Page 371
13.3 RCU Rescues......Page 372
13.3.1 RCU and Per-Thread-Variable-Based Statistical Counters......Page 373
13.3.2 RCU and Counters for Removable I/O Devices......Page 375
13.3.3 Array and Length......Page 376
13.3.4 Correlated Fields......Page 377
13.4.1 Correlated Data Elements......Page 378
13.4.2 Update-Friendly Hash-Table Traversal......Page 379
14.1 Avoiding Locks......Page 380
14.2.1 Memory Ordering and Memory Barriers......Page 381
14.2.2 If B Follows A, and C Follows B, Why Doesn't C Follow A?......Page 382
14.2.3 Variables Can Have More Than One Value......Page 384
14.2.4 What Can You Trust?......Page 385
14.2.6 A Few Simple Rules......Page 393
14.2.7 Abstract Memory Access Model......Page 394
14.2.8 Device Operations......Page 395
14.2.9 Guarantees......Page 396
14.2.10 What Are Memory Barriers?......Page 397
14.2.11 Locking Constraints......Page 410
14.2.12 Memory-Barrier Examples......Page 411
14.2.13 The Effects of the CPU Cache......Page 413
14.3 Non-Blocking Synchronization......Page 415
14.3.1 Simple NBS......Page 417
14.3.2 NBS Discussion......Page 418
15.1.1 Soft Real Time......Page 420
15.1.2 Hard Real Time......Page 421
15.1.3 Real-World Real Time......Page 422
15.2 Who Needs Real-Time Computing?......Page 426
15.3 Who Needs Parallel Real-Time Computing?......Page 427
15.4 Implementing Parallel Real-Time Systems......Page 428
15.4.1 Implementing Parallel Real-Time Operating Systems......Page 429
15.4.2 Implementing Parallel Real-Time Applications......Page 442
15.4.3 The Role of RCU......Page 445
15.5 Real Time vs. Real Fast: How to Choose?......Page 446
16.2 Rusty Scale for API Design......Page 448
16.3 Shaving the Mandelbrot Set......Page 450
17.1.1 Uniprocessor Über Alles......Page 452
17.1.2 Multithreaded Mania......Page 453
17.1.3 More of the Same......Page 454
17.1.4 Crash Dummies Slamming into the Memory Wall......Page 455
17.2.1 Outside World......Page 458
17.2.2 Process Modification......Page 462
17.2.3 Synchronization......Page 467
17.2.4 Discussion......Page 471
17.3 Hardware Transactional Memory......Page 473
17.3.1 HTM Benefits WRT to Locking......Page 474
17.3.2 HTM Weaknesses WRT Locking......Page 476
17.3.3 HTM Weaknesses WRT to Locking When Augmented......Page 482
17.3.4 Where Does HTM Best Fit In?......Page 485
17.3.5 Potential Game Changers......Page 486
17.3.6 Conclusions......Page 488
17.4 Functional Programming for Parallelism......Page 489
A.1 What Does ``After'' Mean?......Page 492
A.2 What is the Difference Between ``Concurrent'' and ``Parallel''?......Page 495
A.3 What Time Is It?......Page 496
B.1 Cache Structure......Page 498
B.2 Cache-Coherence Protocols......Page 500
B.2.2 MESI Protocol Messages......Page 501
B.2.3 MESI State Diagram......Page 502
B.2.4 MESI Protocol Example......Page 504
B.3.1 Store Buffers......Page 505
B.3.2 Store Forwarding......Page 506
B.3.3 Store Buffers and Memory Barriers......Page 507
B.4.1 Invalidate Queues......Page 510
B.4.3 Invalidate Queues and Memory Barriers......Page 511
B.5 Read and Write Memory Barriers......Page 514
B.6.1 Ordering-Hostile Architecture......Page 515
B.6.3 Example 2......Page 516
B.6.4 Example 3......Page 517
B.7 Memory-Barrier Instructions For Specific CPUs......Page 518
B.7.1 Alpha......Page 520
B.7.3 ARMv7-A/R......Page 522
B.7.4 IA64......Page 523
B.7.5 MIPS......Page 524
B.7.7 POWER / PowerPC......Page 525
B.7.8 SPARC RMO, PSO, and TSO......Page 526
B.7.9 x86......Page 527
B.8 Are Memory Barriers Forever?......Page 528
B.9 Advice to Hardware Designers......Page 529
C.1 How To Use This Book......Page 532
C.2 Introduction......Page 533
C.3 Hardware and its Habits......Page 539
C.4 Tools of the Trade......Page 543
C.5 Counting......Page 551
C.6 Partitioning and Synchronization Design......Page 571
C.7 Locking......Page 577
C.8 Data Ownership......Page 587
C.9 Deferred Processing......Page 590
C.10 Data Structures......Page 614
C.11 Validation......Page 618
C.12 Formal Verification......Page 625
C.13 Putting It All Together......Page 632
C.14 Advanced Synchronization......Page 636
C.15 Parallel Real-Time Computing......Page 641
C.16 Ease of Use......Page 643
C.17 Conflicting Visions of the Future......Page 644
C.18 Important Questions......Page 648
C.19 Why Memory Barriers?......Page 649
D Glossary and Bibliography......Page 654
E.2 Reviewers......Page 690
E.5 Figure Credits......Page 691
E.6 Other Support......Page 693
Back......Page 694