Performance analysis and tuning on modern cpus

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Denis Bakhvalov
Edition: 1
Publisher: easyperfect.net
Year: 2020

Language: English
Tags: Cpu, performance

Table Of Contents
1 Introduction
1.1 Why Do We Still Need Performance Tuning?
1.2 Who Needs Performance Tuning?
1.3 What Is Performance Analysis?
1.4 What is discussed in this book?
1.5 What is not in this book?
1.6 Chapter Summary
Part1. Performance analysis on a modern CPU
2 Measuring Performance
2.1 Noise In Modern Systems
2.2 Measuring Performance In Production
2.3 Automated Detection of Performance Regressions
2.4 Manual Performance Testing
2.5 Software and Hardware Timers
2.6 Microbenchmarks
2.7 Chapter Summary
3 CPU Microarchitecture
3.1 Instruction Set Architecture
3.2 Pipelining
3.3 Exploiting Instruction Level Parallelism (ILP)
3.3.1 OOO Execution
3.3.2 Superscalar Engines and VLIW
3.3.3 Speculative Execution
3.4 Exploiting Thread Level Parallelism
3.4.1 Simultaneous Multithreading
3.5 Memory Hierarchy
3.5.1 Cache Hierarchy
3.5.1.1 Placement of data within the cache.
3.5.1.2 Finding data in the cache.
3.5.1.3 Managing misses.
3.5.1.4 Managing writes.
3.5.1.5 Other cache optimization techniques.
3.5.2 Main Memory
3.6 Virtual Memory
3.7 SIMD Multiprocessors
3.8 Modern CPU design
3.8.1 CPU Front-End
3.8.2 CPU Back-End
3.9 Performance Monitoring Unit
3.9.1 Performance Monitoring Counters
4 Terminology and metrics in performance analysis
4.1 Retired vs. Executed Instruction
4.2 CPU Utilization
4.3 CPI & IPC
4.4 UOPs (micro-ops)
4.5 Pipeline Slot
4.6 Core vs. Reference Cycles
4.7 Cache miss
4.8 Mispredicted branch
5 Performance Analysis Approaches
5.1 Code Instrumentation
5.2 Tracing
5.3 Workload Characterization
5.3.1 Counting Performance Events
5.3.2 Manual performance counters collection
5.3.3 Multiplexing and scaling events
5.4 Sampling
5.4.1 User-Mode And Hardware Event-based Sampling
5.4.2 Finding Hotspots
5.4.3 Collecting Call Stacks
5.4.4 Flame Graphs
5.5 Roofline Performance Model
5.6 Static Performance Analysis
5.6.1 Static vs. Dynamic Analyzers
5.7 Compiler Optimization Reports
5.8 Chapter Summary
6 CPU Features For Performance Analysis
6.1 Top-Down Microarchitecture Analysis
6.1.1 TMA in Intel® VTune™ Profiler
6.1.2 TMA in Linux Perf
6.1.3 Step1: Identify the bottleneck
6.1.4 Step2: Locate the place in the code
6.1.5 Step3: Fix the issue
6.1.6 Summary
6.2 Last Branch Record
6.2.1 Collecting LBR stacks
6.2.2 Capture call graph
6.2.3 Identify hot branches
6.2.4 Analyze branch misprediction rate
6.2.5 Precise timing of machine code
6.2.6 Estimating branch outcome probability
6.2.7 Other use cases
6.3 Processor Event-Based Sampling
6.3.1 Precise events
6.3.2 Lower sampling overhead
6.3.3 Analyzing memory accesses
6.4 Intel Processor Traces
6.4.1 Workflow
6.4.2 Timing Packets
6.4.3 Collecting and Decoding Traces
6.4.4 Usages
6.4.5 Disk Space and Decoding Time
6.5 Chapter Summary
Part2. Source Code Tuning For CPU
7 CPU Front-End Optimizations
7.1 Machine code layout
7.2 Basic Block
7.3 Basic block placement
7.4 Basic block alignment
7.5 Function splitting
7.6 Function grouping
7.7 Profile Guided Optimizations
7.8 Optimizing for ITLB
7.9 Chapter Summary
8 CPU Back-End Optimizations
8.1 Memory Bound
8.1.1 Cache-Friendly Data Structures
8.1.1.1 Access data sequentially.
8.1.1.2 Use appropriate containers.
8.1.1.3 Packing the data.
8.1.1.4 Aligning and padding.
8.1.1.5 Dynamic memory allocation.
8.1.1.6 Tune the code for memory hierarchy.
8.1.2 Explicit Memory Prefetching
8.1.3 Optimizing For DTLB
8.1.3.1 Explicit Hugepages.
8.1.3.2 Transparent Hugepages.
8.1.3.3 Explicit vs. Transparent Hugepages.
8.2 Core Bound
8.2.1 Inlining Functions
8.2.2 Loop Optimizations
8.2.2.1 Low-level optimizations.
8.2.2.2 High-level optimizations.
8.2.2.3 Discovering loop optimization opportunities.
8.2.2.4 Use Loop Optimization Frameworks
8.2.3 Vectorization
8.2.3.1 Compiler Autovectorization.
8.2.3.2 Discovering vectorization opportunities.
8.2.3.3 Vectorization is illegal.
8.2.3.4 Vectorization is not beneficial.
8.2.3.5 Loop vectorized but scalar version used.
8.2.3.6 Loop vectorized in a suboptimal way.
8.2.3.7 Use languages with explicit vectorization.
8.3 Chapter Summary
9 Optimizing Bad Speculation
9.1 Replace branches with lookup
9.2 Replace branches with predication
9.3 Chapter Summary
10 Other Tuning Areas
10.1 Compile-Time Computations
10.2 Compiler Intrinsics
10.3 Cache Warming
10.4 Detecting Slow FP Arithmetic
10.5 System Tuning
11 Optimizing Multithreaded Applications
11.1 Performance Scaling And Overhead
11.2 Parallel Efficiency Metrics
11.2.1 Effective CPU Utilization
11.2.2 Thread Count
11.2.3 Wait Time
11.2.4 Spin Time
11.3 Analysis With Intel VTune Profiler
11.3.1 Find Expensive Locks
11.3.2 Platform View
11.4 Analysis with Linux Perf
11.4.1 Find Expensive Locks
11.5 Analysis with Coz
11.6 Analysis with eBPF and GAPP
11.7 Detecting Coherence Issues
11.7.1 Cache Coherency Protocols
11.7.2 True Sharing
11.7.3 False Sharing
11.8 Chapter Summary
Epilog
Glossary
References
Appendix A. Reducing Measurement Noise
Appendix B. The LLVM Vectorizer