Since most applications today are distributed in some fashion, monitoring their health and performance requires a new approach. Enter distributed tracing, a method of profiling and monitoring distributed applications — particularly those that use microservice architectures. There’s just one problem: distributed tracing can be hard. But it doesn’t have to be.
With this guide, you’ll learn what distributed tracing is and how to use it to understand the performance and operation of your software. Key players at LightStep and other organizations walk you through instrumenting your code for tracing, collecting the data that your instrumentation produces, and turning it into useful operational insights. If you want to implement distributed tracing, this book tells you what you need to know.
You’ll learn:
• The pieces of a distributed tracing deployment: instrumentation, data collection, and analysis
• Best practices for instrumentation: methods for generating trace data from your services
• How to deal with (or avoid) overhead using sampling and other techniques
• How to use distributed tracing to improve baseline performance and to mitigate regressions quickly
• Where distributed tracing is headed in the future
Author(s): Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, Rebecca Isaacs
Edition: 1
Publisher: O'Reilly Media
Year: 2020
Language: English
Commentary: True PDF
Pages: 330
City: Sebastopol, CA
Tags: Monitoring; Logging; Microservices; Distributed Applications; Performance; Tracing; Dapper; X-Ray; Zipkin; OpenTelemetry; OpenTracing; OpenCensus; X-Trace; Magpie; Context Propagation
Copyright
Table of Contents
Foreword
Introduction: What Is Distributed Tracing?
Distributed Architectures and You
Deep Systems
The Difficulties of Understanding Distributed Architectures
How Does Distributed Tracing Help?
Distributed Tracing and You
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. The Problem with Distributed Tracing
The Pieces of a Distributed Tracing Deployment
Distributed Tracing, Microservices, Serverless, Oh My!
The Benefits of Tracing
Setting the Table
Chapter 2. An Ontology of Instrumentation
White Box Versus Black Box
Application Versus System
Agents Versus Libraries
Propagating Context
Interprocess Propagation
Intraprocess Propagation
The Shape of Distributed Tracing
Tracing-Friendly Microservices and Serverless
Tracing in a Monolith
Tracing in Web and Mobile Clients
Chapter 3. Open Source Instrumentation: Interfaces, Libraries, and Frameworks
The Importance of Abstract Instrumentation
OpenTelemetry
OpenTracing and OpenCensus
OpenTracing
OpenCensus
Other Notable Formats and Projects
X-Ray
Zipkin
Interoperability and Migration Strategies
Why Use Open Source Instrumentation?
Interoperability
Portability
Ecosystem and Implicit Visibility
Chapter 4. Best Practices for Instrumentation
Tracing by Example
Installing the Sample Application
Adding Basic Distributed Tracing
Custom Instrumentation
Where to Start—Nodes and Edges
Framework Instrumentation
Service Mesh Instrumentation
Creating Your Service Graph
What’s in a Span?
Effective Naming
Effective Tagging
Effective Logging
Understanding Performance Considerations
Trace-Driven Development
Developing with Traces
Testing with Traces
Creating an Instrumentation Plan
Making the Case for Instrumentation
Instrumentation Quality Checklist
Knowing When to Stop Instrumenting
Smart and Sustainable Instrumentation Growth
Chapter 5. Deploying Tracing
Organizational Adoption
Start Close to Your Users
Start Centrally: Load Balancers and Gateways
Leverage Infrastructure: RPC Frameworks and Service Meshes
Make Adoption Repeatable
Tracer Architecture
In-Process Libraries
Sidecars and Agents
Collectors
Centralized Storage and Analysis
Incremental Deployment
Data Provenance, Security, and Federation
Frontend Service Telemetry
Server-Side Telemetry for Managed Services
Summary
Chapter 6. Overhead, Costs, and Sampling
Application Overhead
Latency
Throughput
Infrastructure Costs
Network
Storage
Sampling
Minimum Requirements
Strategies
Selecting Traces
Off-the-Shelf ETL Solutions
Summary
Chapter 7. A New Observability Scorecard
The Three Pillars Defined
Metrics
Logging
Distributed Tracing
Fatal Flaws of the Three Pillars
Design Goals
Assessing the Three Pillars
Three Pipes (Not Pillars)
Observability Goals and Activities
Two Goals in Observability
Two Fundamental Activities in Observability
A New Scorecard
The Path Ahead
Chapter 8. Improving Baseline Performance
Measuring Performance
Percentiles
Histograms
Defining the Critical Path
Approaches to Improving Performance
Individual Traces
Biased Sampling and Trace Comparison
Trace Search
Multimodal Analysis
Aggregate Analysis
Correlation Analysis
Summary
Chapter 9. Restoring Baseline Performance
Defining the Problem
Human Factors
(Avoiding) Finger-Pointing
“Suppressing” the Messenger
Incident Hand-off
Good Postmortems
Approaches to Restoring Performance
Integration with Alerting Workflows
Individual Traces
Biased Sampling
Real-Time Response
Knowing What’s Normal
Aggregate and Correlation Root Cause Analysis
Summary
Chapter 10. Are We There Yet? The Past and Present
Distributed Tracing: A History of Pragmatism
Request-Based Systems
Response Time Matters
Request-Oriented Information
Notable Work
Pinpoint
Magpie
X-Trace
Dapper
Where to Next?
Chapter 11. Beyond Individual Requests
The Value of Traces in Aggregate
Example 1: Is Network Congestion Affecting My Application?
Example 2: What Services Are Required to Serve an API Endpoint?
Organizing the Data
A Strawperson Solution
What About the Trade-offs?
Sampling for Aggregate Analysis
The Processing Pipeline
Incorporating Heterogeneous Data
Custom Functions
Joining with Other Data Sources
Recap and Case Study
The Value of Traces in Aggregate
Organizing the Data
Sampling for Aggregate Analysis
The Processing Pipeline
Incorporating Heterogeneous Data
Chapter 12. Beyond Spans
Why Spans Have Prevailed
Visibility
Pragmatism
Portability
Compatibility
Flexibility
Why Spans Aren’t Enough
Graphs, Not Trees
Inter-Request Dependencies
Decoupled Dependencies
Distributed Dataflow
Machine Learning
Low-Level Performance Metrics
New Abstractions
Seeing Causality
Chapter 13. Beyond Distributed Tracing
Limitations of Distributed Tracing
Challenge 1: Anticipating Problems
Challenge 2: Completeness Versus Costs
Challenge 3: Open-Ended Use Cases
Other Tools Like Distributed Tracing
Census
A Motivating Example
A Distributed Tracing Solution?
Tag Propagation and Local Metric Aggregation
Comparison to Distributed Tracing
Pivot Tracing
Dynamic Instrumentation
Recurring Problems
How Does It Work?
Dynamic Context
Comparison to Distributed Tracing
Pythia
Performance Regressions
Design
Overheads
Comparison to Distributed Tracing
Summary
Chapter 14. The Future of Context Propagation
Cross-Cutting Tools
Use Cases
Distributed Tracing
Cross-Component Metrics
Cross-Component Resource Management
Managing Data Quality Trade-offs
Failure Testing of Microservices
Enforcing Cross-System Consistency
Request Duplication
Record Lineage in Stream Processing Systems
Auditing Security Policies
Testing in Production
Common Themes
Should You Care?
The Tracing Plane
Is Baggage Enough?
Beyond Key-Value Pairs
Compiling BDL
BaggageContext
Merging
Overheads
Summary
Appendix A. The State of Distributed Tracing Circa 2020
Open Source Tracers and Trace Analysis
Commercial Tracers and Trace Analyzers
Language-Specific Tracing Features
Java and C#
Go, Rust, and C++
Python, JavaScript, and Other Dynamic Languages
Appendix B. Context Propagation in OpenTelemetry
Why a Separate Context Model?
The OpenTelemetry Context Model
W3C CorrelationContext and the Correlations API
Distributed and Local Context
Examples and Potential Applications
Bibliography
Index
About the Authors
Colophon