Observability Engineering: Achieving Production Excellence

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand the experience of each and every user. This practical book explains the value of observable systems and shows you how to practice observability-driven development. Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what youâ??re doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. Youâ??ll also learn the impact observability has on organizational culture (and vice versa). You'll explore: • How the concept of observability applies to managing software at scale • The value of practicing observability when delivering complex cloud native applications and systems • The impact observability has across the entire software development lifecycle • How and why different functional teams use observability with service-level objectives • How to instrument your code to help future engineers understand the code you wrote today • How to produce quality code for context-aware system debugging and maintenance • How data-rich analytics can help you debug elusive issues

Author(s): Charity Majors, Liz Fong-Jones, George Miranda
Edition: 1
Publisher: O'Reilly Media
Year: 2022

Language: English
Commentary: Vector PDF
Pages: 318
City: Sebastopol, CA
Tags: DevOps; Management; Debugging; Monitoring; Logging; Microservices; Pipelines; Scalability; Business Intelligence; Site Reliability Engineering; Storage Management; Team Management; OpenTelemetry; Observability; Cloud-Native Applications; Metrics; Sampling; Telemetry; Observability-Driven Development; Return on Investment

Cover
Copyright
Table of Contents
Foreword
Preface
Who This Book Is For
Why We Wrote This Book
What You Will Learn
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. The Path to Observability
Chapter 1. What Is Observability?
The Mathematical Definition of Observability
Applying Observability to Software Systems
Mischaracterizations About Observability for Software
Why Observability Matters Now
Is This Really the Best Way?
Why Are Metrics and Monitoring Not Enough?
Debugging with Metrics Versus Observability
The Role of Cardinality
The Role of Dimensionality
Debugging with Observability
Observability Is for Modern Systems
Conclusion
Chapter 2. How Debugging Practices Differ Between Observability and Monitoring
How Monitoring Data Is Used for Debugging
Troubleshooting Behaviors When Using Dashboards
The Limitations of Troubleshooting by Intuition
Traditional Monitoring Is Fundamentally Reactive
How Observability Enables Better Debugging
Conclusion
Chapter 3. Lessons from Scaling Without Observability
An Introduction to Parse
Scaling at Parse
The Evolution Toward Modern Systems
The Evolution Toward Modern Practices
Shifting Practices at Parse
Conclusion
Chapter 4. How Observability Relates to DevOps, SRE, and Cloud Native
Cloud Native, DevOps, and SRE in a Nutshell
Observability: Debugging Then Versus Now
Observability Empowers DevOps and SRE Practices
Conclusion
Part II. Fundamentals of Observability
Chapter 5. Structured Events Are the Building Blocks of Observability
Debugging with Structured Events
The Limitations of Metrics as a Building Block
The Limitations of Traditional Logs as a Building Block
Unstructured Logs
Structured Logs
Properties of Events That Are Useful in Debugging
Conclusion
Chapter 6. Stitching Events into Traces
Distributed Tracing and Why It Matters Now
The Components of Tracing
Instrumenting a Trace the Hard Way
Adding Custom Fields into Trace Spans
Stitching Events into Traces
Conclusion
Chapter 7. Instrumentation with OpenTelemetry
A Brief Introduction to Instrumentation
Open Instrumentation Standards
Instrumentation Using Code-Based Examples
Start with Automatic Instrumentation
Add Custom Instrumentation
Send Instrumentation Data to a Backend System
Conclusion
Chapter 8. Analyzing Events to Achieve Observability
Debugging from Known Conditions
Debugging from First Principles
Using the Core Analysis Loop
Automating the Brute-Force Portion of the Core Analysis Loop
This Misleading Promise of AIOps
Conclusion
Chapter 9. How Observability and Monitoring Come Together
Where Monitoring Fits
Where Observability Fits
System Versus Software Considerations
Assessing Your Organizational Needs
Exceptions: Infrastructure Monitoring That Can’t Be Ignored
Real-World Examples
Conclusion
Part III. Observability for Teams
Chapter 10. Applying Observability Practices in Your Team
Join a Community Group
Start with the Biggest Pain Points
Buy Instead of Build
Flesh Out Your Instrumentation Iteratively
Look for Opportunities to Leverage Existing Efforts
Prepare for the Hardest Last Push
Conclusion
Chapter 11. Observability-Driven Development
Test-Driven Development
Observability in the Development Cycle
Determining Where to Debug
Debugging in the Time of Microservices
How Instrumentation Drives Observability
Shifting Observability Left
Using Observability to Speed Up Software Delivery
Conclusion
Chapter 12. Using Service-Level Objectives for Reliability
Traditional Monitoring Approaches Create Dangerous Alert Fatigue
Threshold Alerting Is for Known-Unknowns Only
User Experience Is a North Star
What Is a Service-Level Objective?
Reliable Alerting with SLOs
Changing Culture Toward SLO-Based Alerts: A Case Study
Conclusion
Chapter 13. Acting on and Debugging SLO-Based Alerts
Alerting Before Your Error Budget Is Empty
Framing Time as a Sliding Window
Forecasting to Create a Predictive Burn Alert
The Lookahead Window
The Baseline Window
Acting on SLO Burn Alerts
Using Observability Data for SLOs Versus Time-Series Data
Conclusion
Chapter 14. Observability and the Software Supply Chain
Why Slack Needed Observability
Instrumentation: Shared Client Libraries and Dimensions
Case Studies: Operationalizing the Supply Chain
Understanding Context Through Tooling
Embedding Actionable Alerting
Understanding What Changed
Conclusion
Part IV. Observability at Scale
Chapter 15. Build Versus Buy and Return on Investment
How to Analyze the ROI of Observability
The Real Costs of Building Your Own
The Hidden Costs of Using “Free” Software
The Benefits of Building Your Own
The Risks of Building Your Own
The Real Costs of Buying Software
The Hidden Financial Costs of Commercial Software
The Hidden Nonfinancial Costs of Commercial Software
The Benefits of Buying Commercial Software
The Risks of Buying Commercial Software
Buy Versus Build Is Not a Binary Choice
Conclusion
Chapter 16. Efficient Data Storage
The Functional Requirements for Observability
Time-Series Databases Are Inadequate for Observability
Other Possible Data Stores
Data Storage Strategies
Case Study: The Implementation of Honeycomb’s Retriever
Partitioning Data by Time
Storing Data by Column Within Segments
Performing Query Workloads
Querying for Traces
Querying Data in Real Time
Making It Affordable with Tiering
Making It Fast with Parallelism
Dealing with High Cardinality
Scaling and Durability Strategies
Notes on Building Your Own Efficient Data Store
Conclusion
Chapter 17. Cheap and Accurate Enough: Sampling
Sampling to Refine Your Data Collection
Using Different Approaches to Sampling
Constant-Probability Sampling
Sampling on Recent Traffic Volume
Sampling Based on Event Content (Keys)
Combining per Key and Historical Methods
Choosing Dynamic Sampling Options
When to Make a Sampling Decision for Traces
Translating Sampling Strategies into Code
The Base Case
Fixed-Rate Sampling
Recording the Sample Rate
Consistent Sampling
Target Rate Sampling
Having More Than One Static Sample Rate
Sampling by Key and Target Rate
Sampling with Dynamic Rates on Arbitrarily Many Keys
Putting It All Together: Head and Tail per Key Target Rate Sampling
Conclusion
Chapter 18. Telemetry Management with Pipelines
Attributes of Telemetry Pipelines
Routing
Security and Compliance
Workload Isolation
Data Buffering
Capacity Management
Data Filtering and Augmentation
Data Transformation
Ensuring Data Quality and Consistency
Managing a Telemetry Pipeline: Anatomy
Challenges When Managing a Telemetry Pipeline
Performance
Correctness
Availability
Reliability
Isolation
Data Freshness
Use Case: Telemetry Management at Slack
Metrics Aggregation
Logs and Trace Events
Open Source Alternatives
Managing a Telemetry Pipeline: Build Versus Buy
Conclusion
Part V. Spreading Observability Culture
Chapter 19. The Business Case for Observability
The Reactive Approach to Introducing Change
The Return on Investment of Observability
The Proactive Approach to Introducing Change
Introducing Observability as a Practice
Using the Appropriate Tools
Instrumentation
Data Storage and Analytics
Rolling Out Tools to Your Teams
Knowing When You Have Enough Observability
Conclusion
Chapter 20. Observability’s Stakeholders and Allies
Recognizing Nonengineering Observability Needs
Creating Observability Allies in Practice
Customer Support Teams
Customer Success and Product Teams
Sales and Executive Teams
Using Observability Versus Business Intelligence Tools
Query Execution Time
Accuracy
Recency
Structure
Time Windows
Ephemerality
Using Observability and BI Tools Together in Practice
Conclusion
Chapter 21. An Observability Maturity Model
A Note About Maturity Models
Why Observability Needs a Maturity Model
About the Observability Maturity Model
Capabilities Referenced in the OMM
Respond to System Failure with Resilience
Deliver High-Quality Code
Manage Complexity and Technical Debt
Release on a Predictable Cadence
Understand User Behavior
Using the OMM for Your Organization
Conclusion
Chapter 22. Where to Go from Here
Observability, Then Versus Now
Additional Resources
Predictions for Where Observability Is Going
Index
About the Authors
Colophon