Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Although service-level objectives (SLOs) continue to grow in importance, there’s a distinct lack of information about how to implement them. Practical advice that does exist usually assumes that your team already has the infrastructure, tooling, and culture in place. In this book, recognized SLO expert Alex Hidalgo explains how to build an SLO culture from the ground up. Ideal as a primer and daily reference for anyone creating both the culture and tooling necessary for SLO-based approaches to reliability, this guide provides detailed analysis of advanced SLO and service-level indicator (SLI) techniques. Armed with mathematical models and statistical knowledge to help you get the most out of an SLO-based approach, you’ll learn how to build systems capable of measuring meaningful SLIs with buy-in across all departments of your organization. • Define SLIs that meaningfully measure the reliability of a service from a user’s perspective • Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis • Use error budgets to help your team have better discussions and make better data-driven decisions • Build supportive tooling and resources required for an SLO-based approach • Use SLO data to present meaningful reports to leadership and your users

Author(s): Alex Hidalgo
Edition: 1
Publisher: O'Reilly Media
Year: 2020

Language: English
Commentary: Vector PDF
Pages: 404
City: Sebastopol, CA
Tags: DevOps; Management; Reliability; Monitoring; Microservices; Statistics; Metric Analysis; Culture; Software Architecture; Team Management; Performance; Reporting; Service Level Objectives

Copyright
Table of Contents
Foreword
Preface
You Don’t Have to Be Perfect
How to Read This Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. SLO Development
Chapter 1. The Reliability Stack
Service Truths
The Reliability Stack
Service Level Indicators
Service Level Objectives
Error Budgets
What Is a Service?
Example Services
Things to Keep in Mind
SLOs Are Just Data
SLOs Are a Process, Not a Project
Iterate Over Everything
The World Will Change
It’s All About Humans
Summary
Chapter 2. How to Think About Reliability
Reliability Engineering
Past Performance and Your Users
Implied Agreements
Making Agreements
A Worked Example of Reliability
How Reliable Should You Be?
100% Isn’t Necessary
Reliability Is Expensive
How to Think About Reliability
Summary
Chapter 3. Developing Meaningful Service Level Indicators
What Meaningful SLIs Provide
Happier Users
Happier Engineers
A Happier Business
Caring About Many Things
A Request and Response Service
Measuring Many Things by Measuring Only a Few
A Written Example
Something More Complex
Measuring Complex Service User Reliability
Another Written Example
Business Alignment and SLIs
Summary
Chapter 4. Choosing Good Service Level Objectives
Reliability Targets
User Happiness
The Problem of Being Too Reliable
The Problem with the Number Nine
The Problem with Too Many SLOs
Service Dependencies and Components
Service Dependencies
Service Components
Reliability for Things You Don’t Own
Open Source or Hosted Services
Measuring Hardware
Choosing Targets
Past Performance
Basic Statistics
Metric Attributes
Percentile Thresholds
What to Do Without a History
Summary
Chapter 5. How to Use Error Budgets
Error Budgets in Practice
To Release New Features or Not?
Project Focus
Examining Risk Factors
Experimentation and Chaos Engineering
Load and Stress Tests
Blackhole Exercises
Purposely Burning Budget
Error Budgets for Humans
Error Budget Measurement
Establishing Error Budgets
Decision Making
Error Budget Policies
Summary
Part II. SLO Implementation
Chapter 6. Getting Buy-In
Engineering Is More than Code
Key Stakeholders
Engineering
Product
Operations
QA
Legal
Executive Leadership
Making It So
Order of Operation
Common Objections and How to Overcome Them
Your First Error Budget Policy (and Your First Critical Test)
Lessons Learned the Hard Way
Summary
Chapter 7. Measuring SLIs and SLOs
Design Goals
Flexible Targets
Testable Targets
Freshness
Cost
Reliability
Organizational Constraints
Common Machinery
Centralized Time Series Statistics (Metrics)
Structured Event Databases (Logging)
Common Cases
Latency-Sensitive Request Processing
Low-Lag, High-Throughput Batch Processing
Mobile and Web Clients
The General Case
Other Considerations
Integration with Distributed Tracing
SLI and SLO Discoverability
Summary
Chapter 8. SLO Monitoring and Alerting
Motivation: What Is SLO Alerting, and Why Should You Do It?
The Shortcomings of Simple Threshold Alerting
A Better Way
How to Do SLO Alerting
Choosing a Target
Error Budgets and Response Time
Error Budget Burn Rate
Rolling Windows
Putting It Together
Troubleshooting with SLO Alerting
Corner Cases
SLO Alerting in a Brownfield Setup
Parting Recommendations
Summary
Chapter 9. Probability and Statistics for SLIs and SLOs
On Probability
SLI Example: Availability
SLI Example: Low QPS
On Statistics
Maximum Likelihood Estimation
Maximum a Posteriori
Bayesian Inference
SLI Example: Queueing Latency
Batch Latency
SLI Example: Durability
Further Reading
Summary
Chapter 10. Architecting for Reliability
Example System: Image-Serving Service
Architectural Considerations: Hardware
Architectural Considerations: Monolith or Microservices
Architectural Considerations: Anticipating Failure Modes
Architectural Considerations: Three Types of Requests
Systems and Building Blocks
Quantitative Analysis of Systems
Instrumentation! The System Also Needs Instrumentation!
Architectural Considerations: Hardware, Revisited
SLOs as a Result of System SLIs
The Importance of Identifying and Understanding Dependencies
Summary
Chapter 11. Data Reliability
Data Services
Designing Data Applications
Users of Data Services
Setting Measurable Data Objectives
Data and Data Application Reliability
Data Properties
Data Application Properties
System Design Concerns
Data Application Failures
Other Qualities
Data Lineage
Summary
Chapter 12. A Worked Example
Dogs Deserve Clothes
How a Service Grows
The Design of a Service
SLIs and SLOs as User Journeys
Customers: Finding and Browsing Products
Other Services as Users: Buying Products
Internal Users
Platforms as Services
Summary
Part III. SLO Culture
Chapter 13. Building an SLO Culture
A Culture of No SLOs
Strategies for Shifting Culture
Path to a Culture of SLOs
Getting Buy-in
Prioritizing SLO Work
Implementing Your SLO
What Will Your SLIs Be?
What Will Your SLOs Be?
Using Your SLO
Iterating on Your SLO
Determining When Your SLOs Are Good Enough
Advocating for Others to Use SLOs
Summary
Chapter 14. SLO Evolution
SLO Genesis
The First Pass
Listening to Users
Periodic Revisits
Usage Changes
Increased Utilization Changes
Decreased Utilization Changes
Functional Utilization Changes
Dependency Changes
Service Dependency Changes
Platform Changes
Dependency Introduction or Retirement
Failure-Induced Changes
User Expectation and Requirement Changes
User Expectation Changes
User Requirement Changes
Tooling Changes
Measurement Changes
Calculation Changes
Intuition-Based Changes
Setting Aspirational SLOs
Identifying Incorrect SLOs
Listening to Users (Redux)
Paying Attention to Failures
How to Change SLOs
Revisit Schedules
Summary
Chapter 15. Discoverable and Understandable SLOs
Understandability
SLO Definition Documents
Phraseology
Discoverability
Document Repositories
Discoverability Tooling
SLO Reports
Dashboards
Summary
Chapter 16. SLO Advocacy
Crawl
Do Your Research
Prepare Your Sales Pitch
Create Your Supporting Artifacts
Run Your First Training and Workshop
Implement an SLO Pilot with a Single Service
Spread Your Message
Learn How to Handle Challenges
Walk
Work with Early Adopters to Implement SLOs for More Services
Celebrate Achievements and Build Confidence
Create a Library of Case Studies
Scale Your Training Program by Adding More Trainers
Scale Your Communications
Run
Share Your Library of SLO Case Studies
Create a Community of SLO Experts
Continuously Improve
Summary
Chapter 17. Reliability Reporting
Basic Reporting
Counting Incidents
Severity Levels
The Problem with Mean Time to X
SLOs for Basic Reporting
Advanced Reporting
SLO Status
Error Budget Status
Summary
Appendix A. SLO Definition Template
SLO Definition: Service Name
Service Overview
SLIs and SLOs
Rationale
Revisit Schedule
Error Budget Policy
External Links
Appendix B. Proofs for Chapter 9
Theorem 1
Proof
Theorem 2
Proof
Theorem 3
Proof
Theorem 4
Proof
Theorem 5
Proof
Theorem 6
Proof
Theorem 7
Proof
Index
About the Author
About the Contributors
Colophon