Improve Your Service Scalability and Reliability with SRE
"The techniques and principles of SRE are not only clearly defined here, but also the rationale behind them is explained in a way that will stick. This is not some dry definition, this is practical, usable understanding. . . . I can whole-heartedly recommend this book without any reservation. This is a very good book on an important topic that helps to move the game forward for our discipline!"
--From the Foreword by David Farley, Founder and CEO of Continuous Delivery Ltd.
Pioneered by Google to create more scalable and reliable large-scale systems, Site Reliability Engineering (SRE) has become one of today's most valuable software innovation opportunities. Establishing SRE Foundations is a concise, practical guide that shows how to drive successful SRE adoption in your own organization. Dr. Vladyslav Ukis presents a step-by-step approach to establishing the right cultural, organizational, and technical process foundations, quickly achieving a "minimum viable SRE" and continually improving from there.
Dr. Ukis draws extensively on his own experiences leading an SRE transformation journey at a major healthcare company. Throughout, he answers specific questions that organizations ask about SRE, identifies pitfalls, and shows how to avoid or overcome them. Whatever your role in software development, engineering, or operations, this guide will help you apply SRE to improve what matters most: user and customer experience.
• Understand how SRE works, its role in software operations, and the challenges of SRE transformation
• Assess your organization's current operations and readiness for SRE transformation
• Achieve organizational buy-in and initiate foundational activities, including SLO definitions, alerting, on-call rotations, incident response, and error budget-based decision-making
• Align organizational structures to support a full SRE transformation
• Measure the progress and success of your SRE initiative
• Sustain and advance your SRE transformation beyond the foundations
Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.
Author(s): Vladyslav Ukis
Edition: 1
Publisher: Addison-Wesley Professional
Year: 2022
Language: English
Commentary: Vector PDF (Compressed without quality loss)
Pages: 560
City: Boston, MA
Tags: DevOps; Management; Best Practices; Product Management; Budgeting; Site Reliability Engineering; ITIL; COBIT
Cover
Half Title
Title Page
Copyright Page
Table of Contents
Foreword
Preface
Acknowledgments
About the Author
Part I: Foundations
Chapter 1 Introduction to SRE
1.1 Why SRE?
1.1.1 ITIL
1.1.2 COBIT
1.1.3 Modeling
1.1.4 DevOps
1.1.5 SRE
1.1.6 Comparison
1.2 Alignment Using SRE
1.3 Why Does SRE Work?
1.4 Summary
Chapter 2 The Challenge
2.1 Misalignment
2.2 Collective Ownership
2.3 Ownership Using SRE
2.3.1 Product Development
2.3.2 Product Operations
2.3.3 Product Management
2.3.4 Benefits and Costs
2.4 The Challenge Statement
2.5 Coaching
2.6 Summary
Chapter 3 SRE Basic Concepts
3.1 Service Level Indicators
3.2 Service Level Objectives
3.3 Error Budgets
3.3.1 Availability Error Budget Example
3.3.2 Error Budget of Zero
3.3.3 Latency Error Budget Example
3.4 Error Budget Policies
3.5 SRE Concept Pyramid
3.6 Alignment Using the SRE Concept Pyramid
3.7 Summary
Chapter 4 Assessing the Status Quo
4.1 Where Is the Organization?
4.1.1 Organizational Structure
4.1.2 Organizational Alignment
4.1.3 Formal and Informal Leadership
4.2 Where Are the People?
4.3 Where Is the Tech?
4.4 Where Is the Culture?
4.4.1 Is There High Cooperation?
4.4.2 Are Messengers Trained?
4.4.3 Are Risks Shared?
4.4.4 Is Bridging Encouraged?
4.4.5 Does Failure Lead to Inquiry?
4.4.6 Is Novelty Implemented?
4.5 Where Is the Process?
4.6 SRE Maturity Model
4.7 Posing Hypotheses
4.8 Summary
Part II: Running the Transformation
Chapter 5 Achieving Organizational Buy-In
5.1 Getting People Behind SRE
5.2 SRE Marketing Funnel
5.2.1 Awareness
5.2.2 Interest
5.2.3 Understanding
5.2.4 Agreement
5.2.5 Engagement
5.3 SRE Coaches
5.3.1 Qualities
5.3.2 Responsibilities
5.4 Top-Down Buy-In
5.4.1 Stakeholder Chart
5.4.2 Engaging the Head of Development
5.4.3 Engaging the Head of Operations
5.4.4 Engaging the Head of Product Management
5.4.5 Achieving Joint Buy-In
5.4.6 Getting SRE into the Portfolio
5.5 Bottom-Up Buy-In
5.5.1 Engaging the Operations Teams
5.5.2 Engaging the Development Teams
5.6 Lateral Buy-In
5.7 Buy-In Staggering
5.8 Team Coaching
5.9 Traversing the Organization
5.9.1 Grouping the Organization
5.9.2 Traversing the Organization Versus SRE Infrastructure Demand
5.9.3 Team Engagements Over Time
5.10 Organizational Coaching
5.11 Summary
Chapter 6 Laying Down the Foundations
6.1 Introductory Talks by Team
6.2 Conveying the Basics
6.2.1 SLO as a Contract
6.2.2 SLO as a Proxy Measure of Customer Happiness
6.2.3 User Personas
6.2.4 User Story Mapping
6.2.5 Motivation to Fix SLO Breaches
6.2.6 SLOs Are Not About Technicalities
6.2.7 Causes of SLO Breaches
6.2.8 On Call for SLO Breaches
6.3 SLI Standardization
6.3.1 Application Performance Management Facility
6.3.2 Availability
6.3.3 Latency
6.3.4 Prioritization
6.4 Enabling Logging
6.5 Teaching the Log Query Language
6.6 Defining Initial SLOs
6.6.1 What Makes a Good SLO?
6.6.2 Iterating on an SLO
6.6.3 Revising SLOs
6.7 Default SLOs
6.8 Providing Basic Infrastructure
6.8.1 Dashboards
6.8.2 Alert Content
6.9 Engaging Champions
6.10 Dealing with Detractors
6.10.1 Issues with the Cause
6.10.2 Issues with Alerting
6.10.3 Issues with Tooling
6.10.4 Issues with Product Owner Involvement
6.10.5 Issues with Team Motivation
6.11 Creating Documentation
6.12 Broadcast Success
6.13 Summary
Chapter 7 Reacting to Alerts on SLO Breaches
7.1 Environment Selection
7.2 Responsibilities
7.2.1 Dev Versus Ops Responsibilities
7.2.2 Operational Responsibilities
7.2.3 Splitting Operational Responsibilities
7.3 Ways of Working
7.3.1 Interruption-Based Working Mode
7.3.2 Focus-Based Working Mode
7.4 Setting Up On-Call Rotations
7.4.1 Initial Rotation Period
7.4.2 One Person On Call
7.4.3 Two People On Call
7.4.4 Three People On Call
7.5 On-Call Management Tools
7.5.1 Posting SLO Breaches
7.5.2 Scheduling
7.5.3 Professional On-Call Management Tools
7.6 Out-of-Hours On-Call
7.6.1 Using Availability Targets and Product Demand
7.6.2 Trade-offs
7.7 Systematic Knowledge Sharing
7.7.1 Knowledge-Sharing Needs
7.7.2 Knowledge-Sharing Pyramid
7.7.3 On-Call Training
7.7.4 Runbooks
7.7.5 Internal Stack Overflow
7.7.6 SRE Community of Practice
7.8 Broadcast Success
7.9 Summary
Chapter 8 Implementing Alert Dispatching
8.1 Alert Escalation
8.2 Defining an Alert Escalation Policy
8.3 Defining Stakeholder Groups
8.4 Triggering Stakeholder Notifications
8.5 Defining Stakeholder Rings
8.6 Defining Effective Stakeholder Notifications
8.7 Getting the Stakeholders Subscribed
8.7.1 Subscribing Using the On-Call Management Tool
8.7.2 Subscribing Using Other Means
8.8 Broadcast Success
8.9 Summary
Chapter 9 Implementing Incident Response
9.1 Incident Response Foundations
9.2 Incident Priorities
9.2.1 SLO Breaches Versus Incidents
9.2.2 Changing Incident Priority During an Incident
9.2.3 Defining Generic Incident Priorities
9.2.4 Mapping SLOs to Incident Priorities
9.2.5 Mapping Error Budgets to Incident Priorities
9.2.6 Mapping Resource-Based Alerts to Incident Priorities
9.2.7 Uncovering New Use Cases for Incident Priorities
9.2.8 Adjusting Incident Priorities Based on Stakeholder Feedback
9.2.9 Extending the SLO Definition Process
9.2.10 Infrastructure
9.2.11 Deduplication
9.3 Complex Incident Coordination
9.3.1 What Is a Complex Incident?
9.3.2 Existing Incident Coordination Systems
9.3.3 Incident Classification
9.3.4 Defining Generic Incident Severities
9.3.5 Social Dimension of Incident Classification
9.3.6 Incident Priority Versus Incident Severity
9.3.7 Defining Roles
9.3.8 Roles Required by Incident Severity
9.3.9 Roles On Call
9.3.10 Incident Response Process Evaluation
9.3.11 Incident Response Process Dynamics
9.3.12 Incident Response Team Well-Being
9.4 Incident Postmortems
9.5 Effective Postmortem Criteria
9.5.1 Initiating a Postmortem
9.5.2 Postmortem Lifecycle
9.5.3 Before the Postmortem
9.5.4 During the Postmortem
9.5.5 After the Postmortem
9.5.6 Analyzing the Postmortem Process
9.5.7 Postmortem Template
9.5.8 Facilitating Learning from Postmortems
9.5.9 Successful Postmortem Practice
9.5.10 Example Postmortems
9.6 Mashing Up the Tools
9.6.1 Connecting to the On-Call Management Tool
9.6.2 Connections Among Other Tools
9.6.3 Mobile Integrations
9.6.4 Example Tool Landscapes
9.7 Service Status Broadcast
9.8 Documenting the Incident Response Process
9.9 Broadcast Success
9.10 Summary
Chapter 10 Setting Up an Error Budget Policy
10.1 Motivation
10.2 Terminology
10.3 Error Budget Policy Structure
10.4 Error Budget Policy Conditions
10.5 Error Budget Policy Consequences
10.6 Error Budget Policy Governance
10.7 Extending the Error Budget Policy
10.8 Agreeing to the Error Budget Policy
10.9 Storing the Error Budget Policy
10.10 Enacting the Error Budget Policy
10.11 Reviewing the Error Budget Policy
10.12 Related Concepts
10.13 Summary
Chapter 11 Enabling Error Budget–Based Decision–Making
11.1 Reliability Decision-Making Taxonomy
11.2 Implementing SRE Indicators
11.2.1 Dimensions of SRE Indicators
11.2.2 “SLOs by Service” Indicator
11.2.3 SLO Adherence Indicator
11.2.4 SLO Error Budget Depletion Indicator
11.2.5 Premature SLO Error Budget Exhaustion Indicator
11.2.6 “SLAs by Service” Indicator
11.2.7 SLA Error Budget Depletion Indicator
11.2.8 SLA Adherence Indicator
11.2.9 Customer Support Ticket Trend Indicator
11.2.10 “On-Call Rotations by Team” Indicator
11.2.11 Incident Time to Recovery Trend Indicator
11.2.12 Least Available Service Endpoints Indicator
11.2.13 Slowest Service Endpoints Indicator
11.3 Process Indicators, Not People KPIs
11.4 Decisions Versus Indicators
11.5 Decision-Making Workflows
11.5.1 API Consumption Decision Workflow
11.5.2 Tightening a Dependency’s SLO Decision Workflow
11.5.3 Features Versus Reliability Prioritization Workflow
11.5.4 Setting an SLO Decision Workflow
11.5.5 Setting an SLA Decision Workflow
11.5.6 Allocating SRE Capacity to a Team Decision Workflow
11.5.7 Chaos Engineering Hypotheses Selection Workflow
11.6 Summary
Chapter 12 Implementing Organizational Structure
12.1 SRE Principles Versus Organizational Structure
12.2 Who Builds It, Who Runs It?
12.2.1 “Who Builds It, Who Runs It?” Spectrum
12.2.2 Hybrid Models
12.2.3 Reliability Incentives
12.2.4 Model Comparison Criteria
12.2.5 Model Comparison
12.3 You Build It, You Run It
12.4 You Build It, You and SRE Run It
12.4.1 SRE Team Within the Development Organization
12.4.2 SRE Team Within the Operations Organization
12.4.3 SRE Team in a Dedicated SRE Organization
12.4.4 Comparison
12.4.5 SRE Team Incentives, Identity, and Pride
12.4.6 SRE Team Head Count and Budget
12.4.7 SRE Team Cost Accounting
12.4.8 SRE Team KPIs
12.5 You Build It, SRE Run It
12.5.1 SRE Team Within a Development Organization
12.5.2 SRE Team Within an Operations Organization
12.5.3 SRE Team in a Dedicated SRE Organization
12.6 Cost Optimization
12.7 Team Topologies
12.7.1 Reporting Lines
12.7.2 SRE Identity Triangle
12.7.3 Holacracy: No Reporting Lines
12.8 Choosing a Model
12.8.1 Model Transformation Options
12.8.2 Decision Dimensions
12.8.3 Reporting Options
12.8.4 Positioning the SRE Organization
12.8.5 Conveying the Value to Executives
12.9 A New Role: SRE
12.9.1 Why Is a New Role Needed?
12.9.2 Role Definition
12.9.3 Role Naming
12.9.4 Role Assignment
12.9.5 Role Fulfillment
12.10 SRE Career Path
12.10.1 SRE Role Progressions
12.10.2 SRE Role Transitions
12.10.3 Cultural Importance
12.11 Communicating the Chosen Model
12.12 Introducing the Chosen Model
12.12.1 Organization Changes
12.12.2 Reporting Structure Changes
12.12.3 Role Changes
12.13 Summary
Part III: Measuring and Sustaining the Transformation
Chapter 13 Measuring the SRE Transformation
13.1 Testing Transformation Hypotheses
13.2 Outages Not Detected Internally
13.3 Services Exhausting Error Budgets Prematurely
13.4 Executives’ Perceptions
13.5 Reliability Perception by Users and Partners
13.6 Summary
Chapter 14 Sustaining the SRE Movement
14.1 Maturing the SRE CoP
14.2 SRE Minutes
14.3 Availability Newsletter
14.4 SRE Column in the Engineering Blog
14.5 Promote Long-Form SRE Wiki Articles
14.6 SRE Broadcasting
14.7 Combining SRE and CD Indicators
14.7.1 CD Versus SRE Indicators
14.7.2 Bottleneck Analysis
14.8 SRE Feedback Loops
14.9 New Hypotheses
14.10 Providing Learning Opportunities
14.11 Supporting SRE Coaches
14.12 Summary
Chapter 15 The Road Ahead
15.1 Service Catalog
15.2 SLAs
15.3 Regulatory Compliance
15.4 SRE Infrastructure
15.5 Game Days
Appendix Topics for Quick Reference
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z