In 2016, Google's Site Reliability Engineering book ignited an industry discussion on what it means to run production services today--and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google's experiences, but also provides case studies from Google's Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn't.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You'll learn:
- How to run reliable services in environments you don't completely control--like cloud
- Practical applications of how to create, monitor, and run your services via Service Level Objectives
- How to convert existing ops teams to SRE--including how to dig out of operational overload
- Methods for starting SRE from either greenfield or brownfield
Author(s): Betsy Beyer et al. (eds.)
Publisher: O’Reilly
Year: 2018
Language: English
Pages: 512
Foreword I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Foreword II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
1. How SRE Relates to DevOps. . . . . . . . . . . . . . 1
Background on DevOps 2
No More Silos 2
Accidents Are Normal 3
Change Should Be Gradual 3
Tooling and Culture Are Interrelated 3
Measurement Is Crucial 4
Background on SRE 4
Operations Is a Software Problem 4
Manage by Service Level Objectives (SLOs) 5
Work to Minimize Toil 5
Automate This Year’s Job Away 6
Move Fast by Reducing the Cost of Failure 6
Share Ownership with Developers 6
Use the Same Tooling, Regardless of Function or Job Title 7
Compare and Contrast 7
Organizational Context and Fostering Successful Adoption 9
Narrow, Rigid Incentives Narrow Your Success 9
It’s Better to Fix It Yourself; Don’t Blame Someone Else 10
Consider Reliability Work as a Specialized Role 10
When Can Substitute for Whether 11
Strive for Parity of Esteem: Career and Financial 12
2. Implementing SLOs. . . . . . . . . . . . . . . . . . . 17
Why SREs Need SLOs 17
Getting Started 18
Reliability Targets and Error Budgets 19
What to Measure: Using SLIs 20
A Worked Example 23
Moving from SLI Specification to SLI Implementation 25
Measuring the SLIs 26
Using the SLIs to Calculate Starter SLOs 28
Choosing an Appropriate Time Window 29
Getting Stakeholder Agreement 30
Establishing an Error Budget Policy 31
Documenting the SLO and Error Budget Policy 32
Dashboards and Reports 33
Continuous Improvement of SLO Targets 34
Improving the Quality of Your SLO 35
Decision Making Using SLOs and Error Budgets 37
Advanced Topics 38
Modeling User Journeys 39
Grading Interaction Importance 39
Modeling Dependencies 40
Experimenting with Relaxing Your SLOs 41
Conclusion 42
3. SLO Engineering Case Studies. . . . . . . . . . 43
Evernote’s SLO Story 43
Why Did Evernote Adopt the SRE Model? 44
Introduction of SLOs: A Journey in Progress 45
Breaking Down the SLO Wall Between Customer and Cloud Provider 48
Current State 49
The Home Depot’s SLO Story 49
The SLO Culture Project 50
Our First Set of SLOs 52
Evangelizing SLOs 54
Automating VALET Data Collection 55
The Proliferation of SLOs 57
Applying VALET to Batch Applications 57
Using VALET in Testing 58
Future Aspirations 58
Summary 59
Conclusion 60
4. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . 61
Desirable Features of a Monitoring Strategy 62
Speed 62
Calculations 62
Interfaces 63
Alerts 64
Sources of Monitoring Data 64
Examples 65
Managing Your Monitoring System 67
Treat Your Configuration as Code 67
Encourage Consistency 68
Prefer Loose Coupling 68
Metrics with Purpose 69
Intended Changes 70
Dependencies 70
Saturation 71
Status of Served Traffic 72
Implementing Purposeful Metrics 72
Testing Alerting Logic 72
Conclusion 73
5. Alerting on SLOs. . . . . . . . . . . . . . . . . . . . . 75
Alerting Considerations 75
Ways to Alert on Significant Events 76
1: Target Error Rate ≥ SLO Threshold 76
2: Increased Alert Window 78
3: Incrementing Alert Duration 79
4: Alert on Burn Rate 80
5: Multiple Burn Rate Alerts 82
6: Multiwindow, Multi-Burn-Rate Alerts 84
Low-Traffic Services and Error Budget Alerting 86
Generating Artificial Traffic 87
Combining Services 87
Making Service and Infrastructure Changes 87
Lowering the SLO or Increasing the Window 88
Extreme Availability Goals 89
Alerting at Scale 89
Conclusion 91
6. Eliminating Toil. . . . . . . . . . . . . . . . . . . . . . 93
What Is Toil? 94
Measuring Toil 96
Toil Taxonomy 98
Business Processes 98
Production Interrupts 99
Release Shepherding 99
Migrations 99
Cost Engineering and Capacity Planning 100
Troubleshooting for Opaque Architectures 100
Toil Management Strategies 101
Identify and Measure Toil 101
Engineer Toil Out of the System 101
Reject the Toil 101
Use SLOs to Reduce Toil 102
Start with Human-Backed Interfaces 102
Provide Self-Service Methods 102
Get Support from Management and Colleagues 103
Promote Toil Reduction as a Feature 103
Start Small and Then Improve 103
Increase Uniformity 103
Assess Risk Within Automation 104
Automate Toil Response 104
Use Open Source and Third-Party Tools 105
Use Feedback to Improve 105
Case Studies 106
Case Study 1: Reducing Toil in the Datacenter with Automation 107
Background 107
Problem Statement 110
What We Decided to Do 110
Design First Effort: Saturn Line-Card Repair 110
Implementation 111
Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card
Repair 113
Implementation 114
Lessons Learned 118
Case Study 2: Decommissioning Filer-Backed Home Directories 121
Background 121
Problem Statement 121
What We Decided to Do 122
Design and Implementation 123
Key Components 124
Lessons Learned 127
Conclusion 129
7. Simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . 131
Measuring Complexity 131
Simplicity Is End-to-End, and SREs Are Good for That 133
Case Study 1: End-to-End API Simplicity 134
Case Study 2: Project Lifecycle Complexity 134
Regaining Simplicity 135
Case Study 3: Simplification of the Display Ads Spiderweb 137
Case Study 4: Running Hundreds of Microservices on a Shared Platform 139
Case Study 5: pDNS No Longer Depends on Itself 140
Conclusion 141
Part II. Practices
8. On-Call. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Recap of “Being On-Call” Chapter of First SRE Book 148
Example On-Call Setups Within Google and Outside Google 149
Google: Forming a New Team 149
Evernote: Finding Our Feet in the Cloud 153
Practical Implementation Details 156
Anatomy of Pager Load 156
On-Call Flexibility 167
On-Call Team Dynamics 171
Conclusion 173
9. Incident Response. . . . . . . . . . . . . . . . . . . 175
Incident Management at Google 176
Incident Command System 176
Main Roles in Incident Response 177
Case Studies 177
Case Study 1: Software Bug—The Lights Are On but No One’s (Google)
Home 177
Case Study 2: Service Fault—Cache Me If You Can 180
Case Study 3: Power Outage—Lightning Never Strikes Twice…
Until It Does 185
Case Study 4: Incident Response at PagerDuty 188
Putting Best Practices into Practice 191
Incident Response Training 191
Prepare Beforehand 192
Drills 193
Conclusion 194
10. Postmortem Culture: Learning from Failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Case Study 196
Bad Postmortem 197
Why Is This Postmortem Bad? 199
Good Postmortem 203
Why Is This Postmortem Better? 212
Organizational Incentives 214
Model and Enforce Blameless Behavior 214
Reward Postmortem Outcomes 215
Share Postmortems Openly 217
Respond to Postmortem Culture Failures 218
Tools and Templates 220
Postmortem Templates 220
Postmortem Tooling 221
Conclusion 223
11. Managing Load. . . . . . . . . . . . . . . . . . . . . 225
Google Cloud Load Balancing 225
Anycast 226
Maglev 227
Global Software Load Balancer 229
Google Front End 229
GCLB: Low Latency 230
GCLB: High Availability 231
Case Study 1: Pokémon GO on GCLB 231
Autoscaling 236
Handling Unhealthy Machines 236
Working with Stateful Systems 237
Configuring Conservatively 237
Setting Constraints 2
12. Introducing Non-Abstract Large System Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
What Is NALSD? 245
Why “Non-Abstract”? 246
AdWords Example 246
Design Process 246
Initial Requirements 247
One Machine 248
Distributed System 251
Conclusion 260
13. Data Processing Pipelines. . . . . . . . . . . . 263
Pipeline Applications 264
Event Processing/Data Transformation to Order or Structure Data 264
Data Analytics 265
Machine Learning 265
Pipeline Best Practices 268
Define and Measure Service Level Objectives 268
Plan for Dependency Failure 270
Create and Maintain Pipeline Documentation 271
Map Your Development Lifecycle 272
Reduce Hotspotting and Workload Patterns 275
Implement Autoscaling and Resource Planning 276
Adhere to Access Control and Security Policies 277
Plan Escalation Paths 277
Pipeline Requirements and Design 277
What Features Do You Need? 278
Idempotent and Two-Phase Mutations 279
Checkpointing 279
Code Patterns 280
Pipeline Production Readiness 281
Pipeline Failures: Prevention and Response 284
Potential Failure Modes 284
Potential Causes 286
Case Study: Spotify 287
Event Delivery 288
Event Delivery System Design and Architecture 289
Event Delivery System Operation 290
Customer Integration and Support 293
Summary 298
Conclusion 299
14. Configuration Design and Best Practices.01
What Is Configuration? 301
Configuration and Reliability 302
Separating Philosophy and Mechanics 303
Configuration Philosophy 303
Configuration Asks Users Questions 305
Questions Should Be Close to User Goals 305
Mandatory and Optional Questions 306
Escaping Simplicity 308
Mechanics of Configuration 308
Separate Configuration and Resulting Data 308
Importance of Tooling 310
Ownership and Change Tracking 312
Safe Configuration Change Application 312
Conclusion 313
15. Configuration Specifics. . . . . . . . . . . . . . . 315
Configuration-Induced Toil 315
Reducing Configuration-Induced Toil 316
Critical Properties and Pitfalls of Configuration Systems 317
Pitfall 1: Failing to Recognize Configuration as a Programming Language
Problem 317
Pitfall 2: Designing Accidental or Ad Hoc Language Features 318
Pitfall 3: Building Too Much Domain-Specific Optimization 318
Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects” 319
Pitfall 5: Using an Existing General-Purpose Scripting Language Like
Python, Ruby, or Lua 319
Integrating a Configuration Language 320
Generating Config in Specific Formats 320
Driving Multiple Applications 321
Integrating an Existing Application: Kubernetes 322
What Kubernetes Provides 322
Example Kubernetes Config 322
Integrating the Configuration Language 323
Integrating Custom Applications (In-House Software) 326
Effectively Operating a Configuration System 329
Versioning 329
Source Control 330
Tooling 330
Testing 330
When to Evaluate Configuration 331
Very Early: Checking in the JSON 331
Middle of the Road: Evaluate at Build Time 332
Late: Evaluate at Runtime 332
Guarding Against Abusive Configuration 333
Conclusion 334
16. Canarying Releases. . . . . . . . . . . . . . . . . . 335
Release Engineering Principles 336
Balancing Release Velocity and Reliability 337
What Is Canarying? 338
Release Engineering and Canarying 338
Requirements of a Canary Process 339
Our Example Setup 339
A Roll Forward Deployment Versus a Simple Canary Deployment 340
Canary Implementation 342
Minimizing Risk to SLOs and the Error Budget 343
Choosing a Canary Population and Duration 343
Selecting and Evaluating Metrics 345
Metrics Should Indicate Problems 345
Metrics Should Be Representative and Attributable 346
Before/After Evaluation Is Risky 347
Use a Gradual Canary for Better Metric Selection 347
Dependencies and Isolation 348
Canarying in Noninteractive Systems 348
Requirements on Monitoring Data 349
Related Concepts 350
Blue/Green Deployment 350
Artificial Load Generation 350
Traffic Teeing 351
Conclusion 351
Part III. Processes
17. Identifying and Recovering from Overload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
From Load to Overload 356
Case Study 1: Work Overload When Half a Team Leaves 358
Background 358
Problem Statement 358
What We Decided to Do 359
Implementation 359
Lessons Learned 360
Case Study 2: Perceived Overload After Organizational and Workload
Changes 360
Background 360
Problem Statement 361
What We Decided to Do 362
Implementation 363
Effects 365
Lessons Learned 365
Strategies for Mitigating Overload 366
Recognizing the Symptoms of Overload 366
Reducing Overload and Restoring Team Health 367
Conclusion 369
18. SRE Engagement Model. . . . . . . . . . . . . . 371
The Service Lifecycle 372
Phase 1: Architecture and Design 372
Phase 2: Active Development 373
Phase 3: Limited Availability 373
Phase 4: General Availability 374
Phase 5: Deprecation 374
Phase 6: Abandoned 374
Phase 7: Unsupported 374
Setting Up the Relationship 375
Communicating Business and Production Priorities 375
Identifying Risks 375
Aligning Goals 375
Setting Ground Rules 379
Planning and Executing 379
Sustaining an Effective Ongoing Relationship 380
Investing Time in Working Better Together 380
Maintaining an Open Line of Communication 380
Performing Regular Service Reviews 381
Reassessing When Ground Rules Start to Slip 381
Adjusting Priorities According to Your SLOs and Error Budget 381
Handling Mistakes Appropriately 382
Scaling SRE to Larger Environments 382
Supporting Multiple Services with a Single SRE Team 382
Structuring a Multiple SRE Team Environment 383
Adapting SRE Team Structures to Changing Circumstances 384
Running Cohesive Distributed SRE Teams 384
Ending the Relationship 385
Case Study 1: Ares 385
Case Study 2: Data Analysis Pipeline 387
Conclusion 389
19. SRE: Reaching Beyond Your Walls. . . . . . 391
Truths We Hold to Be Self-Evident 391
Reliability Is the Most Important Feature 391
Your Users, Not Your Monitoring, Decide Your Reliability 392
If You Run a Platform, Then Reliability Is a Partnership 392
Everything Important Eventually Becomes a Platform 393
When Your Customers Have a Hard Time, You Have to Slow Down 393
You Will Need to Practice SRE with Your Customers 393
How to: SRE with Your Customers 394
Step 1: SLOs and SLIs Are How You Speak 394
Step 2: Audit the Monitoring and Build Shared Dashboards 395
Step 3: Measure and Renegotiate 396
Step 4: Design Reviews and Risk Analysis 396
Step 5: Practice, Practice, Practice 397
Be Thoughtful and Disciplined 397
Conclusion 398
20. SRE Team Lifecycles. . . . . . . . . . . . . . . . . 399
SRE Practices Without SREs 399
Starting an SRE Role 400
Finding Your First SRE 400
Placing Your First SRE 401
Bootstrapping Your First SRE 402
Distributed SREs 403
Your First SRE Team 403
Forming 404
Storming 405
Norming 408
Performing 411
Making More SRE Teams 413
Service Complexity 413
SRE Rollout 414
Geographical Splits 414
Suggested Practices for Running Many Teams 418
Mission Control 418
SRE Exchange 419
Training 419
Horizontal Projects 419
SRE Mobility 419
Travel 420
Launch Coordination Engineering Teams 420
Production Excellence 421
SRE Funding and Hiring 421
Conclusion 421
21. Organizational Change Management in SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
SRE Embraces Change 423
Introduction to Change Management 424
Lewin’s Three-Stage Model 424
McKinsey’s 7-S Model 424
Kotter’s Eight-Step Process for Leading Change 425
The Prosci ADKAR Model 425
Emotion-Based Models 426
The Deming Cycle 426
How These Theories Apply to SRE 427
Case Study 1: Scaling Waze—From Ad Hoc to Planned Change 427
Background 427
The Messaging Queue: Replacing a System While Maintaining Reliability 427
The Next Cycle of Change: Improving the Deployment Process 429
Lessons Learned 431
Case Study 2: Common Tooling Adoption in SRE 432
Background 432
Problem Statement 433
What We Decided to Do 434
Design 434
Implementation: Monitoring 436
Lessons Learned 436
Conclusion 439
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
A. Example SLO Document. . . . . . . . . . . . . . . 445
B. Example Error Budget Policy. . . . . . . . . . . 449
C. Results of Postmortem Analysis. . . . . . . . . 453
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455