A web application involves many specialists, but it takes people in web ops to ensure that everything works together throughout an application's lifetime. It's the expertise you need when your start-up gets an unexpected spike in web traffic, or when a new feature causes your mature application to fail. In this collection of essays and interviews, web veterans such as Theo Schlossnagle, Baron Schwartz, and Alistair Croll offer insights into this evolving field. You'll learn stories from the trenches--from builders of some of the biggest sites on the Web--on what's necessary to help a site thrive.
- Learn the skills needed in web operations, and why they're gained through experience rather than schooling
- Understand why it's important to gather metrics from both your application and infrastructure
- Consider common approaches to database architectures and the pitfalls that come with increasing scale
- Learn how to handle the human side of outages and degradations
- Find out how one company avoided disaster after a huge traffic deluge
- Discover what went wrong after a problem occurs, and how to prevent it from happening again
Contributors include:
John Allspaw Heather Champ Michael Christian Richard Cook Alistair Croll Patrick Debois Eric Florenzano Paul Hammond Justin Huff Adam Jacob Jacob Loomis Matt Massie Brian Moon Anoop Nagwani Sean Power Eric Ries Theo Schlossnagle Baron Schwartz Andrew Shafer
Foreword......Page 13
Preface......Page 15
Theo Schlossnagle......Page 21
Why Does Web Operations Have It Tough?......Page 22
From Apprentice to Master......Page 24
Conclusion......Page 29
Justin Huff......Page 31
Where the Cloud Fits (and Why!)......Page 32
Conclusion......Page 40
John Allspaw, with Matt Massie......Page 41
Time Resolution and Retention Concerns......Page 42
Locality of Metrics Collection and Storage......Page 43
Layers of Metrics......Page 44
Providing Context for Anomaly Detection and Alerts......Page 47
Log Lines Are Metrics, Too......Page 48
Correlation with Change Management and Incident Timelines......Page 50
Making Metrics Available to Your Alerting Mechanisms......Page 51
Using Metrics to Guide Load-Feedback Mechanisms......Page 52
A Metrics Collection System, Illustrated: Ganglia......Page 56
Conclusion......Page 67
Small Batches Mean Faster Feedback......Page 69
Small Batches Reduce Risk......Page 70
Small Batches Reduce Overhead......Page 71
The Quality Defenders’ Lament......Page 72
Getting Started......Page 76
Continuous Deployment Is for Mission-Critical Applications......Page 80
Conclusion......Page 83
Adam Jacob......Page 85
Service-Oriented Architecture......Page 87
Conclusion......Page 99
Story: “The Start of a Journey”......Page 101
Step 1: Understand What You Are Monitoring......Page 105
Step 2: Understand Normal Behavior......Page 115
Step 3: Be Prepared and Learn......Page 122
Conclusion......Page 126
John Allspaw and Richard Cook......Page 127
How Complex Systems Fail......Page 128
Further Reading......Page 136
Heather Champ and John Allspaw......Page 137
How It All Started......Page 147
Alarms Abound......Page 148
Putting Out the Fire......Page 149
Surviving the Weekend......Page 150
CDN to the Rescue......Page 151
Corralling the Stampede......Page 152
Streamlining the Codebase......Page 153
How Do We Know It Works?......Page 154
The Real Test......Page 155
Improvements Since Then......Page 156
Paul Hammond......Page 159
Deployment......Page 160
Shared, Open Infrastructure......Page 164
Trust......Page 166
On-call Developers......Page 168
Avoiding Blame......Page 173
Conclusion......Page 175
Alistair Croll and Sean Power......Page 177
Why Collect User-Facing Metrics?......Page 179
What Makes a Site Slow?......Page 183
Measuring Delay......Page 185
Building an SLA......Page 191
Visitor Outcomes: Analytics......Page 193
Other Metrics Marketing Cares About......Page 198
How User Experience Affects Web Ops......Page 199
The Future of Web Monitoring......Page 200
Conclusion......Page 205
Baron Schwartz......Page 207
Requirements for Web Databases......Page 208
How Typical Web Databases Grow......Page 213
The Yearning for a Cluster......Page 220
Database Strategy......Page 225
Database Tactics......Page 232
Conclusion......Page 238
Jake Loomis......Page 239
The Worst Postmortem......Page 240
What Is a Postmortem?......Page 241
When to Conduct a Postmortem......Page 242
Running a Postmortem......Page 243
Postmortem Follow-Up......Page 244
Conclusion......Page 246
Data Asset Inventory......Page 247
Data Protection......Page 251
Capacity Planning......Page 260
Storage Sizing......Page 262
Operations......Page 264
Conclusion......Page 265
Eric Florenzano......Page 267
NoSQL Database Overview......Page 268
Some Systems in Detail......Page 272
Conclusion......Page 281
Andrew Clay Shafer......Page 283
Agile Infrastructure......Page 285
So, What’s the Problem?......Page 289
Trading Zones and Apologies......Page 299
Conclusion......Page 302
Mike Christian......Page 305
Definitions......Page 307
How Many 9s?......Page 308
Impact Duration Versus Incident Duration......Page 309
Datacenter Footprint......Page 310
Gradual Failures......Page 311
Failover Testing......Page 312
Monitoring and History of Patterns......Page 313
Getting a Good Night’s Sleep......Page 314
Contributors......Page 317
Index......Page 323