97 Things Every SRE Should Know: Collective Wisdom from the Experts

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ. Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provokingquestions that drive the direction of the field. Some of the 97 things you should know: • "Test Your Disaster Plan"--Tanya Reilly • "Integrating Empathy into SRE Tools"--Daniella Niyonkuru • "The Best Advice I Can Give to Teams"--Nicole Forsgren • "Where to SRE"--Fatema Boxwala • "Facing That First Page"--Andrew Louis • "I Have an Error Budget, Now What?"--Alex Hidalgo • "Get Your Work Recognized: Write a Brag Document"--Julia Evans and Karla Burnett

Author(s): Emil Stolarsky, Jaime Woo
Edition: 1
Publisher: O'Reilly Media
Year: 2020

Language: English
Commentary: Vector PDF
Pages: 252
City: Sebastopol, CA
Tags: Management; Debugging; Security; Best Practices; Site Reliability Engineering; Resilience; TCP; Disaster Recovery; Observability; Metrics; Runbooks

Cover
Copyright
Table of Contents
Preface
How We Structured the Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. New to SRE
Chapter 1. Site Reliability Engineering in Six Words
Alex Hidalgo
Chapter 2. Do We Know Why We Really Want Reliability?
Niall Murphy
Chapter 3. Building Self-Regulating Processes
Denise Yu
Chapter 4. Four Engineers of an SRE Seder
Jacob Scott
Chapter 5. The Reliability Stack
Alex Hidalgo
Chapter 6. Infrastructure: It’s Where the Power Is
Charity Majors
Chapter 7. Thinking About Resilience
Justin Li
Chapter 8. Observability in the Development Cycle
Charity Majors and Liz Fong-Jones
Chapter 9. There Is No Magic
Bouke van der Bijl
Chapter 10. How Wikipedia Is Served to You
Effie Mouzeli
Chapter 11. Why You Should Understand (a Little) About TCP
Julia Evans
Chapter 12. The Importance of a Management Interface
Salim Virji
Chapter 13. When It Comes to Storage, Think Distributed
Salim Virji
Chapter 14. The Role of Cardinality
Charity Majors and Liz Fong-Jones
Chapter 15. Security Is like an Onion
Lucas Fontes
Chapter 16. Use Your Words
Tanya Reilly
Chapter 17. Where to SRE
Fatema Boxwala
Chapter 18. Dear Future Team
Frances Rees
Chapter 19. Sustainability and Burnout
Denise Yu
Chapter 20. Don’t Take Advice from Graybeards
John Looney
Chapter 21. Facing That First Page
Andrew Louis
Part II. Zero to One
Chapter 22. SRE, at Any Size, Is Cultural
Matthew Huxtable
Chapter 23. Everyone Is an SRE in a Small Organization
Matthew Huxtable
Chapter 24. Auditing Your Environment for Improvements
Joan O’Callaghan
Chapter 25. With Incident Response, Start Small
Thai Wood
Chapter 26. Solo SRE: Effecting Large-Scale Change as a Single Individual
Ashley Poole
Chapter 27. Design Goals for SLO Measurement
Ben Sigelman
Chapter 28. I Have an Error Budget—Now What?
Alex Hidalgo
Chapter 29. How to Change Things
Joan O’Callaghan
Chapter 30. Methodological Debugging
Avishai Ish-Shalom and Nati Cohen
Chapter 31. How Startups Can Build an SRE Mindset
Tamara Miner
Chapter 32. Bootstrapping SRE in Enterprises
Vanessa Yiu
Chapter 33. It’s Okay Not to Know, and It’s Okay to Be Wrong
Todd Palino
Chapter 34. Storytelling Is a Superpower
Anita Clarke
Chapter 35. Get Your Work Recognized: Write a Brag Document
Julia Evans and Karla Burnett
Part III. One to Ten
Chapter 36. Making Work Visible
Lorin Hochstein
Chapter 37. An Overlooked Engineering Skill
Murali Suriar
Chapter 38. Unpacking the On-Call Divide
Jason Hand
Chapter 39. The Maestros of Incident Response
Andrew Louis
Stop the Bleeding
What’s Everyone Doing?
Chapter 40. Effortless Incident Management
Suhail Patel, Miles Bryant, and Chris Evans
Chapter 41. If You’re Doing Runbooks, Do Them Well
Spike Lindsey
Chapter 42. Why I Hate Our Playbooks
Frances Rees
Chapter 43. What Machines Do Well
Michelle Brush
Chapter 44. Integrating Empathy into SRE Tools
Daniella Niyonkuru
Chapter 45. Using ChatOps to Implement Empathy
Daniella Niyonkuru
Chapter 46. Move Fast to Unbreak Things
Michelle Brush
Chapter 47. You Don’t Know for Sure Until It Runs in Production
Ingrid Epure
Chapter 48. Sometimes the Fix Is the Problem
Jake Pittis
Chapter 49. Legendary
Elise Gale
Chapter 50. Metrics Are Not SLIs (The Measure Everything Trap)
Brian Murphy
Chapter 51. When SLOs Attack: Pathological SLOs and How to Fix Them
Narayan Desai
Chapter 52. Holistic Approach to Product Reliability
Kristine Chen and Bart Ponurkiewicz
Chapter 53. In Search of the Lost Time
Ingrid Epure
Chapter 54. Unexpected Lessons from Office Hours
Tamara Miner
Chapter 55. Building Tools for Internal Customers that They Actually Want to Use
Vinessa Wan
Chapter 56. It’s About the Individuals and Interactions
Vinessa Wan
Chapter 57. The Human Baseline in SRE
Effie Mouzeli
Chapter 58. Remotely Productive or Productively Remote
Avleen Vig
Chapter 59. Of Margins and Individuals
Kurt Andersen
Chapter 60. The Importance of Margins in Systems
Kurt Andersen
Chapter 61. Fewer Spreadsheets, More Napkins
Jacob Bednarz
Chapter 62. Sneaking in Your DevOps Deliciously
Vinessa Wan
Chapter 63. Effecting SRE Cultural Changes in Enterprises
Vanessa Yiu
Chapter 64. To All the SREs I’ve Loved
Felix Glaser
Chapter 65. Complex: The Most Overloaded Word in Technology
Laura Nolan
Part IV. Ten to Hundred
Chapter 66. The Best Advice I Can Give to Teams
Nicole Forsgren
Chapter 67. Create Your Supporting Artifacts
Daria Barteneva and Eva Parish
Chapter 68. The Order of Operations for Getting SLO Buy-In
David K. Rensin
Chapter 69. Heroes Are Necessary, but Hero Culture Is Not
Lei Lopez
Chapter 70. On-Call Rotations that People Want to Join
Miles Bryant, Chris Evans, and Suhail Patel
Chapter 71. Study of Human Factors and Team Culture to Improve Pager Fatigue
Daria Barteneva
Chapter 72. Optimize for MTTBTB (Mean Time to Back to Bed)
Spike Lindsey
Chapter 73. Mitigating and Preventing Cascading Failures
Rita Lu
Chapter 74. On-Call Health: The Metric You Could Be Measuring
Caitie McCaffrey
Chapter 75. Helping Leaders Prioritize On-Call Health
Caitie McCaffrey
Bring Quantitative Data
Link SLAs to On-Call Health
Treat On-Call Health like a Feature
Measure Attrition
Chapter 76. The SRE as a Diplomat
Johnny Boursiquot
Chapter 77. The Forward-Deployed SRE
Johnny Boursiquot
Chapter 78. Test Your Disaster Plan
Tanya Reilly
Chapter 79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program
Jennifer Petoff
Chapter 80. The Power of Uniformity
Chris Evans, Suhail Patel, and Miles Bryant
Chapter 81. Bytes per User Value
Arshia Mufti
Chapter 82. Make Your Engineering Blog a Priority
Anita Clarke
Chapter 83. Don’t Let Anyone Run Code in Your Context
John Looney
Chapter 84. Trading Places: SRE and Product
Shubheksha Jalan
Chapter 85. You See Teams, I See Product
Avleen Vig
Chapter 86. The Performance Emergency Fund
Dawn Parzych
Chapter 87. Important but Not Urgent: Roadmaps for SREs
Laura Nolan
Part V. The Future of SRE
Chapter 88. That 50% Thing
Tanya Reilly
Chapter 89. Following the Path of Safety-Critical Systems
Heidy Khlaaf
Chapter 90. Applicable and Achievable Static Analysis
Heidy Khlaaf
Chapter 91. The Importance of Formal Specification
Hillel Wayne
Chapter 92. Risk and Rot in Sociotechnical Systems
Laura Nolan
Chapter 93. SRE in Crisis
Niall Murphy
Chapter 94. Expected Risk Limitations
Blake Bisset
Chapter 95. Beyond Local Risk: Accounting for Angry Birds
Blake Bisset
Chapter 96. A Word from Software Safety Nerds
J. Paul Reed
Chapter 97. Incidents: A Window into Gaps
Lorin Hochstein
Chapter 98. The Third Age of SRE
Björn “Beorn” Rabenstein
Contributors
Kurt Andersen
Daria Barteneva
Jacob Bednarz
Bouke van der Bijl
Blake Bisset
Johnny Boursiquot
Fatema Boxwala
Michelle Brush
Miles Bryant
Karla Burnett
Kristine Chen
Anita Clarke
Nati Cohen
Narayan Desai
Ingrid Epure
Chris Evans
Julia Evans
Liz Fong-Jones
Lucas Fontes
Dr. Nicole Forsgren
Elise Gale
Felix Glaser
Jason Hand
Alex Hidalgo
Lorin Hochstein
Matthew Huxtable
Avishai Ish-Shalom
Shubheksha Jalan
Heidy Khlaaf
Justin Li
Spike Lindsey
John Looney
Lei Lopez
Andrew Louis
Rita Lu
Charity Majors
Caitie McCaffrey
Tamara Miner
Effie Mouzeli
Arshia Mufti
Brian Murphy
Niall Murphy
Daniella Niyonkuru
Laura Nolan
Joan O’Callaghan
Todd Palino
Eva Parish
Dawn Parzych
Suhail Patel
Jennifer Petoff
Jake Pittis
Bart Ponurkiewicz
Ashley Poole
Björn “Beorn” Rabenstein
J. Paul Reed
Frances Rees
Tanya Reilly
David K. Rensin
Jacob Scott
Ben Sigelman
Murali Suriar
Avleen Vig
Salim Virji
Vinessa Wan
Hillel Wayne
Thai Wood
Vanessa Yiu
Denise Yu
Index
About the Editors
Emil Stolarsky andJaime Woo