Becoming SRE: First Steps Toward Reliability for You and Your Organization

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Do you wish the existing books on site reliability engineering started at the beginning? Do you wish someone would walk you through how to become an SRE, how to think like an SRE, or how to build and grow a successful SRE function in your organization?

Becoming SRE addresses all of these needs and more with three interconnected sections: the essential groundwork for understanding SRE and SRE culture, advice for individuals on becoming an SRE, and guidance for organizations on creating and developing a thriving SRE practice.

Acting as your personal and personable guide, author David Blank-Edelman takes you through subjects like:

  • SRE mindset, SRE culture, and SRE advocacy
  • What you need to get started and hired in SRE and what the job will be like when you get there
  • What you need to bring SRE into an organization and what is required for a good organizational fit so it can thrive there
  • How to work with your business folks and management around SRE
  • How SRE can grow and mature in an organization over time

Ready to become an SRE or introduce SRE into your organization? This book is here to help.

Author(s): David N. Blank-Edelman
Edition: 1
Publisher: O'Reilly Media
Year: 2024

Language: English
Commentary: Publisher PDF | Published: February 2024 | First Edition: First Release
Pages: 200
City: Sebastopol, CA
Tags: Site Reliability Engineering; SRE

Cover
Copyright
Table of Contents
Preface
Where Are You Right Now?
Navigating This Book
We Are Going to Need a Bigger Boat
I’m Not the Lorax
Ready?
Convention Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Coping
Part I. Introduction to SRE
Chapter 1. First Things First
What Is SRE?
Reliability
Appropriate
Sustainable
(Other Words)
Origin Story
SRE and Its Relationship to DevOps
Part 1: SRE Implements Class DevOps
Part 2: SRE Is to Reliability as DevOps Is to Delivery
Part 3: It’s All About the Direction of Attention
Onward to SRE Fundamentals
Chapter 2. SRE Mindset
Zooming Out to Maintain a Systems Perspective
Creating and Nurturing Feedback Loops
Keeping the Focus on the Customer
Relationships (to People and Things)
SRE’s Relationship to (Other) People
SRE’s Relationship to Failure and Errors
The Mindset in Motion
Chapter 3. SRE Culture
Happy Fish, um, People
How to Create a Supportive Culture for SRE
Culture as a Vehicle or a Lever
What Do You Want SRE to Be/Do?
Thinking About Assembling the Culture You Want and Need
I Still Don’t Know Where to Start
Nurturing Your Nascent SRE Culture
Keep On Keeping On
Chapter 4. Talking About SRE (SRE Advocacy)
Why It Matters, Even Early in Your Experience with SRE
When It Matters
Get Your Story (and Audience) Straight
Some Story Ideas
Other People’s Stories
Secondary Stories
The Challenges the Stories Present
One Last Tip
Part II. Becoming SRE for the Individual
Chapter 5. Preparing to Become an SRE
Do You Need to Know How to Code?
Do You Need a Computer Science Degree?
Fundamentals
Single/Basic Systems (and Their Failure Modes)
Distributed Systems (and Their Failure Modes)
Statistics and Data Visualization
Storytelling
Be a Good Person
Bonus Round
Non-Abstract Large System Design (NALSD)
Resilience Engineering
Chaos Engineering and Performance Engineering
Machine Learning and Artificial Intelligence
What Else?
Chapter 6. Getting to SRE from…       
Are You Already an SRE?
From Student to SRE
From Dev/SWE to SRE
From Sysadmin/IT to SRE
Generic Advice
Technical Role X to SRE
Nontechnical Role X to SRE
Track Your Progress to Keep On Keeping On
Chapter 7. Hints for Getting Hired as an SRE
Scrutinizing the Job Posting
Preparing for an SRE Interview
What to Ask at the SRE Interview
Win!
Chapter 8. A Day in the Life of an SRE
Modes of an SRE’s Day
Incident/Outage Mode
Postincident Learning Mode
Builder/Project/Learn Mode
Architecture Mode
Management Mode
Planning Mode
Collaboration Mode
Recovery and Self-Care Mode
Balance
Make a Day in the Life a Good Day
Chapter 9. Establishing a Relationship to Toil
Defining Toil with More Precision
Whose Toil Are We Talking About?
Why Do SREs Care About Toil?
The Dynamics of Toil: Early Versus Established
Dealing with Toil
Intermediate to Advanced Toil Reduction
What Are You Going to Do About It?
Chapter 10. Learning from Failure
Talking About Failure
Postincident Reviews
Postincident Reviews: The Basics
Postincident Reviews: The Process
Postincident Reviews: Common Traps
Learning from Failure Through Resilience Engineering
Learning from Failure via Chaos Engineering
Learning from Failure: Next Steps
Part III. Becoming SRE for the Organization
Chapter 11. Organizational Factors for Success
Contributing Factor 1: What’s the Problem?
Contributing Factor 2: What Is the Org Willing to Do to Get There?
Contributing Factor 3: Does the Org Have the Requisite Patience?
Contributing Factor 4: Can We Collaborate?
Contributing Factor 5: Does the Org Make Decisions Based on Data?
Contributing Factor 6: Can the Org Learn and Act on What It Learns?
Contributing Factor 7: Can You Make a Difference?
Contributing Factor 8: Can You See (and Address) the Friction in the System?
The Fine Print
It’s All About Organizational Values
Chapter 12. How SRE Can Fail
Contributing Factor 1: Title Flipping to Create SREs
Contributing Factor 2: Converting Tier 3 Support to SRE
Contributing Factor 3: On Call and That’s All
Contributing Factor 4: Wrong Org Chart
Contributing Factor 5: SRE by Rote
Contributing Factor 6: Gatekeeping
Contributing Factor 7: Death Through Success
Contributing Factor 8: A Collection of Smaller Factors
How to “SRE” Your SRE Failure
Chapter 13. SRE from a Business Perspective
Communicating About SRE
Talking to the Business About Reliability
Selling SRE
Communicating Success Back to the Business
Proving the Success of an SRE Group to Others
Budgeting for SRE
First Budget Request
Talking About Funding
Re-Up Conversations
Funding Models
SRE Alignment
Models for Engagement
Why Not the Embedded Model? Why a Separate Org?
Avoiding the Pager Monkey or Toil Bucket Traps
SRE Teams
Choosing Headcount Sizes
How Do You Know When an SRE Team Might Be in Trouble?
Alert Noise as a Signal of Team Health
SRE Promotions
Turning Teams Down
From the Author: I Would Like to Hear from You
Chapter 14. The Dickerson Hierarchy of Reliability (A Good Place to Start)
The Dickerson Hierarchy of Reliability
Level 1: Monitoring/Observability
Level 2: Incident Response
Level 3: Postincident Review
Level 4: Testing/Release (Deployment)
Level 5: Provisioning/Capacity Planning
Levels 6 and 7: Development Process and Product Design
Wrong Turns
You Know You’ve Taken a Wrong Turn When…
Positive Signs
Chapter 15. Fitting SRE into Your Organization
Pre-role and Pre-team Practices
Integration Models
Centralized/Partnered Model
Distributed/Embedded Model
Hybrid Model
How to Choose Between These Models
Creating and Nurturing the Right Feedback Loops
Feedback Loops and Data
Feedback Loops and Iteration
Feedback Loops and Planning for Iteration
How and Where to Insert These Feedback Loops into the Organization
Signs of Success
Chapter 16. SRE Organizational Evolutionary Stages
Stage 1: The Firefighter
Stage 2: The Gatekeeper
Stage 3: The Advocate
Stage 4: The Partner
Stage 5: The Engineer
Caveat Implementer
Chapter 17. Growing SRE in Your Org
How Do You Know When to Scale?
Scaling 0 to 1
Scaling 1 to 6
Scaling 6 to 18
Scaling 18 to 48
Scaling 48 to 108 (and Beyond)
Growing SRE’s Leadership Representation
Chapter 18. Conclusion
What’s Next?
Appendix A. Letters to a Young SRE (Apologies to Rilke)
John Amori
Fred Hebert
Aju Tamang
Daniel Gentleman
Joanna Wijntjes
Fabrizio Waldner
Graham Poulter
Jamie Wilkinson
Andrew Howden
Pedro Alves
Balasundaram N
Eduardo Spotti
Ian Bartholomew
Olivier Duquesne
Ralph Pritchard
David Caudill
Alex Hidalgo
Effie Mouzeli
Appendix B. Advice from Former SREs
Dina Levitan
Sara Smollett
Andrew Fong
Scott MacFiggen
Appendix C. SRE Resources
Core Books
“SRE and…” Books
Events
SREcon
Vendor SRE Single-Day Events
DevOps Event Tracks/Sessions
SRE-Adjacent Niche Events
SRE Video Content
SRE-Specific Podcasts
SRE-Specific Email Newsletters
Online Forums
Historical Document
Curated Link Collections
Index
About the Author
Colophon