Cybersecurity is broken. Year after year, attackers remain unchallenged and undeterred, while engineering teams feel pressure to design, build, and operate "secure" systems. Failure can't be prevented, mental models of systems are incomplete, and our digital world constantly evolves. How can we verify that our systems behave the way we expect? What can we do to improve our systems' resilience?
In this comprehensive guide, authors Kelly Shortridge and Aaron Rinehart help you navigate the challenges of sustaining resilience in complex software systems by using the principles and practices of security chaos engineering. By preparing for adverse events, you can ensure they don't disrupt your ability to innovate, move quickly, and achieve your engineering and business goals.
- Learn how to design a modern security program
- Make informed decisions at each phase of software delivery to nurture resilience and adaptive capacity
- Understand the complex systems dynamics upon which resilience outcomes depend
- Navigate technical and organizational trade-offs that distort decision making in systems
- Explore chaos experimentation to verify critical assumptions about software quality and security
- Learn how major enterprises leverage security chaos engineering
Author(s): Kelly Shortridge, Aaron Rinehart
Edition: 1
Publisher: O'Reilly Media
Year: 2023
Language: English
Pages: 340
City: Sebastopol, CA
Tags: Software Engineering; Cybersecurity; Secure Systems; Software Systems; Security Chaos Engineering
Preface
Who Should Read This Book?
Scope of This Book
Outline of This Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. Resilience in Software and Systems
What Is a Complex System?
Variety Defines Complex Systems
Complex Systems Are Adaptive
The Holistic Nature of Complex Systems
What Is Failure?
Acute and Chronic Stressors in Complex Systems
Surprises in Complex Systems
What Is Resilience?
Critical Functionality
Safety Boundaries (Thresholds)
Interactions Across Space-Time
Feedback Loops and Learning Culture
Flexibility and Openness to Change
Resilience Is a Verb
Resilience: Myth Versus Reality
Myth: Robustness = Resilience
Myth: We Can and Should Prevent Failure
Myth: The Security of Each Component Adds Up to Resilience
Myth: Creating a “Security Culture” Fixes Human Error
Chapter Takeaways
2. Systems-Oriented Security
Mental Models of System Behavior
How Attackers Exploit Our Mental Models
Refining Our Mental Models
Resilience Stress Testing
The E&E Resilience Assessment Approach
Evaluation: Tier 1 Assessment
Mapping Flows to Critical Functionality
Document Assumptions About Safety Boundaries
Making Attacker Math Work for You
Starting the Feedback Flywheel with Decision Trees
Moving Toward Tier 2: Experimentation
Experimentation: Tier 2 Assessment
The Value of Experimental Evidence
Sustaining Resilience Assessments
Fail-Safe Versus Safe-to-Fail
Uncertainty Versus Ambiguity
Fail-Safe Neglects the Systems Perspective
The Fragmented World of Fail-Safe
SCE Versus Security Theater
What Is Security Theater?
How Does SCE Differ from Security Theater?
How to RAVE Your Way to Resilience
Repeatability: Handling Complexity
Accessibility: Making Security Easier for Engineers
Variability: Supporting Evolution
Chapter Takeaways
3. Architecting and Designing
The Effort Investment Portfolio
Allocating Your Effort Investment Portfolio
Investing Effort Based on Local Context
The Four Failure Modes Resulting from System Design
The Two Key Axes of Resilient Design: Coupling and Complexity
Designing to Preserve Possibilities
Coupling in Complex Systems
The Tight Coupling Trade-Off
The Dangers of Tight Coupling: Taming the Forest
Investing in Loose Coupling in Software Systems
Chaos Experiments Expose Coupling
Complexity in Complex Systems
Understanding Complexity: Essential and Accidental
Complexity and Mental Models
Introducing Linearity into Our Systems
Designing for Interactivity: Identity and Access Management
Navigating Flawed Mental Models
Chapter Takeaways
4. Building and Delivering
Mental Models When Developing Software
Who Owns Application Security (and Resilience)?
Lessons We Can Learn from Database Administration Going DevOps
Decisions on Critical Functionality Before Building
Defining System Goals and Guidelines on “What to Throw Out the Airlock”
Code Reviews and Mental Models
“Boring” Technology Is Resilient Technology
Standardization of Raw Materials
Developing and Delivering to Expand Safety Boundaries
Anticipating Scale and SLOs
Automating Security Checks via CI/CD
Standardization of Patterns and Tools
Dependency Analysis and Prioritizing Vulnerabilities
Observe System Interactions Across Space-Time (or Make More Linear)
Configuration as Code
Fault Injection During Development
Integration Tests, Load Tests, and Test Theater
Beware Premature and Improper Abstractions
Fostering Feedback Loops and Learning During Build and Deliver
Test Automation
Documenting Why and When
Distributed Tracing and Logging
Refining How Humans Interact with Build and Delivery Practices
Flexibility and Willingness to Change
Iteration to Mimic Evolution
Modularity: Humanity’s Ancient Tool for Resilience
Feature Flags and Dark Launches
Preserving Possibilities for Refactoring: Typing
The Strangler Fig Pattern
Chapter Takeaways
5. Operating and Observing
What Does Operating and Observing Involve?
Operational Goals in SCE
The Overlap of SRE and Security
Measuring Operational Success
Crafting Success Metrics like Attackers
The DORA Metrics
SLOs, SLAs, and Principled Performance Analytics
Embracing Confidence-Based Security
Observability for Resilience and Security
Thresholding to Uncover Safety Boundaries
Attack Observability
Scalable Is Safer
Navigating Scalability
Automating Away Toil
Chapter Takeaways
6. Responding and Recovering
Responding to Surprises in Complex Systems
Incident Response and the Effort Investment Portfolio
Action Bias in Incident Response
Practicing Response Activities
Recovering from Surprises
Blameless Culture
Blaming Human Error
Hindsight Bias and Outcome Bias
The Just-World Hypothesis
Neutral Practitioner Questions
Chapter Takeaways
7. Platform Resilience Engineering
Production Pressures and How They Influence System Behavior
What Is Platform Engineering?
Defining a Vision
Defining a User Problem
Local Context Is Critical
User Personas, Stories, and Journeys
Understanding How Humans Make Trade-Offs Under Pressure
Designing a Solution
The Ice Cream Cone Hierarchy of Security Solutions
System Design and Redesign to Eliminate Hazards
Substitute Less Hazardous Methods or Materials
Incorporate Safety Devices and Guards
Provide Warning and Awareness Systems
Apply Administrative Controls Including Guidelines and Training
Two Paths: The Control Strategy or the Resilience Strategy
Experimentation and Feedback Loops for Solution Design
Implementing a Solution
Fostering Consensus
Planning for Migration
Success Metrics
Chapter Takeaways
8. Security Chaos Experiments
Lessons Learned from Early Adopters
Lesson #1. Start in Nonproduction Environments; You Can Still Learn a Lot
Lesson #2. Use Past Incidents as a Source of Experiments
Lesson #3. Publish and Evangelize Experimental Findings
Setting Experiments Up for Success
Designing a Hypothesis
Designing an Experiment
Experiment Design Specifications
Conducting Experiments
Collecting Evidence
Analyzing and Documenting Evidence
Capturing Knowledge for Feedback Loops
Document Experiment Release Notes
Automating Experiments
Easing into Chaos: Game Days
Example Security Chaos Experiments
Security Chaos Experiments for Production Infrastructure
Security Chaos Experiments for Build Pipelines
Security Chaos Experiments in Cloud Native Environments
Security Chaos Experiments in Windows Environments
Chapter Takeaways
9. Security Chaos Engineering in the Wild
Experience Report: The Existence of Order Through Chaos (UnitedHealth Group)
The Story of ChaoSlingr
Step-by-Step Example: PortSlingr
Experience Report: A Quest for Stronger Reliability (Verizon)
The Bigger They Are…
All Hands on Deck Means No Hands on the Helm
Assert Your Hypothesis
Reliability Experiments
Cost Experiments
Performance Experiments
Risk Experiments
More Traditionally Known Experiments
Changing the Paradigm to Continuous
Lessons Learned
Experience Report: Security Monitoring (OpenDoor)
Experience Report: Applied Security (Cardinal Health)
Building the SCE Culture
The Mission of Applied Security
The Method: Continuous Verification and Validation (CVV)
The CVV Process Includes Four Steps
Experience Report: Balancing Reliability and Security via SCE (Accenture Global)
Our Roadmap to SCE Enterprise Capability
Our Process for Adoption
Experience Report: Cyber Chaos Engineering (Capital One)
What Does All This Have to Do with SCE?
What Is Secure Today May Not Be Secure Tomorrow
How We Started
How We Did This in Ye Olden Days
Things I’ve Learned Along the Way
A Reduction of Guesswork
Driving Value
Conclusion
Chapter Takeaways
Index
About the Authors