Chaos Engineering: Site reliability through controlled disruption

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Auto engineers test the safety of a car by intentionally crashing it and carefully observing the results. Chaos engineering applies the same principles to software systems. In Chaos Engineering: Site reliability through controlled disruption, you’ll learn to run your applications and infrastructure through a series of tests that simulate real-life failures. You'll maximize the benefits of chaos engineering by learning to think like a chaos engineer, and how to design the proper experiments to ensure the reliability of your software. With examples that cover a whole spectrum of software, you'll be ready to run an intensive testing regime on anything from a simple WordPress site to a massive distributed system running on Kubernetes. About the Technology Can your network survive a devastating failure? Could an accident bring your day-to-day operations to a halt? Chaos engineering simulates infrastructure outages, component crashes, and other calamities to show how systems and staff respond. Testing systems in distress is the best way to ensure their future resilience, which is especially important for complex, large-scale applications with little room for downtime. About the book Chaos Engineering teaches you to design and execute controlled experiments that uncover hidden problems. Learn to inject system-shaking failures that disrupt system calls, networking, APIs, and Kubernetes-based microservices infrastructures. To help you practice, the book includes a downloadable Linux VM image with a suite of preconfigured tools so you can experiment quickly—without risk. What's inside • Inject failure into processes, applications, and virtual machines • Test software running on Kubernetes • Work with both open source and legacy software • Simulate database connection latency • Test and improve your team’s failure response About the reader Assumes Linux servers. Basic scripting skills required. About the author Mikolaj Pawlikowski is a recognized authority on chaos engineering. He is the creator of the Kubernetes chaos engineering tool PowerfulSeal, and the networking visibility tool Goldpinger.

Author(s): Mikolaj Pawlikowski
Edition: 1
Publisher: Manning Publications
Year: 2021

Language: English
Commentary: Vector PDF
Pages: 424
City: Shelter Island, NY
Tags: Databases; Python; Java; Web Applications; Microservices; Docker; Redis; Application Development; Kubernetes; WordPress; Site Reliability Engineering; Software Architecture; Testing; Chaos

Chaos Engineering
brief contents
contents
foreword
foreword
preface
acknowledgments
about this book
Who should read this book
How this book is organized: a roadmap
About the code
liveBook discussion forum
about the author
about the cover illustration
1 Into the world of chaos engineering
1.1 What is chaos engineering?
1.2 Motivations for chaos engineering
1.2.1 Estimating risk and cost, and setting SLIs, SLOs, and SLAs
1.2.2 Testing a system as a whole
1.2.3 Finding emergent properties
1.3 Four steps to chaos engineering
1.3.1 Ensure observability
1.3.2 Define a steady state
1.3.3 Form a hypothesis
1.3.4 Run the experiment and prove (or refute) your hypothesis
1.4 What chaos engineering is not
1.5 A taste of chaos engineering
1.5.1 FizzBuzz as a service
1.5.2 A long, dark night
1.5.3 Postmortem
1.5.4 Chaos engineering in a nutshell
Summary
Part 1—Chaos engineering fundamentals
2 First cup of chaos and blast radius
2.1 Setup: Working with the code in this book
2.2 Scenario
2.3 Linux forensics 101
2.3.1 Exit codes
2.3.2 Killing processes
2.3.3 Out-Of-Memory Killer
2.4 The first chaos experiment
2.4.1 Ensure observability
2.4.2 Define a steady state
2.4.3 Form a hypothesis
2.4.4 Run the experiment
2.5 Blast radius
2.6 Digging deeper
2.6.1 Saving the world
Summary
3 Observability
3.1 The app is slow
3.2 The USE method
3.3 Resources
3.3.1 System overview
3.3.2 Block I/O
3.3.3 Networking
3.3.4 RAM
3.3.5 CPU
3.3.6 OS
3.4 Application
3.4.1 cProfile
3.4.2 BCC and Python
3.5 Automation: Using time series
3.5.1 Prometheus and Grafana
3.6 Further reading
Summary
4 Database trouble and testing in production
4.1 We’re doing WordPress
4.2 Weak links
4.2.1 Experiment 1: Slow disks
4.2.2 Experiment 2: Slow connection
4.3 Testing in production
Summary
Part 2—Chaos engineering in action
5 Poking Docker
5.1 My (Dockerized) app is slow!
5.1.1 Architecture
5.2 A brief history of Docker
5.2.1 Emulation, simulation, and virtualization
5.2.2 Virtual machines and containers
5.3 Linux containers and Docker
5.4 Peeking under Docker’s hood
5.4.1 Uprooting processes with chroot
5.4.2 Implementing a simple container(-ish) part 1: Using chroot
5.4.3 Experiment 1: Can one container prevent another one from writing to disk?
5.4.4 Isolating processes with Linux namespaces
5.4.5 Docker and namespaces
5.5 Experiment 2: Killing processes in a different PID namespace
5.5.1 Implementing a simple container(-ish) part 2: Namespaces
5.5.2 Limiting resource use of a process with cgroups
5.6 Experiment 3: Using all the CPU you can find!
5.7 Experiment 4: Using too much RAM
5.7.1 Implementing a simple container(-ish) part 3: Cgroups
5.8 Docker and networking
5.8.1 Capabilities and seccomp
5.9 Docker demystified
5.10 Fixing my (Dockerized) app that’s being slow
5.10.1 Booting up Meower
5.10.2 Why is the app slow?
5.11 Experiment 5: Network slowness for containers with Pumba
5.11.1 Pumba: Docker chaos engineering tool
5.11.2 Chaos experiment implementation
5.12 Other parts of the puzzle
5.12.1 Docker daemon restarts
5.12.2 Storage for image layers
5.12.3 Advanced networking
5.12.4 Security
Summary
6 Who you gonna call? Syscall-busters!
6.1 Scenario: Congratulations on your promotion!
6.1.1 System X: If everyone is using it, but no one maintains it, is it abandonware?
6.2 A brief refresher on syscalls
6.2.1 Finding out about syscalls
6.2.2 Using the standard C library and glibc
6.3 How to observe a process’s syscalls
6.3.1 strace and sleep
6.3.2 strace and System X
6.3.3 strace’s problem: Overhead
6.3.4 BPF
6.3.5 Other options
6.4 Blocking syscalls for fun and profit part 1: strace
6.4.1 Experiment 1: Breaking the close syscall
6.4.2 Experiment 2: Breaking the write syscall
6.5 Blocking syscalls for fun and profit part 2: Seccomp
6.5.1 Seccomp the easy way with Docker
6.5.2 Seccomp the hard way with libseccomp
Summary
7 Injecting failure into the JVM
7.1 Scenario
7.1.1 Introducing FizzBuzzEnterpriseEdition
7.1.2 Looking around FizzBuzzEnterpriseEdition
7.2 Chaos engineering and Java
7.2.1 Experiment idea
7.2.2 Experiment plan
7.2.3 Brief introduction to JVM bytecode
7.2.4 Experiment implementation
7.3 Existing tools
7.3.1 Byteman
7.3.2 Byte-Monkey
7.3.3 Chaos Monkey for Spring Boot
7.4 Further reading
Summary
8 Application-level fault injection
8.1 Scenario
8.1.1 Implementation details: Before chaos
8.2 Experiment 1: Redis latency
8.2.1 Experiment 1 plan
8.2.2 Experiment 1 steady state
8.2.3 Experiment 1 implementation
8.2.4 Experiment 1 execution
8.2.5 Experiment 1 discussion
8.3 Experiment 2: Failing requests
8.3.1 Experiment 2 plan
8.3.2 Experiment 2 implementation
8.3.3 Experiment 2 execution
8.4 Application vs. infrastructure
Summary
9 There’s a monkey in my browser!
9.1 Scenario
9.1.1 Pgweb
9.1.2 Pgweb implementation details
9.2 Experiment 1: Adding latency
9.2.1 Experiment 1 plan
9.2.2 Experiment 1 steady state
9.2.3 Experiment 1 implementation
9.2.4 Experiment 1 run
9.3 Experiment 2: Adding failure
9.3.1 Experiment 2 implementation
9.3.2 Experiment 2 run
9.4 Other good-to-know topics
9.4.1 Fetch API
9.4.2 Throttling
9.4.3 Tooling: Greasemonkey and Tampermonkey
Summary
Part 3—Chaos engineering in Kubernetes
10 Chaos in Kubernetes
10.1 Porting things onto Kubernetes
10.1.1 High-Profile Project documentation
10.1.2 What’s Goldpinger?
10.2 What’s Kubernetes (in 7 minutes)?
10.2.1 A very brief history of Kubernetes
10.2.2 What can Kubernetes do for you?
10.3 Setting up a Kubernetes cluster
10.3.1 Using Minikube
10.3.2 Starting a cluster
10.4 Testing out software running on Kubernetes
10.4.1 Running the ICANT Project
10.4.2 Experiment 1: Kill 50% of pods
10.4.3 Party trick: Kill pods in style
10.4.4 Experiment 2: Introduce network slowness
Summary
11 Automating Kubernetes experiments
11.1 Automating chaos with PowerfulSeal
11.1.1 What’s PowerfulSeal?
11.1.2 PowerfulSeal installation
11.1.3 Experiment 1b: Killing 50% of pods
11.1.4 Experiment 2b: Introducing network slowness
11.2 Ongoing testing and service-level objectives
11.2.1 Experiment 3: Verifying pods are ready within (n) seconds of being created
11.3 Cloud layer
11.3.1 Cloud provider APIs, availability zones
11.3.2 Experiment 4: Taking VMs down
Summary
12 Under the hood of Kubernetes
12.1 Anatomy of a Kubernetes cluster and how to break it
12.1.1 Control plane
12.1.2 Kubelet and pause container
12.1.3 Kubernetes, Docker, and container runtimes
12.1.4 Kubernetes networking
12.2 Summary of key components
Summary
13 Chaos engineering (for) people
13.1 Chaos engineering mindset
13.1.1 Failure is not a maybe: It will happen
13.1.2 Failing early vs. failing late
13.2 Getting buy-in
13.2.1 Management
13.2.2 Team members
13.2.3 Game days
13.3 Teams as distributed systems
13.3.1 Finding knowledge single points of failure: Staycation
13.3.2 Misinformation and trust within the team: Liar, Liar
13.3.3 Bottlenecks in the team: Life in the Slow Lane
13.3.4 Testing your processes: Inside Job
Summary
13.4 Where to go from here?
appendix A—Installing chaos engineering tools
A.1 Prerequisites
A.2 Installing the Linux tools
A.2.1 Pumba
A.2.2 Python 3.7 with DTrace option
A.2.3 Pgweb
A.2.4 Pip dependencies
A.2.5 Example data to look at for pgweb
A.3 Configuring WordPress
A.4 Checking out the source code for this book
A.5 Installing Minikube (Kubernetes)
A.5.1 Linux
A.5.2 macOS
A.5.3 Windows
appendix B—Answers to the pop quizzes
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
appendix C—Director’s cut (aka the bloopers)
C.1 Cloud
C.2 Chaos engineering tools comparison
C.3 Windows
C.4 Runtimes
C.5 Node.js
C.6 Architecture problems
C.7 The four steps to a chaos experiment
C.8 You should have included !
C.9 Real-world failure examples!
C.10 “Chaos engineering” is a terrible name!
C.11 Wrap!
appendix D—Chaos-engineering recipes
D.1 SRE (’ShRoomEee) burger
D.1.1 Ingredients
D.1.2 Hidden dependencies
D.1.3 Making the patty
D.1.4 Assembling the finished product
D.2 Chaos pizza
D.2.1 Ingredients
D.2.2 Preparation
index
Symbols
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y