Early system administration required in-depth knowledge of a variety of services on individual systems. Now, the job is increasingly complex and different from one company to the next with an ever-growing list of technologies and third-party services to integrate. How does any one individual stay relevant in systems and services? This practical guide helps anyone in operations—sysadmins, automation engineers, IT professionals, and site reliability engineers—understand the essential concepts of the role today.
Collaboration, automation, and the evolution of systems change the fundamentals of operations work. No matter where you are in your journey, this book provides you the information to craft your path to advancing essential system administration skills. Author Jennifer Davis provides examples of modern practices and tools with recommended materials to advance your skills.
Topics include:
Development and testing: Version control, fundamentals of...
Author(s): Jennifer Davis
Publisher: O'Reilly Media
Year: 2022
Language: English
Pages: 325
Foreword
Preface
Who Should Read This Book?
What This Book Is Not
Scope of This Book
If I Could Tell You Only One Thing
If I Could Tell You Only One More Thing
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Introducing Modern System Administration
Map Your Journey
Embrace a Mindset Shift
What Is the Job?
Flavors of System Administration
Embrace Evolving Practices
Embrace Collaboration
Embrace Sustainability
Wrapping Up
I. Reasoning About Systems
1. Patterns and Interconnections
How to Connect Things
How Things Communicate
Application Layer
Transport Layer
Network Layer
Data Link Layer
Physical Layer
Wrapping Up
2. Computing Environments
Common Workloads
Choosing the Location of Your Workloads
On-Prem
Cloud Computing
Compute Options
Serverless
Unikernels
Functions
App services
Containers
Virtual Machines
Guidelines for Choosing Compute
Wrapping Up
3. Storage
Why Care About Storage?
Key Characteristics
Storage Categories
Block Storage
File Storage
Object Storage
Database Storage
Considerations for Your Storage Strategy
Anticipate Your Capacity and Latency Requirements
Retain Your Data as Long as Is Reasonably Necessary
Respect the Privacy Concerns of Your Users
Defend Your Data
Be Prepared to Handle Disaster Recovery Situations
Wrapping Up
4. Network
Caring About Networks
Key Characteristics of Networks
Build a Network
Virtualization
Software-Defined Networks
Content Distribution Networks
Guidelines to Your Network Strategy
Wrapping Up
II. Practices
5. Sysadmin Toolkit
What Is Your Digital Toolkit?
The Components of Your Toolkit
Choosing an Editor
Integrated static code analysis
Code completion
Establish and validate team conventions
Integrate workflow with Git
Choosing Programming Languages
Frameworks and Libraries
Other Helpful Utilities
Wrapping Up
6. Version Control
What Is Version Control?
Benefits of Version Control
Organizing Infra Projects
Wrapping Up
7. Testing
You’re Already Testing
Common Types of Testing
Linting
Unit Tests
Integration Tests
End-to-End Tests
Explicit Testing Strategy
Improving Your Tests; Learning from Failure
Next Steps
Wrapping Up
8. Infrastructure Security
What Is Infrastructure Security?
Share Security Responsibilities
Borrow the Attacker Lens
Design for Security Operability
Categorize Discovered Issues
Wrapping Up
9. Documentation
Know Your Audience
Dimensions of Documentation
Organization Practices
Organizing a Topic
Organizing a Site
Recommendations for Quality Documentation
Wrapping Up
10. Presentations
Know Your Audience
Choose Your Channel
Choose Your Story Type
Storytelling in Practice
Case #1: Charts Are Worth a Thousand Words
Case #2: Telling the Same Story with a Different Audience
Team dashboard
Manager dashboard
Customer dashboards
The Key Takeaways
Know Your Visuals
Visual Cues
Chart Types
Data tables
Bar charts
Line charts
Area charts
Heat maps
Flame graphs
Treemaps
Recommended Visualization Practices
Wrapping Up
III. Assembling the System
11. Scripting Infrastructure
Why Script Your Infrastructure?
Three Lenses to Model Your Infrastructure
Code to Build Machine Images
Code to Provision Infrastructure
Code to Configure Infrastructure
Getting Started
Wrapping Up
12. Managing Your Infrastructure
Infrastructure as Code
Treating Your Infrastructure as Data
Getting Started with Infrastructure Management
Linting
Writing Unit Tests
Writing Integration Tests
Writing End-to-End Tests
Wrapping Up
13. Securing Your Infrastructure
Assessing Attack Vectors
Manage Identity and Access
How Should You Control Access to Your System?
Who Should Have Access to Your System?
Manage Secrets
Password Managers and Secret Management Software
Defending Secrets and Monitoring Usage
Securing Your Computing Environment
Securing Your Network
Security Recommendations for Your Infrastructure Management
Wrapping Up
IV. Monitoring the System
14. Monitoring Theory
Why Monitor?
How Do Monitoring and Observability Differ?
Monitoring Building Blocks
Events
Monitors
Data: Metrics, Logs, and Tracing
First-Level Monitoring
Event Detection
Data Collection
Data Reduction
Data Analysis
Data Presentation
Second-Level Monitoring
Wrapping Up
15. Compute and Software Monitoring in Practice
Identify Your Desired Outputs
What Should You Monitor?
Do What You Can Now
Monitors That Matter
Plan for a Monitoring Project
What Alerts Should You Set?
Examine Monitoring Platforms
Choose a Monitoring Tool or Platform
Wrapping Up
16. Managing Monitoring Data
What Is Monitoring Data?
Metrics
Logs
Structured Logs
Tracing
Distributed Tracing
Choose Your Data Types
Retain Log Data
Analyze Log Data
Monitoring Data at Scale
Wrapping Up
17. Monitor Your Work
Why Should You Monitor Your Work?
Manage Your Work with Kanban
Choose a Platform
Find the Interesting Information
Wrapping Up
V. Scaling the System
18. Capacity Management
What Is Capacity?
The Capacity Management Model
Resource Procurement
Justification
Management
Monitoring
The Framework for Capacity Planning
Do You Need Capacity Planning with Cloud Computing?
Wrapping Up
19. Developing On-Call Resilience
What Is On-Call?
Humane On-Call Processes
Check Your On-Call Policies
Preparing for On-Call
One Week Out
The Night Before
Your On-Call Rotation
On-Call Handoff
The Day After On-Call
Monitor the On-Call Experience
Wrapping Up
20. Managing Incidents
What Is an Incident?
What Is Incident Management?
Planning and Preparing for Incidents
Set Up and Document Communication Channels
Train for Effective Communication
Create Templates
Maintain Documentation
Document the Risks
Practice Failure
Understand Your Tools
Clearly Define Roles and Responsibilities
Understand Severity Levels and Escalation Protocols
Responding to Incidents
Learning from the Incident
How Deep Should You Dig?
Aiding Discovery
Documenting Incidents Effectively
Distributing the Information
Next Steps
Wrapping Up
21. Leading Sustainable Teams
Collective Leadership
Adopt a Whole-Team Approach
Build Resilient On-Call Teams
Update On-Call Processes
Monitor the Team’s Work
Why Monitor the Team?
What Should You Monitor?
What are the team’s objectives?
What is the team’s definition of a task?
What is the team’s definition of a project?
What is the service catalog that your team offers?
Examine the work
Measure Impact on the Team
Support Team Infrastructure with Documentation
Budget a Learning Culture
Adapt to Challenges
Wrapping Up
Conclusion
A. Protocols in Practice
Hypertext Transfer Protocol
QUIC
Domain Name System
B. Resolving Test Failures
Test Failure Type #1: Environment Problems
Test Failure Type #2: Flawed Test Logic
Test Failure Type #3: Changing Assumptions
Test Failure Type #4: Flaky Tests
Test Failure Type #5: Code Defects
Index