The Practice of Cloud System Administration
27%
off

The Practice of Cloud System Administration : DevOps and SRE Practices for Web Services, Volume 2

4.38 (201 ratings by Goodreads)
By (author)  , By (author)  , By (author) 

Free delivery worldwide

Available. Dispatched from the UK in 2 business days
When will my order arrive?

Description

"There's an incredible amount of depth and thinking in the practices described here, and it's impressive to see it all in one place."

-Win Treese, coauthor of Designing Systems for Internet Commerce



The Practice of Cloud System Administration, Volume 2, focuses on "distributed" or "cloud" computing and brings a DevOps/SRE sensibility to the practice of system administration. Unsatisfied with books that cover either design or operations in isolation, the authors created this authoritative reference centered on a comprehensive approach.



Case studies and examples from Google, Etsy, Twitter, Facebook, Netflix, Amazon, and other industry giants are explained in practical ways that are useful to all enterprises. The new companion to the best-selling first volume, The Practice of System and Network Administration, Second Edition, this guide offers expert coverage of the following and many other crucial topics:



Designing and building modern web and distributed systems



Fundamentals of large system design
Understand the new software engineering implications of cloud administration
Make systems that are resilient to failure and grow and scale dynamically
Implement DevOps principles and cultural changes
IaaS/PaaS/SaaS and virtual platform selection

Operating and running systems using the latest DevOps/SRE strategies



Upgrade production systems with zero down-time
What and how to automate; how to decide what not to automate
On-call best practices that improve uptime
Why distributed systems require fundamentally different system administration techniques
Identify and resolve resiliency problems before they surprise you

Assessing and evaluating your team's operational effectiveness



Manage the scientific process of continuous improvement
A forty-page, pain-free assessment system you can start using today
show more

Out of ideas for the holidays?

Visit our Gift Guides and find our recommendations on what to get friends and family during the holiday season. Shop now .

Product details

  • Paperback | 560 pages
  • 181 x 231 x 29mm | 892g
  • Addison-Wesley Educational Publishers Inc
  • New Jersey, United States
  • English
  • 032194318X
  • 9780321943187
  • 129,333

Table of contents

Preface xxiii

About the Authors xxix





Introduction 1



Part I: Design: Building It 7





Chapter 1: Designing in a Distributed World 9



1.1 Visibility at Scale 10 1.2 The Importance of Simplicity 11

1.3 Composition 12

1.4 Distributed State 17

1.5 The CAP Principle 21

1.6 Loosely Coupled Systems 24

1.7 Speed 26

1.8 Summary 29

Exercises 30



Chapter 2: Designing for Operations 31



2.1 Operational Requirements 31

2.2 Implementing Design for Operations 45

2.3 Improving the Model 48

2.4 Summary 49

Exercises 50



Chapter 3: Selecting a Service Platform 51



3.1 Level of Service Abstraction 52 3.2 Type of Machine 56

3.3 Level of Resource Sharing 62

3.4 Colocation 65

3.5 Selection Strategies 66

3.6 Summary 68

Exercises 68



Chapter 4: Application Architectures 69



4.1 Single-Machine Web Server 70

4.2 Three-Tier Web Service 71

4.3 Four-Tier Web Service 77

4.4 Reverse Proxy Service 80

4.5 Cloud-Scale Service 80

4.6 Message Bus Architectures 85

4.7 Service-Oriented Architecture 90

4.8 Summary 92

Exercises 93



Chapter 5: Design Patterns for Scaling 95



5.1 General Strategy 96 5.2 Scaling Up 98

5.3 The AKF Scaling Cube 99

5.4 Caching 104

5.5 Data Sharding 110

5.6 Threading 112

5.7 Queueing 113

5.8 Content Delivery Networks 114

5.9 Summary 116

Exercises 116



Chapter 6: Design Patterns for Resiliency 119



6.1 Software Resiliency Beats Hardware Reliability 120

6.2 Everything Malfunctions Eventually 121

6.3 Resiliency through Spare Capacity 124

6.4 Failure Domains 126

6.5 Software Failures 128

6.6 Physical Failures 131

6.7 Overload Failures 138

6.8 Human Error 141

6.9 Summary 142

Exercises 143



Part II: Operations: Running It 145



Chapter 7: Operations in a Distributed World 147



7.1 Distributed Systems Operations 148 7.2 Service Life Cycle 155

7.3 Organizing Strategy for Operational Teams 160

7.4 Virtual Office 166

7.5 Summary 167

Exercises 168



Chapter 8: DevOps Culture 171



8.1 What Is DevOps? 172

8.2 The Three Ways of DevOps 176

8.3 History of DevOps 180

8.4 DevOps Values and Principles 181

8.5 Converting to DevOps 186

8.6 Agile and Continuous Delivery 188

8.7 Summary 192

Exercises 193



Chapter 9: Service Delivery: The Build Phase 195



9.1 Service Delivery Strategies 197 9.2 The Virtuous Cycle of Quality 200

9.3 Build-Phase Steps 202

9.4 Build Console 205

9.5 Continuous Integration 205

9.6 Packages as Handoff Interface 207

9.7 Summary 208

Exercises 209



Chapter 10: Service Delivery: The Deployment Phase 211



10.1 Deployment-Phase Steps 211

10.2 Testing and Approval 214

10.3 Operations Console 217

10.4 Infrastructure Automation Strategies 217

10.5 Continuous Delivery 221

10.6 Infrastructure as Code 221

10.7 Other Platform Services 222

10.8 Summary 222

Exercises 223



Chapter 11: Upgrading Live Services 225



11.1 Taking the Service Down for Upgrading 225 11.2 Rolling Upgrades 226

11.3 Canary 227

11.4 Phased Roll-outs 229

11.5 Proportional Shedding 230

11.6 Blue-Green Deployment 230

11.7 Toggling Features 230

11.8 Live Schema Changes 234

11.9 Live Code Changes 236

11.10 Continuous Deployment 236

11.11 Dealing with Failed Code Pushes 239

11.12 Release Atomicity 240

11.13 Summary 241

Exercises 241



Chapter 12: Automation 243



12.1 Approaches to Automation 244

12.2 Tool Building versus Automation 250

12.3 Goals of Automation 252

12.4 Creating Automation 255

12.5 How to Automate 258

12.6 Language Tools 258

12.7 Software Engineering Tools and Techniques 262

12.8 Multitenant Systems 270

12.9 Summary 271

Exercises 272



Chapter 13: Design Documents 275



13.1 Design Documents Overview 275 13.2 Design Document Anatomy 277

13.3 Template 279

13.4 Document Archive 279

13.5 Review Workflows 280

13.6 Adopting Design Documents 282

13.7 Summary 283

Exercises 284



Chapter 14: Oncall 285



14.1 Designing Oncall 285

14.2 Being Oncall 294

14.3 Between Oncall Shifts 299

14.4 Periodic Review of Alerts 302

14.5 Being Paged Too Much 304

14.6 Summary 305

Exercises 306



Chapter 15: Disaster Preparedness 307



15.1 Mindset 308 15.2 Individual Training: Wheel of Misfortune 311

15.3 Team Training: Fire Drills 312

15.4 Training for Organizations: Game Day/DiRT 315

15.5 Incident Command System 323

15.6 Summary 329

Exercises 330



Chapter 16: Monitoring Fundamentals 331



16.1 Overview 332

16.2 Consumers of Monitoring Information 334

16.3 What to Monitor 336

16.4 Retention 338

16.5 Meta-monitoring 339

16.6 Logs 340

16.7 Summary 342

Exercises 342



Chapter 17: Monitoring Architecture and Practice 345



17.1 Sensing and Measurement 346 17.2 Collection 350

17.3 Analysis and Computation 353

17.4 Alerting and Escalation Manager 354

17.5 Visualization 358

17.6 Storage 362

17.7 Configuration 362

17.8 Summary 363

Exercises 364



Chapter 18: Capacity Planning 365



18.1 Standard Capacity Planning 366

18.2 Advanced Capacity Planning 371

18.3 Resource Regression 381

18.4 Launching New Services 382

18.5 Reduce Provisioning Time 384

18.6 Summary 385

Exercises 386



Chapter 19: Creating KPIs 387



19.1 What Is a KPI? 388 19.2 Creating KPIs 389

19.3 Example KPI: Machine Allocation 393

19.4 Case Study: Error Budget 396

19.5 Summary 399

Exercises 399



Chapter 20: Operational Excellence 401



20.1 What Does Operational Excellence Look Like? 401

20.2 How to Measure Greatness 402

20.3 Assessment Methodology 403

20.4 Service Assessments 407

20.5 Organizational Assessments 411

20.6 Levels of Improvement 412

20.7 Getting Started 413

20.8 Summary 414

Exercises 415



Epilogue 416



Part III: Appendices 419



Appendix A: Assessments 421



A.1 Regular Tasks (RT) 423

A.2 Emergency Response (ER) 426

A.3 Monitoring and Metrics (MM) 428

A.4 Capacity Planning (CP) 431

A.5 Change Management (CM) 433

A.6 New Product Introduction and Removal (NPI/NPR) 435

A.7 Service Deployment and Decommissioning (SDD) 437

A.8 Performance and Efficiency (PE) 439

A.9 Service Delivery: The Build Phase 442

A.10 Service Delivery: The Deployment Phase 444

A.11 Toil Reduction 446

A.12 Disaster Preparedness 448



Appendix B: The Origins and Future of Distributed Computing and Clouds 451



B.1 The Pre-Web Era (1985-1994) 452 B.2 The First Web Era: The Bubble (1995-2000) 455

B.3 The Dot-Bomb Era (2000-2003) 459

B.4 The Second Web Era (2003-2010) 465

B.5 The Cloud Computing Era (2010-present) 469

B.6 Conclusion 472

Exercises 473



Appendix C: Scaling Terminology and Concepts 475



C.1 Constant, Linear, and Exponential Scaling 475

C.2 Big O Notation 476

C.3 Limitations of Big O Notation 478



Appendix D: Templates and Examples 481



D.1 Design Document Template 481 D.2 Design Document Example 482

D.3 Sample Postmortem Template 484



Appendix E: Recommended Reading 487





Bibliography 491

Index 499
show more

About Thomas A. Limoncelli

Thomas A. Limoncelli is an internationally recognized author, speaker, and system administrator with more than twenty years of experience at companies like Google, Bell Labs, and StackExchange.com.



Strata R. Chalup has more than twenty-five years of experience in Silicon Valley, focusing on IT strategy, best-practices, and scalable infrastructures at firms that include Apple, Sun, Cisco, McAfee, and Palm.



Christina J. Hogan has more than twenty years of experience in system administration and network engineering, from Silicon Valley to Italy and Switzerland. She has a master's degree in computer science, a doctorate in aeronautical engineering, and has been part of a Formula 1 racing team.
show more

Rating details

201 ratings
4.38 out of 5 stars
5 54% (108)
4 34% (68)
3 10% (20)
2 2% (5)
1 0% (0)
Book ratings by Goodreads
Goodreads is the world's largest site for readers with over 50 million reviews. We're featuring millions of their reader ratings on our book pages to help you find your new favourite book. Close X