This document discusses Site Reliability Engineering (SRE), which is Google's approach to service management. It outlines the key tenets of SRE, which include ensuring a durable focus on engineering, pursuing maximum change velocity without violating service-level objectives, monitoring, emergency response, change management, demand forecasting and capacity planning, provisioning, and efficiency and performance. The document also discusses best practices for incident management in SRE and how DevOps and SRE can be applied in the enterprise.
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Cloud expo 2018: From Apollo 13 to Google SRE - When DevOps meets SRE
1. FROM APOLLO 13 TO GOOGLE SRE
WHEN DEVOPS MET SRE
Sanjeev Sharma
@sd_architect | http://sdarchitect.blog
2. #WHOAMI
• 20+ Years in Software Development
and Delivery
• Past: IBM Distinguished Engineer
and CTO for DevOps Adoption
• Now: Global Practice Lead for Data
Transformation at Delphix
• Author of two DevOps books:
• DevOps For Dummies: https://ibm.biz/
BdsPMX
• The DevOps Adoption Playbook: http://
amzn.to/2hH7rt2
• Blog: https://sdarchitect.blog
• Tweets: @sd_architect
3. WHAT IS SRE?
“SRE is what happens
when you ask a software
engineer to design an
operations team. ”
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall
Richard Murphy
“Site Reliability Engineering.”
Site Reliability Engineering (SRE) : Google’s
approach to Service Management
4. RELIABILITY: THE REAL AVAILABILITY NUMBERS!
How much downtime does 5-nines 99.999% availability translate to?
• Daily: 0.9s
• Weekly: 6.0s
• Monthly: 26.3s
• Yearly: 5m 15.6s
4-nines or 99.99% translates to downtime of:
• Daily: 8.6s
• Weekly: 1m 0.5s
• Monthly: 4m 23.0s
• Yearly: 52m 35.7s
Even the more common
99.95% availability SLO is
a mere 43 seconds/day or
5:24 minutes/week.
5. APOLLO 13 – THE REAL HEROES
Image Courtesy:
Universal Pictures, NASA
6. EIGHT TENETS OF GOOGLE SRE
1. Ensuring a Durable Focus on Engineering
2. Pursuing Maximum ChangeVelocity WithoutViolating a Service’s
SLO
3. Monitoring
4. Emergency Response
5. Change Management
6. Demand Forecasting and Capacity Planning
7. Provisioning
8. Efficiency and Performance
7. BEST PRACTICES OF INCIDENT MANAGEMENT
1. Prioritize
2. Prepare
3. Trust
4. Introspect
5. Consider alternatives
6. Practice
7. Change it around
Image Courtesy:
Universal Pictures, NASA
8. Development SCM Build
Package
Repo Deploy
Development SCM Build
Package
Repo Deploy
Development SCM Build
Package
Repo Deploy
Development SCM Build
Package
Repo Deploy Test Stage Production
Mainframe Hosted App
Mobile App
App Server Monolithic App
Cloud Native App
Enterprise
Release
Agile/Innovation Edge
Rapid Delivery for Innovation • Agile • Antifragile • Experimentation • New and Innovative • Hybrid Cloud • IaaS/PaaS • Containers
Industrialized Core
Deliver at regular cadence • Agile • Stability • Predictability • Lean Delivery pipeline • Core and Legacy Systems
Hybrid Infrastructure – Physical, Cloud • IaaS/PaaS • Containers
Business
Capability
DevOps + SRE in the Enterprise
9. Development SCM Build
Package
Repo Deploy
Development SCM Build
Package
Repo Deploy
Development SCM Build
Package
Repo Deploy
Development SCM Build
Package
Repo Deploy Test Stage Production
Application N
Application C
Application B
Application A
Enterprise
Release
Agile/Innovation Edge
Rapid Delivery for Innovation • Agile • Antifragile • Experimentation • New and Innovative • Hybrid Cloud • IaaS/PaaS • Containers
Industrialized Core
Deliver at regular cadence • Agile • Stability • Predictability • Lean Delivery pipeline • Core and Legacy Systems
Hybrid Infrastructure – Physical, Cloud • IaaS/PaaS • Containers
Business
Capability
Standardization Across Delivery Pipelines
Deployment Automation
and
Orchestration
Service andTest
Environment
Virtualization
ArchitecturePlanning
Release
Management
Operational
Readiness
10. Your Delivery Pipeline
will be as fast as the
slowest Delivery
Pipeline it is dependent
on
Data Friction is
usually the last
challenge to be
addressed
PLANNING
12. Developers are paid to
write code, not maintain
deployment and
configuration scripts.
DBAs are paid to
Manage Data and
Datastores, not
generateTest Data sets
APPLICATION DEPLOYMENT AND ENVIRONMENT ORCHESTRATION
13. If you are doing 2-
week Sprints, but it
takes 3-weeks to get
aTest Environment
andTest Data sets,
how long are your
Sprints?
TEST SERVICE AND ENVIRONMENT VIRTUALIZATION
14. It is not possible to
patch the software of a
missile AFTER it has
been launched
RELEASE MANAGEMENT
17. WHEN DEVOPS MEETS SRE
DevOps: “Everyone is responsible for
delivering Business Value.”
SRE: “(Everyone) is responsible for
delivering Continuous Business Value”