SlideShare a Scribd company logo
1 of 42
Anatomy of
Three Incidents
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
Background
@randyshoup
@randyshoup
App Engine Outage - Oct 2012
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
App Engine Outage - Oct 2012
App Engine Reliability Fixit
• Step 1: Identify the Problem
o All team leads and senior engineers met in a room with a whiteboard
o Enumerated all known and suspected reliability issues
o Too much technical debt had accumulated
o Reliability issues had not been prioritized
o Identify 8-10 themes
@randyshoup
• Step 2: Understand the Problem
o Each theme assigned to a senior engineer to investigate
o Timeboxed for 1 week
o After 1 week, all leads came back with
• Detailed list of issues
• Recommended steps to address them
• Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.)
App Engine Reliability Fixit
@randyshoup
• Step 3: Consensus and Prioritization
o Leads discussed themes and prioritized work
o Assigned engineers to tasks
App Engine Reliability Fixit
@randyshoup
• Step 4: Implementation and Follow-up
o Engineers worked on assigned tasks
o Simple spreadsheet of task status, which engineers updated weekly
o Minimal effort from management (~1 hour / week) to summarize progress at
weekly team meeting
App Engine Reliability Fixit
@randyshoup
•  Results
o 10x reduction in reliability issues
o Improved team cohesion and camaraderie
o Broader participation and ownership of the future health of the platform
o Still remembered several years later
App Engine Reliability Fixit
@randyshoup
@randyshoup
Stitch Fix – Oct / Nov 2016
• (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database]
• (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database]
• (10/25/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/24/2016) All systems unavailable for ~5 minutes [Shared Database]
• (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack]
• (10/18/2016) All systems unavailable for ~3 minutes [Shared Database]
• (10/17/2016) All systems unavailable for ~20 minutes [Shared Database]
• (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage]
• (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage]
• (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage]
• (10/10/2016) All systems unavailable for ~10 minutes [Shared Database]
@randyshoup
Database Stability Problems
• 1. Applications contended on common tables
• 2. Scalability limited by database connections
• 3. One application could take down entire company
@randyshoup
Stability Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
•  Results
@randyshoup
Stability Solutions
• 1. Focus on expensive queries
o Log
o Eliminate
o Rewrite
o Reduce
• 2. Manage database connections via connection concentrator
• 3. Stability and Scalability Program
o Ongoing 25% investment in services migration
@randyshoup
@randyshoup
Login Issues - 2019
• Problem: Some members unable to log in
• Inconsistent representations across different services in the
system
• Over time, simple system interactions grew increasingly
complex and convoluted
• Not enough graceful degradation or automated repair
@randyshoup
Login Retrospective
• Step 1: Identify the Problem
• Step 2: Understand the Problem
• Step 3: Consensus and Prioritization
• Step 4: Implementation and Follow-Up
@randyshoup
Login Solutions
• 1. Clean up user data
o Find inconsistencies
o Track inconsistency metrics
o Identify and fix contributing processes and applications
• 2. User state machines
o Define user journeys as explicit state machines
o Refine and correct via cross-functional feedback
o Implement state machines in code
• 3. “Pandora” Program
o Rewrite core identity system into set of user capabilities
@randyshoup
Common Elements
• Unintentional, long-term accumulation of
small, individually reasonable decisions
• “Compelling event” catalyzes long-term
change
• Blameless culture makes learning and
improvement possible
• Structured post-incident approach
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Vicious Cycle of Technical Debt
Technical
Debt
“No time
to do it
right”
Quick-
and-dirty
“Do you have time to do it
twice?”
“We don’t have time to do it
right!”
@randyshoup
The more constrained you are
on time or resources, the more
important it is to get it done
the first time.
@randyshoup
Negotiating Tradeoffs
Scope
Time
Quality
@randyshoup
Virtuous Cycle of Investment
Solid
Foundation
Confidence
Faster and
Better
Quality
Investment
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
During the Incident
• Focus on restoring service
o Everything else is secondary, and should wait
• Shield the team
• Clear, structured communication
o Even when there is nothing to report!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
After the Incident
• Blameless postmortem
• Identify and understand the
contributing factors
• Action items and Learnings
• Follow Up!
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
Psychological Safety
• Team is safe for interpersonal
risk-taking
• “Being able to show and employ
one’s self without fear of
negative consequences”
• More important than any other
factor in team success
“Finally we can prioritize
fixing that broken system!”
@randyshoup
Inclusive Decisionmaking
• Make better business decisions
87% of the time
• Make decisions 2x faster with
1/2 the meetings
• Deliver 60% better business
results
Cloverpop Inclusive Decisionmaking study, 2016
As we improve diversity, decisionmaking improves
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Frame the Problem:
Quality and reliability are
business concerns
@randyshoup
Use Common Currency
Time
Money People
@randyshoup
15 Million
“Never let a
good crisis go
to waste.”
@randyshoup
“Incidents are unplanned
investments, and they are also
opportunities. Your challenge
is to maximize the ROI on the
sunk cost.”
@randyshoup
-- John Allspaw, Adaptive Capacity Labs
Improvement Budget
• Explicit resource investment
o Agree on an up-front investment
(e.g., 25%, 30% of engineering efforts)
• Retain autonomy, Provide transparency
o Making these decisions is exactly why they hired you
@randyshoup
Lessons
•Engineering Tradeoffs
•Compelling Event
•Driving Improvement
Incident Response Patterns
• Incident Roles
• Incident Triggers
• On-Call Rotation and Onboarding
• Incident Command Training
• Incident Communication Plan
• Periodic Incident Updates
• Shared Incident State Doc
• Incident Call Recording
• Incident Swarming
• Local / Global Incident Reviews
• Post-Review Improvement Items
@randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
Thank you!
@randyshoup
linkedin.com/in/randyshoup
medium.com/@randyshoup

More Related Content

What's hot

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionDOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionGene Kim
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayRandy Shoup
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsRandy Shoup
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at ScaleRandy Shoup
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Randy Shoup
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine LearningRandy Shoup
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesRandy Shoup
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupRandy Shoup
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsRandy Shoup
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsRandy Shoup
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the roomAgileDenver
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsRandy Shoup
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudRandy Shoup
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...DevOpsDays Tel Aviv
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Chris Riley ☁
 
Lean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersLean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersKeith Klundt
 
Continuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeContinuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeTechWell
 
IAM - One Year Later
IAM - One Year LaterIAM - One Year Later
IAM - One Year LaterDave Shields
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference
 

What's hot (20)

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionDOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
 
A CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling OrganizationsA CTO's Guide to Scaling Organizations
A CTO's Guide to Scaling Organizations
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at Scale
 
Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015Service Architectures At Scale - QCon London 2015
Service Architectures At Scale - QCon London 2015
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
Monoliths, Migrations, and Microservices
Monoliths, Migrations, and MicroservicesMonoliths, Migrations, and Microservices
Monoliths, Migrations, and Microservices
 
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a StartupMinimum Viable Architecture -- Good Enough is Good Enough in a Startup
Minimum Viable Architecture -- Good Enough is Good Enough in a Startup
 
Managing Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and EventsManaging Data at Scale - Microservices and Events
Managing Data at Scale - Microservices and Events
 
DevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of OperationsDevOpsDays Silicon Valley 2014 - The Game of Operations
DevOpsDays Silicon Valley 2014 - The Game of Operations
 
The agile elephant in the room
The agile elephant in the roomThe agile elephant in the room
The agile elephant in the room
 
Scaling Your Architecture with Services and Events
Scaling Your Architecture with Services and EventsScaling Your Architecture with Services and Events
Scaling Your Architecture with Services and Events
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the Cloud
 
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
ONE-SIZE DOESN'T FIT ALL - EFFECTIVELY (RE)EVALUATE A DATA SOLUTION FOR YOUR ...
 
Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014Enterprise DevOps fact or fiction - DevOps Summit 2014
Enterprise DevOps fact or fiction - DevOps Summit 2014
 
Lean Canvas for Internal Product Owners
Lean Canvas for Internal Product OwnersLean Canvas for Internal Product Owners
Lean Canvas for Internal Product Owners
 
Continuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a TimeContinuous Delivery in a Legacy Shop—One Step at a Time
Continuous Delivery in a Legacy Shop—One Step at a Time
 
IAM - One Year Later
IAM - One Year LaterIAM - One Year Later
IAM - One Year Later
 
Lazar Milovic - No estimates
Lazar Milovic - No estimatesLazar Milovic - No estimates
Lazar Milovic - No estimates
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 

Similar to Anatomy of Three Incidents -- Commonalities and Lessons

Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Balaji Muniraja
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldRandy Shoup
 
Software Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseSoftware Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseXebiaLabs
 
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...AXIA Consulting Inc.
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolutionGrupa Unity
 
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...AppDynamics
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuDevConFu
 
How Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineHow Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineImaginet
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentTasktop
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...AppDynamics
 
A Journey Through Agile in the Government
A Journey Through Agile in the GovernmentA Journey Through Agile in the Government
A Journey Through Agile in the GovernmentRichard Cheng
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalJohn Willis
 
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoAhmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoZahid02
 
The Web is Not a Project
The Web is Not a ProjectThe Web is Not a Project
The Web is Not a ProjectMark Greenfield
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityRandy Shoup
 

Similar to Anatomy of Three Incidents -- Commonalities and Lessons (20)

Effective Scrum
Effective ScrumEffective Scrum
Effective Scrum
 
Visualisation&agile practices ai2014
Visualisation&agile practices ai2014Visualisation&agile practices ai2014
Visualisation&agile practices ai2014
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
 
Software Release Orchestration and the Enterprise
Software Release Orchestration and the EnterpriseSoftware Release Orchestration and the Enterprise
Software Release Orchestration and the Enterprise
 
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
Performing an R12 Upgrade in a Highly Customized Environment with a Worldwide...
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
 
Utils_Presentation_Richard U
Utils_Presentation_Richard UUtils_Presentation_Richard U
Utils_Presentation_Richard U
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
 
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
Standard Bank: How APM Supports DevOps, Agile and Engineering Transformation ...
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
 
How Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom LineHow Microsoft ALM Tools Can Improve Your Bottom Line
How Microsoft ALM Tools Can Improve Your Bottom Line
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
 
A Journey Through Agile in the Government
A Journey Through Agile in the GovernmentA Journey Through Agile in the Government
A Journey Through Agile in the Government
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational Capital
 
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte CasinoAhmed Jassat Oracle Customer Day Presentation at Monte Casino
Ahmed Jassat Oracle Customer Day Presentation at Monte Casino
 
Gaurav_CV
Gaurav_CVGaurav_CV
Gaurav_CV
 
Agile Bureaucracy
Agile BureaucracyAgile Bureaucracy
Agile Bureaucracy
 
The Web is Not a Project
The Web is Not a ProjectThe Web is Not a Project
The Web is Not a Project
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
 

More from Randy Shoup

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsRandy Shoup
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionRandy Shoup
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in MicroservicesRandy Shoup
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic MicroservicesRandy Shoup
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015Randy Shoup
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-ServicesRandy Shoup
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYERandy Shoup
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...Randy Shoup
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...Randy Shoup
 

More from Randy Shoup (9)

Breaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building TeamsBreaking Codes, Designing Jets, and Building Teams
Breaking Codes, Designing Jets, and Building Teams
 
Ten Lessons of the DevOps Transition
Ten Lessons of the DevOps TransitionTen Lessons of the DevOps Transition
Ten Lessons of the DevOps Transition
 
Managing Data in Microservices
Managing Data in MicroservicesManaging Data in Microservices
Managing Data in Microservices
 
Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic Microservices
 
From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015From the Monolith to Microservices - CraftConf 2015
From the Monolith to Microservices - CraftConf 2015
 
Concurrency at Scale: Evolution to Micro-Services
Concurrency at Scale:  Evolution to Micro-ServicesConcurrency at Scale:  Evolution to Micro-Services
Concurrency at Scale: Evolution to Micro-Services
 
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYEQCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
QCon New York 2014 - Scalable, Reliable Analytics Infrastructure at KIXEYE
 
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
QCon Tokyo 2014 - Virtuous Cycles of Velocity: What I Learned About Going Fas...
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
 

Recently uploaded

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 

Recently uploaded (20)

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 

Anatomy of Three Incidents -- Commonalities and Lessons

  • 1. Anatomy of Three Incidents Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 4. App Engine Outage - Oct 2012
  • 6. App Engine Reliability Fixit • Step 1: Identify the Problem o All team leads and senior engineers met in a room with a whiteboard o Enumerated all known and suspected reliability issues o Too much technical debt had accumulated o Reliability issues had not been prioritized o Identify 8-10 themes @randyshoup
  • 7. • Step 2: Understand the Problem o Each theme assigned to a senior engineer to investigate o Timeboxed for 1 week o After 1 week, all leads came back with • Detailed list of issues • Recommended steps to address them • Estimated order-of-magnitude of effort (1 day, 1 week, 1 month, etc.) App Engine Reliability Fixit @randyshoup
  • 8. • Step 3: Consensus and Prioritization o Leads discussed themes and prioritized work o Assigned engineers to tasks App Engine Reliability Fixit @randyshoup
  • 9. • Step 4: Implementation and Follow-up o Engineers worked on assigned tasks o Simple spreadsheet of task status, which engineers updated weekly o Minimal effort from management (~1 hour / week) to summarize progress at weekly team meeting App Engine Reliability Fixit @randyshoup
  • 10. •  Results o 10x reduction in reliability issues o Improved team cohesion and camaraderie o Broader participation and ownership of the future health of the platform o Still remembered several years later App Engine Reliability Fixit @randyshoup
  • 12. Stitch Fix – Oct / Nov 2016 • (11/08/2016) Spectre unavailable for ~3 minutes [Shared Database] • (11/05/2016) Spectre unavailable for ~5 minutes [Shared Database] • (10/25/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/24/2016) All systems unavailable for ~5 minutes [Shared Database] • (10/21/2016) All systems unavailable for ~3 ½ hours [DDOS attack] • (10/18/2016) All systems unavailable for ~3 minutes [Shared Database] • (10/17/2016) All systems unavailable for ~20 minutes [Shared Database] • (10/13/2016) Minx escalation broken for ~2 hours [Zendesk outage] • (10/11/2016) Label printing unavailable for ~10 minutes [FedEx outage] • (10/10/2016) Label printing unavailable for ~15 minutes [FedEx outage] • (10/10/2016) All systems unavailable for ~10 minutes [Shared Database] @randyshoup
  • 13. Database Stability Problems • 1. Applications contended on common tables • 2. Scalability limited by database connections • 3. One application could take down entire company @randyshoup
  • 14. Stability Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up •  Results @randyshoup
  • 15. Stability Solutions • 1. Focus on expensive queries o Log o Eliminate o Rewrite o Reduce • 2. Manage database connections via connection concentrator • 3. Stability and Scalability Program o Ongoing 25% investment in services migration @randyshoup
  • 17. Login Issues - 2019 • Problem: Some members unable to log in • Inconsistent representations across different services in the system • Over time, simple system interactions grew increasingly complex and convoluted • Not enough graceful degradation or automated repair @randyshoup
  • 18. Login Retrospective • Step 1: Identify the Problem • Step 2: Understand the Problem • Step 3: Consensus and Prioritization • Step 4: Implementation and Follow-Up @randyshoup
  • 19. Login Solutions • 1. Clean up user data o Find inconsistencies o Track inconsistency metrics o Identify and fix contributing processes and applications • 2. User state machines o Define user journeys as explicit state machines o Refine and correct via cross-functional feedback o Implement state machines in code • 3. “Pandora” Program o Rewrite core identity system into set of user capabilities @randyshoup
  • 20. Common Elements • Unintentional, long-term accumulation of small, individually reasonable decisions • “Compelling event” catalyzes long-term change • Blameless culture makes learning and improvement possible • Structured post-incident approach @randyshoup
  • 23. Vicious Cycle of Technical Debt Technical Debt “No time to do it right” Quick- and-dirty
  • 24. “Do you have time to do it twice?” “We don’t have time to do it right!” @randyshoup
  • 25. The more constrained you are on time or resources, the more important it is to get it done the first time. @randyshoup
  • 27. Virtuous Cycle of Investment Solid Foundation Confidence Faster and Better Quality Investment
  • 29. During the Incident • Focus on restoring service o Everything else is secondary, and should wait • Shield the team • Clear, structured communication o Even when there is nothing to report! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  • 30. After the Incident • Blameless postmortem • Identify and understand the contributing factors • Action items and Learnings • Follow Up! @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response
  • 31. Psychological Safety • Team is safe for interpersonal risk-taking • “Being able to show and employ one’s self without fear of negative consequences” • More important than any other factor in team success
  • 32. “Finally we can prioritize fixing that broken system!” @randyshoup
  • 33. Inclusive Decisionmaking • Make better business decisions 87% of the time • Make decisions 2x faster with 1/2 the meetings • Deliver 60% better business results Cloverpop Inclusive Decisionmaking study, 2016 As we improve diversity, decisionmaking improves @randyshoup
  • 35. Frame the Problem: Quality and reliability are business concerns @randyshoup
  • 36. Use Common Currency Time Money People @randyshoup
  • 37. 15 Million “Never let a good crisis go to waste.” @randyshoup
  • 38. “Incidents are unplanned investments, and they are also opportunities. Your challenge is to maximize the ROI on the sunk cost.” @randyshoup -- John Allspaw, Adaptive Capacity Labs
  • 39. Improvement Budget • Explicit resource investment o Agree on an up-front investment (e.g., 25%, 30% of engineering efforts) • Retain autonomy, Provide transparency o Making these decisions is exactly why they hired you @randyshoup
  • 41. Incident Response Patterns • Incident Roles • Incident Triggers • On-Call Rotation and Onboarding • Incident Command Training • Incident Communication Plan • Periodic Incident Updates • Shared Incident State Doc • Incident Call Recording • Incident Swarming • Local / Global Incident Reviews • Post-Review Improvement Items @randyshoup https://myresources.itrevolution.com/id006657105/A-Framework-for-Incident-Response