SlideShare a Scribd company logo
1 of 15
Download to read offline
Amazon Major Cloud Outage Analysis
Author: Rahul Tyagi
2
The Agenda
• The Issue
• The Goals
• Analysis Methodology
• The Analysis
3
The Issue
• Due to deep proliferation of Amazon cloud into
enterprises, The major Amazon cloud outages
causes wide spread impact…
• The organizations like Netflix, Dropbox, AirBnB
and Pinterest had impact due to Amazon cloud
outages
4
The Issue
• Major cloud outages are pretty regular events in
recent past, some of the major outages
• Dec/24/2012
• Oct/22/2012
• Jun/29/2012
• Apr/21/2011
5
The Goals
• We want to analyze chain of events causing major
Amazon cloud outages (from official Amazon
statements)…
• We analyzed major outages in past 2 years…
• The goal is to identify probable root causes and
areas that have opportunity to improve…
6
Analysis Methodology
We would leverage “Analytical Hierarchy Process”
for identifying root causes…
7
Analysis Methodology
Analyze
Amazon’s
Statements
about Outage
Identify “Chain
of Events”
causing outage
Categorize
“Chain of
Events”
Analysis and
Conclusion
8
The Analysis > Analyze Amazon’s Statements about Outages
Outage Date Amazon’s Statement
Dec/24/2012 http://aws.amazon.com/message/680587/
Oct/22/2012 http://aws.amazon.com/message/680342/
Jun/29/2012 http://aws.amazon.com/message/67457/
Apr/21/2011 http://aws.amazon.com/message/65648/
We analyzed following Amazon’s official
statements…
9
The Analysis > Identify “Chain of Events” causing outages
Outage Core Issue
Dec-12
“The *ELB State+ data was deleted by a maintenance process that
was inadvertently run against the production ELB state data”
Oct-12
“The root cause of the problem was a latent bug in an operational
data collection agent that runs on the EBS storage servers”
Jun-12
“In the single datacenter that did not successfully transfer to the
generator backup, all servers continued to operate normally on
Uninterruptable Power Supply (“UPS”) power. As onsite personnel
worked to stabilize the primary and backup power generators, the
UPS systems were depleting and servers began losing power at
8:04pm PDT”
Apr-11
“The traffic shift was executed incorrectly and rather than routing
the traffic to the other router on the primary network, the traffic
was routed onto the lower capacity redundant EBS network.”
The statements in double quotes are from
Amazon’s press releases…
10
The Analysis > Identify “Chain of Events” causing outages
Outage Chain of Events
Dec-12"Maintenance process inadvertently run against production ELB state data"
Process for incident approval had loose ends
Validation for maintenance process's (which ran inadvertently) output was missing
"load balancers that were modified were improperly configured by the control plane"
Oct-12"latent bug in an (EBS) operational data collection agent"
"latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent.
"the DNS update did not successfully propagate to all of the internal DNS servers"
"the (aggressive) throttling policy that was put in place was too aggressive"
Jun-12"datacenter that did not successfully transfer to the generator backup"
"As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting
and servers began losing power at 8:04pm PDT"
"a small number of Multi-AZ RDS instances did not complete failover, due to a software bug"
"As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen
before"
Apr-11
“The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary
network, the traffic was routed onto the lower capacity redundant EBS network.”
"We now understand the amount of capacity needed for large recovery events and will be modifying our capacity
planning and alarming so that we carry the additional safety capacity that is needed for large scale failures"
"We will audit our change process and increase the automation to prevent this mistake from happening in the future"
"We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"
11
The Analysis > Categorize “Chain of Events”
Outage Chain of Events Hardware Software Automation Process
Dec-12"Maintenance process inadvertently run against production ELB state data" X X
Process for incident approval had loose ends X
Validation for maintenance process's (which ran inadvertently) output was
missing X X X
"load balancers that were modified were improperly configured by the control
plane" X
Oct-12"latent bug in an (EBS) operational data collection agent" X X
"latent memory leak bug in the reporting agent" The monitoring process of
memory leak was non existent. X X
"the DNS update did not successfully propagate to all of the internal DNS servers" X X
"the (aggressive) throttling policy that was put in place was too aggressive" X X
Jun-12"datacenter that did not successfully transfer to the generator backup" X
"As onsite personnel worked to stabilize the primary and backup power
generators, the UPS systems were depleting and servers began losing power at
8:04pm PDT" X
"a small number of Multi-AZ RDS instances did not complete failover, due to a
software bug" X X
"As the power and systems returned, a large number of ELBs came up in a state
which triggered a bug we hadn’t seen before" X X
Apr-11
“The traffic shift was executed incorrectly and rather than routing the traffic to
the other router on the primary network, the traffic was routed onto the lower
capacity redundant EBS network.” X
"We now understand the amount of capacity needed for large recovery events
and will be modifying our capacity planning and alarming so that we carry the
additional safety capacity that is needed for large scale failures" X
"We will audit our change process and increase the automation to prevent this
mistake from happening in the future" X
"We will also invest in increasing our visibility, control, and automation to recover
volumes in an EBS cluster" X X
12
The Analysis > Analysis and Conclusions
Process issues are common theme in major
outages at Amazon cloud…
13
The Analysis > Analysis and Conclusions
Software, 8
Automation, 4
Process, 14
#ofIssues
Amazon Cloud Major Outage - Issues Categories
Process and Software are leading contributing
factors to major outages at Amazon…
14
The Analysis > Analysis and Conclusions
• The majority of issues contributing to outages are
related to process or software
• It seems “Process” rigor in cloud operations and
SDLC at Amazon has opportunity to improve
• Culture? We heard, Amazon has Just-Do-It
culture, The process rigor may require more than
just “just-do-it”
15
Thank You! You are Awesome! You deserve applause!!

More Related Content

Viewers also liked

5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...Jive Communications
 
Production Monitoring Platform
Production Monitoring PlatformProduction Monitoring Platform
Production Monitoring PlatformAriel Smoliar
 
Aws presentation
Aws presentationAws presentation
Aws presentationmhprogramr
 
External analysis Nokia, Amazon
External analysis Nokia, AmazonExternal analysis Nokia, Amazon
External analysis Nokia, AmazonDan Saguy
 
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise StrategyAWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise StrategyAmazon Web Services
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environmentiosrjce
 
Dcpl cloud computing amazon fail
Dcpl cloud computing amazon failDcpl cloud computing amazon fail
Dcpl cloud computing amazon failchris tonjes
 
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Rajesh Prabhakar
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?ITpreneurs
 
Creating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of UsersCreating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of UsersHasan Basri AKIRMAK, MSc,ExecMBA
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementOMNETRIC
 
Amazon Investor's Analysis
Amazon Investor's AnalysisAmazon Investor's Analysis
Amazon Investor's AnalysisThomas Pollard
 
APN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWSAPN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWSAmazon Web Services
 
Amazon Web Services SWOT
Amazon Web Services SWOTAmazon Web Services SWOT
Amazon Web Services SWOTBessie Chu
 
Amazon Brand Analysis
Amazon Brand AnalysisAmazon Brand Analysis
Amazon Brand AnalysisRitesh Tandon
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proofGuido Frabotti
 

Viewers also liked (19)

5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
5 Worst Case Scenarios Your Hosted VoIP Provider Should Be Ready For-LONG VER...
 
Production Monitoring Platform
Production Monitoring PlatformProduction Monitoring Platform
Production Monitoring Platform
 
Aws presentation
Aws presentationAws presentation
Aws presentation
 
External analysis Nokia, Amazon
External analysis Nokia, AmazonExternal analysis Nokia, Amazon
External analysis Nokia, Amazon
 
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise StrategyAWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
AWS Summit Kuala Lumpur Keynote with Stephen Orban - Head of Enterprise Strategy
 
Henry
HenryHenry
Henry
 
Analyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing EnvironmentAnalyzing and Surveying Trust In Cloud Computing Environment
Analyzing and Surveying Trust In Cloud Computing Environment
 
Dcpl cloud computing amazon fail
Dcpl cloud computing amazon failDcpl cloud computing amazon fail
Dcpl cloud computing amazon fail
 
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012 Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
 
Cloud malfunction up11
Cloud malfunction up11Cloud malfunction up11
Cloud malfunction up11
 
Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?Cloud Computing & ITSM - For Better of for Worse?
Cloud Computing & ITSM - For Better of for Worse?
 
Creating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of UsersCreating SaaS Startups that Rock: Scaling to Millions of Users
Creating SaaS Startups that Rock: Scaling to Millions of Users
 
European Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage ManagementEuropean Utility Week 2015: Next Generation Outage Management
European Utility Week 2015: Next Generation Outage Management
 
Amazon Investor's Analysis
Amazon Investor's AnalysisAmazon Investor's Analysis
Amazon Investor's Analysis
 
Amazon Partnership Model
Amazon Partnership Model Amazon Partnership Model
Amazon Partnership Model
 
APN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWSAPN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWS
 
Amazon Web Services SWOT
Amazon Web Services SWOTAmazon Web Services SWOT
Amazon Web Services SWOT
 
Amazon Brand Analysis
Amazon Brand AnalysisAmazon Brand Analysis
Amazon Brand Analysis
 
DC architectures future proof
DC architectures future proofDC architectures future proof
DC architectures future proof
 

Recently uploaded

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 

Recently uploaded (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 

Amazon Cloud Major Outages Analysis

  • 1. Amazon Major Cloud Outage Analysis Author: Rahul Tyagi
  • 2. 2 The Agenda • The Issue • The Goals • Analysis Methodology • The Analysis
  • 3. 3 The Issue • Due to deep proliferation of Amazon cloud into enterprises, The major Amazon cloud outages causes wide spread impact… • The organizations like Netflix, Dropbox, AirBnB and Pinterest had impact due to Amazon cloud outages
  • 4. 4 The Issue • Major cloud outages are pretty regular events in recent past, some of the major outages • Dec/24/2012 • Oct/22/2012 • Jun/29/2012 • Apr/21/2011
  • 5. 5 The Goals • We want to analyze chain of events causing major Amazon cloud outages (from official Amazon statements)… • We analyzed major outages in past 2 years… • The goal is to identify probable root causes and areas that have opportunity to improve…
  • 6. 6 Analysis Methodology We would leverage “Analytical Hierarchy Process” for identifying root causes…
  • 7. 7 Analysis Methodology Analyze Amazon’s Statements about Outage Identify “Chain of Events” causing outage Categorize “Chain of Events” Analysis and Conclusion
  • 8. 8 The Analysis > Analyze Amazon’s Statements about Outages Outage Date Amazon’s Statement Dec/24/2012 http://aws.amazon.com/message/680587/ Oct/22/2012 http://aws.amazon.com/message/680342/ Jun/29/2012 http://aws.amazon.com/message/67457/ Apr/21/2011 http://aws.amazon.com/message/65648/ We analyzed following Amazon’s official statements…
  • 9. 9 The Analysis > Identify “Chain of Events” causing outages Outage Core Issue Dec-12 “The *ELB State+ data was deleted by a maintenance process that was inadvertently run against the production ELB state data” Oct-12 “The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers” Jun-12 “In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT” Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” The statements in double quotes are from Amazon’s press releases…
  • 10. 10 The Analysis > Identify “Chain of Events” causing outages Outage Chain of Events Dec-12"Maintenance process inadvertently run against production ELB state data" Process for incident approval had loose ends Validation for maintenance process's (which ran inadvertently) output was missing "load balancers that were modified were improperly configured by the control plane" Oct-12"latent bug in an (EBS) operational data collection agent" "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. "the DNS update did not successfully propagate to all of the internal DNS servers" "the (aggressive) throttling policy that was put in place was too aggressive" Jun-12"datacenter that did not successfully transfer to the generator backup" "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" "We will audit our change process and increase the automation to prevent this mistake from happening in the future" "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster"
  • 11. 11 The Analysis > Categorize “Chain of Events” Outage Chain of Events Hardware Software Automation Process Dec-12"Maintenance process inadvertently run against production ELB state data" X X Process for incident approval had loose ends X Validation for maintenance process's (which ran inadvertently) output was missing X X X "load balancers that were modified were improperly configured by the control plane" X Oct-12"latent bug in an (EBS) operational data collection agent" X X "latent memory leak bug in the reporting agent" The monitoring process of memory leak was non existent. X X "the DNS update did not successfully propagate to all of the internal DNS servers" X X "the (aggressive) throttling policy that was put in place was too aggressive" X X Jun-12"datacenter that did not successfully transfer to the generator backup" X "As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT" X "a small number of Multi-AZ RDS instances did not complete failover, due to a software bug" X X "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before" X X Apr-11 “The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.” X "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures" X "We will audit our change process and increase the automation to prevent this mistake from happening in the future" X "We will also invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster" X X
  • 12. 12 The Analysis > Analysis and Conclusions Process issues are common theme in major outages at Amazon cloud…
  • 13. 13 The Analysis > Analysis and Conclusions Software, 8 Automation, 4 Process, 14 #ofIssues Amazon Cloud Major Outage - Issues Categories Process and Software are leading contributing factors to major outages at Amazon…
  • 14. 14 The Analysis > Analysis and Conclusions • The majority of issues contributing to outages are related to process or software • It seems “Process” rigor in cloud operations and SDLC at Amazon has opportunity to improve • Culture? We heard, Amazon has Just-Do-It culture, The process rigor may require more than just “just-do-it”
  • 15. 15 Thank You! You are Awesome! You deserve applause!!