SlideShare a Scribd company logo
1 of 39
Download to read offline
Analyzing a Complex Cloud
                 Outage
                               @botchagalupe
                           VP of Services enStratus



                                       1

Saturday, December 1, 12
John Willis
Call me Botchagalupe
VP of Services
WHO AM I




                                  2

Saturday, December 1, 12
30 yrs it
ubuntu cloud evangelist startup
Opscode
DTO awesome dudes
Enstratus GR called..
GOALS




    • Look at a complex cloud outage.
    • Understanding complexity.
    • Analyze a complex cloud outage.



                           3

Saturday, December 1, 12
Review bullets...
Amazon’s EBS Outage 10/22/2012




                                      4

Saturday, December 1, 12
Fed reserve story just the WSJ part
Amazon outages are big deal
Minecraft
Amazon’s EBS Outage 10/22/2012




                               5

Saturday, December 1, 12
Amazon’s EBS Outage 10/22/2012




                                              6

Saturday, December 1, 12
#Let’s take a look at the value stream of the service that failed on 10/22
#If we look in the middle we see the green storage server
#This box is a simplified process of a larger service (meaning many servers.. KISS
4now)
#We always try to look at a VS from right to left.. (form the customer back)
#In this example customers use something called EBS (block storage). Cloud based
SAN)
#Next the thing that is often left out of most VS’s the humans. In this ex they were an
integral part of this system
#Next we have an operations monitoring database (disk).. Thins the humans need to
know about.
#Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if
threads broke or ran out.
#Next there is the EBS server failover machine... Most production systems in large IT
centers will have FA
#It is important to realize that there are core services o this (e.g, EBS) and non core
services on this box.
## Non core services are things like teh operations monitoring agent that feed into the
monitor disk DB)
## Also we will see in a minute there is a hardware agent on this EBS server for
hardware detection failures
#Next is a Fleet monitor server... basically a hardware monitor that can phone home or
auto order defective parts from the manufacturer (in large infras like amazon, google,
facebook this is common)
#It has a FA server
Amazon’s EBS Outage 10/22/2012


                                                           The EBS System




                                              6

Saturday, December 1, 12
#Let’s take a look at the value stream of the service that failed on 10/22
#If we look in the middle we see the green storage server
#This box is a simplified process of a larger service (meaning many servers.. KISS
4now)
#We always try to look at a VS from right to left.. (form the customer back)
#In this example customers use something called EBS (block storage). Cloud based
SAN)
#Next the thing that is often left out of most VS’s the humans. In this ex they were an
integral part of this system
#Next we have an operations monitoring database (disk).. Thins the humans need to
know about.
#Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if
threads broke or ran out.
#Next there is the EBS server failover machine... Most production systems in large IT
centers will have FA
#It is important to realize that there are core services o this (e.g, EBS) and non core
services on this box.
## Non core services are things like teh operations monitoring agent that feed into the
monitor disk DB)
## Also we will see in a minute there is a hardware agent on this EBS server for
hardware detection failures
#Next is a Fleet monitor server... basically a hardware monitor that can phone home or
auto order defective parts from the manufacturer (in large infras like amazon, google,
facebook this is common)
#It has a FA server
Amazon’s EBS Outage 10/22/2012


                                                           Server Failure




                                               7

Saturday, December 1, 12
#on this one fine afternoon the fleet server died.. remember this is receiving data from
the HW agents on the EBS server
Amazon’s EBS Outage 10/22/2012


                                                        Server Failover




                                             8

Saturday, December 1, 12
#At this point there is most likely automated FA/HA
# we see the FA server now supposed to be logically in the VS (the new arrow)
# Any system thinkers out there see the first problem with this red circle?
Amazon’s EBS Outage 10/22/2012


                                                   DNS Propagation Failure




                                              9

Saturday, December 1, 12
#The HA/HA seems to work flawlessly. From the FM guys perspective.
#However our second problem happens and that is that DNS does not update it’s
records correctly
#therefore the HW agent running on the EBS server is still pointing to the down fleet
server. (everyone see that?)
Amazon’s EBS Outage 10/22/2012


                                                         Agent Memory Leak




                                               10

Saturday, December 1, 12
#Now a third problem jumps in.. (the yellow box)
#When the HW agent tries to write back to the wrong FMS (the dead one) some not well
tested code fails and creates a memory leak...
#To make matters worse this particular agent is designed to be fault tolerant. In other
words is should die silently and not disrupt any core service. E.g., it is designed to be ok
to fail if it can’t send to the FMS. he assumption is that is will get it next time.
#Now you can start seeing the fist level of a complex system emerge.
##The first issue seems to be fixed (the FMS FA)
## DNS isn’t showing up as a failure on anybodies dashboard
## and we have a silent error occurring on one of our core serves (a customer service)
Amazon’s EBS Outage 10/22/2012


                                              EBS Service is slowing down




                                               11

Saturday, December 1, 12
#The memory leak continues undetected and eventually it starts slowing down the EBS
service
because of low memory...
# Key point here is that is probably still not detected by the IT staff and maybe it’s just
starting to annoy customers but not enough to turn the customer box yellow (yet)
Amazon’s EBS Outage 10/22/2012


                                                       Throttling the API




                                              12

Saturday, December 1, 12
#At some point the IT staff notices the slowdown. We would hope before the customer
complain and in AMZN’s case that is probably true (they are pretty good).
#However, as we said earlier, they still don’t know why it’s slow...
#Another bit of complexity is introduced here.
## The EBS servers always run hot (high) on memory. Therefore the undetected
memory leak is most likely unnoticed at this point. (we will discuss this in detail when we
get the the analysis part of this preso. .
##from AMZN’s RCA is was pretty clear this was the case that they had not detected the
mem leak
##Next a human interacting is take and that is they (the humans) decide to active a
throttling tool
##They use this to throttle customer API requests as a stop gap to give them time to
figure out and hopefully fix the issue (the slow down).
Amazon’s EBS Outage 10/22/2012


                                                       Customer Issues




                                           13

Saturday, December 1, 12
#By now the customer is getting a double whammy.
## one, they were already experience slow responses from the service
##two, now the throttling has really made it worse for them
Amazon’s EBS Outage 10/22/2012


                                                          EBS Failover




                                              14

Saturday, December 1, 12
#This situation continues on where teh IT staff still doesn’t know why the EBS service is
having issues
#The customers situation get worse.
# And now the IT guys decide to punt again (like throttling).
## they force a FA/HA of the EBS service (servers).
##keep in mind they still don’t know what is wrong... gasping at straws twice now..
remember this 4 later.
##Notice now that the new HA/FA server is in place (show the arrow).
## COmplexity strikes again... This is a classic IT outage scenario.. where something
seems to be fixed and when it really isn’t.
##The new FA server seems to have solved the problem. The new server is not slow at
first...
##However, what they don’t know is all they have done is delay the inevitable.
#the mem leak just starts all over again on the FA server.
##customer is still orange mainly because of throttling...
Amazon’s EBS Outage 10/22/2012


                                                             Twitter Effect




                                                15

Saturday, December 1, 12
#At this point we start getting what are called indirect effects
##The first effect (and this was in the RCA) is that suers’ tend to use more services
when a potential outage is perceived. The start testing more services, trying other
services.
##the next indirect effect is what I call the twitter effect. That is now the outage starts
trending on twitter and everyone in extended system starts trying to kick the tires on
AWS.. Let’s start up Netflix, I wonder if Guthub is working ok. ...
Amazon’s EBS Outage 10/22/2012


                                                             Failover Server Dies




                                                16

Saturday, December 1, 12
#And of course the FA server eventually gets to the same state as the original EBS
server
#meanwhile it is very likely that the IT staff still does not know why this is all happening.
Amazon’s EBS Outage 10/22/2012


                                                             Systemic Outage




                                                17

Saturday, December 1, 12
#Now our complete system is in a systemic failure...
#Ironically the original failed over FMS is just fine (no red there). Now one is using it..
remember why?
10 Minutes
Understanding Complexity




                                              18

Saturday, December 1, 12
#So lets talk about complexity from a theoretical standpoint.
#typically humans think linear. Our first instinct is that it’s always an X->Y
#One variable X will change the outcome (y) - (Y is the dependent variable)
--That is for an new improvement (change, bug fix, maintenance, etc..)
--An emergency (like the amazon issues)
--A new product, feature etc...
Understanding Complexity




                                              18

Saturday, December 1, 12
#So lets talk about complexity from a theoretical standpoint.
#typically humans think linear. Our first instinct is that it’s always an X->Y
#One variable X will change the outcome (y) - (Y is the dependent variable)
--That is for an new improvement (change, bug fix, maintenance, etc..)
--An emergency (like the amazon issues)
--A new product, feature etc...
Understanding Complexity




                                              18

Saturday, December 1, 12
#So lets talk about complexity from a theoretical standpoint.
#typically humans think linear. Our first instinct is that it’s always an X->Y
#One variable X will change the outcome (y) - (Y is the dependent variable)
--That is for an new improvement (change, bug fix, maintenance, etc..)
--An emergency (like the amazon issues)
--A new product, feature etc...
Understanding Complexity




                                              18

Saturday, December 1, 12
#So lets talk about complexity from a theoretical standpoint.
#typically humans think linear. Our first instinct is that it’s always an X->Y
#One variable X will change the outcome (y) - (Y is the dependent variable)
--That is for an new improvement (change, bug fix, maintenance, etc..)
--An emergency (like the amazon issues)
--A new product, feature etc...
Understanding Complexity




                                                19

Saturday, December 1, 12
#However, in real life it’s never really x-<y it’s usually many In real life you get many
variables
#In statistics this referred to the don’t confuse correlation with causation
#X->Y is correlation but it’s dangerous to assume it’s causation ..
#real life is not that simple... we call it the messiness of like.

#X1 a simple server failure
#X2 The failover
#X3 The DNS

Deming wrote of Chanticleer, the barnyard rooster who had a theory. He crowed every
morning, putting forth all his energy, flapped his wings. The sun came up. The
connexion was clear: His crowing caused the sun to come up. There was no question
about his importance.
There came a snag. He forgot one morning to crow. The sun came up anyhow.
Understanding Complexity




                           T1             T2


                                             20

Saturday, December 1, 12
#You also more likely get time dependent variables that add to the complexity
#X1-X3 happen at T1 andX4-X5 happen at T2

#X4 is the memory leak
#X5The dreaded throttling...
Understanding Complexity




                           T1             T2


                                             21

Saturday, December 1, 12
#There are also indirect effects on the dependent variables (y)
#for example X1 in concert with X4 can conjointly effect the dependent var Y
## X1 changes X4 and the combination effect is different on Y
##same with X5
# This is a different model than a simple X->Y

#X? The customer respond with more usage
#X? The twitter effect

15 Minutes
W. Edwards Deming (1900 – 1993)




     • Father of Quality
     • Understanding of the system
     • Understanding variation
     • Understanding human behavior
     • Introduced sampling into US Census
     • WWII success credited to his quality approach
     • Taught Japan after WWII and transformed quality
     • In 1980 Transformed American quality revolution
     • The Foundations of Six Sigma


                                             22

Saturday, December 1, 12
#There is a tool that has been used by successful companies like Toyota, (lean) and
many others.
# Dr. Edwards Deming gave such a lens to break down complexity
## (the real world just like a camera does)

20 Minutes
System of Profound Knowledge (SoPK)




                                           23

Saturday, December 1, 12
#Let’s say a lens for improvement of something (an enhancement, a bug fix, new
product idea)
#An outcome X->y
#Dr Deming gave us a tool called “The System of Profound Knowledge”

#Just common sense... However, Mark Twain said nothing common about common
sense,,

#SoPK is a Lens to break down complexity and give ourselves an advantage to not over
simplify what we are trying to do.
I#n otherwords clear up the messiness of real life just like a camera lens does.

(
Knowledge of a System




     • Systems Thinking
     • End to End Value Stream
     • What is the Aim of a System?
     • The Purpose of the System
     • Global Optimization
     • Not Local Optimization




                                             24

Saturday, December 1, 12
(S) Appreciation of a System - Systems thinking - Deming would say understanding the
AIM of a system.
Deming said every system must have an AIM.
Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the
same thing as we will soon see)
Eli Goldarat (TOC) would say Global optimization over local optimization even if sub
optimization is sub optimal. Understanding subsystems and dependent systems.
Knowledge of a System




                                            25

Saturday, December 1, 12
One big exercise of non systems thinking...
Clearly there were independant views of the system
What was the AIM of this system
Did the hardware guys have the same aim as the core services guys.

#Lens #1 Not having a systems view- Not seeing this as dependent systems. You might
say surely they had automation to DNS. However I would say no. Because

Lens #2 The hardware guys should know that they are part of a bigger system other
than just hardware monitor. They had code on a core service was it smoke tested
immune tested.. Was there a systems view for QA and smoke testing of agent code
changes?

Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’t
understand hardware monitoring as a sub system.. local optimization....
Knowledge of a System




                                            25

Saturday, December 1, 12
One big exercise of non systems thinking...
Clearly there were independant views of the system
What was the AIM of this system
Did the hardware guys have the same aim as the core services guys.

#Lens #1 Not having a systems view- Not seeing this as dependent systems. You might
say surely they had automation to DNS. However I would say no. Because

Lens #2 The hardware guys should know that they are part of a bigger system other
than just hardware monitor. They had code on a core service was it smoke tested
immune tested.. Was there a systems view for QA and smoke testing of agent code
changes?

Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’t
understand hardware monitoring as a sub system.. local optimization....
Knowledge of Variation




      • There is always Variation
      • Special Cause Variation
      • Common Cause Variation
      • Understanding Variation




                                              26

Saturday, December 1, 12
Continuous improvement requires the understanding of variation

You have a power outage and it takes key personnel a long time to get to the data
center
That is special cause Var. A bad reaction would be to create a new policy that all
personnel live with 5 minutes of the data center (i.e., treating it like common cause)
Conversely. Firing a new programmer who brings down a production system would be
treating common cause as a special cause situation. More than likely it was bad
safeguards, insufficient training....

(V) Variation - Not understanding Variation is the root of all evil. Deming would get mad
at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you
understand variation? Statistics (primarily STD and and it’s relationship to a process
i.e., it’s distribution)
Give you an example. A large cloud provider rates API calls at 100 (why 100) per (x).
for Most customers that’s fine, however, others they get treated as DDOS. Where did
they get 100? It had to be a guess. If they understood SPC (variation) they might come
up with the number and have a CI process in place when they found special variation.

(
Control Chart




                                               27

Saturday, December 1, 12
 •   #Approximately 99% - 100% of the values will fall within 3 standard deviations of
     the mean


 •   Approximately 90% - 98% of the values will fall within 2 standard deviations of the
     mean


 •   Approximately 60% - 78% of the values will fall within 1 standard deviation of the
     mean

 • Approximately 90% - 98% of the values will fall within
   2 standard deviations of the mean

 • Approximately 60% - 78% of the values will fall within
   1 standard deviation of the mean
Knowledge of Variation




                                              28

Saturday, December 1, 12
The biggest issue here is the knee jerk reactions... Throttling and forced EBS server
failover. They didn’t understand the type of variation.

Lens #1 The systems guys don’t understand common vs special cause variation .. they
react to a “S” that should of been a “C”.
 Turns out ... monitoring sub processes monitoring looking at individual monitors... e.g.,
they might have gone from 95% to 96% which caused the issue. However, if they were
looking at the individual agent memory they.
Theory of Knowledge




     • Scientific Method
     • Knowledge Must Have Theory
     • Theory Must Have Prediction
     • Prediction Must Have Tests
     • Aim-->Measure-->Change




                                            29

Saturday, December 1, 12
(K) is the simplest but hardest to understand by most ppl. Simply put it is using
Scientific method to everything you do. Deming says you must have Theory to have
knowledge and you can’t have knowledge with out prediction and you predication with
out a test is useless.

PDSA others call it (AMC) AIM,Measure (a.what process u gonna change b.measure if
the change worked), Change. You have to test any improvement to see if it worked,
failed or did nothing. Imagine someone staring a failover system with automation but
not testing to see if it really worked (could never happen).
Theory of Knowledge




                                             30

Saturday, December 1, 12
 #Lens #1 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).

Lens #2 Measures with out results are not fixes (throttling). They should have looked at
the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?
Theory of Psychology




     • Understanding Behavior
     • Understanding Tribes
     • Understanding Worldviews




                                             31

Saturday, December 1, 12
(P) Another easy one but hardest to implement. Understanding behavior. Why ppl do
the things they do. Tribal behavior. Things that are important to one group might not be
important to other groups. Understanding Human Behavior (another lens factor).
Worldview's. Imagine a server that has software on it from two totally different dev
groups. Further imagine these to group’s worldview are so far apart. One does agile CI,
TDD, BDD, CD the other has never even hear of those things. (groupthink
experiment)...
Theory of Psychology




                                            32

Saturday, December 1, 12
Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.

Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.


Lens #3 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.

Lens #4 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.
Amazon’s Outage 10/22/2012

                           Let’s Review




                                           33

Saturday, December 1, 12
#X1 a simple server failure
#X2 The failover
#X3 The DNS
#X4 is the memory leak
#X5 Bad TDD hygene by FMS eng/dev
#X6 The dreaded throttling...
#X7 The customer respond with more usage
#X8 EBS Server failover
#X9 The twitter effect

#The complexity was masked
#This was not an X->Y
#To bad they had not read deming...
Amazon’s Outage 10/22/2012

                           Let’s Review




                           X->Y
                                           33

Saturday, December 1, 12
#X1 a simple server failure
#X2 The failover
#X3 The DNS
#X4 is the memory leak
#X5 Bad TDD hygene by FMS eng/dev
#X6 The dreaded throttling...
#X7 The customer respond with more usage
#X8 EBS Server failover
#X9 The twitter effect

#The complexity was masked
#This was not an X->Y
#To bad they had not read deming...

More Related Content

More from John Willis

Automated Governance
Automated GovernanceAutomated Governance
Automated GovernanceJohn Willis
 
Devops Long Strange Trip
Devops Long Strange Trip Devops Long Strange Trip
Devops Long Strange Trip John Willis
 
I Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of ThemI Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of ThemJohn Willis
 
The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019John Willis
 
Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018John Willis
 
swampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber SecurityswampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber SecurityJohn Willis
 
Divine and felonios cyber security devopsdays austin 2018
Divine and felonios cyber security  devopsdays austin 2018Divine and felonios cyber security  devopsdays austin 2018
Divine and felonios cyber security devopsdays austin 2018John Willis
 
Devops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's BeenDevops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's BeenJohn Willis
 
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's BeenDevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's BeenJohn Willis
 
You build it - Cyber Chicago Keynote
You build it -  Cyber Chicago KeynoteYou build it -  Cyber Chicago Keynote
You build it - Cyber Chicago KeynoteJohn Willis
 
Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 John Willis
 
Why Executives Can't Change
Why Executives Can't Change Why Executives Can't Change
Why Executives Can't Change John Willis
 
Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017 Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017 John Willis
 
Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security John Willis
 
Alibaba Cloud Conference 2016 - Docker Open Source
Alibaba Cloud Conference   2016 - Docker Open Source Alibaba Cloud Conference   2016 - Docker Open Source
Alibaba Cloud Conference 2016 - Docker Open Source John Willis
 
Alibaba Cloud Conference 2016 - Docker Enterprise
Alibaba Cloud Conference   2016 - Docker EnterpriseAlibaba Cloud Conference   2016 - Docker Enterprise
Alibaba Cloud Conference 2016 - Docker EnterpriseJohn Willis
 
Breaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSACBreaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSACJohn Willis
 
Breaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LABreaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LAJohn Willis
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...John Willis
 

More from John Willis (20)

Automated Governance
Automated GovernanceAutomated Governance
Automated Governance
 
Devops Long Strange Trip
Devops Long Strange Trip Devops Long Strange Trip
Devops Long Strange Trip
 
I Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of ThemI Got 99 Problems and a Bash DSL Ain't One of Them
I Got 99 Problems and a Bash DSL Ain't One of Them
 
Math is cool
Math is coolMath is cool
Math is cool
 
The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019The 7 deadly diseases of DevOps 2019
The 7 deadly diseases of DevOps 2019
 
Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018Next Generation Infrastructure - Devops Enterprise Summit 2018
Next Generation Infrastructure - Devops Enterprise Summit 2018
 
swampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber SecurityswampUP - 2018 - The Divine and Felonious Nature of Cyber Security
swampUP - 2018 - The Divine and Felonious Nature of Cyber Security
 
Divine and felonios cyber security devopsdays austin 2018
Divine and felonios cyber security  devopsdays austin 2018Divine and felonios cyber security  devopsdays austin 2018
Divine and felonios cyber security devopsdays austin 2018
 
Devops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's BeenDevops - A Long Strange Trip It's Been
Devops - A Long Strange Trip It's Been
 
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's BeenDevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
DevopsdaysNYC - Almost 10 Years - What A Strange Long Trip It's Been
 
You build it - Cyber Chicago Keynote
You build it -  Cyber Chicago KeynoteYou build it -  Cyber Chicago Keynote
You build it - Cyber Chicago Keynote
 
Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017 Art of the Possible - Serverless Conference NYC 2017
Art of the Possible - Serverless Conference NYC 2017
 
Why Executives Can't Change
Why Executives Can't Change Why Executives Can't Change
Why Executives Can't Change
 
Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017 Devops Kaizen - DevopsDays Dallas 2017
Devops Kaizen - DevopsDays Dallas 2017
 
Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security Evolve 2017 - Vegas - Devops, Docker and Security
Evolve 2017 - Vegas - Devops, Docker and Security
 
Alibaba Cloud Conference 2016 - Docker Open Source
Alibaba Cloud Conference   2016 - Docker Open Source Alibaba Cloud Conference   2016 - Docker Open Source
Alibaba Cloud Conference 2016 - Docker Open Source
 
Alibaba Cloud Conference 2016 - Docker Enterprise
Alibaba Cloud Conference   2016 - Docker EnterpriseAlibaba Cloud Conference   2016 - Docker Enterprise
Alibaba Cloud Conference 2016 - Docker Enterprise
 
Breaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSACBreaking Bad Equilibrium - Devops Connect 2017 RSAC
Breaking Bad Equilibrium - Devops Connect 2017 RSAC
 
Breaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LABreaking Bad Equilibrium - Devops Connect 2016 LA
Breaking Bad Equilibrium - Devops Connect 2016 LA
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
 

Recently uploaded

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Recently uploaded (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas

  • 1. Analyzing a Complex Cloud Outage @botchagalupe VP of Services enStratus 1 Saturday, December 1, 12 John Willis Call me Botchagalupe VP of Services
  • 2. WHO AM I 2 Saturday, December 1, 12 30 yrs it ubuntu cloud evangelist startup Opscode DTO awesome dudes Enstratus GR called..
  • 3. GOALS • Look at a complex cloud outage. • Understanding complexity. • Analyze a complex cloud outage. 3 Saturday, December 1, 12 Review bullets...
  • 4. Amazon’s EBS Outage 10/22/2012 4 Saturday, December 1, 12 Fed reserve story just the WSJ part Amazon outages are big deal Minecraft
  • 5. Amazon’s EBS Outage 10/22/2012 5 Saturday, December 1, 12
  • 6. Amazon’s EBS Outage 10/22/2012 6 Saturday, December 1, 12 #Let’s take a look at the value stream of the service that failed on 10/22 #If we look in the middle we see the green storage server #This box is a simplified process of a larger service (meaning many servers.. KISS 4now) #We always try to look at a VS from right to left.. (form the customer back) #In this example customers use something called EBS (block storage). Cloud based SAN) #Next the thing that is often left out of most VS’s the humans. In this ex they were an integral part of this system #Next we have an operations monitoring database (disk).. Thins the humans need to know about. #Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if threads broke or ran out. #Next there is the EBS server failover machine... Most production systems in large IT centers will have FA #It is important to realize that there are core services o this (e.g, EBS) and non core services on this box. ## Non core services are things like teh operations monitoring agent that feed into the monitor disk DB) ## Also we will see in a minute there is a hardware agent on this EBS server for hardware detection failures #Next is a Fleet monitor server... basically a hardware monitor that can phone home or auto order defective parts from the manufacturer (in large infras like amazon, google, facebook this is common) #It has a FA server
  • 7. Amazon’s EBS Outage 10/22/2012 The EBS System 6 Saturday, December 1, 12 #Let’s take a look at the value stream of the service that failed on 10/22 #If we look in the middle we see the green storage server #This box is a simplified process of a larger service (meaning many servers.. KISS 4now) #We always try to look at a VS from right to left.. (form the customer back) #In this example customers use something called EBS (block storage). Cloud based SAN) #Next the thing that is often left out of most VS’s the humans. In this ex they were an integral part of this system #Next we have an operations monitoring database (disk).. Thins the humans need to know about. #Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if threads broke or ran out. #Next there is the EBS server failover machine... Most production systems in large IT centers will have FA #It is important to realize that there are core services o this (e.g, EBS) and non core services on this box. ## Non core services are things like teh operations monitoring agent that feed into the monitor disk DB) ## Also we will see in a minute there is a hardware agent on this EBS server for hardware detection failures #Next is a Fleet monitor server... basically a hardware monitor that can phone home or auto order defective parts from the manufacturer (in large infras like amazon, google, facebook this is common) #It has a FA server
  • 8. Amazon’s EBS Outage 10/22/2012 Server Failure 7 Saturday, December 1, 12 #on this one fine afternoon the fleet server died.. remember this is receiving data from the HW agents on the EBS server
  • 9. Amazon’s EBS Outage 10/22/2012 Server Failover 8 Saturday, December 1, 12 #At this point there is most likely automated FA/HA # we see the FA server now supposed to be logically in the VS (the new arrow) # Any system thinkers out there see the first problem with this red circle?
  • 10. Amazon’s EBS Outage 10/22/2012 DNS Propagation Failure 9 Saturday, December 1, 12 #The HA/HA seems to work flawlessly. From the FM guys perspective. #However our second problem happens and that is that DNS does not update it’s records correctly #therefore the HW agent running on the EBS server is still pointing to the down fleet server. (everyone see that?)
  • 11. Amazon’s EBS Outage 10/22/2012 Agent Memory Leak 10 Saturday, December 1, 12 #Now a third problem jumps in.. (the yellow box) #When the HW agent tries to write back to the wrong FMS (the dead one) some not well tested code fails and creates a memory leak... #To make matters worse this particular agent is designed to be fault tolerant. In other words is should die silently and not disrupt any core service. E.g., it is designed to be ok to fail if it can’t send to the FMS. he assumption is that is will get it next time. #Now you can start seeing the fist level of a complex system emerge. ##The first issue seems to be fixed (the FMS FA) ## DNS isn’t showing up as a failure on anybodies dashboard ## and we have a silent error occurring on one of our core serves (a customer service)
  • 12. Amazon’s EBS Outage 10/22/2012 EBS Service is slowing down 11 Saturday, December 1, 12 #The memory leak continues undetected and eventually it starts slowing down the EBS service because of low memory... # Key point here is that is probably still not detected by the IT staff and maybe it’s just starting to annoy customers but not enough to turn the customer box yellow (yet)
  • 13. Amazon’s EBS Outage 10/22/2012 Throttling the API 12 Saturday, December 1, 12 #At some point the IT staff notices the slowdown. We would hope before the customer complain and in AMZN’s case that is probably true (they are pretty good). #However, as we said earlier, they still don’t know why it’s slow... #Another bit of complexity is introduced here. ## The EBS servers always run hot (high) on memory. Therefore the undetected memory leak is most likely unnoticed at this point. (we will discuss this in detail when we get the the analysis part of this preso. . ##from AMZN’s RCA is was pretty clear this was the case that they had not detected the mem leak ##Next a human interacting is take and that is they (the humans) decide to active a throttling tool ##They use this to throttle customer API requests as a stop gap to give them time to figure out and hopefully fix the issue (the slow down).
  • 14. Amazon’s EBS Outage 10/22/2012 Customer Issues 13 Saturday, December 1, 12 #By now the customer is getting a double whammy. ## one, they were already experience slow responses from the service ##two, now the throttling has really made it worse for them
  • 15. Amazon’s EBS Outage 10/22/2012 EBS Failover 14 Saturday, December 1, 12 #This situation continues on where teh IT staff still doesn’t know why the EBS service is having issues #The customers situation get worse. # And now the IT guys decide to punt again (like throttling). ## they force a FA/HA of the EBS service (servers). ##keep in mind they still don’t know what is wrong... gasping at straws twice now.. remember this 4 later. ##Notice now that the new HA/FA server is in place (show the arrow). ## COmplexity strikes again... This is a classic IT outage scenario.. where something seems to be fixed and when it really isn’t. ##The new FA server seems to have solved the problem. The new server is not slow at first... ##However, what they don’t know is all they have done is delay the inevitable. #the mem leak just starts all over again on the FA server. ##customer is still orange mainly because of throttling...
  • 16. Amazon’s EBS Outage 10/22/2012 Twitter Effect 15 Saturday, December 1, 12 #At this point we start getting what are called indirect effects ##The first effect (and this was in the RCA) is that suers’ tend to use more services when a potential outage is perceived. The start testing more services, trying other services. ##the next indirect effect is what I call the twitter effect. That is now the outage starts trending on twitter and everyone in extended system starts trying to kick the tires on AWS.. Let’s start up Netflix, I wonder if Guthub is working ok. ...
  • 17. Amazon’s EBS Outage 10/22/2012 Failover Server Dies 16 Saturday, December 1, 12 #And of course the FA server eventually gets to the same state as the original EBS server #meanwhile it is very likely that the IT staff still does not know why this is all happening.
  • 18. Amazon’s EBS Outage 10/22/2012 Systemic Outage 17 Saturday, December 1, 12 #Now our complete system is in a systemic failure... #Ironically the original failed over FMS is just fine (no red there). Now one is using it.. remember why? 10 Minutes
  • 19. Understanding Complexity 18 Saturday, December 1, 12 #So lets talk about complexity from a theoretical standpoint. #typically humans think linear. Our first instinct is that it’s always an X->Y #One variable X will change the outcome (y) - (Y is the dependent variable) --That is for an new improvement (change, bug fix, maintenance, etc..) --An emergency (like the amazon issues) --A new product, feature etc...
  • 20. Understanding Complexity 18 Saturday, December 1, 12 #So lets talk about complexity from a theoretical standpoint. #typically humans think linear. Our first instinct is that it’s always an X->Y #One variable X will change the outcome (y) - (Y is the dependent variable) --That is for an new improvement (change, bug fix, maintenance, etc..) --An emergency (like the amazon issues) --A new product, feature etc...
  • 21. Understanding Complexity 18 Saturday, December 1, 12 #So lets talk about complexity from a theoretical standpoint. #typically humans think linear. Our first instinct is that it’s always an X->Y #One variable X will change the outcome (y) - (Y is the dependent variable) --That is for an new improvement (change, bug fix, maintenance, etc..) --An emergency (like the amazon issues) --A new product, feature etc...
  • 22. Understanding Complexity 18 Saturday, December 1, 12 #So lets talk about complexity from a theoretical standpoint. #typically humans think linear. Our first instinct is that it’s always an X->Y #One variable X will change the outcome (y) - (Y is the dependent variable) --That is for an new improvement (change, bug fix, maintenance, etc..) --An emergency (like the amazon issues) --A new product, feature etc...
  • 23. Understanding Complexity 19 Saturday, December 1, 12 #However, in real life it’s never really x-<y it’s usually many In real life you get many variables #In statistics this referred to the don’t confuse correlation with causation #X->Y is correlation but it’s dangerous to assume it’s causation .. #real life is not that simple... we call it the messiness of like. #X1 a simple server failure #X2 The failover #X3 The DNS Deming wrote of Chanticleer, the barnyard rooster who had a theory. He crowed every morning, putting forth all his energy, flapped his wings. The sun came up. The connexion was clear: His crowing caused the sun to come up. There was no question about his importance. There came a snag. He forgot one morning to crow. The sun came up anyhow.
  • 24. Understanding Complexity T1 T2 20 Saturday, December 1, 12 #You also more likely get time dependent variables that add to the complexity #X1-X3 happen at T1 andX4-X5 happen at T2 #X4 is the memory leak #X5The dreaded throttling...
  • 25. Understanding Complexity T1 T2 21 Saturday, December 1, 12 #There are also indirect effects on the dependent variables (y) #for example X1 in concert with X4 can conjointly effect the dependent var Y ## X1 changes X4 and the combination effect is different on Y ##same with X5 # This is a different model than a simple X->Y #X? The customer respond with more usage #X? The twitter effect 15 Minutes
  • 26. W. Edwards Deming (1900 – 1993) • Father of Quality • Understanding of the system • Understanding variation • Understanding human behavior • Introduced sampling into US Census • WWII success credited to his quality approach • Taught Japan after WWII and transformed quality • In 1980 Transformed American quality revolution • The Foundations of Six Sigma 22 Saturday, December 1, 12 #There is a tool that has been used by successful companies like Toyota, (lean) and many others. # Dr. Edwards Deming gave such a lens to break down complexity ## (the real world just like a camera does) 20 Minutes
  • 27. System of Profound Knowledge (SoPK) 23 Saturday, December 1, 12 #Let’s say a lens for improvement of something (an enhancement, a bug fix, new product idea) #An outcome X->y #Dr Deming gave us a tool called “The System of Profound Knowledge” #Just common sense... However, Mark Twain said nothing common about common sense,, #SoPK is a Lens to break down complexity and give ourselves an advantage to not over simplify what we are trying to do. I#n otherwords clear up the messiness of real life just like a camera lens does. (
  • 28. Knowledge of a System • Systems Thinking • End to End Value Stream • What is the Aim of a System? • The Purpose of the System • Global Optimization • Not Local Optimization 24 Saturday, December 1, 12 (S) Appreciation of a System - Systems thinking - Deming would say understanding the AIM of a system. Deming said every system must have an AIM. Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the same thing as we will soon see) Eli Goldarat (TOC) would say Global optimization over local optimization even if sub optimization is sub optimal. Understanding subsystems and dependent systems.
  • 29. Knowledge of a System 25 Saturday, December 1, 12 One big exercise of non systems thinking... Clearly there were independant views of the system What was the AIM of this system Did the hardware guys have the same aim as the core services guys. #Lens #1 Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because Lens #2 The hardware guys should know that they are part of a bigger system other than just hardware monitor. They had code on a core service was it smoke tested immune tested.. Was there a systems view for QA and smoke testing of agent code changes? Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’t understand hardware monitoring as a sub system.. local optimization....
  • 30. Knowledge of a System 25 Saturday, December 1, 12 One big exercise of non systems thinking... Clearly there were independant views of the system What was the AIM of this system Did the hardware guys have the same aim as the core services guys. #Lens #1 Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because Lens #2 The hardware guys should know that they are part of a bigger system other than just hardware monitor. They had code on a core service was it smoke tested immune tested.. Was there a systems view for QA and smoke testing of agent code changes? Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’t understand hardware monitoring as a sub system.. local optimization....
  • 31. Knowledge of Variation • There is always Variation • Special Cause Variation • Common Cause Variation • Understanding Variation 26 Saturday, December 1, 12 Continuous improvement requires the understanding of variation You have a power outage and it takes key personnel a long time to get to the data center That is special cause Var. A bad reaction would be to create a new policy that all personnel live with 5 minutes of the data center (i.e., treating it like common cause) Conversely. Firing a new programmer who brings down a production system would be treating common cause as a special cause situation. More than likely it was bad safeguards, insufficient training.... (V) Variation - Not understanding Variation is the root of all evil. Deming would get mad at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you understand variation? Statistics (primarily STD and and it’s relationship to a process i.e., it’s distribution) Give you an example. A large cloud provider rates API calls at 100 (why 100) per (x). for Most customers that’s fine, however, others they get treated as DDOS. Where did they get 100? It had to be a guess. If they understood SPC (variation) they might come up with the number and have a CI process in place when they found special variation. (
  • 32. Control Chart 27 Saturday, December 1, 12 • #Approximately 99% - 100% of the values will fall within 3 standard deviations of the mean • Approximately 90% - 98% of the values will fall within 2 standard deviations of the mean • Approximately 60% - 78% of the values will fall within 1 standard deviation of the mean • Approximately 90% - 98% of the values will fall within 2 standard deviations of the mean • Approximately 60% - 78% of the values will fall within 1 standard deviation of the mean
  • 33. Knowledge of Variation 28 Saturday, December 1, 12 The biggest issue here is the knee jerk reactions... Throttling and forced EBS server failover. They didn’t understand the type of variation. Lens #1 The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out ... monitoring sub processes monitoring looking at individual monitors... e.g., they might have gone from 95% to 96% which caused the issue. However, if they were looking at the individual agent memory they.
  • 34. Theory of Knowledge • Scientific Method • Knowledge Must Have Theory • Theory Must Have Prediction • Prediction Must Have Tests • Aim-->Measure-->Change 29 Saturday, December 1, 12 (K) is the simplest but hardest to understand by most ppl. Simply put it is using Scientific method to everything you do. Deming says you must have Theory to have knowledge and you can’t have knowledge with out prediction and you predication with out a test is useless. PDSA others call it (AMC) AIM,Measure (a.what process u gonna change b.measure if the change worked), Change. You have to test any improvement to see if it worked, failed or did nothing. Imagine someone staring a failover system with automation but not testing to see if it really worked (could never happen).
  • 35. Theory of Knowledge 30 Saturday, December 1, 12 #Lens #1 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh). Lens #2 Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?
  • 36. Theory of Psychology • Understanding Behavior • Understanding Tribes • Understanding Worldviews 31 Saturday, December 1, 12 (P) Another easy one but hardest to implement. Understanding behavior. Why ppl do the things they do. Tribal behavior. Things that are important to one group might not be important to other groups. Understanding Human Behavior (another lens factor). Worldview's. Imagine a server that has software on it from two totally different dev groups. Further imagine these to group’s worldview are so far apart. One does agile CI, TDD, BDD, CD the other has never even hear of those things. (groupthink experiment)...
  • 37. Theory of Psychology 32 Saturday, December 1, 12 Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together. Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents. Lens #3 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. Lens #4 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.
  • 38. Amazon’s Outage 10/22/2012 Let’s Review 33 Saturday, December 1, 12 #X1 a simple server failure #X2 The failover #X3 The DNS #X4 is the memory leak #X5 Bad TDD hygene by FMS eng/dev #X6 The dreaded throttling... #X7 The customer respond with more usage #X8 EBS Server failover #X9 The twitter effect #The complexity was masked #This was not an X->Y #To bad they had not read deming...
  • 39. Amazon’s Outage 10/22/2012 Let’s Review X->Y 33 Saturday, December 1, 12 #X1 a simple server failure #X2 The failover #X3 The DNS #X4 is the memory leak #X5 Bad TDD hygene by FMS eng/dev #X6 The dreaded throttling... #X7 The customer respond with more usage #X8 EBS Server failover #X9 The twitter effect #The complexity was masked #This was not an X->Y #To bad they had not read deming...