Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas

Analyzing a Complex Cloud
Outage
@botchagalupe
VP of Services enStratus

1

Saturday, December 1, 12
John Willis
Call me Botchagalupe
VP of Services

WHO AM I

2

30 yrs it
ubuntu cloud evangelist startup
Opscode
DTO awesome dudes
Enstratus GR called..

GOALS

• Look at a complex cloud outage.
• Understanding complexity.
• Analyze a complex cloud outage.

3

Review bullets...

Amazon’s EBS Outage 10/22/2012

4

Fed reserve story just the WSJ part
Amazon outages are big deal
Minecraft


5



6

#Let’s take a look at the value stream of the service that failed on 10/22
#If we look in the middle we see the green storage server
#This box is a simplified process of a larger service (meaning many servers.. KISS
4now)
#We always try to look at a VS from right to left.. (form the customer back)
#In this example customers use something called EBS (block storage). Cloud based
SAN)
#Next the thing that is often left out of most VS’s the humans. In this ex they were an
integral part of this system
#Next we have an operations monitoring database (disk).. Thins the humans need to
know about.
#Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if
threads broke or ran out.
#Next there is the EBS server failover machine... Most production systems in large IT
centers will have FA
#It is important to realize that there are core services o this (e.g, EBS) and non core
services on this box.
## Non core services are things like teh operations monitoring agent that feed into the
monitor disk DB)
## Also we will see in a minute there is a hardware agent on this EBS server for
hardware detection failures
#Next is a Fleet monitor server... basically a hardware monitor that can phone home or
auto order defective parts from the manufacturer (in large infras like amazon, google,
facebook this is common)
#It has a FA server


The EBS System

6

#Let’s take a look at the value stream of the service that failed on 10/22
#If we look in the middle we see the green storage server
#This box is a simplified process of a larger service (meaning many servers.. KISS
4now)
#We always try to look at a VS from right to left.. (form the customer back)
#In this example customers use something called EBS (block storage). Cloud based
SAN)
#Next the thing that is often left out of most VS’s the humans. In this ex they were an
integral part of this system
#Next we have an operations monitoring database (disk).. Thins the humans need to
know about.
#Could talk about autonomation (pre automation) Sakichi Toyoda auto stop the loom if
threads broke or ran out.
#Next there is the EBS server failover machine... Most production systems in large IT
centers will have FA
#It is important to realize that there are core services o this (e.g, EBS) and non core
services on this box.
## Non core services are things like teh operations monitoring agent that feed into the
monitor disk DB)
## Also we will see in a minute there is a hardware agent on this EBS server for
hardware detection failures
#Next is a Fleet monitor server... basically a hardware monitor that can phone home or
auto order defective parts from the manufacturer (in large infras like amazon, google,
facebook this is common)
#It has a FA server


Server Failure

7

#on this one fine afternoon the fleet server died.. remember this is receiving data from
the HW agents on the EBS server


Server Failover

8

#At this point there is most likely automated FA/HA
# we see the FA server now supposed to be logically in the VS (the new arrow)
# Any system thinkers out there see the first problem with this red circle?


DNS Propagation Failure

9

#The HA/HA seems to work flawlessly. From the FM guys perspective.
#However our second problem happens and that is that DNS does not update it’s
records correctly
#therefore the HW agent running on the EBS server is still pointing to the down fleet
server. (everyone see that?)


Agent Memory Leak

10

#Now a third problem jumps in.. (the yellow box)
#When the HW agent tries to write back to the wrong FMS (the dead one) some not well
tested code fails and creates a memory leak...
#To make matters worse this particular agent is designed to be fault tolerant. In other
words is should die silently and not disrupt any core service. E.g., it is designed to be ok
to fail if it can’t send to the FMS. he assumption is that is will get it next time.
#Now you can start seeing the fist level of a complex system emerge.
##The first issue seems to be fixed (the FMS FA)
## DNS isn’t showing up as a failure on anybodies dashboard
## and we have a silent error occurring on one of our core serves (a customer service)


EBS Service is slowing down

11

#The memory leak continues undetected and eventually it starts slowing down the EBS
service
because of low memory...
# Key point here is that is probably still not detected by the IT staff and maybe it’s just
starting to annoy customers but not enough to turn the customer box yellow (yet)


Throttling the API

12

#At some point the IT staff notices the slowdown. We would hope before the customer
complain and in AMZN’s case that is probably true (they are pretty good).
#However, as we said earlier, they still don’t know why it’s slow...
#Another bit of complexity is introduced here.
## The EBS servers always run hot (high) on memory. Therefore the undetected
memory leak is most likely unnoticed at this point. (we will discuss this in detail when we
get the the analysis part of this preso. .
##from AMZN’s RCA is was pretty clear this was the case that they had not detected the
mem leak
##Next a human interacting is take and that is they (the humans) decide to active a
throttling tool
##They use this to throttle customer API requests as a stop gap to give them time to
figure out and hopefully fix the issue (the slow down).


Customer Issues

13

#By now the customer is getting a double whammy.
## one, they were already experience slow responses from the service
##two, now the throttling has really made it worse for them


EBS Failover

14

#This situation continues on where teh IT staff still doesn’t know why the EBS service is
having issues
#The customers situation get worse.
# And now the IT guys decide to punt again (like throttling).
## they force a FA/HA of the EBS service (servers).
##keep in mind they still don’t know what is wrong... gasping at straws twice now..
remember this 4 later.
##Notice now that the new HA/FA server is in place (show the arrow).
## COmplexity strikes again... This is a classic IT outage scenario.. where something
seems to be fixed and when it really isn’t.
##The new FA server seems to have solved the problem. The new server is not slow at
first...
##However, what they don’t know is all they have done is delay the inevitable.
#the mem leak just starts all over again on the FA server.
##customer is still orange mainly because of throttling...


Twitter Effect

15

#At this point we start getting what are called indirect effects
##The first effect (and this was in the RCA) is that suers’ tend to use more services
when a potential outage is perceived. The start testing more services, trying other
services.
##the next indirect effect is what I call the twitter effect. That is now the outage starts
trending on twitter and everyone in extended system starts trying to kick the tires on
AWS.. Let’s start up Netflix, I wonder if Guthub is working ok. ...


Failover Server Dies

16

#And of course the FA server eventually gets to the same state as the original EBS
server
#meanwhile it is very likely that the IT staff still does not know why this is all happening.


Systemic Outage

17

#Now our complete system is in a systemic failure...
#Ironically the original failed over FMS is just fine (no red there). Now one is using it..
remember why?
10 Minutes

Understanding Complexity

18

#So lets talk about complexity from a theoretical standpoint.
#typically humans think linear. Our first instinct is that it’s always an X->Y
#One variable X will change the outcome (y) - (Y is the dependent variable)
--That is for an new improvement (change, bug fix, maintenance, etc..)
--An emergency (like the amazon issues)
--A new product, feature etc...


19

#However, in real life it’s never really x-<y it’s usually many In real life you get many
variables
#In statistics this referred to the don’t confuse correlation with causation
#X->Y is correlation but it’s dangerous to assume it’s causation ..
#real life is not that simple... we call it the messiness of like.

#X1 a simple server failure
#X2 The failover
#X3 The DNS

Deming wrote of Chanticleer, the barnyard rooster who had a theory. He crowed every
morning, putting forth all his energy, flapped his wings. The sun came up. The
connexion was clear: His crowing caused the sun to come up. There was no question
about his importance.
There came a snag. He forgot one morning to crow. The sun came up anyhow.


T1 T2

20

#You also more likely get time dependent variables that add to the complexity
#X1-X3 happen at T1 andX4-X5 happen at T2

#X4 is the memory leak
#X5The dreaded throttling...


T1 T2

21

#There are also indirect effects on the dependent variables (y)
#for example X1 in concert with X4 can conjointly effect the dependent var Y
## X1 changes X4 and the combination effect is different on Y
##same with X5
# This is a different model than a simple X->Y

#X? The customer respond with more usage
#X? The twitter effect

15 Minutes

W. Edwards Deming (1900 – 1993)

• Father of Quality
• Understanding of the system
• Understanding variation
• Understanding human behavior
• Introduced sampling into US Census
• WWII success credited to his quality approach
• Taught Japan after WWII and transformed quality
• In 1980 Transformed American quality revolution
• The Foundations of Six Sigma

22

#There is a tool that has been used by successful companies like Toyota, (lean) and
many others.
# Dr. Edwards Deming gave such a lens to break down complexity
## (the real world just like a camera does)

20 Minutes

System of Profound Knowledge (SoPK)

23

#Let’s say a lens for improvement of something (an enhancement, a bug fix, new
product idea)
#An outcome X->y
#Dr Deming gave us a tool called “The System of Profound Knowledge”

#Just common sense... However, Mark Twain said nothing common about common
sense,,

#SoPK is a Lens to break down complexity and give ourselves an advantage to not over
simplify what we are trying to do.
I#n otherwords clear up the messiness of real life just like a camera lens does.

(

Knowledge of a System

• Systems Thinking
• End to End Value Stream
• What is the Aim of a System?
• The Purpose of the System
• Global Optimization
• Not Local Optimization

24

(S) Appreciation of a System - Systems thinking - Deming would say understanding the
AIM of a system.
Deming said every system must have an AIM.
Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the
same thing as we will soon see)
Eli Goldarat (TOC) would say Global optimization over local optimization even if sub
optimization is sub optimal. Understanding subsystems and dependent systems.

Knowledge of a System

25

One big exercise of non systems thinking...
Clearly there were independant views of the system
What was the AIM of this system
Did the hardware guys have the same aim as the core services guys.

#Lens #1 Not having a systems view- Not seeing this as dependent systems. You might
say surely they had automation to DNS. However I would say no. Because

Lens #2 The hardware guys should know that they are part of a bigger system other
than just hardware monitor. They had code on a core service was it smoke tested
immune tested.. Was there a systems view for QA and smoke testing of agent code
changes?

Lens #3 (X->Y) Humans try to correct the memory issue with throttling and they don’t
understand hardware monitoring as a sub system.. local optimization....

Knowledge of Variation

• There is always Variation
• Special Cause Variation
• Common Cause Variation
• Understanding Variation

26

Continuous improvement requires the understanding of variation

You have a power outage and it takes key personnel a long time to get to the data
center
That is special cause Var. A bad reaction would be to create a new policy that all
personnel live with 5 minutes of the data center (i.e., treating it like common cause)
Conversely. Firing a new programmer who brings down a production system would be
treating common cause as a special cause situation. More than likely it was bad
safeguards, insufficient training....

(V) Variation - Not understanding Variation is the root of all evil. Deming would get mad
at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you
understand variation? Statistics (primarily STD and and it’s relationship to a process
i.e., it’s distribution)
Give you an example. A large cloud provider rates API calls at 100 (why 100) per (x).
for Most customers that’s fine, however, others they get treated as DDOS. Where did
they get 100? It had to be a guess. If they understood SPC (variation) they might come
up with the number and have a CI process in place when they found special variation.

(

Control Chart

27

• #Approximately 99% - 100% of the values will fall within 3 standard deviations of
the mean

• Approximately 90% - 98% of the values will fall within 2 standard deviations of the
mean

• Approximately 60% - 78% of the values will fall within 1 standard deviation of the
mean

• Approximately 90% - 98% of the values will fall within
2 standard deviations of the mean

• Approximately 60% - 78% of the values will fall within
1 standard deviation of the mean

Knowledge of Variation

28

The biggest issue here is the knee jerk reactions... Throttling and forced EBS server
failover. They didn’t understand the type of variation.

Lens #1 The systems guys don’t understand common vs special cause variation .. they
react to a “S” that should of been a “C”.
Turns out ... monitoring sub processes monitoring looking at individual monitors... e.g.,
they might have gone from 95% to 96% which caused the issue. However, if they were
looking at the individual agent memory they.

Theory of Knowledge

• Scientific Method
• Knowledge Must Have Theory
• Theory Must Have Prediction
• Prediction Must Have Tests
• Aim-->Measure-->Change

29

(K) is the simplest but hardest to understand by most ppl. Simply put it is using
Scientific method to everything you do. Deming says you must have Theory to have
knowledge and you can’t have knowledge with out prediction and you predication with
out a test is useless.

PDSA others call it (AMC) AIM,Measure (a.what process u gonna change b.measure if
the change worked), Change. You have to test any improvement to see if it worked,
failed or did nothing. Imagine someone staring a failover system with automation but
not testing to see if it really worked (could never happen).

Theory of Knowledge

30

#Lens #1 (K) Theory can not be an un measured guess. Whoever, did the failover
(automation or manual) apparently didn’t have a proper measure for success. Should
have verified that the they were actually using the new server (duh).

Lens #2 Measures with out results are not fixes (throttling). They should have looked at
the results.Three potential outcomes a) get better b) Stays the same c) Gets worse.
What do you think happened?

Theory of Psychology

• Understanding Behavior
• Understanding Tribes
• Understanding Worldviews

31

(P) Another easy one but hardest to implement. Understanding behavior. Why ppl do
the things they do. Tribal behavior. Things that are important to one group might not be
important to other groups. Understanding Human Behavior (another lens factor).
Worldview's. Imagine a server that has software on it from two totally different dev
groups. Further imagine these to group’s worldview are so far apart. One does agile CI,
TDD, BDD, CD the other has never even hear of those things. (groupthink
experiment)...

Theory of Psychology

32

Lens #1 (P) We could argue that maybe because the fleet servers are managed by
hardware guys and DNS is by systems guys and may they’ are different cultural tribes
and don’t understand the importance of each. Maybe they don’t go to lunch together.

Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they
have the same Theories. Do the EBS guys do CD Smoke testing with hardware
monitoring agents.

Lens #3 (P) Tribal understanding of behavior differences between Hardware guys and
Systems guys.

Lens #4 (P) Not understanding customer behavior.. Customers increase there actions
(API) calls testing services, qa, smoke.

Amazon’s Outage 10/22/2012

Let’s Review

33

#X2 The failover
#X3 The DNS
#X5 Bad TDD hygene by FMS eng/dev
#X6 The dreaded throttling...
#X7 The customer respond with more usage
#X8 EBS Server failover
#X9 The twitter effect

#The complexity was masked
#This was not an X->Y
#To bad they had not read deming...

Amazon’s Outage 10/22/2012

Let’s Review

X->Y
33

#X2 The failover
#X3 The DNS
#X5 Bad TDD hygene by FMS eng/dev
#X6 The dreaded throttling...
#X7 The customer respond with more usage
#X8 EBS Server failover
#X9 The twitter effect

#The complexity was masked
#This was not an X->Y
#To bad they had not read deming...

Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas

Recommended

Recommended

More Related Content

More from John Willis

More from John Willis (20)

Recently uploaded

Recently uploaded (20)

Analyzing a Complex Cloud Outage - CloudStack Collaboration Conference - Vegas