SlideShare a Scribd company logo
1 of 102
Download to read offline
On Failure and Resilience


             Mike Brittain
             DIRECTOR OF ENGINEERING, ETSY
             @mikebrittain
             Presented at 37signals on Aug 21, 2012
“Software Infrastructure”
“Framework” code, caching, ORM, file storage tier,
developer tools, CI/deployment, site performance,
             front-end architecture.
Managing failures and building
resilience into systems, applications,
         process, and people.
On Failure and Resilience
$61 M in goods sold in the marketplace
2.9 M items sold
1.2 B page views




                                                           Photo: http://www.etsy.com/shop/TheOldTimeJunkShop

  http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
Architecture
Linux, Apache, MySQL, PHP, Postgres,
Solr, Gearman, Memcache, Chef,
Hadoop, EC2/S3/EMR

                      30+ Logical data stores
                 (23 shards + more functionally partitioned)


   Search and storage tiers as “services”
150 Engineers + Designers + Product
                                (this was 20 in Feb 2010)




credit: martin_heigan (flickr)
Buyers, sellers, support,
developer api, i18n,
core infrastructure, storage,
payments, security, fraud
detection, big data and BI,
email delivery, corp IT,
operations, developer tools,
continuous integration and
testing, site performance,
search, advertising, seller
economics, mobile web,
iOS.
On Failure and Resilience
Zero Release Managers
There Will Be Fail



                     Credit: wilkee.deviantart.com
We cannot comprehend all of the ways in
which the individual parts of a complex
system will interact. We cannot know all
of the states and scenarios.

We cannot prevent failures.
Yet, we can mitigate them.

Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
“Uptime” is not binary.
Async
            Convos           Ads   Auth
                     Tasks




Functionally Partitioned
Async
            Convos           Ads   Auth
                     Tasks




Functionally Partitioned
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
1         5
                                              3         2        4




    shard1   shard1   shard2   shard2     shard3     shard3          shard4   shard4



                                        ~4% of listing data is
                                          stored on shard3




Sharded Tables
1        5
                                            3        2    4




    shard1   shard1   shard2   shard2   shard3   shard3       shard4   shard4




Sharded Tables
shard1   shard1   shard2   shard2    shard3    shard3      shard4   shard4




                                        Outage is limited to
                                         ~4% of data set




Sharded Tables
“Uptime” is not binary.
Uptime of the application is the
responsibility of our Operations team.
Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.
Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.

If you are committing code, you are
operating the site.
Branching in Code
“All existing revision control systems
 were built by people who build
 installed software”


                                  Always Ship Trunk
                                      Paul Hammond
                                   Velocity Conf 2010
Config Flags
Enable and disable features quickly.
Features for staff or for beta groups.
Percentage ramp-up of users or requests.
A/B “experiments.”
$cfg[‘new_search’]   =   array('enabled'   =>   'on');
$cfg[‘sign_in’]      =   array('enabled'   =>   'on');
$cfg[‘checkout’]     =   array('enabled'   =>   'on');
$cfg[‘homepage’]     =   array('enabled'   =>   'on');
$cfg[‘new_search’] = array('enabled' => 'on');

// Meanwhile...

if ($cfg[‘new_search’]) {
  # New hotness
  $results = do_solr();
} else {
  # old and boring
  $results = do_grep();
}
But...
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
   “That sounds like it sucks.”
                     Really?
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
   “That sounds like it sucks.”
                     Really?
   “Wait a minute... all of the counter arguments are
    in Comic Sans. WTF?!?
                         Oh, you noticed? ;)
+06:40
                                               Site up, some seller tools disabled
    00:00
    Site down for maintenance




                       +01:47
                       Site up, disabled login and registration

                                                                                     +07:41
                                                                                     All features restored


DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
“Uptime” is not binary.
Features are launched by flipping a
   config flag, not by deploying
    hundreds of lines of code.
“If Engineering at Etsy has a religion,
           it’s the Church of Graphs.
                     Ian Malpass, Code as Craft
                                http://etsy.me/ePkoZB
On Failure and Resilience
On Failure and Resilience
THIS IS HOW
                      YOU RUN
               A COMPLEX
                        SYSTEM
http://www.flickr.com/photos/flyforfun/2694158656/
Config flags
                        Operator                       Metrics




http://www.flickr.com/photos/flyforfun/2694158656/
Oh, you want to talk about how we collect
metrics and make graphs?


                http://www.slideshare.net/mikebrittain/metricsdriven-engineering
Resilient User Interfaces
Interfaces and user experiences
that adapt to technical and
architectural failure.
On Failure and Resilience
On Failure and Resilience
On Failure and Resilience
On Failure and Resilience
On Failure and Resilience
On Failure and Resilience
/**
 * Creates a database connection.
 */
public function __construct($host, $user, $pass, $db) {
    parent::__construct($host, $user, $pass, $db);

     if (mysqli_connect_error()) {

         throw new DBConnection_Exception(
             sprintf("Error: %s, %s",
                 mysqli_connect_errno(),
                 mysqli_connect_error()));

     }
 }
try {
    $conn = new DBConnection('viewsdb.host', 'db_read_user',
                             'ssssshh!', 'views_db');
} catch (DBConnection_Exception $e) {

    // TODO: Someone should figure out what to do if
    // we can't connect to the views db.
    throw $e;
}
On Failure and Resilience
On Failure and Resilience
Site navigation
           Logo

          Cute Picture

Generic, catch-all
error messaging....
On Failure and Resilience
Every back-end service is an
  opportunity for failure.
On Failure and Resilience
On Failure and Resilience
On Failure and Resilience
1           4
                                  9
                5
                    8                                   6
    2   3




                                                   10
                                               4            11
                                      7




                                          14
                    7

                        13



                             12
On Failure and Resilience
Critical Path
On Failure and Resilience
On Failure and Resilience
On Failure and Resilience
#srsly?
< 400 ms
Non-blocking Ajax
Google Calendar




   Google Docs
GMail
“Oops, we aren’t able to
access click metrics right
now, do not worry — your
      data is safe.”
Product design doesn’t stop
    at 100% availability.
Dev   Ops
Dev         Ops


  Product
1           4
                                  9
                5
                    8                                   6
    2   3




                                                   10
                                               4            11
                                      7




                                          14
                    7

                        13



                             12
Operability Reviews
“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off?... and what happens when we do?
“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off? ...and what happens when we do?
“GameDay” Exercises
On Failure and Resilience
Pedro
Homepage (95th perc.)

                        Surprise!!!
                        Turning off multi-
                        language support
                        improves our page
                        generation times by
                        up to 25%.
(Blameless) Post-Mortems
How could this have gone better?

How quickly did we find out that something was wrong?
Did we communicate well to our visitors and each other?
Why did we have confidence that what we were doing was OK?
Did we have the right tools, did we use them properly?
Did we collect metrics, and could we find them?
Where did we make the wrong decisions?

What steps do we take to reduce the chance of this
happening again in the future?
“... an engineer who thinks they’re going to be
reprimanded are disincentivized to give the details
necessary to get an understanding of the mechanism,
pathology, and operation of the failure.

This lack of understanding of how the accident occurred
all but guarantees that it will repeat. If not with the
original engineer, another one in the future.”

                                                                    John Allspaw
                                                       VP, Technical Operations, Etsy


         http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
We should try to learn not only what went
wrong, but also what went right.
+06:40
                                               Site up, some seller tools disabled
    00:00
    Site down for maintenance




                       +01:47
                       Site up, disabled login and registration

                                                                                     +07:41
                                                                                     All features restored


DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
Operational Mindset



Dev   Ops            Product
Operational Mindset



Dev   Ops              Product


        Business Priorities
Introspection
page views for error template
page views for error template
...or, how are we screwing our users?
Risk mitigation in a complex system

Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
Thank you.
  Mike Brittain
 mike@etsy.com
 @mikebrittain
On Failure and Resilience
On Failure and Resilience
On Failure and Resilience
PHOTO
CREDITS
                                                                   Flickr: roboppy
                         http://www.flickr.com/photos/51035735481@N01/163374138/




                                                Flickr: jamesjyu
                                                http://www.flickr.com/photos/32593095@N00/3465022/




                                         Flickr: circulating
  http://www.flickr.com/photos/26835318@N00/2318226026/

More Related Content

Viewers also liked

Take My Logs. Please!
Take My Logs. Please!Take My Logs. Please!
Take My Logs. Please!Mike Brittain
 
How to Get to Second Base with Your CDN
How to Get to Second Base with Your CDNHow to Get to Second Base with Your CDN
How to Get to Second Base with Your CDNMike Brittain
 
Metrics-Driven Engineering at Etsy
Metrics-Driven Engineering at EtsyMetrics-Driven Engineering at Etsy
Metrics-Driven Engineering at EtsyMike Brittain
 
Metrics-Driven Engineering
Metrics-Driven EngineeringMetrics-Driven Engineering
Metrics-Driven EngineeringMike Brittain
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyMike Brittain
 
Continuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYCContinuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYCMike Brittain
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsMike Brittain
 
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...SERENEWorkshop
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyMike Brittain
 
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core SystemsEngineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core SystemsSERENEWorkshop
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and TrendingMike Brittain
 
Continuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsContinuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsMike Brittain
 
From Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsFrom Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsMike Brittain
 
Using Social Media2
Using Social Media2Using Social Media2
Using Social Media2Jane Hart
 
Brain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork
 
应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查dxw8448
 
Database Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkDatabase Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkAileen Marshall
 
Don't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinDon't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinZillionDesigns
 
Herramientas de comunicación en línea
Herramientas de comunicación en líneaHerramientas de comunicación en línea
Herramientas de comunicación en líneaMarisol Bolaños
 

Viewers also liked (19)

Take My Logs. Please!
Take My Logs. Please!Take My Logs. Please!
Take My Logs. Please!
 
How to Get to Second Base with Your CDN
How to Get to Second Base with Your CDNHow to Get to Second Base with Your CDN
How to Get to Second Base with Your CDN
 
Metrics-Driven Engineering at Etsy
Metrics-Driven Engineering at EtsyMetrics-Driven Engineering at Etsy
Metrics-Driven Engineering at Etsy
 
Metrics-Driven Engineering
Metrics-Driven EngineeringMetrics-Driven Engineering
Metrics-Driven Engineering
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at Etsy
 
Continuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYCContinuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYC
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty Details
 
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at Etsy
 
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core SystemsEngineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
 
Continuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsContinuous Delivery: The Dirty Details
Continuous Delivery: The Dirty Details
 
From Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsFrom Building a Marketplace to Building Teams
From Building a Marketplace to Building Teams
 
Using Social Media2
Using Social Media2Using Social Media2
Using Social Media2
 
Brain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork - FPGA because
Brain NECSTwork - FPGA because
 
应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查
 
Database Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkDatabase Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White Shark
 
Don't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinDon't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle Lin
 
Herramientas de comunicación en línea
Herramientas de comunicación en líneaHerramientas de comunicación en línea
Herramientas de comunicación en línea
 

Similar to On Failure and Resilience

How do I run microservices in production using Docker.
How do I run microservices in production using Docker.How do I run microservices in production using Docker.
How do I run microservices in production using Docker.Daniël van Gils
 
[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise Applications[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise ApplicationsDaniel Oh
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyonddion
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?Daniël van Gils
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Clearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad DecisionsClearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad DecisionsJonathan Hitchcock
 
MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012jackdanger
 
Introduction to aop
Introduction to aopIntroduction to aop
Introduction to aopDror Helper
 
Modular Web Applications With Netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzkenetzke
 
Sustainable Agile Development
Sustainable Agile DevelopmentSustainable Agile Development
Sustainable Agile DevelopmentGabriele Lana
 
Securing Rails
Securing RailsSecuring Rails
Securing RailsAlex Payne
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?Katarzyna Hoffman
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?Cloud 66
 
Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionOvidiu Dimulescu
 
What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.David Nuescheler
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Plugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 TaiwanPlugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 TaiwanRack Lin
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMilen Dyankov
 
TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011bobmcwhirter
 

Similar to On Failure and Resilience (20)

How do I run microservices in production using Docker.
How do I run microservices in production using Docker.How do I run microservices in production using Docker.
How do I run microservices in production using Docker.
 
[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise Applications[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise Applications
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyond
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Clearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad DecisionsClearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad Decisions
 
MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012
 
Introduction to aop
Introduction to aopIntroduction to aop
Introduction to aop
 
Modular Web Applications With Netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzke
 
Sustainable Agile Development
Sustainable Agile DevelopmentSustainable Agile Development
Sustainable Agile Development
 
Securing Rails
Securing RailsSecuring Rails
Securing Rails
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?
 
Unit Testing 101
Unit Testing 101Unit Testing 101
Unit Testing 101
 
Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java edition
 
What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Plugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 TaiwanPlugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 Taiwan
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
 
TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011
 

Recently uploaded

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 

Recently uploaded (20)

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 

On Failure and Resilience

  • 1. On Failure and Resilience Mike Brittain DIRECTOR OF ENGINEERING, ETSY @mikebrittain Presented at 37signals on Aug 21, 2012
  • 2. “Software Infrastructure” “Framework” code, caching, ORM, file storage tier, developer tools, CI/deployment, site performance, front-end architecture.
  • 3. Managing failures and building resilience into systems, applications, process, and people.
  • 5. $61 M in goods sold in the marketplace 2.9 M items sold 1.2 B page views Photo: http://www.etsy.com/shop/TheOldTimeJunkShop http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
  • 6. Architecture Linux, Apache, MySQL, PHP, Postgres, Solr, Gearman, Memcache, Chef, Hadoop, EC2/S3/EMR 30+ Logical data stores (23 shards + more functionally partitioned) Search and storage tiers as “services”
  • 7. 150 Engineers + Designers + Product (this was 20 in Feb 2010) credit: martin_heigan (flickr)
  • 8. Buyers, sellers, support, developer api, i18n, core infrastructure, storage, payments, security, fraud detection, big data and BI, email delivery, corp IT, operations, developer tools, continuous integration and testing, site performance, search, advertising, seller economics, mobile web, iOS.
  • 11. There Will Be Fail Credit: wilkee.deviantart.com
  • 12. We cannot comprehend all of the ways in which the individual parts of a complex system will interact. We cannot know all of the states and scenarios. We cannot prevent failures.
  • 13. Yet, we can mitigate them. Redundant system architectures. Small, well-understood changes to production. Control application using config flags. Gratuitous metrics collection. Resilient user interfaces. GameDay exercises.
  • 15. Async Convos Ads Auth Tasks Functionally Partitioned
  • 16. Async Convos Ads Auth Tasks Functionally Partitioned
  • 17. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 18. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 19. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 20. 1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 ~4% of listing data is stored on shard3 Sharded Tables
  • 21. 1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 Sharded Tables
  • 22. shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 Outage is limited to ~4% of data set Sharded Tables
  • 24. Uptime of the application is the responsibility of our Operations team.
  • 25. Uptime of the application is the responsibility of our Operations, Engineering, Product, and Design teams.
  • 26. Uptime of the application is the responsibility of our Operations, Engineering, Product, and Design teams. If you are committing code, you are operating the site.
  • 28. “All existing revision control systems were built by people who build installed software” Always Ship Trunk Paul Hammond Velocity Conf 2010
  • 29. Config Flags Enable and disable features quickly. Features for staff or for beta groups. Percentage ramp-up of users or requests. A/B “experiments.”
  • 30. $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');
  • 31. $cfg[‘new_search’] = array('enabled' => 'on'); // Meanwhile... if ($cfg[‘new_search’]) { # New hotness $results = do_solr(); } else { # old and boring $results = do_grep(); }
  • 33. “Doesn’t that mean you have conditionals all over your code?” Yes.
  • 34. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes.
  • 35. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really?
  • 36. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really? “Wait a minute... all of the counter arguments are in Comic Sans. WTF?!? Oh, you noticed? ;)
  • 37. +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restored DB Server Maintenance, June 16, 2012 http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  • 39. Features are launched by flipping a config flag, not by deploying hundreds of lines of code.
  • 40. “If Engineering at Etsy has a religion, it’s the Church of Graphs. Ian Malpass, Code as Craft http://etsy.me/ePkoZB
  • 43. THIS IS HOW YOU RUN A COMPLEX SYSTEM http://www.flickr.com/photos/flyforfun/2694158656/
  • 44. Config flags Operator Metrics http://www.flickr.com/photos/flyforfun/2694158656/
  • 45. Oh, you want to talk about how we collect metrics and make graphs? http://www.slideshare.net/mikebrittain/metricsdriven-engineering
  • 47. Interfaces and user experiences that adapt to technical and architectural failure.
  • 54. /** * Creates a database connection. */ public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db); if (mysqli_connect_error()) { throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error())); } }
  • 55. try { $conn = new DBConnection('viewsdb.host', 'db_read_user', 'ssssshh!', 'views_db'); } catch (DBConnection_Exception $e) { // TODO: Someone should figure out what to do if // we can't connect to the views db. throw $e; }
  • 58. Site navigation Logo Cute Picture Generic, catch-all error messaging....
  • 60. Every back-end service is an opportunity for failure.
  • 64. 1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  • 73. Google Calendar Google Docs
  • 74. GMail
  • 75. “Oops, we aren’t able to access click metrics right now, do not worry — your data is safe.”
  • 76. Product design doesn’t stop at 100% availability.
  • 77. Dev Ops
  • 78. Dev Ops Product
  • 79. 1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  • 81. “What could possibly go wrong?” What is changing about the architecture? What kind of data access patterns are we using? How much traffic, how many queries? What metrics are we collecting? Are there automated alerts? How do we know the thresholds are right? How do we turn it off?... and what happens when we do?
  • 82. “What could possibly go wrong?” What is changing about the architecture? What kind of data access patterns are we using? How much traffic, how many queries? What metrics are we collecting? Are there automated alerts? How do we know the thresholds are right? How do we turn it off? ...and what happens when we do?
  • 85. Pedro
  • 86. Homepage (95th perc.) Surprise!!! Turning off multi- language support improves our page generation times by up to 25%.
  • 88. How could this have gone better? How quickly did we find out that something was wrong? Did we communicate well to our visitors and each other? Why did we have confidence that what we were doing was OK? Did we have the right tools, did we use them properly? Did we collect metrics, and could we find them? Where did we make the wrong decisions? What steps do we take to reduce the chance of this happening again in the future?
  • 89. “... an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.” John Allspaw VP, Technical Operations, Etsy http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
  • 90. We should try to learn not only what went wrong, but also what went right.
  • 91. +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restored DB Server Maintenance, June 16, 2012 http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  • 92. Operational Mindset Dev Ops Product
  • 93. Operational Mindset Dev Ops Product Business Priorities
  • 95. page views for error template
  • 96. page views for error template ...or, how are we screwing our users?
  • 97. Risk mitigation in a complex system Redundant system architectures. Small, well-understood changes to production. Control application using config flags. Gratuitous metrics collection. Resilient user interfaces. GameDay exercises.
  • 98. Thank you. Mike Brittain mike@etsy.com @mikebrittain
  • 102. PHOTO CREDITS Flickr: roboppy http://www.flickr.com/photos/51035735481@N01/163374138/ Flickr: jamesjyu http://www.flickr.com/photos/32593095@N00/3465022/ Flickr: circulating http://www.flickr.com/photos/26835318@N00/2318226026/