SlideShare a Scribd company logo
1 of 102
Download to read offline
On Failure and Resilience


             Mike Brittain
             DIRECTOR OF ENGINEERING, ETSY
             @mikebrittain
             Presented at 37signals on Aug 21, 2012
“Software Infrastructure”
“Framework” code, caching, ORM, file storage tier,
developer tools, CI/deployment, site performance,
             front-end architecture.
Managing failures and building
resilience into systems, applications,
         process, and people.
$61 M in goods sold in the marketplace
2.9 M items sold
1.2 B page views




                                                           Photo: http://www.etsy.com/shop/TheOldTimeJunkShop

  http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
Architecture
Linux, Apache, MySQL, PHP, Postgres,
Solr, Gearman, Memcache, Chef,
Hadoop, EC2/S3/EMR

                      30+ Logical data stores
                 (23 shards + more functionally partitioned)


   Search and storage tiers as “services”
150 Engineers + Designers + Product
                                (this was 20 in Feb 2010)




credit: martin_heigan (flickr)
Buyers, sellers, support,
developer api, i18n,
core infrastructure, storage,
payments, security, fraud
detection, big data and BI,
email delivery, corp IT,
operations, developer tools,
continuous integration and
testing, site performance,
search, advertising, seller
economics, mobile web,
iOS.
Zero Release Managers
There Will Be Fail



                     Credit: wilkee.deviantart.com
We cannot comprehend all of the ways in
which the individual parts of a complex
system will interact. We cannot know all
of the states and scenarios.

We cannot prevent failures.
Yet, we can mitigate them.

Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
“Uptime” is not binary.
Async
            Convos           Ads   Auth
                     Tasks




Functionally Partitioned
Async
            Convos           Ads   Auth
                     Tasks




Functionally Partitioned
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
1         5
                                              3         2        4




    shard1   shard1   shard2   shard2     shard3     shard3          shard4   shard4



                                        ~4% of listing data is
                                          stored on shard3




Sharded Tables
1        5
                                            3        2    4




    shard1   shard1   shard2   shard2   shard3   shard3       shard4   shard4




Sharded Tables
shard1   shard1   shard2   shard2    shard3    shard3      shard4   shard4




                                        Outage is limited to
                                         ~4% of data set




Sharded Tables
“Uptime” is not binary.
Uptime of the application is the
responsibility of our Operations team.
Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.
Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.

If you are committing code, you are
operating the site.
Branching in Code
“All existing revision control systems
 were built by people who build
 installed software”


                                  Always Ship Trunk
                                      Paul Hammond
                                   Velocity Conf 2010
Config Flags
Enable and disable features quickly.
Features for staff or for beta groups.
Percentage ramp-up of users or requests.
A/B “experiments.”
$cfg[‘new_search’]   =   array('enabled'   =>   'on');
$cfg[‘sign_in’]      =   array('enabled'   =>   'on');
$cfg[‘checkout’]     =   array('enabled'   =>   'on');
$cfg[‘homepage’]     =   array('enabled'   =>   'on');
$cfg[‘new_search’] = array('enabled' => 'on');

// Meanwhile...

if ($cfg[‘new_search’]) {
  # New hotness
  $results = do_solr();
} else {
  # old and boring
  $results = do_grep();
}
But...
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
   “That sounds like it sucks.”
                     Really?
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
   “That sounds like it sucks.”
                     Really?
   “Wait a minute... all of the counter arguments are
    in Comic Sans. WTF?!?
                         Oh, you noticed? ;)
+06:40
                                               Site up, some seller tools disabled
    00:00
    Site down for maintenance




                       +01:47
                       Site up, disabled login and registration

                                                                                     +07:41
                                                                                     All features restored


DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
“Uptime” is not binary.
Features are launched by flipping a
   config flag, not by deploying
    hundreds of lines of code.
“If Engineering at Etsy has a religion,
           it’s the Church of Graphs.
                     Ian Malpass, Code as Craft
                                http://etsy.me/ePkoZB
THIS IS HOW
                      YOU RUN
               A COMPLEX
                        SYSTEM
http://www.flickr.com/photos/flyforfun/2694158656/
Config flags
                        Operator                       Metrics




http://www.flickr.com/photos/flyforfun/2694158656/
Oh, you want to talk about how we collect
metrics and make graphs?


                http://www.slideshare.net/mikebrittain/metricsdriven-engineering
Resilient User Interfaces
Interfaces and user experiences
that adapt to technical and
architectural failure.
/**
 * Creates a database connection.
 */
public function __construct($host, $user, $pass, $db) {
    parent::__construct($host, $user, $pass, $db);

     if (mysqli_connect_error()) {

         throw new DBConnection_Exception(
             sprintf("Error: %s, %s",
                 mysqli_connect_errno(),
                 mysqli_connect_error()));

     }
 }
try {
    $conn = new DBConnection('viewsdb.host', 'db_read_user',
                             'ssssshh!', 'views_db');
} catch (DBConnection_Exception $e) {

    // TODO: Someone should figure out what to do if
    // we can't connect to the views db.
    throw $e;
}
Site navigation
           Logo

          Cute Picture

Generic, catch-all
error messaging....
Every back-end service is an
  opportunity for failure.
1           4
                                  9
                5
                    8                                   6
    2   3




                                                   10
                                               4            11
                                      7




                                          14
                    7

                        13



                             12
Critical Path
#srsly?
< 400 ms
Non-blocking Ajax
Google Calendar




   Google Docs
GMail
“Oops, we aren’t able to
access click metrics right
now, do not worry — your
      data is safe.”
Product design doesn’t stop
    at 100% availability.
Dev   Ops
Dev         Ops


  Product
1           4
                                  9
                5
                    8                                   6
    2   3




                                                   10
                                               4            11
                                      7




                                          14
                    7

                        13



                             12
Operability Reviews
“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off?... and what happens when we do?
“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off? ...and what happens when we do?
“GameDay” Exercises
Pedro
Homepage (95th perc.)

                        Surprise!!!
                        Turning off multi-
                        language support
                        improves our page
                        generation times by
                        up to 25%.
(Blameless) Post-Mortems
How could this have gone better?

How quickly did we find out that something was wrong?
Did we communicate well to our visitors and each other?
Why did we have confidence that what we were doing was OK?
Did we have the right tools, did we use them properly?
Did we collect metrics, and could we find them?
Where did we make the wrong decisions?

What steps do we take to reduce the chance of this
happening again in the future?
“... an engineer who thinks they’re going to be
reprimanded are disincentivized to give the details
necessary to get an understanding of the mechanism,
pathology, and operation of the failure.

This lack of understanding of how the accident occurred
all but guarantees that it will repeat. If not with the
original engineer, another one in the future.”

                                                                    John Allspaw
                                                       VP, Technical Operations, Etsy


         http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
We should try to learn not only what went
wrong, but also what went right.
+06:40
                                               Site up, some seller tools disabled
    00:00
    Site down for maintenance




                       +01:47
                       Site up, disabled login and registration

                                                                                     +07:41
                                                                                     All features restored


DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
Operational Mindset



Dev   Ops            Product
Operational Mindset



Dev   Ops              Product


        Business Priorities
Introspection
page views for error template
page views for error template
...or, how are we screwing our users?
Risk mitigation in a complex system

Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
Thank you.
  Mike Brittain
 mike@etsy.com
 @mikebrittain
PHOTO
CREDITS
                                                                   Flickr: roboppy
                         http://www.flickr.com/photos/51035735481@N01/163374138/




                                                Flickr: jamesjyu
                                                http://www.flickr.com/photos/32593095@N00/3465022/




                                         Flickr: circulating
  http://www.flickr.com/photos/26835318@N00/2318226026/

More Related Content

Viewers also liked

Take My Logs. Please!
Take My Logs. Please!Take My Logs. Please!
Take My Logs. Please!Mike Brittain
 
How to Get to Second Base with Your CDN
How to Get to Second Base with Your CDNHow to Get to Second Base with Your CDN
How to Get to Second Base with Your CDNMike Brittain
 
Metrics-Driven Engineering at Etsy
Metrics-Driven Engineering at EtsyMetrics-Driven Engineering at Etsy
Metrics-Driven Engineering at EtsyMike Brittain
 
Metrics-Driven Engineering
Metrics-Driven EngineeringMetrics-Driven Engineering
Metrics-Driven EngineeringMike Brittain
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyMike Brittain
 
Continuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYCContinuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYCMike Brittain
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsMike Brittain
 
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...SERENEWorkshop
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyMike Brittain
 
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core SystemsEngineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core SystemsSERENEWorkshop
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and TrendingMike Brittain
 
Continuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsContinuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsMike Brittain
 
From Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsFrom Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsMike Brittain
 
Using Social Media2
Using Social Media2Using Social Media2
Using Social Media2Jane Hart
 
Brain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork
 
应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查dxw8448
 
Database Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkDatabase Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkAileen Marshall
 
Don't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinDon't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinZillionDesigns
 
Herramientas de comunicación en línea
Herramientas de comunicación en líneaHerramientas de comunicación en línea
Herramientas de comunicación en líneaMarisol Bolaños
 

Viewers also liked (19)

Take My Logs. Please!
Take My Logs. Please!Take My Logs. Please!
Take My Logs. Please!
 
How to Get to Second Base with Your CDN
How to Get to Second Base with Your CDNHow to Get to Second Base with Your CDN
How to Get to Second Base with Your CDN
 
Metrics-Driven Engineering at Etsy
Metrics-Driven Engineering at EtsyMetrics-Driven Engineering at Etsy
Metrics-Driven Engineering at Etsy
 
Metrics-Driven Engineering
Metrics-Driven EngineeringMetrics-Driven Engineering
Metrics-Driven Engineering
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at Etsy
 
Continuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYCContinuous Deployment at Etsy — TimesOpen NYC
Continuous Deployment at Etsy — TimesOpen NYC
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty Details
 
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
SERENE 2014 Workshop: Panel on "Views on Runtime Resilience Assessment of Dyn...
 
Web Performance Culture and Tools at Etsy
Web Performance Culture and Tools at EtsyWeb Performance Culture and Tools at Etsy
Web Performance Culture and Tools at Etsy
 
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core SystemsEngineering Cross-Layer Fault Tolerance in Many-Core Systems
Engineering Cross-Layer Fault Tolerance in Many-Core Systems
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
 
Continuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsContinuous Delivery: The Dirty Details
Continuous Delivery: The Dirty Details
 
From Building a Marketplace to Building Teams
From Building a Marketplace to Building TeamsFrom Building a Marketplace to Building Teams
From Building a Marketplace to Building Teams
 
Using Social Media2
Using Social Media2Using Social Media2
Using Social Media2
 
Brain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork - FPGA because
Brain NECSTwork - FPGA because
 
应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查应届毕业生胜任素质问卷调查
应届毕业生胜任素质问卷调查
 
Database Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkDatabase Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White Shark
 
Don't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinDon't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle Lin
 
Herramientas de comunicación en línea
Herramientas de comunicación en líneaHerramientas de comunicación en línea
Herramientas de comunicación en línea
 

Similar to On Failure and Resilience

How do I run microservices in production using Docker.
How do I run microservices in production using Docker.How do I run microservices in production using Docker.
How do I run microservices in production using Docker.Daniël van Gils
 
[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise Applications[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise ApplicationsDaniel Oh
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyonddion
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?Daniël van Gils
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Clearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad DecisionsClearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad DecisionsJonathan Hitchcock
 
MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012jackdanger
 
Introduction to aop
Introduction to aopIntroduction to aop
Introduction to aopDror Helper
 
Modular Web Applications With Netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzkenetzke
 
Sustainable Agile Development
Sustainable Agile DevelopmentSustainable Agile Development
Sustainable Agile DevelopmentGabriele Lana
 
Securing Rails
Securing RailsSecuring Rails
Securing RailsAlex Payne
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?Katarzyna Hoffman
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?Cloud 66
 
Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionOvidiu Dimulescu
 
What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.David Nuescheler
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Plugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 TaiwanPlugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 TaiwanRack Lin
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMilen Dyankov
 
TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011bobmcwhirter
 

Similar to On Failure and Resilience (20)

How do I run microservices in production using Docker.
How do I run microservices in production using Docker.How do I run microservices in production using Docker.
How do I run microservices in production using Docker.
 
[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise Applications[RHFSeoul2017]6 Steps to Transform Enterprise Applications
[RHFSeoul2017]6 Steps to Transform Enterprise Applications
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyond
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Clearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad DecisionsClearly, I Have Made Some Bad Decisions
Clearly, I Have Made Some Bad Decisions
 
MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012MonoRails - GoGaRuCo 2012
MonoRails - GoGaRuCo 2012
 
Introduction to aop
Introduction to aopIntroduction to aop
Introduction to aop
 
Modular Web Applications With Netzke
Modular Web Applications With NetzkeModular Web Applications With Netzke
Modular Web Applications With Netzke
 
Sustainable Agile Development
Sustainable Agile DevelopmentSustainable Agile Development
Sustainable Agile Development
 
Securing Rails
Securing RailsSecuring Rails
Securing Rails
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?
 
How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?How the hell do I run my microservices in production, and will it scale?
How the hell do I run my microservices in production, and will it scale?
 
Unit Testing 101
Unit Testing 101Unit Testing 101
Unit Testing 101
 
Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java edition
 
What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.What's new in CQ 5.3? Top 10 features.
What's new in CQ 5.3? Top 10 features.
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Plugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 TaiwanPlugin-able POS Solutions by Javascript @HDM9 Taiwan
Plugin-able POS Solutions by Javascript @HDM9 Taiwan
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
 
TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011TorqueBox at DC:JBUG - November 2011
TorqueBox at DC:JBUG - November 2011
 

Recently uploaded

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Recently uploaded (20)

JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

On Failure and Resilience

  • 1. On Failure and Resilience Mike Brittain DIRECTOR OF ENGINEERING, ETSY @mikebrittain Presented at 37signals on Aug 21, 2012
  • 2. “Software Infrastructure” “Framework” code, caching, ORM, file storage tier, developer tools, CI/deployment, site performance, front-end architecture.
  • 3. Managing failures and building resilience into systems, applications, process, and people.
  • 4.
  • 5. $61 M in goods sold in the marketplace 2.9 M items sold 1.2 B page views Photo: http://www.etsy.com/shop/TheOldTimeJunkShop http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
  • 6. Architecture Linux, Apache, MySQL, PHP, Postgres, Solr, Gearman, Memcache, Chef, Hadoop, EC2/S3/EMR 30+ Logical data stores (23 shards + more functionally partitioned) Search and storage tiers as “services”
  • 7. 150 Engineers + Designers + Product (this was 20 in Feb 2010) credit: martin_heigan (flickr)
  • 8. Buyers, sellers, support, developer api, i18n, core infrastructure, storage, payments, security, fraud detection, big data and BI, email delivery, corp IT, operations, developer tools, continuous integration and testing, site performance, search, advertising, seller economics, mobile web, iOS.
  • 9.
  • 11. There Will Be Fail Credit: wilkee.deviantart.com
  • 12. We cannot comprehend all of the ways in which the individual parts of a complex system will interact. We cannot know all of the states and scenarios. We cannot prevent failures.
  • 13. Yet, we can mitigate them. Redundant system architectures. Small, well-understood changes to production. Control application using config flags. Gratuitous metrics collection. Resilient user interfaces. GameDay exercises.
  • 15. Async Convos Ads Auth Tasks Functionally Partitioned
  • 16. Async Convos Ads Auth Tasks Functionally Partitioned
  • 17. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 18. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 19. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 20. 1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 ~4% of listing data is stored on shard3 Sharded Tables
  • 21. 1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 Sharded Tables
  • 22. shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 Outage is limited to ~4% of data set Sharded Tables
  • 24. Uptime of the application is the responsibility of our Operations team.
  • 25. Uptime of the application is the responsibility of our Operations, Engineering, Product, and Design teams.
  • 26. Uptime of the application is the responsibility of our Operations, Engineering, Product, and Design teams. If you are committing code, you are operating the site.
  • 28. “All existing revision control systems were built by people who build installed software” Always Ship Trunk Paul Hammond Velocity Conf 2010
  • 29. Config Flags Enable and disable features quickly. Features for staff or for beta groups. Percentage ramp-up of users or requests. A/B “experiments.”
  • 30. $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');
  • 31. $cfg[‘new_search’] = array('enabled' => 'on'); // Meanwhile... if ($cfg[‘new_search’]) { # New hotness $results = do_solr(); } else { # old and boring $results = do_grep(); }
  • 33. “Doesn’t that mean you have conditionals all over your code?” Yes.
  • 34. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes.
  • 35. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really?
  • 36. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really? “Wait a minute... all of the counter arguments are in Comic Sans. WTF?!? Oh, you noticed? ;)
  • 37. +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restored DB Server Maintenance, June 16, 2012 http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  • 39. Features are launched by flipping a config flag, not by deploying hundreds of lines of code.
  • 40. “If Engineering at Etsy has a religion, it’s the Church of Graphs. Ian Malpass, Code as Craft http://etsy.me/ePkoZB
  • 41.
  • 42.
  • 43. THIS IS HOW YOU RUN A COMPLEX SYSTEM http://www.flickr.com/photos/flyforfun/2694158656/
  • 44. Config flags Operator Metrics http://www.flickr.com/photos/flyforfun/2694158656/
  • 45. Oh, you want to talk about how we collect metrics and make graphs? http://www.slideshare.net/mikebrittain/metricsdriven-engineering
  • 47. Interfaces and user experiences that adapt to technical and architectural failure.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. /** * Creates a database connection. */ public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db); if (mysqli_connect_error()) { throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error())); } }
  • 55. try { $conn = new DBConnection('viewsdb.host', 'db_read_user', 'ssssshh!', 'views_db'); } catch (DBConnection_Exception $e) { // TODO: Someone should figure out what to do if // we can't connect to the views db. throw $e; }
  • 56.
  • 57.
  • 58. Site navigation Logo Cute Picture Generic, catch-all error messaging....
  • 59.
  • 60. Every back-end service is an opportunity for failure.
  • 61.
  • 62.
  • 63.
  • 64. 1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  • 65.
  • 67.
  • 68.
  • 69.
  • 73. Google Calendar Google Docs
  • 74. GMail
  • 75. “Oops, we aren’t able to access click metrics right now, do not worry — your data is safe.”
  • 76. Product design doesn’t stop at 100% availability.
  • 77. Dev Ops
  • 78. Dev Ops Product
  • 79. 1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  • 81. “What could possibly go wrong?” What is changing about the architecture? What kind of data access patterns are we using? How much traffic, how many queries? What metrics are we collecting? Are there automated alerts? How do we know the thresholds are right? How do we turn it off?... and what happens when we do?
  • 82. “What could possibly go wrong?” What is changing about the architecture? What kind of data access patterns are we using? How much traffic, how many queries? What metrics are we collecting? Are there automated alerts? How do we know the thresholds are right? How do we turn it off? ...and what happens when we do?
  • 84.
  • 85. Pedro
  • 86. Homepage (95th perc.) Surprise!!! Turning off multi- language support improves our page generation times by up to 25%.
  • 88. How could this have gone better? How quickly did we find out that something was wrong? Did we communicate well to our visitors and each other? Why did we have confidence that what we were doing was OK? Did we have the right tools, did we use them properly? Did we collect metrics, and could we find them? Where did we make the wrong decisions? What steps do we take to reduce the chance of this happening again in the future?
  • 89. “... an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.” John Allspaw VP, Technical Operations, Etsy http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
  • 90. We should try to learn not only what went wrong, but also what went right.
  • 91. +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restored DB Server Maintenance, June 16, 2012 http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  • 92. Operational Mindset Dev Ops Product
  • 93. Operational Mindset Dev Ops Product Business Priorities
  • 95. page views for error template
  • 96. page views for error template ...or, how are we screwing our users?
  • 97. Risk mitigation in a complex system Redundant system architectures. Small, well-understood changes to production. Control application using config flags. Gratuitous metrics collection. Resilient user interfaces. GameDay exercises.
  • 98. Thank you. Mike Brittain mike@etsy.com @mikebrittain
  • 99.
  • 100.
  • 101.
  • 102. PHOTO CREDITS Flickr: roboppy http://www.flickr.com/photos/51035735481@N01/163374138/ Flickr: jamesjyu http://www.flickr.com/photos/32593095@N00/3465022/ Flickr: circulating http://www.flickr.com/photos/26835318@N00/2318226026/