SlideShare a Scribd company logo
1 of 92
Resilient
                Response in
                Complex
                Systems
                                  John Allspaw
                                 SVP, Tech Ops
                              Qcon London 2012

Sunday, March 11, 12
OPERABILITY




Sunday, March 11, 12
PRODUCTION




Sunday, March 11, 12
http://whoownsmyavailability.com




Sunday, March 11, 12
Sunday, March 11, 12
How important is this?


Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
Sunday, March 11, 12
How important is this?


Sunday, March 11, 12
How Can This Happen?


Sunday, March 11, 12
Complicated?
                          Complex?




Sunday, March 11, 12
Complex Systems
    •      Cascading Failures
    •      Difficult to determine boundaries
    •      Complex systems may be open
    •      Complex systems may have a memory
    •      Complex systems may be nested
    •      Dynamic network of multiplicity
    •      May produce emergent phenomena
    •      Relationships are non-linear
    •      Relationships contain feedback loops
Sunday, March 11, 12
1998

Sunday, March 11, 12
How Can This Happen?
                     It does happen.
                    And it will again.
Sunday, March 11, 12
                        And again.
Sunday, March 11, 12
Optimization
                       MTBF
                       MTTR
Sunday, March 11, 12
http://www.flickr.com/photos/sparktography/75499095/
Sunday, March 11, 12
How does team
                       troubleshooting
                           happen?
Sunday, March 11, 12
Problem Starts

                   Detection
                          Evaluation
                                  Response
                                             Stable




                                                                                PostMortem
                                                      Confirmation
                                                                    All Clear
                                        Time
Sunday, March 11, 12
Problem Starts
                                       Stress
                   Detection
                          Evaluation
                                  Response
                                             Stable




                                                                                PostMortem
                                                      Confirmation
                                                                    All Clear
                                        Time
Sunday, March 11, 12
Forced beyond learned roles
         Actions whose consequences are both important and
         difficult to see
         Cognitively and perceptively noisy
         Coordinative load increases exponentially
Sunday, March 11, 12
Sunday, March 11, 12
So What
                       Can We Do?

Sunday, March 11, 12
We Learn From
                          Others

Sunday, March 11, 12
Characteristics of response to
    escalating scenarios




Sunday, March 11, 12
Characteristics of response to
    escalating scenarios
                       ...tend to neglect how processes
                       develop within time (awareness of
                       rates) versus assessing how things
                       are in the moment



      “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980


Sunday, March 11, 12
Characteristics of response to
    escalating scenarios
                       ...have difficulty in dealing with
                       exponential developments (hard to
                       imagine how fast something can
                       change, or accelerate)



      “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980


Sunday, March 11, 12
Characteristics of response to
    escalating scenarios
                       ...inclined to think in causal series,
                       instead of causal nets.
                       A therefore B,
                       instead of
                       A, therefore B and C (therefore D and
                       E), etc.
      “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980


Sunday, March 11, 12
Pitfalls

Thematic
Vagabonding

Sunday, March 11, 12
Pitfalls

Goal Fixation
(encystment)

Sunday, March 11, 12
Pitfalls

Refusal to make
decisions

Sunday, March 11, 12
Heroism
         Non-communicating lone wolf-isms


Sunday, March 11, 12
Distraction
         Irrelevant noise in comm channels


Sunday, March 11, 12
Jens Rasmussen, 1983
                                                       Senior Member, IEEE




         “Skills, Rules, and Knowledge; Signals, Signs,
         and Symbols, and Other Distinctions in Human
         Performance Models”
         IEEE Transactions On Systems, Man, and Cybernetics, May 1983




Sunday, March 11, 12
SKILL - BASED

                             Simple, routine
 RULE - BASED


                        Knowable, but unfamiliar
 KNOWLEDGE - BASED


       (Reason, 1990)
                        WTF IS GOING ON?
Sunday, March 11, 12
Team Dynamics



Sunday, March 11, 12
High Reliability Organizations

      • Air Traffic Control           • Complex Socio-Technical
                                      systems
      • Naval Air Operations At Sea • Efficiency <-> Thoroughness
      • Electrical Power Systems • Time/Resource Constrained
      • Etc.                        • Engineering-driven
Sunday, March 11, 12
Sunday, March 11, 12
“The Self-Designing High-Reliability Organization:
         Aircraft Carrier Flight Operations at Sea”
         Rochlin, La Porte, and Roberts. Naval War College Review 1987

         http://govleaders.org/reliability.htm




Sunday, March 11, 12
"So you want to understand an aircraft carrier? Well, just
                       imagine that it's a busy day, and you shrink San Francisco
                       Airport to only one short runway and one ramp and gate. Make
                       planes take off and land at the same time, at half the present
                       time interval, rock the runway from side to side, and require that
                       everyone who leaves in the morning returns that same day.
                       Make sure the equipment is so close to the edge of the envelope
                       that it's fragile. Then turn off the radar to avoid detection,
                       impose strict controls on radios, fuel the aircraft in place with
                       their engines running, put an enemy in the air, and scatter live
                       bombs and rockets around. Now wet the whole thing down with
                       salt water and oil, and man it with 20-year-olds, half of whom
                       have never seen an airplane close-up.

                       Oh, and by the way, try not to kill anyone."
                                                                   -- Senior officer, Air Division



Sunday, March 11, 12
Close interdependence
         between groups




Sunday, March 11, 12
Close reciprocal
            coordination and
            information sharing,
            resulting in overlapping
            knowledge




Sunday, March 11, 12
High redundancy: multiple
            people observing the same
            event and sharing
            information



Sunday, March 11, 12
Broad definition of who
         belongs to the team.


Sunday, March 11, 12
Teammates are included in
         the communication loops
         rather than excluded.

Sunday, March 11, 12
Lots of error correction.



Sunday, March 11, 12
High levels of situation
         comprehension: maintain
         constant awareness of the
         possibility of accidents.

Sunday, March 11, 12
High levels of interpersonal
         skills


Sunday, March 11, 12
Maintenance of detailed
         records of past incidents
         that are closely examined
         with a view to learning from
         them.
Sunday, March 11, 12
Patterns of authority are
         changed to meet the
         demands of the events:
         organizational flexibility.

Sunday, March 11, 12
The reporting of errors and
         faults is rewarded, not
         punished.

Sunday, March 11, 12
So What Else
                       Can We Do?

Sunday, March 11, 12
We Drill

Sunday, March 11, 12
We GameDay

Sunday, March 11, 12
Sunday, March 11, 12
We Learn To Improvise



Sunday, March 11, 12
IMPROVISATION



Sunday, March 11, 12
IMPROVISATION



Sunday, March 11, 12
We Learn From Our
                           Mistakes

Sunday, March 11, 12
Postmortems
      • Full timelines: What happened, when
      • Review in public, everyone invited
      • Search for “second stories” instead of “human error”
      • Cultivating a blameless environment
      • Giving requisite authority to individuals to improve things
Sunday, March 11, 12
Qualifying Response
        High signal:noise in comm channels?
        Troubleshooting fatigue?
        Troubleshooting handoff?
        All tools on-hand?
        Improvised tooling or solutions?
        Metrics visibility?
        Collaborative and skillful communication?
Sunday, March 11, 12
Remediation



Sunday, March 11, 12
Mature Role of Automation

       “Ironies of Automation” - Lisanne Bainbridge
          http://www.bainbrdg.demon.co.uk/Papers/Ironies.html




Sunday, March 11, 12
Mature Role of Automation
        •       Moves humans from manual operator to supervisor
        •       Extends and augments human abilities, doesn’t replace it
        •       Doesn’t remove “human error”
        •       Are brittle
        •       Recognize that there is always discretionary space for humans
        •       Recognizes the Law of Stretched Systems

Sunday, March 11, 12
Law of Stretched Systems
         “Every system is stretched to operate at its
         capacity; as soon as there is some
         improvement, for example, in the form of
         new technology, it will be exploited to
         achieve a new intensity and tempo of
         activity”

  D.Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006

Sunday, March 11, 12
We Share Near-Miss
                            Events

Sunday, March 11, 12
Near Misses
                       Hey everybody -
                       Don’t be like me. I tried to X, but
                       that wasn’t a good idea.
                       It almost exploded everyone.

                       So, don’t do: (details about X)
                                                     Love,
                                                      Joe
Sunday, March 11, 12
Near Misses
      • Can act like “vaccines” - help system safety without actually
              hurting anything
      • Happen more often, so provide more data on latent failures
      • Powerful reminder of hazards, and slows down the process of
              forgetting to be afraid

Sunday, March 11, 12
A parting word
    A parting challenge


Sunday, March 11, 12
Two Propositions



Sunday, March 11, 12
100 changes
         6 change-related issues
Sunday, March 11, 12
100 > 6
Sunday, March 11, 12
Proposition #1
         “Ways in which things go right are special cases
         of the ways in which things go wrong.”




Sunday, March 11, 12
Proposition #1
                       Successes = failures gone wrong
                       Study the failures, generalize from that.
                         Potential data sources: 6 out of 100

Sunday, March 11, 12
Proposition #2
         “Ways in which things go wrong are special
         cases of the ways in which things go right.”




Sunday, March 11, 12
Proposition #2

                          Failures = successes gone wrong
                          Study the successes, generalize from that



Sunday, March 11, 12
                       Potential data sources:   94 out of 100
94/100 ?
                          OR

Sunday, March 11, 12
                       6/100 ?
What and WHY Do Things
     Go RIGHT?
Sunday, March 11, 12
Not just:
                           why did we fail?

        But also:
                       why did we succeed?
Sunday, March 11, 12
Resilient Response
  •      Can learn from other fields
  •      Can train for outages
  •      Can learn from mistakes
  •      Can learn from successes as well as failures
Sunday, March 11, 12
http://www.flickr.com/photos/sparktography/75499095/
Sunday, March 11, 12
THE END




Sunday, March 11, 12

More Related Content

Viewers also liked

Transforming Search in the Digital Marketplace
Transforming Search in the Digital MarketplaceTransforming Search in the Digital Marketplace
Transforming Search in the Digital MarketplaceJason Davis
 
Data mining for_product_search
Data mining for_product_searchData mining for_product_search
Data mining for_product_searchAaron Beppu
 
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloudEmphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloudgfodor
 
Responding to Outages Maturely
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages MaturelyJohn Allspaw
 
Netflix Billing System
Netflix Billing SystemNetflix Billing System
Netflix Billing SystemNirmalSrini
 
Migrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without DowntimeMigrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without DowntimeMatt Graham
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Gregg Donovan
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comJohn Allspaw
 
Deep Organisational Transformation a model for a higher Agility
Deep Organisational Transformation a model for a higher AgilityDeep Organisational Transformation a model for a higher Agility
Deep Organisational Transformation a model for a higher AgilityAndreas Prins
 
Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanGregg Donovan
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsCloudera, Inc.
 
AP Bio Ch. 7 part 2 The extracellular matrix
AP Bio Ch. 7 part 2 The extracellular matrixAP Bio Ch. 7 part 2 The extracellular matrix
AP Bio Ch. 7 part 2 The extracellular matrixStephanie Beck
 
Extracellular matrix
Extracellular matrixExtracellular matrix
Extracellular matrixaqeel hadithe
 
Making sense of messy problems - Systems Thinking for multi-channel UX
Making sense of messy problems - Systems Thinking for multi-channel UXMaking sense of messy problems - Systems Thinking for multi-channel UX
Making sense of messy problems - Systems Thinking for multi-channel UXjohanna kollmann
 
To Boldly Go… From Information to Understanding
To Boldly Go… From Information to UnderstandingTo Boldly Go… From Information to Understanding
To Boldly Go… From Information to UnderstandingStephen Anderson
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorJohn Allspaw
 
Extracellular Matrix
Extracellular MatrixExtracellular Matrix
Extracellular MatrixSaradbrata
 
Quest for Emotional Engagement: Information Visualization (v1.5)
Quest for Emotional Engagement: Information Visualization (v1.5)Quest for Emotional Engagement: Information Visualization (v1.5)
Quest for Emotional Engagement: Information Visualization (v1.5)Stephen Anderson
 

Viewers also liked (20)

Transforming Search in the Digital Marketplace
Transforming Search in the Digital MarketplaceTransforming Search in the Digital Marketplace
Transforming Search in the Digital Marketplace
 
Solr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene EuroconSolr @ Etsy - Apache Lucene Eurocon
Solr @ Etsy - Apache Lucene Eurocon
 
Data mining for_product_search
Data mining for_product_searchData mining for_product_search
Data mining for_product_search
 
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloudEmphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
 
Responding to Outages Maturely
Responding to Outages MaturelyResponding to Outages Maturely
Responding to Outages Maturely
 
Netflix Billing System
Netflix Billing SystemNetflix Billing System
Netflix Billing System
 
Migrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without DowntimeMigrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without Downtime
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
 
DevTools at Etsy
DevTools at EtsyDevTools at Etsy
DevTools at Etsy
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.com
 
Deep Organisational Transformation a model for a higher Agility
Deep Organisational Transformation a model for a higher AgilityDeep Organisational Transformation a model for a higher Agility
Deep Organisational Transformation a model for a higher Agility
 
Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg Donovan
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
 
AP Bio Ch. 7 part 2 The extracellular matrix
AP Bio Ch. 7 part 2 The extracellular matrixAP Bio Ch. 7 part 2 The extracellular matrix
AP Bio Ch. 7 part 2 The extracellular matrix
 
Extracellular matrix
Extracellular matrixExtracellular matrix
Extracellular matrix
 
Making sense of messy problems - Systems Thinking for multi-channel UX
Making sense of messy problems - Systems Thinking for multi-channel UXMaking sense of messy problems - Systems Thinking for multi-channel UX
Making sense of messy problems - Systems Thinking for multi-channel UX
 
To Boldly Go… From Information to Understanding
To Boldly Go… From Information to UnderstandingTo Boldly Go… From Information to Understanding
To Boldly Go… From Information to Understanding
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
 
Extracellular Matrix
Extracellular MatrixExtracellular Matrix
Extracellular Matrix
 
Quest for Emotional Engagement: Information Visualization (v1.5)
Quest for Emotional Engagement: Information Visualization (v1.5)Quest for Emotional Engagement: Information Visualization (v1.5)
Quest for Emotional Engagement: Information Visualization (v1.5)
 

More from John Allspaw

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...John Allspaw
 
Considerations for Alert Design
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert DesignJohn Allspaw
 
Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?John Allspaw
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)John Allspaw
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrJohn Allspaw
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeJohn Allspaw
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeJohn Allspaw
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMPJohn Allspaw
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at FlickrJohn Allspaw
 
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009John Allspaw
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web OperationsJohn Allspaw
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008John Allspaw
 

More from John Allspaw (12)

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Considerations for Alert Design
Considerations for Alert DesignConsiderations for Alert Design
Considerations for Alert Design
 
Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?Anticipation: What Could Possibly Go Wrong?
Anticipation: What Could Possibly Go Wrong?
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
 
Ops Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For ChangeOps Meta-Metrics: The Currency You Pay For Change
Ops Meta-Metrics: The Currency You Pay For Change
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMP
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
 
Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009Operational Efficiency Hacks Web20 Expo2009
Operational Efficiency Hacks Web20 Expo2009
 
Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
 
Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008Capacity Planning for Web Operations - Web20 Expo 2008
Capacity Planning for Web Operations - Web20 Expo 2008
 

Recently uploaded

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Recently uploaded (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Resilient Response In Complex Systems

  • 1. Resilient Response in Complex Systems John Allspaw SVP, Tech Ops Qcon London 2012 Sunday, March 11, 12
  • 6. How important is this? Sunday, March 11, 12
  • 19. How important is this? Sunday, March 11, 12
  • 20. How Can This Happen? Sunday, March 11, 12
  • 21. Complicated? Complex? Sunday, March 11, 12
  • 22. Complex Systems • Cascading Failures • Difficult to determine boundaries • Complex systems may be open • Complex systems may have a memory • Complex systems may be nested • Dynamic network of multiplicity • May produce emergent phenomena • Relationships are non-linear • Relationships contain feedback loops Sunday, March 11, 12
  • 24. How Can This Happen? It does happen. And it will again. Sunday, March 11, 12 And again.
  • 26. Optimization MTBF MTTR Sunday, March 11, 12
  • 28. How does team troubleshooting happen? Sunday, March 11, 12
  • 29. Problem Starts Detection Evaluation Response Stable PostMortem Confirmation All Clear Time Sunday, March 11, 12
  • 30. Problem Starts Stress Detection Evaluation Response Stable PostMortem Confirmation All Clear Time Sunday, March 11, 12
  • 31. Forced beyond learned roles Actions whose consequences are both important and difficult to see Cognitively and perceptively noisy Coordinative load increases exponentially Sunday, March 11, 12
  • 33. So What Can We Do? Sunday, March 11, 12
  • 34. We Learn From Others Sunday, March 11, 12
  • 35. Characteristics of response to escalating scenarios Sunday, March 11, 12
  • 36. Characteristics of response to escalating scenarios ...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980 Sunday, March 11, 12
  • 37. Characteristics of response to escalating scenarios ...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate) “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980 Sunday, March 11, 12
  • 38. Characteristics of response to escalating scenarios ...inclined to think in causal series, instead of causal nets. A therefore B, instead of A, therefore B and C (therefore D and E), etc. “On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980 Sunday, March 11, 12
  • 42. Heroism Non-communicating lone wolf-isms Sunday, March 11, 12
  • 43. Distraction Irrelevant noise in comm channels Sunday, March 11, 12
  • 44. Jens Rasmussen, 1983 Senior Member, IEEE “Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models” IEEE Transactions On Systems, Man, and Cybernetics, May 1983 Sunday, March 11, 12
  • 45. SKILL - BASED Simple, routine RULE - BASED Knowable, but unfamiliar KNOWLEDGE - BASED (Reason, 1990) WTF IS GOING ON? Sunday, March 11, 12
  • 47. High Reliability Organizations • Air Traffic Control • Complex Socio-Technical systems • Naval Air Operations At Sea • Efficiency <-> Thoroughness • Electrical Power Systems • Time/Resource Constrained • Etc. • Engineering-driven Sunday, March 11, 12
  • 49. “The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea” Rochlin, La Porte, and Roberts. Naval War College Review 1987 http://govleaders.org/reliability.htm Sunday, March 11, 12
  • 50. "So you want to understand an aircraft carrier? Well, just imagine that it's a busy day, and you shrink San Francisco Airport to only one short runway and one ramp and gate. Make planes take off and land at the same time, at half the present time interval, rock the runway from side to side, and require that everyone who leaves in the morning returns that same day. Make sure the equipment is so close to the edge of the envelope that it's fragile. Then turn off the radar to avoid detection, impose strict controls on radios, fuel the aircraft in place with their engines running, put an enemy in the air, and scatter live bombs and rockets around. Now wet the whole thing down with salt water and oil, and man it with 20-year-olds, half of whom have never seen an airplane close-up. Oh, and by the way, try not to kill anyone."                                             -- Senior officer, Air Division Sunday, March 11, 12
  • 51. Close interdependence between groups Sunday, March 11, 12
  • 52. Close reciprocal coordination and information sharing, resulting in overlapping knowledge Sunday, March 11, 12
  • 53. High redundancy: multiple people observing the same event and sharing information Sunday, March 11, 12
  • 54. Broad definition of who belongs to the team. Sunday, March 11, 12
  • 55. Teammates are included in the communication loops rather than excluded. Sunday, March 11, 12
  • 56. Lots of error correction. Sunday, March 11, 12
  • 57. High levels of situation comprehension: maintain constant awareness of the possibility of accidents. Sunday, March 11, 12
  • 58. High levels of interpersonal skills Sunday, March 11, 12
  • 59. Maintenance of detailed records of past incidents that are closely examined with a view to learning from them. Sunday, March 11, 12
  • 60. Patterns of authority are changed to meet the demands of the events: organizational flexibility. Sunday, March 11, 12
  • 61. The reporting of errors and faults is rewarded, not punished. Sunday, March 11, 12
  • 62. So What Else Can We Do? Sunday, March 11, 12
  • 66. We Learn To Improvise Sunday, March 11, 12
  • 69. We Learn From Our Mistakes Sunday, March 11, 12
  • 70. Postmortems • Full timelines: What happened, when • Review in public, everyone invited • Search for “second stories” instead of “human error” • Cultivating a blameless environment • Giving requisite authority to individuals to improve things Sunday, March 11, 12
  • 71. Qualifying Response High signal:noise in comm channels? Troubleshooting fatigue? Troubleshooting handoff? All tools on-hand? Improvised tooling or solutions? Metrics visibility? Collaborative and skillful communication? Sunday, March 11, 12
  • 73. Mature Role of Automation “Ironies of Automation” - Lisanne Bainbridge http://www.bainbrdg.demon.co.uk/Papers/Ironies.html Sunday, March 11, 12
  • 74. Mature Role of Automation • Moves humans from manual operator to supervisor • Extends and augments human abilities, doesn’t replace it • Doesn’t remove “human error” • Are brittle • Recognize that there is always discretionary space for humans • Recognizes the Law of Stretched Systems Sunday, March 11, 12
  • 75. Law of Stretched Systems “Every system is stretched to operate at its capacity; as soon as there is some improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity” D.Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006 Sunday, March 11, 12
  • 76. We Share Near-Miss Events Sunday, March 11, 12
  • 77. Near Misses Hey everybody - Don’t be like me. I tried to X, but that wasn’t a good idea. It almost exploded everyone. So, don’t do: (details about X) Love, Joe Sunday, March 11, 12
  • 78. Near Misses • Can act like “vaccines” - help system safety without actually hurting anything • Happen more often, so provide more data on latent failures • Powerful reminder of hazards, and slows down the process of forgetting to be afraid Sunday, March 11, 12
  • 79. A parting word A parting challenge Sunday, March 11, 12
  • 81. 100 changes 6 change-related issues Sunday, March 11, 12
  • 82. 100 > 6 Sunday, March 11, 12
  • 83. Proposition #1 “Ways in which things go right are special cases of the ways in which things go wrong.” Sunday, March 11, 12
  • 84. Proposition #1 Successes = failures gone wrong Study the failures, generalize from that. Potential data sources: 6 out of 100 Sunday, March 11, 12
  • 85. Proposition #2 “Ways in which things go wrong are special cases of the ways in which things go right.” Sunday, March 11, 12
  • 86. Proposition #2 Failures = successes gone wrong Study the successes, generalize from that Sunday, March 11, 12 Potential data sources: 94 out of 100
  • 87. 94/100 ? OR Sunday, March 11, 12 6/100 ?
  • 88. What and WHY Do Things Go RIGHT? Sunday, March 11, 12
  • 89. Not just: why did we fail? But also: why did we succeed? Sunday, March 11, 12
  • 90. Resilient Response • Can learn from other fields • Can train for outages • Can learn from mistakes • Can learn from successes as well as failures Sunday, March 11, 12