SlideShare a Scribd company logo
1 of 17
Cloud Native and Epidemic
Failures
April 2014
Adrian Cockcroft
@adrianco @BatteryVentures
http://www.linkedin.com/in/adriancockcroft
Cloud Native?
Epidemic Failures
Automated Diversity
Cloud Native
Construct a highly agile and highly
available service from ephemeral and
often broken components
Inspiration
Numquam ponenda est pluralitas
sine necessitate
Plurality must never be posited
without necessity
Occam’s Razor
Monoculture
Replicate “the best” as patterns
Reduce interaction complexity
Epidemic single point of failure
Pattern Failures
Infrastructure Pattern Failures
Software Stack Pattern Failures
Application Pattern Failures
Infrastructure Pattern Failures
• Device failures – bad batch of disks, PSUs, etc.
• CPU failures – cache corruption, math errors
• Datacenter failures – power, network, disaster
• Routing failures – DNS, Internet/ISP path
Software Stack Pattern Failures
• Time bombs – Counter wrap, memory leak
• Date bombs - Leap year, leap second, epoch
• Expiration – Certs timing out
• Trust revocation – Certificate Authority fails
• Security exploit – e.g. heartbleed
• Language bugs – compile time
• Runtime bugs – JVM, Linux, Hypervisor
• Network bugs – routers, firewalls, protocols
Application Pattern Failures
• Time bombs – Counter wrap, memory leak
• Date bombs - Leap year, leap second, epoch
• Content bombs – Data dependent failure
• Configuration – wrong/bad syntax
• Versioning – incompatible mixes
• Cascading failures – error handling bugs etc.
• Cascading overload – excessive logging etc.
What to do?
Automated diversity management
Diversified automation
Efficient vs. Antifragile
Specific Ideas
• Automate running a mixture
– Diversity as default for any service stack
– No developer overhead, stay agile, low cost
• Support oldest and newest versions together
– Automate running 50/50 mix CentOS/Ubuntu
– Mix versions of JDK, Tomcat, etc.
• Vendor diversity
– Multiple DNS vendors, cloud regions, costs more
– Multiple cloud vendors? Much higher cost.
Generate Permutations
> epi <-
data.frame(java=gl(2,1,8,c("java6","java7")), linux=gl(2,2,8,c("centos","ubuntu"))
, codeversion=gl(2,4,8,c("v34","v35")))
> epi
java linux codeversion
1 java6 centos v34
2 java7 centos v34
3 java6 ubuntu v34
4 java7 ubuntu v34
5 java6 centos v35
6 java7 centos v35
7 java6 ubuntu v35
8 java7 ubuntu v35
Deployment
• Builds
– Manual to test, automate if it works
– Modify build to generate permutation AMIs
– Modify Asgard to auto-deploy permutations
• Data collection
– Tag each instance with its permutation
– Gather metrics by permutation per instance
– Do R-based Design of Experiments analysis
Analysis
• As a function of permutations
– Error rate
– Response time
– CPU Utilization
• Interactions
– E.g. interaction between linux and java
– Contrasts identify components with issues
– Small changes with high statistical significance
GCS Total API Outage for ~1hr
Takeaway
Watch out for monocultures
A|B Testing – it’s not just for personalization
http://perfcap.blogspot.com
http://slideshare.net/adrianco – Netflix
http://slideshare.net/adriancockcroft - Battery
http://www.linkedin.com/in/adriancockcroft
@adrianco @BatteryVentures

More Related Content

What's hot

Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...
Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...
Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...
Docker, Inc.
 
The hardest part of microservices: your data
The hardest part of microservices: your dataThe hardest part of microservices: your data
The hardest part of microservices: your data
Christian Posta
 
Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the Cloud
Amazon Web Services
 

What's hot (20)

Microservices: What's Missing - O'Reilly Software Architecture New York
Microservices: What's Missing - O'Reilly Software Architecture New YorkMicroservices: What's Missing - O'Reilly Software Architecture New York
Microservices: What's Missing - O'Reilly Software Architecture New York
 
Microxchg Microservices
Microxchg MicroservicesMicroxchg Microservices
Microxchg Microservices
 
Software Architecture Conference - Monitoring Microservices - A Challenge
Software Architecture Conference -  Monitoring Microservices - A ChallengeSoftware Architecture Conference -  Monitoring Microservices - A Challenge
Software Architecture Conference - Monitoring Microservices - A Challenge
 
From Monolith to Microservices – and Beyond!
From Monolith to Microservices – and Beyond!From Monolith to Microservices – and Beyond!
From Monolith to Microservices – and Beyond!
 
Microservices the Good Bad and the Ugly
Microservices the Good Bad and the UglyMicroservices the Good Bad and the Ugly
Microservices the Good Bad and the Ugly
 
Goto Berlin - Migrating to Microservices (Fast Delivery)
Goto Berlin - Migrating to Microservices (Fast Delivery)Goto Berlin - Migrating to Microservices (Fast Delivery)
Goto Berlin - Migrating to Microservices (Fast Delivery)
 
Microxchg Analyzing Response Time Distributions for Microservices
Microxchg Analyzing Response Time Distributions for MicroservicesMicroxchg Analyzing Response Time Distributions for Microservices
Microxchg Analyzing Response Time Distributions for Microservices
 
Microservices Workshop - Craft Conference
Microservices Workshop - Craft ConferenceMicroservices Workshop - Craft Conference
Microservices Workshop - Craft Conference
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
 
Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...
Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...
Dockerizing CS50: From Cluster to Cloud to Appliance to Container by David Ma...
 
Cloud Native Cost Optimization UCC
Cloud Native Cost Optimization UCCCloud Native Cost Optimization UCC
Cloud Native Cost Optimization UCC
 
Rugged DevOps Will help you build ur cloudz
Rugged DevOps Will help you build ur cloudzRugged DevOps Will help you build ur cloudz
Rugged DevOps Will help you build ur cloudz
 
Staying Secure When Moving to the Cloud - Dave Millier
Staying Secure When Moving to the Cloud - Dave MillierStaying Secure When Moving to the Cloud - Dave Millier
Staying Secure When Moving to the Cloud - Dave Millier
 
The hardest part of microservices: your data
The hardest part of microservices: your dataThe hardest part of microservices: your data
The hardest part of microservices: your data
 
Real-world Microservices: Lessons from the Front Line - Zhamak Delghani, Thou...
Real-world Microservices: Lessons from the Front Line - Zhamak Delghani, Thou...Real-world Microservices: Lessons from the Front Line - Zhamak Delghani, Thou...
Real-world Microservices: Lessons from the Front Line - Zhamak Delghani, Thou...
 
Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the Cloud
 
Microservices architecture overview v3
Microservices architecture overview v3Microservices architecture overview v3
Microservices architecture overview v3
 
Building a Bank out of Microservices (NDC Sydney, August 2016)
Building a Bank out of Microservices (NDC Sydney, August 2016)Building a Bank out of Microservices (NDC Sydney, August 2016)
Building a Bank out of Microservices (NDC Sydney, August 2016)
 
Software as a Service workshop / Unlocked: the Hybrid Cloud 12th May 2014
Software as a Service workshop / Unlocked: the Hybrid Cloud 12th May 2014Software as a Service workshop / Unlocked: the Hybrid Cloud 12th May 2014
Software as a Service workshop / Unlocked: the Hybrid Cloud 12th May 2014
 
Mini-Training: Netflix Simian Army
Mini-Training: Netflix Simian ArmyMini-Training: Netflix Simian Army
Mini-Training: Netflix Simian Army
 

Viewers also liked

Eye Catching Photos
Eye Catching PhotosEye Catching Photos
Eye Catching Photos
Yee Seng Gan
 
Nair jure123456
Nair jure123456Nair jure123456
Nair jure123456
nairjure
 
Optical flares installation
Optical flares installationOptical flares installation
Optical flares installation
galamo11
 
Talent Gradient Construction Based on Competency
Talent Gradient Construction Based on CompetencyTalent Gradient Construction Based on Competency
Talent Gradient Construction Based on Competency
Vincent T. ZHAO
 

Viewers also liked (20)

Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
Cloud Trends Nov2015 Structure
Cloud Trends Nov2015 StructureCloud Trends Nov2015 Structure
Cloud Trends Nov2015 Structure
 
Innovation and Architecture
Innovation and ArchitectureInnovation and Architecture
Innovation and Architecture
 
Speeding Up Innovation
Speeding Up InnovationSpeeding Up Innovation
Speeding Up Innovation
 
Gluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A ChallengeGluecon Monitoring Microservices and Containers: A Challenge
Gluecon Monitoring Microservices and Containers: A Challenge
 
Most people cannot say - even to themselves - what their "Business Model" is
Most people cannot say - even to themselves - what their "Business Model" is Most people cannot say - even to themselves - what their "Business Model" is
Most people cannot say - even to themselves - what their "Business Model" is
 
Systems analysis and design (abe)
Systems analysis and design (abe)Systems analysis and design (abe)
Systems analysis and design (abe)
 
Rom - Ruby Object Mapper
Rom - Ruby Object MapperRom - Ruby Object Mapper
Rom - Ruby Object Mapper
 
Zenoss Monitroing – zendmd Scripting Guide
Zenoss Monitroing – zendmd Scripting GuideZenoss Monitroing – zendmd Scripting Guide
Zenoss Monitroing – zendmd Scripting Guide
 
More Than Human - Personal Update Magazine - August 2011
More Than Human - Personal Update Magazine - August 2011More Than Human - Personal Update Magazine - August 2011
More Than Human - Personal Update Magazine - August 2011
 
Eye Catching Photos
Eye Catching PhotosEye Catching Photos
Eye Catching Photos
 
Technology In Schools What Is Changing
Technology  In  Schools  What  Is  ChangingTechnology  In  Schools  What  Is  Changing
Technology In Schools What Is Changing
 
Nair jure123456
Nair jure123456Nair jure123456
Nair jure123456
 
4g5
4g54g5
4g5
 
Optical flares installation
Optical flares installationOptical flares installation
Optical flares installation
 
Dynamic ticket pricing. Squeezing more juice from half time oranges
Dynamic ticket pricing. Squeezing more juice from half time oranges  Dynamic ticket pricing. Squeezing more juice from half time oranges
Dynamic ticket pricing. Squeezing more juice from half time oranges
 
Web syahid 1210651273
Web syahid 1210651273Web syahid 1210651273
Web syahid 1210651273
 
Talent Gradient Construction Based on Competency
Talent Gradient Construction Based on CompetencyTalent Gradient Construction Based on Competency
Talent Gradient Construction Based on Competency
 
Mig gig first draft
Mig gig first draftMig gig first draft
Mig gig first draft
 
Quantum Entanglement - Cryptography and Communication
Quantum Entanglement - Cryptography and CommunicationQuantum Entanglement - Cryptography and Communication
Quantum Entanglement - Cryptography and Communication
 

Similar to Epidemic Failures

Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
Yves Goeleven
 
SQL Server Clustering for Dummies
SQL Server Clustering for DummiesSQL Server Clustering for Dummies
SQL Server Clustering for Dummies
Mark Broadbent
 
Performance of Microservice Frameworks on different JVMs
Performance of Microservice Frameworks on different JVMsPerformance of Microservice Frameworks on different JVMs
Performance of Microservice Frameworks on different JVMs
Maarten Smeets
 

Similar to Epidemic Failures (20)

Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Cloud Native Camel Riding
Cloud Native Camel RidingCloud Native Camel Riding
Cloud Native Camel Riding
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
 
D. Andreadis, Red Hat: Concepts and technical overview of Quarkus
D. Andreadis, Red Hat: Concepts and technical overview of QuarkusD. Andreadis, Red Hat: Concepts and technical overview of Quarkus
D. Andreadis, Red Hat: Concepts and technical overview of Quarkus
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Fuse integration-services
Fuse integration-servicesFuse integration-services
Fuse integration-services
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, Docker
 
33rd degree
33rd degree33rd degree
33rd degree
 
EVE Microservices Platform
EVE Microservices PlatformEVE Microservices Platform
EVE Microservices Platform
 
SQL Server Clustering for Dummies
SQL Server Clustering for DummiesSQL Server Clustering for Dummies
SQL Server Clustering for Dummies
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
PowerPoint Presentation
PowerPoint PresentationPowerPoint Presentation
PowerPoint Presentation
 
Performance of Microservice Frameworks on different JVMs
Performance of Microservice Frameworks on different JVMsPerformance of Microservice Frameworks on different JVMs
Performance of Microservice Frameworks on different JVMs
 
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 

More from Adrian Cockcroft

More from Adrian Cockcroft (10)

Gophercon 2016 Communicating Sequential Goroutines
Gophercon 2016 Communicating Sequential GoroutinesGophercon 2016 Communicating Sequential Goroutines
Gophercon 2016 Communicating Sequential Goroutines
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
 
Evolution of Microservices - Craft Conference
Evolution of Microservices - Craft ConferenceEvolution of Microservices - Craft Conference
Evolution of Microservices - Craft Conference
 
What's Missing? Microservices Meetup at Cisco
What's Missing? Microservices Meetup at CiscoWhat's Missing? Microservices Meetup at Cisco
What's Missing? Microservices Meetup at Cisco
 
In Search of Segmentation
In Search of SegmentationIn Search of Segmentation
In Search of Segmentation
 
Dockercon 2015 - Faster Cheaper Safer
Dockercon 2015 - Faster Cheaper SaferDockercon 2015 - Faster Cheaper Safer
Dockercon 2015 - Faster Cheaper Safer
 
Dockercon State of the Art in Microservices
Dockercon State of the Art in MicroservicesDockercon State of the Art in Microservices
Dockercon State of the Art in Microservices
 
Cloud Native Cost Optimization
Cloud Native Cost OptimizationCloud Native Cost Optimization
Cloud Native Cost Optimization
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
 
Hack Kid Con - Learn to be a Data Scientist for $1
Hack Kid Con - Learn to be a Data Scientist for $1Hack Kid Con - Learn to be a Data Scientist for $1
Hack Kid Con - Learn to be a Data Scientist for $1
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Epidemic Failures

  • 1. Cloud Native and Epidemic Failures April 2014 Adrian Cockcroft @adrianco @BatteryVentures http://www.linkedin.com/in/adriancockcroft
  • 3. Cloud Native Construct a highly agile and highly available service from ephemeral and often broken components
  • 5. Numquam ponenda est pluralitas sine necessitate Plurality must never be posited without necessity Occam’s Razor
  • 6. Monoculture Replicate “the best” as patterns Reduce interaction complexity Epidemic single point of failure
  • 7. Pattern Failures Infrastructure Pattern Failures Software Stack Pattern Failures Application Pattern Failures
  • 8. Infrastructure Pattern Failures • Device failures – bad batch of disks, PSUs, etc. • CPU failures – cache corruption, math errors • Datacenter failures – power, network, disaster • Routing failures – DNS, Internet/ISP path
  • 9. Software Stack Pattern Failures • Time bombs – Counter wrap, memory leak • Date bombs - Leap year, leap second, epoch • Expiration – Certs timing out • Trust revocation – Certificate Authority fails • Security exploit – e.g. heartbleed • Language bugs – compile time • Runtime bugs – JVM, Linux, Hypervisor • Network bugs – routers, firewalls, protocols
  • 10. Application Pattern Failures • Time bombs – Counter wrap, memory leak • Date bombs - Leap year, leap second, epoch • Content bombs – Data dependent failure • Configuration – wrong/bad syntax • Versioning – incompatible mixes • Cascading failures – error handling bugs etc. • Cascading overload – excessive logging etc.
  • 11. What to do? Automated diversity management Diversified automation Efficient vs. Antifragile
  • 12. Specific Ideas • Automate running a mixture – Diversity as default for any service stack – No developer overhead, stay agile, low cost • Support oldest and newest versions together – Automate running 50/50 mix CentOS/Ubuntu – Mix versions of JDK, Tomcat, etc. • Vendor diversity – Multiple DNS vendors, cloud regions, costs more – Multiple cloud vendors? Much higher cost.
  • 13. Generate Permutations > epi <- data.frame(java=gl(2,1,8,c("java6","java7")), linux=gl(2,2,8,c("centos","ubuntu")) , codeversion=gl(2,4,8,c("v34","v35"))) > epi java linux codeversion 1 java6 centos v34 2 java7 centos v34 3 java6 ubuntu v34 4 java7 ubuntu v34 5 java6 centos v35 6 java7 centos v35 7 java6 ubuntu v35 8 java7 ubuntu v35
  • 14. Deployment • Builds – Manual to test, automate if it works – Modify build to generate permutation AMIs – Modify Asgard to auto-deploy permutations • Data collection – Tag each instance with its permutation – Gather metrics by permutation per instance – Do R-based Design of Experiments analysis
  • 15. Analysis • As a function of permutations – Error rate – Response time – CPU Utilization • Interactions – E.g. interaction between linux and java – Contrasts identify components with issues – Small changes with high statistical significance
  • 16. GCS Total API Outage for ~1hr
  • 17. Takeaway Watch out for monocultures A|B Testing – it’s not just for personalization http://perfcap.blogspot.com http://slideshare.net/adrianco – Netflix http://slideshare.net/adriancockcroft - Battery http://www.linkedin.com/in/adriancockcroft @adrianco @BatteryVentures