Principles of Chaos Engineering

•

1 like•981 views

Slide deck from my talk about "Principles of Chaos Engineering" at the first ever Chaos Engineering Hamburg meet up. Come join us at http://www.meetup.com/Chaos-Engineering-Hamburg/ and stay up to date with new events and other news.

Software

Chaos Engineering
Hamburg
Marvin Hoffmann | Computer Scientist
15.12.2015

1. AWS Basics and Intro
2. Evolution of Chaos Testing
3. Tooling
4. Chaos Engineering
Agenda

Europe West (Ireland)US East (N. Virginia)
Regions
AZs Instances
AWS Basics

“A way to improve availability is
to install proven hardware and
software, and then leave it alone”
Jim Gray
Why Do Computers Stop and What Can Be Done About It?

• Systems need to be reliable
• Nuklear weapon arsenal, heart rate monitoring,
World of Warcraft servers, Streaming business
• Third party dependencies (software and
hardware)
Be reliable!

DynamoDB Outage US-East
• “… there was a brief network disruption that impacted a
portion of DynamoDB’s storage servers.”
• 2:19am until 7:10am PDT
• “There are several other AWS services that use
DynamoDB that experienced problems during the event.”
• SQS, EC2 auto scaling, CloudWatch
Source: https://aws.amazon.com/message/5467D2/

• Deployments themselves may cause issues
• Unpredicted behaviour after a change has been
rolled out
• Issues during rollback
• Change in client / user behaviour
It’s not always the infrastructure

Do the simplest thing ﬁrst
• Prepare for your machines to die
• “Cattle, not pets” (Adrian Cockcroft)
• Resilience through redundancy
• Stateless machines

Deal with infrastructure issues
• Latency between instances
• Package loss
• Ports blocked
• or even outages of an entire AZ

Think big!
• Remember that DynamoDB failure?
• Outage of an entire AWS region!
• You’ll need more than one region in the ﬁrst place
• Re-routing of entire trafﬁc from one region to another
• Any region needs to be able to scale to take the load of
two regions

Chaos Monkey
Kills random instances in your account

Chaos Gorilla
Kills a random AZ in your account

Chaos Kong
Kills an entire AWS region in your account

What’s in it?
• A compilation of scripts
• Scripts mess with your AWS account
• Thus, they are very AWS speciﬁc
• If not on AWS, get inspired and build your toolset around
these ideas
• Not a comprehensive toolset

• Latency Monkey
• Conformity Monkey
• Security Monkey
• Doctor Monkey
• 10-18 Monkey
Simian Army

• Systematic approach to Chaos Testing
• Started by Netﬂix
• Talk about it a lot to attract talent
• Many other companies doing similar things in that ﬁeld
• Want to grow a community around it
Chaos Engineering

“Experiment on a distributed system
in order to build conﬁdence in the
system’s capability to withstand
turbulent conditions in production.”
Netﬂix

Know your system
• Operational insight
• What is “normal”? What does a failure look like?

Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour

The “Happy Path”
• Trace through code
where nothing bad
happens
• usually testing happens
ﬁrst on the happy path
• Bad things usually
happen off the happy
path
Source: https://bethtrissel.ﬁles.wordpress.com/2014/06/176869567.jpg

Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events

Laboratory
• “Works on my machine” (or “works in stage env.”)
Source: http://www.memegasms.com/media/created/vhyfxm.jpg

Four Principles of
Chaos Engineering
1.Build a hypothesis around steady-state behaviour
2.Vary real-world events
3.Run experiments in production

Chaos Engineering Culture
• http://principlesofchaos.com
• More resources:
• https://github.com/Netﬂix/SimianArmy
• https://github.com/Netﬂix/atlas
• https://www.youtube.com/watch?v=vq4QZ4_YDok

What's hot

DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012Nick Galbreath

Overcoming Security Challenges in DevOpsAlert Logic

DevSecCon KeyNote London 2015Shannon Lietz

The Journey to DevSecOpsSeniorStoryteller

How to adapt the SDLC to the era of DevSecOpsZane Lackey

Finding Security a Home in a DevOps WorldShannon Lietz

Effective approaches to web application security Zane Lackey

Introduction to DevSecOpsAmazon Web Services

The Joy of Proactive SecurityAndy Hoernecke

Security as Code owaspShannon Lietz

DevSecOps - The big pictureDevSecOpsSg

The Rise of DevSecOps - Fabian Lim - DevSecOpsSgDevSecOpsSg

From Gates to Guardrails: Alternate Approaches to Product SecurityJason Chan

Cloud Application Security: Lessons LearnedJason Chan

Chaos Engineering and Systems ReliabilitySylvain Hellegouarch

DevSecCon London 2017: when good containers go bad by Tim MackeyDevSecCon

Accelerating Innovation and Time-to-Market @ Camp Devops Houston 2015 Ariel Tseitlin

2019 DevSecOps Reference ArchitecturesSonatype

DevOps, Common use cases, Architectures, Best PracticesShiva Narayanaswamy

Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are SecurePuppet

What's hot (20)

DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012

Overcoming Security Challenges in DevOps

DevSecCon KeyNote London 2015

The Journey to DevSecOps

How to adapt the SDLC to the era of DevSecOps

Finding Security a Home in a DevOps World

Effective approaches to web application security

Introduction to DevSecOps

The Joy of Proactive Security

Security as Code owasp

DevSecOps - The big picture

The Rise of DevSecOps - Fabian Lim - DevSecOpsSg

From Gates to Guardrails: Alternate Approaches to Product Security

Cloud Application Security: Lessons Learned

Chaos Engineering and Systems Reliability

DevSecCon London 2017: when good containers go bad by Tim Mackey

Accelerating Innovation and Time-to-Market @ Camp Devops Houston 2015

2019 DevSecOps Reference Architectures

DevOps, Common use cases, Architectures, Best Practices

Security & DevOps- Ways To Make Sure Your Apps & Infrastructure Are Secure

Similar to Principles of Chaos Engineering

20140708 - Jeremy Edberg: How Netflix Delivers SoftwareDevOps Chicago

Mini-Training: Netflix Simian ArmyBetclic Everest Group Tech Team

Inrastructure as CodeCharles Anderson

Elatt Presentationstudent-elatt

Chirp 2010: Scaling TwitterJohn Adams

Chaos engineering & Gameday on AWSBilal Aybar

Hack-Proof Your Cloud: Responding to 2016 Threats | AWS Public Sector Summit ...Amazon Web Services

The economies of scaling software - Abdel Remanijaxconf

Do you lose sleep at night?Nathan Van Gheem

The Economies of Scaling SoftwareAbdelmonaim Remani

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Amazon Web Services

Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Codemotion

CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver

AWS Meetup - Nordstrom Data Lab and the AWS CloudNordstromDataLab

Azure Service Fabric MeshUdaiappa Ramachandran

Dev Ops without the OpsKonstantin Gredeskoul

Hacklu2011 tricaudstricaud

Meetup #3: Migrate a fast scale system to AWSAWS Vietnam Community

Migrating to awsIT Expert Club

Using AWS WAF and Lambda for Automatic ProtectionAmazon Web Services

Similar to Principles of Chaos Engineering (20)

20140708 - Jeremy Edberg: How Netflix Delivers Software

Mini-Training: Netflix Simian Army

Inrastructure as Code

Elatt Presentation

Chirp 2010: Scaling Twitter

Chaos engineering & Gameday on AWS

Hack-Proof Your Cloud: Responding to 2016 Threats | AWS Public Sector Summit ...

The economies of scaling software - Abdel Remani

Do you lose sleep at night?

The Economies of Scaling Software

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...

Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...

CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police

AWS Meetup - Nordstrom Data Lab and the AWS Cloud

Azure Service Fabric Mesh

Dev Ops without the Ops

Hacklu2011 tricaud

Meetup #3: Migrate a fast scale system to AWS

Migrating to aws

Using AWS WAF and Lambda for Automatic Protection

Recently uploaded

chapter--4-software-project-planning.pptkotipi9215

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

EY_Graph Database Powered SustainabilityNeo4j

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

What is Binary Language? Computer Number SystemsJheuzeDellosa

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

DNT_Corporate presentation know about usDynamic Netsoft

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Professional Resume Template for Software DevelopersVinodh Ram

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Recently uploaded (20)

chapter--4-software-project-planning.ppt

Der Spagat zwischen BIAS und FAIRNESS (2024)

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

EY_Graph Database Powered Sustainability

Advancing Engineering with AI through the Next Generation of Strategic Projec...

why an Opensea Clone Script might be your perfect match.pdf

Cloud Management Software Platforms: OpenStack

What is Binary Language? Computer Number Systems

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

HR Software Buyers Guide in 2024 - HRSoftware.com

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

DNT_Corporate presentation know about us

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Professional Resume Template for Software Developers

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Principles of Chaos Engineering

1. Chaos Engineering Hamburg Marvin Hoffmann | Computer Scientist 15.12.2015

2. 1. AWS Basics and Intro 2. Evolution of Chaos Testing 3. Tooling 4. Chaos Engineering Agenda

3. Europe West (Ireland)US East (N. Virginia) Regions AZs Instances AWS Basics

4. Chaos? - What do we mean?

5. “A way to improve availability is to install proven hardware and software, and then leave it alone” Jim Gray Why Do Computers Stop and What Can Be Done About It?

6. • Systems need to be reliable • Nuklear weapon arsenal, heart rate monitoring, World of Warcraft servers, Streaming business • Third party dependencies (software and hardware) Be reliable!

7. DynamoDB Outage US-East • “… there was a brief network disruption that impacted a portion of DynamoDB’s storage servers.” • 2:19am until 7:10am PDT • “There are several other AWS services that use DynamoDB that experienced problems during the event.” • SQS, EC2 auto scaling, CloudWatch Source: https://aws.amazon.com/message/5467D2/

8. • Deployments themselves may cause issues • Unpredicted behaviour after a change has been rolled out • Issues during rollback • Change in client / user behaviour It’s not always the infrastructure

9. Evolution of Chaos Testing

10. Do the simplest thing ﬁrst • Prepare for your machines to die • “Cattle, not pets” (Adrian Cockcroft) • Resilience through redundancy • Stateless machines

11. Deal with infrastructure issues • Latency between instances • Package loss • Ports blocked • or even outages of an entire AZ

12. Think big! • Remember that DynamoDB failure? • Outage of an entire AWS region! • You’ll need more than one region in the ﬁrst place • Re-routing of entire trafﬁc from one region to another • Any region needs to be able to scale to take the load of two regions

13. Tooling (meet the Monkeys)

14. Chaos Monkey Kills random instances in your account

15. Chaos Gorilla Kills a random AZ in your account

16. Chaos Kong Kills an entire AWS region in your account

17. What’s in it? • A compilation of scripts • Scripts mess with your AWS account • Thus, they are very AWS speciﬁc • If not on AWS, get inspired and build your toolset around these ideas • Not a comprehensive toolset

18. • Latency Monkey • Conformity Monkey • Security Monkey • Doctor Monkey • 10-18 Monkey Simian Army

19. Chaos Engineering

20. • Systematic approach to Chaos Testing • Started by Netﬂix • Talk about it a lot to attract talent • Many other companies doing similar things in that ﬁeld • Want to grow a community around it Chaos Engineering

21. “Experiment on a distributed system in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production.” Netﬂix

22. Four Principles of Chaos Engineering

23. Know your system • Operational insight • What is “normal”? What does a failure look like?

24. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour

25. The “Happy Path” • Trace through code where nothing bad happens • usually testing happens ﬁrst on the happy path • Bad things usually happen off the happy path Source: https://bethtrissel.ﬁles.wordpress.com/2014/06/176869567.jpg

26. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour 2.Vary real-world events

27. Laboratory • “Works on my machine” (or “works in stage env.”) Source: http://www.memegasms.com/media/created/vhyfxm.jpg

28. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour 2.Vary real-world events 3.Run experiments in production

29. Four Principles of Chaos Engineering 1.Build a hypothesis around steady-state behaviour 2.Vary real-world events 3.Run experiments in production 4.Automate experiments to run continuously

30. Chaos Engineering Culture • http://principlesofchaos.com • More resources: • https://github.com/Netﬂix/SimianArmy • https://github.com/Netﬂix/atlas • https://www.youtube.com/watch?v=vq4QZ4_YDok

Principles of Chaos Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Principles of Chaos Engineering

Similar to Principles of Chaos Engineering (20)

Recently uploaded

Recently uploaded (20)

Principles of Chaos Engineering