Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetups - September 2018

@ana_m_medina
Chaos Engineering with
Kubernetes
Ana Medina
Chaos Engineer at Gremlin
Chaos Engineering Meetups in Berlin & Hamburg
September 12/13, 2018

@ana_m_medina
Ana Medina
I work at Gremlin as a Chaos Engineer.
I was a Software Engineer at Uber for two years in Site
Reliability Engineering doing Chaos Engineering and Cloud
Computing. Previously worked/interned at SFEFCU,
Google, Quicken Loans, Stanford University and Miami
Dade College.
I’m from Costa Rica and Nicaragua and currently living in
San Francisco, CA.
College dropout.
Self taught engineer.
@ana_m_medina

@ana_m_medina
Gremlin
Chaos Engineer /ˈkāˌäs /ˌenjəˈnir/
noun
1. a person helping companies avoid
outages by running proactive chaos
engineering experiments.
gremlin.com
@gremlininc

@ana_m_medina
Gremlin
Failure as a Service.
Finds weaknesses in your system
before they cause problems.
Run Gremlin Agents on Hosts or
Containers. Schedule attacks using
UI, API and CLI.
Provides 11 attacks out of the box
gremlin.com
gremlin.com
@gremlininc

@ana_m_medina
How many of you have heard
of Chaos Engineering?

@ana_m_medina
Introduction to Chaos Engineering

@ana_m_medina
Chaos Engineering?
Thoughtful, planned
experiments designed to reveal
the weakness in our systems.

@ana_m_medina
Like a vaccine, we inject harm
to build immunity.

@ana_m_medina
What Chaos Engineering is
● Thoughtful chaos engineering experiments
● Controlled and planned chaos engineering experiments
● Preparing for unpredictable failure
● Preparing engineers for failure
● Preparing for GameDay
● A way to improve SLA
○ fortify systems
○ build and move fast
○ build confidence in systems
○ reveal weak points in your systems
○ build assurance that you can still serve your customers

@ana_m_medina
What Chaos Engineering is not
● Random chaos engineering experiments
● Unsupervised chaos engineering experiments
● Unmonitored chaos engineering experiments
● Unexpected chaos engineering experiments
● Breaking production by accident
● Creating outages

@ana_m_medina
Why do Chaos Engineering
● Microservice Architecture is tricky
● Our systems are scaling fast
● Services will fail
● Dependencies on other companies will fail
● Prepares for real world scenarios
● Reduce amount of outages, reduce down time, lose less money

@ana_m_medina
What do you need before doing Chaos
Engineering?
○ Monitoring / Observability
○ On Call / Incident Management
○ Alerting and paging
○ Clear instructions on how to roll back an experiment
○ The cost of downtime per hour

@ana_m_medina
Injecting Chaos is a controlled way will
lead to engineers building resilient
systems.

@ana_m_medina
Inject Failure at any level
● Application
● API
● Caching
● Database
● Hardware
● Cloud Infrastructure / Bare metal

@ana_m_medina
Top places to inject chaos

@ana_m_medina
Getting Started:
1. Identify top 5 critical systems
2. Choose system
3. Whiteboard the system
4. Determine what experiment you want to run: (resource,
state, network)
5. Determine Blast Radius

@ana_m_medina
Companies doing Chaos Engineering

@ana_m_medina
Chaos Engineering on K8s

@ana_m_medina
Create a K8 cluster + Install Weave Net & Demo
App
Visit link here:
bit.ly/2OegimQ

@ana_m_medina
What could go wrong?

@ana_m_medina
In group of 3-5 discuss what Chaos
Engineering experiments you would
perform

@ana_m_medina
Experiments
Run one to various attacks to form an experiment.
This can include attacks on the resource, state or network layer.

@ana_m_medina
Chaos Engineering helps you test
monitoring tools, metrics, dashboards,
alerts, and thresholds.

@ana_m_medina
The “Hello World” of Chaos Engineering: CPU Attack
run cpu attack as chaos user, view
‘top’; see attack in action
‘pkill -u chaos’ to stop attack

@ana_m_medina
Resource Chaos Engineering:
Maximizing Utilization of Resources such as CPU, Disk, IO and
Memory

@ana_m_medina
State Chaos Engineering:
killing a process
loop kill a process
spawn new processes
shutting down self
killing pods
shut down a container from a host
shut down various containers from a container
shut down several containers from several containers
time travel: nodes clock getting out of sync

@ana_m_medina
Network Chaos Engineering:
Blackholing all traffic for Kubernetes
Blackholing all traffic for certain pods
DNS chaos
Latency
Packet Loss

@ana_m_medina
Tools you can use
Kubernetes Edition

@ana_m_medina
Kube-Monkey
An implementation of Netflix's Chaos Monkey for Kubernetes
clusters
github.com/asobti/kube-monkey

@ana_m_medina
PowerfulSeal
A powerful testing tool for Kubernetes clusters.
github.com/bloomberg/powerfulseal

@ana_m_medina
Litmus
Litmus is chaos engineering for stateful workloads on
Kubernetes.
github.com/openebs/litmus

@ana_m_medina
Gremlin
Downtime is expensive and damages customer trust.
Gremlin’s Failure as a Service finds weaknesses in your
system before they cause problems.
gremlin.com/demo

@ana_m_medina
Where to learn more?
Principles of Chaos
Chaos Engineering Book
Awesome Chaos Engineering GitHub Repo
Chaos Engineering Slack gremlin.com/slack
Gremlin Community
Netflix - Chaos Kong
Chaos Engineering Meetups
Learn More about SEV
Learn More about GameDays

@ana_m_medina
Join Slack:
www.gremlin.com/slack
Check out the Gremlin Community:
www.gremlin.com/community

@ana_m_medina
Q&A
Any Questions?

@ana_m_medina
some questions + give feedback:
bit.ly/ce-meetup-sept-qs
THANKS!
i have stickers!
@ana_m_medina

Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetups - September 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetups - September 2018

Similar to Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetups - September 2018 (20)

More from Ana Medina

More from Ana Medina (6)

Recently uploaded

Recently uploaded (20)

Chaos Engineering with Kubernetes - Berlin / Hamburg Chaos Engineering Meetups - September 2018