This is the short story of a bug in one of our Go services that cut off some of the traffic targeting one of our production Kubernetes cluster running on AWS. But more than that, this is about how we did conceptually similar mistakes before and why thinking about failures and the famous "fallacies of distributed computing" is key to develop infrastructural components.
With this talk, we will give you a walk through some of those problems, illustrate some interesting details of Kubernetes, AWS and hopefully help you to not make the same mistakes again.
3. 3
ZALANDO
15 markets
6 fulfillment centers
21 million active customers
3.6 billion € net sales 2016
200 million visits per month
13,000 employees in Europe
15. 15
WHAT REALLY HAPPENED
• All routes removed:
• No routes to the applications deployed inside the cluster
• Healthcheck “unhealthy” because of no connection to API server
• => All nodes were unhealthy in the ELBv2
16. WHAT HAPPENS WHEN
ALL THE TARGETS IN
AN ELBv2 ARE
UNHEALTHY?
Photo by Sandro Katalina on Unsplash
22. 22
WE ARE NOT ALONE
• … a test that simulated the failure of a single apiserver node
disrupted the cluster in a way that negatively impacted the
availability of running workloads
• ... helped us identify that the disruption was likely related to an
interaction between the various clients that connect to the
Kubernetes apiserver (like calico-agent, kubelet, kube-proxy, and
kube-controller-manager) and our internal load balancer’s
behavior during an apiserver node failure.
• Source: Kubernetes at GitHub
24. 24
HOW WE FIXED IT
• Do not change the healthcheck in case of API server failures
• Do not drop the routes in case of API server failures
• => Delete when you are really sure you want to delete!
26. 26
8 FALLACIES OF DISTRIBUTED COMPUTING
• The network is reliable.
• Latency is zero.
• Bandwidth is infinite.
• The network is secure.
• Topology doesn't change.
• There is one administrator.
• Transport cost is zero.
• The network is homogeneous.
28. 28
THE FALLACIES OF CLOUD COMPUTING
• The API call you will make will succeed.
• The next API call you will make will succeed.
• Deleting resources is the same as adding new.
• Your cloud provider will have no outages.
• The dependencies between your services are clear.
30. 30
WHEN MAKING API CALLS
• Every API call can fail
• Retry (with backoff)
• Circuit breakers
• Fallbacks
• Don’t scale down / delete resources fast!
• Deal with rate limiting
• Deal with “weird” values due to a broken cloud provider feature
31. 31
TEST ALL THE THINGS
• Continuous integration tests
• Continuous deployment of cluster updates
• Load tests
• Chaos tests
32. 32
CONTINUOUS INTEGRATION TESTS
• Test the interactions between components
• For every configuration change we run extensive e2e tests
35. 35
LOAD TESTING
• Lots of request to the API server
• Lots of pods running
• Write/reads to the data storage (etcd)
• => what matters: observe the impact on running applications
36. 36
CHAOS TESTING
• Random shutdown of Kubernetes components
• https://github.com/linki/chaoskube
• http://chaostoolkit.org/
• https://github.com/asobti/kube-monkey
• http://principlesofchaos.org
• Random shutdown of nodes (EC2 Instances)
• https://github.com/Netflix/chaosmonkey
37. 37
MORE ON CHAOS TESTING
• Netflix’s principles of Chaos Engineering
• http://principlesofchaos.org
• Chaos Engineering free ebook ->
38. 38
THAT’S NOT ALL
• You think Kubernetes the hard way is hard
• The hard part was never only the setup
• Sometimes you will have to break things to learn
• …setup a healthy post mortem culture and learn from
mistakes!