SlideShare a Scribd company logo
1 of 49
Download to read offline
DEVOPS NRW
2018-02-21
HENNING JACOBS
@try_except_
Kubernetes on AWS
@ZalandoTech
2
ZALANDO IN NUMBERS
> 4.5billion EUR
2017
> 200
million
visits
per
month
> 14,000
employees in
Europe
> 70%
of visits via
mobile devices
> 22
million
active customers
> 250,000
product choices
~ 2,000
brands
15
countries
3
OUR FOOTPRINT AROUND EUROPE
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1
4
OUR FOOTPRINT AROUND EUROPE
TECH
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1
INCIDENTS ARE FINE
ON CALL
7
INCIDENT #1: CUSTOMER IMPACT
8
INCIDENT #1: IAM RETURNING 404
9
INCIDENT #1: NUMBER OF PODS
10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DEPLOYMENT
POD
10.2.0.1
POD
10.2.1.1
POD
10.2.2.1
POD
10.2.3.1
SKIPPER
172.31.1.1:9999
SKIPPER
172.31.2.1:9999
SKIPPER
172.31.3.1:9999
SKIPPER
172.31.4.1:9999
ALIAS Record
11
INCIDENT #1: UNASSUMING MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
metadata:
labels:
application: "foobar"
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...
12
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
metadata:
labels:
application: "foobar"
spec:
restartPolicy: Never
containers:
...
13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to stay “healthy” during API server problems
• Fix Skipper Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
14
STABILITY: LIMIT RANGE
$ kubectl describe limitrange
Name: limits
Namespace: default
Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio
---- -------- --- --- ----------- ------------- -----------------------
Container memory - 64Gi 100Mi 1Gi -
Container cpu - 16 100m 3 -
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources
⇒ Mitigate errors on OSI layer 8 ;-)
15
INCIDENT #2: CLUSTER DOWN
16
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
17
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]
https://www.outcome-eng.com/human-error-never-root-cause/
19
INCIDENT #2: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
20
INCIDENT #3: LATENCY SPIKES
21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
23
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
24
INCIDENT #4: IMPACT
25
INCIDENT #4: CLUSTER DOWN?
26
INCIDENT #4: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
28
CLUSTER
UPDATES
29
CLUSTER LIFECYCLE MANAGER
30
INCIDENT #4: LESSONS LEARNED
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with the previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
31
TRAFFIC SWITCHING
Default deployment: rolling update
32
TRAFFIC SWITCHING
33
MANUAL TRAFFIC SWITCHING
$ zkubectl traffic <ingress-name>
SERVICE WEIGHT
<service-backend-1> 30%
<service-backend-2> 70%
$ zkubectl traffic <ingress-name> <service-backend-2> 100
SERVICE WEIGHT
<service-backend-1> 0%
<service-backend-2> 100%
34
DOCUMENTATION
"Documentation is hard to find"
"Documentation is not comprehensive enough"
"Remove unnecessary complexity and obstacles."
"Get the documentation up to date and prepare
use cases"
"More and more clear documentation"
"More detailed docs, example repos with more
complicated deployments."
35
DOCUMENTATION
• Restructure following
https://www.divio.com/en/blog/documentation/
• Concepts
• How Tos
• Tutorials
• Reference
• Global Search
• Weekly Health Check: Support → Documentation
36
ONBOARDING
• Many new concepts to grasp vs. 200+ teams
• Kubernetes Training (2h)
• Documentation
• Recorded Friday Demos
• Support Channels (chat, mail)
37
CLUSTER SCOPE
38
DOES ANYTHING EVEN WORK?
• Kubernetes API as the primary interface
• Ingress Controller + External DNS
• kube2iam
• Cluster lifecycle management
• Zalando IAM/OAuth integration via CRD
• PostgreSQL Operator
39
DEPLOYMENT
40
DEPLOYMENT CONFIGURATION
.
├── deploy/apply
│ ├── deployment.yaml # K8s Deployment
│ ├── credentials.yaml # K8s TPR
│ ├── ingress.yaml # K8s Ingress
│ └── service.yaml # K8s Service
└── delivery.yaml # pipeline config
41
INGRESS.YAML
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "myapp.foo.example.org"
http:
paths:
- backend:
serviceName: "myapp"
servicePort: 80
42
CONTINUOUS DELIVERY PLATFORM
43
CDP: APPLY
44
CDP: OPTIONAL APPROVAL
45
POSTGRES OPERATOR
Application to manage PostgreSQL clusters
Observes “postgres” manifests (CRDs)
Spawns and modifies new clusters
Syncs and provisions roles
Handles volume resize, incl. Resize2fs
Also responsible for updating Docker images
https://github.com/hjacobs/kube-ops-view
47
LINKS
Running Kubernetes in Production on AWS
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html
Kube AWS Ingress Controller
https://github.com/zalando-incubator/kube-ingress-aws-controller
External DNS
https://github.com/kubernetes-incubator/external-dns
PostgreSQL Operator
https://github.com/zalando-incubator/postgres-operator
Zalando Cluster Configuration
https://github.com/zalando-incubator/kubernetes-on-aws
List of Organizations using Kubernetes on AWS
https://github.com/hjacobs/kubernetes-on-aws-users
https://goo.gl/t2zNc8
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

More Related Content

What's hot

From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupHenning Jacobs
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Henning Jacobs
 
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Henning Jacobs
 
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDocker, Inc.
 
Kernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSKernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSDocker, Inc.
 
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Henning Jacobs
 
Kubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationKubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationHenning Jacobs
 
Docker and Windows: The State of the Union
Docker and Windows: The State of the UnionDocker and Windows: The State of the Union
Docker and Windows: The State of the UnionElton Stoneman
 
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipelineKubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipelineKubeAcademy
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson LinHanLing Shen
 
KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government KubeAcademy
 
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees) Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees) Publicis Sapient Engineering
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Henning Jacobs
 
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeAcademy
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeAcademy
 
Docker 進階實務班
Docker 進階實務班Docker 進階實務班
Docker 進階實務班Philip Zheng
 
Docker for PHP Developers - Jetbrains
Docker for PHP Developers - JetbrainsDocker for PHP Developers - Jetbrains
Docker for PHP Developers - JetbrainsChris Tankersley
 
Java Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and KubernetesJava Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and KubernetesAntons Kranga
 
Pod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from DockershimPod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from DockershimVictor Morales
 
Docker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server ContainersDocker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server ContainersAnthony Chu
 

What's hot (20)

From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
 
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
 
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaSDockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
DockerCon EU 2015: The Glue is the Hard Part: Making a Production-Ready PaaS
 
Kernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVSKernel load-balancing for Docker containers using IPVS
Kernel load-balancing for Docker containers using IPVS
 
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
 
Kubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee PresentationKubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee Presentation
 
Docker and Windows: The State of the Union
Docker and Windows: The State of the UnionDocker and Windows: The State of the Union
Docker and Windows: The State of the Union
 
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipelineKubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
KubeCon EU 2016: Leveraging ephemeral namespaces in a CI/CD pipeline
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin
 
KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government KubeCon EU 2016: Transforming the Government
KubeCon EU 2016: Transforming the Government
 
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees) Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
Paris Container Day 2016 : Orchestrating Continuous Delivery (CloudBees)
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
 
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to Kubernetes
 
Docker 進階實務班
Docker 進階實務班Docker 進階實務班
Docker 進階實務班
 
Docker for PHP Developers - Jetbrains
Docker for PHP Developers - JetbrainsDocker for PHP Developers - Jetbrains
Docker for PHP Developers - Jetbrains
 
Java Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and KubernetesJava Day Kharkiv - Next-gen engineering with Docker and Kubernetes
Java Day Kharkiv - Next-gen engineering with Docker and Kubernetes
 
Pod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from DockershimPod Sandbox workflow creation from Dockershim
Pod Sandbox workflow creation from Dockershim
 
Docker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server ContainersDocker All The Things - ASP.NET 4.x and Windows Server Containers
Docker All The Things - ASP.NET 4.x and Windows Server Containers
 

Similar to Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Henning Jacobs
 
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECHZalando adtech lab
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Henning Jacobs
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Henning Jacobs
 
Automatic Ingress in Kubernetes
Automatic Ingress in KubernetesAutomatic Ingress in Kubernetes
Automatic Ingress in KubernetesRodrigo Reis
 
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinHenning Jacobs
 
Kubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" ApproachKubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" ApproachRodrigo Reis
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
 
Kubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueHenning Jacobs
 
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...Agile Testing Alliance
 
Minikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setupMinikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setupMartin Schmidt
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackQAware GmbH
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17Mario-Leander Reimer
 
Consul administration at scale
Consul administration at scaleConsul administration at scale
Consul administration at scalePierre Souchay
 
Kubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechKubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechMichael Dürgner
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPSACA IT-Solutions
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITStijn Wijndaele
 
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Threestackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick ThreeNETWAYS
 

Similar to Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW (20)

Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
 
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
12.07.2017 Docker Meetup - KUBERNETES ON AWS @ ZALANDO TECH
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
 
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
 
Automatic Ingress in Kubernetes
Automatic Ingress in KubernetesAutomatic Ingress in Kubernetes
Automatic Ingress in Kubernetes
 
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
 
Kubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" ApproachKubernetes Deployments: A "Hands-off" Approach
Kubernetes Deployments: A "Hands-off" Approach
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
 
Kubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native Prague
 
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
#Interactive Session by Kirti Ranjan Satapathy and Nandini K, "Elements of Qu...
 
Minikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setupMinikube – get Connections in the smalles possible setup
Minikube – get Connections in the smalles possible setup
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stack
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
 
Consul administration at scale
Consul administration at scaleConsul administration at scale
Consul administration at scale
 
Kubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando TechKubernetes on AWS @ Zalando Tech
Kubernetes on AWS @ Zalando Tech
 
'DOCKER' & CLOUD: ENABLERS For DEVOPS
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPS
 
Docker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-IT
 
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Threestackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
 

More from Henning Jacobs

Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Henning Jacobs
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaHenning Jacobs
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformHenning Jacobs
 
Plan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthPlan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthHenning Jacobs
 
Docker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroDocker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroHenning Jacobs
 
STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015Henning Jacobs
 
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Henning Jacobs
 
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...Henning Jacobs
 

More from Henning Jacobs (10)

Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion Platform
 
Plan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuthPlan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuth
 
Docker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando IntroDocker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando Intro
 
STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015
 
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
 
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
STUPS by Zalando @WHD.local Frankfurt: STUPS.io - an Open Source Cloud Framew...
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

  • 2. 2 ZALANDO IN NUMBERS > 4.5billion EUR 2017 > 200 million visits per month > 14,000 employees in Europe > 70% of visits via mobile devices > 22 million active customers > 250,000 product choices ~ 2,000 brands 15 countries
  • 3. 3 OUR FOOTPRINT AROUND EUROPE as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  • 4. 4 OUR FOOTPRINT AROUND EUROPE TECH as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  • 8. 8 INCIDENT #1: IAM RETURNING 404
  • 10. 10 LIFE OF A REQUEST (INGRESS) DNS my-app.example.org ALB aws-1234-lb.eu-central-1.elb.amazonaws.com SERVICE 10.3.0.216 DEPLOYMENT POD 10.2.0.1 POD 10.2.1.1 POD 10.2.2.1 POD 10.2.3.1 SKIPPER 172.31.1.1:9999 SKIPPER 172.31.2.1:9999 SKIPPER 172.31.3.1:9999 SKIPPER 172.31.4.1:9999 ALIAS Record
  • 11. 11 INCIDENT #1: UNASSUMING MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: metadata: labels: application: "foobar" spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  • 12. 12 INCIDENT #1: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: metadata: labels: application: "foobar" spec: restartPolicy: Never containers: ...
  • 13. 13 INCIDENT #1: LESSONS LEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Skipper Ingress to stay “healthy” during API server problems • Fix Skipper Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  • 14. 14 STABILITY: LIMIT RANGE $ kubectl describe limitrange Name: limits Namespace: default Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio ---- -------- --- --- ----------- ------------- ----------------------- Container memory - 64Gi 100Mi 1Gi - Container cpu - 16 100m 3 - http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources ⇒ Mitigate errors on OSI layer 8 ;-)
  • 16. 16 INCIDENT #2: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  • 17. 17 INCIDENT #2: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  • 19. 19 INCIDENT #2: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  • 21. 21 INCIDENT #3: STOP THE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  • 22. 22 INCIDENT #3: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  • 23. 23 INCIDENT #3: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  • 30. 30 INCIDENT #4: LESSONS LEARNED • Automated end-to-end tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with the previous configuration • Apply new configuration • Run end-to-end & conformance tests
  • 33. 33 MANUAL TRAFFIC SWITCHING $ zkubectl traffic <ingress-name> SERVICE WEIGHT <service-backend-1> 30% <service-backend-2> 70% $ zkubectl traffic <ingress-name> <service-backend-2> 100 SERVICE WEIGHT <service-backend-1> 0% <service-backend-2> 100%
  • 34. 34 DOCUMENTATION "Documentation is hard to find" "Documentation is not comprehensive enough" "Remove unnecessary complexity and obstacles." "Get the documentation up to date and prepare use cases" "More and more clear documentation" "More detailed docs, example repos with more complicated deployments."
  • 35. 35 DOCUMENTATION • Restructure following https://www.divio.com/en/blog/documentation/ • Concepts • How Tos • Tutorials • Reference • Global Search • Weekly Health Check: Support → Documentation
  • 36. 36 ONBOARDING • Many new concepts to grasp vs. 200+ teams • Kubernetes Training (2h) • Documentation • Recorded Friday Demos • Support Channels (chat, mail)
  • 38. 38 DOES ANYTHING EVEN WORK? • Kubernetes API as the primary interface • Ingress Controller + External DNS • kube2iam • Cluster lifecycle management • Zalando IAM/OAuth integration via CRD • PostgreSQL Operator
  • 40. 40 DEPLOYMENT CONFIGURATION . ├── deploy/apply │ ├── deployment.yaml # K8s Deployment │ ├── credentials.yaml # K8s TPR │ ├── ingress.yaml # K8s Ingress │ └── service.yaml # K8s Service └── delivery.yaml # pipeline config
  • 41. 41 INGRESS.YAML apiVersion: extensions/v1beta1 kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80
  • 45. 45 POSTGRES OPERATOR Application to manage PostgreSQL clusters Observes “postgres” manifests (CRDs) Spawns and modifies new clusters Syncs and provisions roles Handles volume resize, incl. Resize2fs Also responsible for updating Docker images
  • 47. 47 LINKS Running Kubernetes in Production on AWS http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html Kube AWS Ingress Controller https://github.com/zalando-incubator/kube-ingress-aws-controller External DNS https://github.com/kubernetes-incubator/external-dns PostgreSQL Operator https://github.com/zalando-incubator/postgres-operator Zalando Cluster Configuration https://github.com/zalando-incubator/kubernetes-on-aws List of Organizations using Kubernetes on AWS https://github.com/hjacobs/kubernetes-on-aws-users
  • 49. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k