Chris Homer - Moving the entire stack to k8s within a year – lessons learned
1. Moving Our Entire Stack to K8S Within a Year - 7 Lessons Learned
October 12, 2018
2. Chris Homer
Co-Founder & CTO at thredUPddd
● Largest Consignment Store
● $130M+ invested
● 1000+ employees
● 4 distribution centers
● Kiev & SF Engineering Offices
● We’re Hiring!
Co-Founder & CTO at thredUP
Solution Specialist at Microsoft
Princeton University & Harvard Business School
Chris Homer - @chrishomer
3.
4. Confidential 4
The thredUP Marketplace
● Convenient Pre-Paid Bag
● Earn Cash or Donate
● Do Good
● Amazing prices
● Wide assortment
● Fresh selection everyday
8. Confidential 8
● K8S Migration Begins
Infrastructure Timeline
A little history of our journey towards the promised-land
2010 201820152014 2016 2017 2019 ...2009
● Slicehost
● Manual Config
● Capistrano Deploy
● Manual Tests
● AWS Hosted
● Manual Saved AMI’s
● Staging & Dev - cleansed prod copy
● “Outsourcing DevOps”
● Back to Chef
● “Microservices”
● Hand-crafted Staging
● Chef
● Ansible all the things
● “Insourcing DevOps”
● Back to Ansible - One Source of Truth
● Infrastructure Team
● DevOps is about Culture
● Security Assessment
● Terraform
● Ansible Hardening
● Dynamic Staging
● Service Mesh
● DevSecOps● Docker & ECS “Attempt”
9. Confidential 9
The Current Infrastructure Stack
After the migration, the picture is getting clearer and increasingly rational
prod staging dev
10. Confidential 10
Why Docker & Kubernetes?
● Obviously because it’s cool & hype :)
● Popularity - widely supported
● Scalable & fault-tolerant out of the box
● Flexibility & deep control
● Standardization & ownership
● Speed up development lifecycle
● Encourage more & smaller services
● Linux Foundation & CNCF
11. Confidential 11
Learning #1 - Fear, Uncertainty & Doubt => Excitement & Ownership
● Not everyone will be on board
● Share the vision, explain the advantages, pains and short-comings
● A simple demo application helps “make it real”
● Emphasize that success requires app team and infra team ownership
● Cultivate champions and use their help
● Momentum is your friend
● Milestones are important for larger services
● Technical debt opportunities
● Knowledge sharing & workshops along the way and after
12. Confidential 12
Learning #2 - Pay close attention to performance
➢ Setup k8s VPS that is peered with prod
VPC
○ Redis
○ Memcached
○ Aurora
➢ scale haproxy instances
➢ update kubernetes nodes to c5.2xlarge
➢ disable ingress controller
➢ disable kubeDNS
13. Confidential 13
Learning #2 - Pay close attention to performance
ec2 response time p90
k8s response time p90
14. Confidential 14
Learning #2 cont’d - Internal communication is way faster
access by
cluster IP
access by
public DNS name
15. Confidential 15
Learning #3 - Liveness probe is not always your friend
Response time
time
k8s healthcheck timeout
External Request
Our Code
18. Confidential 18
Many DNS errors and ~5 seconds delays
Learning #4 – DNS
● It’s a well-known issue with UDP & Dynamic NAT
● It has a bug report - https://github.com/kubernetes/kubernetes/issues/56903
● And good problem explanation https://www.weave.works/blog/racy-conntrack-and-dns-
lookup-timeouts
Solution – use TCP as a protocol
dnsConfig:
options:
- name: use-vc
dnsPolicy: ClusterFirst
Another Solution
dnsConfig:
options:
- name: single-request-reopen
dnsPolicy: ClusterFirst
19. Confidential 19
Learning #5 - Too many open files
Ok, Google =)
max_user_watches=8192 → this looks too low, let's bump it a little!
That did seem to help … For some time ....
21. Confidential 21
Learning #5 - Too many open files
Logaggregatorclient
Docker
container
container
container
log file
log file
log file
fd
fd
fd
22. Confidential 22
Learning #5 - Too many open files
Docker
container
container
container
log file
log file
log file
fd
fd
container log file
fd
fd
fd
fdfd
Logaggregatorclient
These are still opened
23. Confidential 23
Learning #6 - Pod Distribution after Cluster Maintenance
Worker node#1 Worker node#2 Worker node#3
Service A pod Service A pod Service A pod
24. Confidential 24
Learning #6 - Pod Distribution after Cluster Maintenance
Under
Maintenance
Worker node#1 Worker node#2 Worker node#3
Service A pod Service A pod
Service A pod
Service A pod
25. Confidential 25
Learning #6 - Pod Distribution after Cluster Maintenance
Alive and
functioning
Worker node#1 Worker node#2 Worker node#3
Service A pod Service A pod
Service A pod
26. Confidential 26
Learning #6 - Pod Distribution after Cluster Maintenance
Worker node#1 Worker node#2 Worker node#3
Service A pod Service A pod
Service A pod
27. Confidential 27
Learning #6 - Pod Distribution after Cluster Maintenance
Worker node#1 Worker node#2 Worker node#3
Service A pod Service A pod
Service A pod
Service A pod
Service A pod
All traffic goes here
28. Confidential 28
Learning #6 - Pod Distribution after Cluster Maintenance
Worker node#1 Worker node#2 Worker node#3
Service A pod Service A pod
Service A pod
Service A pod
Service A pod
All traffic goes here
Solution: Redeploy to redistribute pods
29. Confidential 29
Learning #6 - Pod Distribution after Cluster Maintenance
Worker node#1 Worker node#2 Worker node#3
Service A pod Service A pod Service A pod
30. Confidential 30
Learning #7 - Building Docker Images within the K8s Cluster
Kubernetes worker node
Docker
daemon
docker.sock
Jenkins
slave
Container BContainer A
Docker cli
Containers
Jenkinsfile
docker build ...
docker build ...
31. Confidential 31
Learning #7 - Building Docker Images within the K8s Cluster
Kubernetes worker node
Docker
daemon
docker.sock
Jenkins
slave
Container BContainer A
Docker cli
Containers
Jenkinsfile
docker rm ...
docker rm ...
32. Confidential 32
Learning #7 - Building Docker Images within the K8s Cluster
Kubernetes worker node
Docker
daemon
Jenkins
slave
Container BContainer A
Docker cli
Containers
Separate ec2 instance
33. Confidential 33
Was it worth it? YES!
● Deployment time halved ~ (main service – from 12 min to 5 min)
● Rollback is very easy and fast (nearly instant)
● Hardware provisioned decreased by a factor of 3
● Pods autoscaling eliminated manual work to support traffic spikes
● System level upgrades are now non-blocking and easy to execute
● Time to provision and deploy a new service in production changed from
days/weeks to minutes/hours
● Each project has its own simple helm chart in a project repo ~ 3200
ansible config files deprecated.
34. Confidential 34
What’s next?
● Dynamic Staging Environments
○ Encourage better development workflow
○ Easily enable cross-team review with design, marketing and others
● Telepresence for Complex Local Development
○ Easier onboarding & dev env refresh
○ More consistent behavior with production
● End-to-end integration suite
● Iterate for Improvements
○ Faster builds
○ Cluster Performance
○ Observability
○ Cost Improvements
● Service mesh with Istio