What are you going to do if you have 60,000 jobs coming in a blink of an eye? It's normal in the Machine Learning world that you are going to process a huge load of the jobs that coming instantly in no time. We are going to walk you through our journey to scale out Kubernetes cluster to handle them. The tools we used, load testing, how to measure it and our solution.
2. Journey of Kubernetes Scaling
● Setthasarun Prasanpun (Beer)
● Former PHP developer
● DevOps Engineer @ Opsta
#whoami
3. Journey of Kubernetes Scaling
● Jirayut Nimsaeng (Dear)
● Interested in Cloud and
Open Source
● Agile Practitioner with
DevOps Driven
● CEO and Founder Opsta
#whoami
4. Journey of Kubernetes Scaling
● What is Docker and Kubernetes?
● Batch Processing
● Solution to scale Batch Processing
● Optimization
● Benchmark
● Future
Agenda
8. Journey of Kubernetes Scaling
Kubernetes Automatic Bin Packing
Node 2Node 1 Node 3
Container
Service A
Container
Service A
Container
Service B
kube-scheduler
9. Journey of Kubernetes Scaling
● Self-healing
● Service discovery & load balancing
● Automated rollouts and rollbacks
● Secret and configuration management
● Storage orchestration
● Batch execution
● Horizontal manual/auto-scaling
Some more features on Kubernetes
10. Journey of Kubernetes Scaling
Batch Processing
User
User
User
User
Queue
Worker
Worker
Worker
Result
Job
Job
Job
Job
Consume
Consume
Consume
11. Journey of Kubernetes Scaling
Challenge
User
User
User
User
Queue
Worker
Worker
Worker
DB
Job
Job
Job
Job
API
Consume
Consume
Consume
Push
12. Journey of Kubernetes Scaling
First Design on AWS
User
User
User
User
SQS
Worker
Worker
Worker
DB
API
13. Journey of Kubernetes Scaling
Problem
User
User
User
User
SQS
Worker
Worker
Worker
DB
API
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
User
60,000
QUEUES!!!
14. Journey of Kubernetes Scaling
Solution with Elastic Beanstalk
API
SQS
Elastic Beanstalk Container
Auto Scaling Instance Group
EC2 Sqsd
Worker
EC2 Sqsd
Worker
EC2 Sqsd
Worker
Set scale condition by CPU utilization
15. Journey of Kubernetes Scaling
Problems
- CPU utilization not a good metric for autoscale condition
- 1 EC2 contain only 1 Worker container
- EC2 spec not fit with worker require, waste resources.
- Very slow to scale up, Autoscaling isn't really intended for
bursting.
16. Journey of Kubernetes Scaling
Kubernetes Solution
User
User
User
User
SQS
Worker
Worker
Worker
DB
API
17. Journey of Kubernetes Scaling
Solution with Kubernetes
SQS
WORKER
WORKER
WORKER
Node1
Node2
Node3
Kubernetes
Cluster
18. Journey of Kubernetes Scaling
Scale Pod with Kubernetes
SQS
WORKER
WORKER
WORKER
WORKER
WORKER
WORKER
Node1
Node2
Node3
Kubernetes
Cluster
20. Journey of Kubernetes Scaling
What need to be done
● Change code not to depend on Sqsd
● Build Kubernetes Cluster on AWS
● Find solution to automated scale pods and nodes
21. Journey of Kubernetes Scaling
Scale Pod with kube-sqs-autoscaler
● https://github.com/Wattpad/kube-sqs-autoscaler
● Pod autoscaler based on queue size in AWS SQS
● Periodically retrieves the number of messages in SQS
and scales pods accordingly with configuration
○ --scale-down-cool-down=30s
--scale-up-cool-down=5m
--scale-up-messages=100
--scale-down-messages=10
--max-pods=5
--min-pods=1
25. Journey of Kubernetes Scaling
Scale Node with OpenAI
● https://github.com/openai/kubernetes-ec2-autoscaler
● Work with AWS Autoscaling Group to scale instance up
and down
● Scale node up by checking pod if pending status and no
free capacity node left
● Scale node down by checking CPU idle
30. Journey of Kubernetes Scaling
Enhance kube-sqs-autoscale
● Scale 1 Pod at a time is too slow!
● So we improve kube-sqs-autoscale code to scale pod by
ratio between SQS and pod
○ --scale-by-ratio
--queue-per-pod-ratio=100
--scale-down-cool-down=30s
--scale-up-cool-down=5m
--max-pods=5
--min-pods=1
31. Journey of Kubernetes Scaling
Move from OpenAI to autoscaler
● https://github.com/kubernetes/autoscaler
● OpenAI is lack of development since developer move from
AWS to Azure
● OpenAI is not support multiple instance groups
● Autoscaler is more maturity since it is one of the
Kubernetes component
32. Journey of Kubernetes Scaling
Worker parallel optimization
- Worker consume only 1 job at a time.
- CPU using less than 15% but Memory going to ~35% per
worker on node, Not good for us.
- We improved our worker to consume and process multiple
jobs simultaneously (configurable setting).
- After some trials, Worker can do 5 concurrent jobs with
same processing time using more CPU and a bit more of
Memory.
33. Journey of Kubernetes Scaling
Worker CPU optimization
- Our worker using Tensorflow installed via Pip
- Tensorflow notice about library wasn't compiled to use
AVX and SSE4.1 instructions, but these are available on
machine. Pip version not build for any cpu instructions
- So, We build Tensorflow with all CPU instructions
available on EC2 (t2.medium) machine.
- Result is job processed about 35% Faster!!!
35. Journey of Kubernetes Scaling
Benchmark questions
● How to do load test?
○ Python script 5000 reqs (200 ccu x 25 reqs/u)
within 1 mins
● What is the most optimize instance size with cost
effective?
36. Journey of Kubernetes Scaling
Benchmark Result Graph
t2.medium win
@1570 queues/minute
37. Journey of Kubernetes Scaling
Benchmark result
● Worker scaling speed:
○ EB 5-10 mins per worker instance
○ K8S <2 mins (Node available, use free node)
<5 mins (Node not available, spin up new)
38. Journey of Kubernetes Scaling
Conclusions
● K8s is flexible for batch processing job
● K8s has many components for autoscale
● K8s help us to optimize resource with cost effective
● K8s can finished 60,000 queues in 10 mins
39. Journey of Kubernetes Scaling
Future
● Use Kubernetes with AWS GPU Instance
● Change Queue
○ RabbitMQ
○ Kafka
● Optimize cost with AWS Spot Instance