Managing resources (cpu, memory, network io) in compute clusters is difficult. Regardless of running Hadoop, Spark or customized workloads, we face the challenge of scheduling a mixture of long running, short running workload with different resource requirements and deadlines in a compute cluster. The difficulty often comes in when we try to maximize cluster utilization and share resources properly among workloads at the same time.
This talk presents a solution to this problem by using two cutting-edge open source technology — Cook (https://github.com/twosigma/cook) and Apache Mesos (http://mesos.apache.org). At Two Sigma, we use Cook and Mesos to manage our entire compute clusters and run tens of thousands of compute workload every day. By using Cook and Mesos, we are able to efficiently utilize the compute cluster and achieve high user satisfaction.
In this talk, we will discuss the idea behind our algorithm, the design of the system and show how Cook and Mesos can be used to solve cluster resource sharing problem for other people.
4. What is Mesos
• Open Source Apache Project
• 2010: AMPLab, University of California Berkeley
• 2012: Twitter, Airbnb
• 2015: Twitter, Airbnb, Apple, Bloomberg, Cisco,
eBay, Yelp…
5. What is Mesos
• Tool to build distributed applications
– Hadoop, Spark…
– Cassandra, Kafta, Riak…
6. What is Mesos
• Distributed applications commonality:
– Manages resources (cpu, memory, disk…) on
worker hosts
– Manages life cycle of remote processes
– Manages communication between masters
and workers
10. What is Mesos
• Distributed applications commonality:
– Manages resources (cpu, memory, disk…)
on worker hosts
– Manages life cycle of remote processes
– Manages communication between masters
and workers
13. What is Cook
• Two Sigma’s Simulation Platform
• Manages tens of thousands of simulations
• Shares compute resources among users
14. What is Simulation
• Idempotent, distributed, resource intensive
computations
• Simulation set
• A handful ~ thousands of simulations
• Simulation
• Multiple Mesos tasks
15. What is Simulation
• Simulation task footprint
• 10 ~ 100 GB RAM
• 1 ~ 20 CPUs
• 15 minutes ~ a few hours
• Simulation use cases
• Interactive
• Batch processing
16. Problem
• High resource demand
• 5 x capacity during peak hours
• Optimize
• Utilization
• Process workloads as fast as possible
• Fairness
• Allocate resources fairly to users
20. What is Fairness, Really
• Fairness is not about ‘fair’
• Fairness is about user experience
• User should get their share of the cluster whenever
they need it
22. Static Quota
• Quota = Max percentage of the cluster allowed for
single user
• Static
• 100 % / # Max concurrent users
• Pros:
• Fairness
• Cons:
• Poor Utilization
48. Outline
• Introduction: Mesos and Cook
• Problem: Utilization and Fairness
• Fairness: How do we do it
• Preemption: How do we do it
• Intuition
• Formalization
49. Cumulative Resource Share (CRS)
• Assuming there is an total order of tasks for
each user, where > means ‘more important
than’.
– CRS of task t is sum of all tasks of the same user
that are greater than or equal to t, divided by total
cluster resource.
• 𝐶𝑅𝑆 𝑡 =
1
𝑅 𝑇𝑜𝑡𝑎𝑙
𝑡′≥𝑡 𝑅 𝑡′
58. Outline
• Introduction: Mesos and Cook
• Problem: Utilization and Fairness
• Fairness: How do we do it
• Preemption: How do we do it
• Intuition
• Formalization
• Put things together: Mesos and Cook
60. Are we doing better?
Static Quota Dynamic Quota Preemption?
Fairness
Utilization
61. Outline
• Introduction: Mesos and Cook
• Problem: Utilization and Fairness
• Fairness: How do we do it
• Preemption: How do we do it
• Intuition
• Formalization
• Put things together: Mesos and Cook
• Benchmark
Hello Everyone. It’s an honor to be here today. My name is Li Jin. I am from New York. Today I am going to talk about …
First, a little bit background about me. I am a software engineer @ Two Sigma. I have been working on Mesos and Cook for a little bit over a year now.
Two Sigma is a quantitative hedge fund based in New York City. It is a technology company that applies computer science, engineering and math in finance and investment.
Ok, let’s jump right into it.
Let’s talk about what’s Mesos and Cook
First of all, Mesos is a open source apache project. It is created in UC Berkeley in 2010.
In 2012, Mesos is used by Twitter and Airbnb in their production enviroment
And now, Mesos is powering many many more companies such as Apple, Bloomberg, Ciso
Mesos is powerful tool to build distributed applications:
Here by distributed application, I mean applications that launches and manages remote processes on a set of worker hosts
For instance, it can be distributed computing framework like hadoop or spark, or distributed storage systems like cassandra…
To explain why Mesos is a great tool to build distributed applications, let us think about commonality among those:
Distributed applications need to account for resources on worker hosts in order not to overload them. They also need to implement resource isolation to make sure different processes don’t affect each other. And, these two things become even harder when multiple applications are running on the same set of worker hosts because they need to be aware of how the work hosts are being by the other applications.
Distributed applications need to monitor the life cycle of remote processes. They need to know when a remote process starts, succeeds and fails. This might sound easy but if you think about all the failure cases – host can go down or worse, be overloaded; network partition can happen; the application can lose track of remote processes, and etc.
How about communication? All distributed applications need to have some communication mechanism. Http, messaging, rpc…you name it. And worse, they all need to deal with message loss and resending.
Finally, application need to optimize for execution. This including prioritizing workload, handle workload dependencies and etc, Hadoop, for instance, does straggler detection
Now let’s talk a look at how Mesos helps. Mesos provides an abstraction layer of the physical machines and presents those machines at essentially “resources” to the applications. Applications, then, can use those resources for their workloads.
…
Now the applications no longer needs to worry about resource management. Whenever there is resource available, Mesos will just send resource offers to the application. Resouces isolation is taken care of as well. Mesos will launch remote processes in containers in monitor the resource usage of those containers.
TwoSigma is powered by Mesos. We have multiple data centers that run Mesos and we run multiple frameworks on top of that. Some of them are open source frameworks like Marathon and Spark, and some of them are built by us to meet specific use cases. The framework I am going to talk about today is a framework we developed at Two Sigma called Cook.
So what is Cook?
Cook is TS’s simulation platform.
At the very high level, Cook manages tens of thousands of simulation. And since the platform is shared by all researchers, cook is also responsible for sharing compute resources among users.
Simulation is a tool that quantitative researchers at TS use to back test their investment strategy.
From an abstract point of view, simulations are just idempotent, distributed, resource intensive computations.
One simulation is implemented as multiple Mesos tasks.
So here is what a simulation task looks like. It takes 10 – 100 GB of Memory, 1 – 20 Cpus and it runs from 15 mintues up to a few hours.
But mostly, there are two major use cases. The first is interactive research. This type of workload usually finishes in 30 min to an hour and user actively waits for the result. The second type is batch computation, this type of workloads usually consume more resources, and users don’t care too much about the latency as long as they finish over night or weekend.
So In Cook, we face a very high resource demand. We can easily receive workload that are 5x capacity of the cluster during peak hours and we are often at full or near full utilization during business hours.
Under such workloads, it’s very important for Cook to optimize for two things. First is utilization, because we want to process workloads as fast as possible. And second, since Cook is a shared platform, we need to make sure it allocates resources to all users fairly for some definition of ‘Fair’. Well, we all know what utilization means but fairness is a little bit unclear at the moment.
So what is fairness? Well, fairness has a lot of definitions and there are a lot ways to achieve fairness.
Let’s see some examples.
First come first serve is a way to achieve fairness. Most of services we use in real life everyday is first come first serve. Stores, post offices, you name it. Maybe we can do the same.
Time-sharing is another way to achieve fairness. We can split one day into 1hour chunks and we can fairly share the cluster among 24 researchers.
Or we can throw a dice every day and decide who is going to use the cluster for that day.
*Explain more why they don’t work*
*Explain how user experience maps to fairness*
So, These approaches are all ‘fair’ but is that we want?
Let me use a story to answer that question.
Imagine yourself as a researcher at Two Sigma, you have this great idea that you think is going to make a lot of money and you want to run some simulations to test your it. You submit a batch of simulations, normally they should complete in an hour so you decide to go get some lunch. So you have this great lunch, you are fully energized and ready to go, you sit down and start to look at the results. However, you find that your simulations are still sitting in the queue. You are quite upset because this is blocking you from doing your job. You need those make
What makes you more upset is when you open the utilization dashboard, you see this.
You see you only have a tiny bit of the cluster and other users are using much more.
A few words pop in your mind “This is not fair!”
I can only assume this is what the researcher becomes.
So what is fairness, really
Well, I think fairness is not about fair. If we think about the story again, the researcher won’t look at the dashboard in the first place if he gets his results back.
So I think fairness is about user experience. Fairness is a way to make sure users can get resources to do their job.
So Fairness to us means users should get their share of the cluster whenever they need it.
Now we have a better idea of what fairness is, let’s talk about how to achieve it
Well, the easiest thing we can do is to use quota
Quota is basically a max percentage of the cluster allowed to single user.
A static quota can be total resources divided by the number of max concurrent users
Quota can guarantee fairness, any user can get his quota any time.
However, an obvious problem with static quota is that it can lead to low utilization. During peak hours, we can still have 80, 90% utilization but during night, since the number of users are usually lower, utilization can drop to 30, 40% while workloads are being sitting in the queue because of quota.
To solve the utilization problem, we introduce this notion of dynamic quota.
The basic idea is instead of use a static quota, we adjust quota based on current utilization. The lower the utilization is, the higher the quota can be.
This approach brings us much higher utilization. During night, the utilization jumps from 30, 40% to 60,70%.
However, dynamic quota brings us a new problem of unfairness
Let’s take a look at this.
Since some users enter the system when it’s relatively empty, they can have a higher quota and run a lot of jobs.
As the utilization increases, quota decreases and we reach the allocation on the left side.
The problem is that even though quota can change quickly based on utilization, the change of allocation is much slower because we don’t have a way to reclaim resources other than wait for simulations to complete and as I mentioned earlier, that can take hours.
These long delays can be very problematic for us because again they can lead to bad user experience.
So far we have static quota which is great in fairness but poor in utilization. And dynamic quota is quite the opposite.
Can we do something better? Can we find an approach that have both high utilization and high fairness?
Well, not surprisingly, our answer is to use preemption.
Preemption here simply means to kill a Simulation task and reschedule it later.
The most important idea behind preemption is that we can reclaim resources much faster.
By using preemption, instead of hours, we only need minutes to go from the left side to the right side.
So how to do preemption? Or more specifically, what’s the criteria to choose what tasks to preempt under what condition?
Let’s first walk through an example to get some intuition behind preemption
Let’s say we have a cluster of 6 cpus. Each box here represents a task taking 1 cpu.
Here we have two users, Jerry and Kevin, each of them is using half the cluster
Well, we know eventually, we want to reach a fair allocation like this.
Well, we know eventually, we want to reach a fair allocation like this.
But we don’t know how to get them yet. We don’t know which one of the six tasks we should preempt.
Well, we know that both Jerry of Kevin are above their fair share. So intuitively we can preempt either Jerry or Kevin, but we don’t know much more beyond that. So we consider all their tasks for preemption, which are marked in orange here to schedule for Dave’s task, which is marked in yellow.
And we decide to preempt one of Jerry’s task
And end up like this
Now we do it again and this time, since Jerry is no longer above his fair share, we only consider kevin’s tasks for preemption.
And similar, we decide to preempt one of Kevin’s task
And end up like this. So did we do a good job?
Well, it turns out we did not.
The problem is not all tasks are equal. Different tasks are of different importance to the users and we’ve just preempted some important tasks.
This, again, leads to bad user experience.
Now we know we cannot treat all tasks as equal so we need a score function to reflect a task’s value.
We use value here to represent two things we’ve mentioned so far. The first is fairness, we want to use the score function to achieve fairness easily. The second is importance, we want to have the score function also reflects how important a task is.
And Cook will use the score as a preemption criteria and preempt low score task for high score task
Let’s see how that works.
First, we don’t quite know how to express the relative importance among all task. It is hard for us to say one researcher’s task is more important than another.
But we do know how to express the relative importance among tasks of the same user. The user has a easy way to tell us which of his tasks are more important.
Here the importance are shown in currencies and we have a ordering for each user’s tasks.
But since they are in different currencies, we still cannot compare them across users.
Now it’s important for us to unify the currency.
Here we apply our principle of fairness and say all users’ most important task are of the same value and so on and so forth and by doing this. The dollar amount on the task now reflects both fairness and importance.
Now things becomes easier, when we choose tasks to preempt for the yellow one, we consider all tasks that have a lower value.
The reason we need to consider multiple tasks instead of the lowest one is because we need to preemption subject to bin packing constraint. The yellow task needs to be able to fit on the host after the preemption. In this example we don’t have this problem because all tasks are of equal size but in reality that is no longer true.
*Add arrows*
Here, we preempt Jerry’s task
Similarly we do this again
This time, there is only task considered for preemption.
And finally, we reach fair allocation and we are running most important tasks for each user.
And finally, we reach fair allocation and we are running most important tasks for each user.
Now we have developed some intuition though this example. Especially, how the score function should look like. Let take a look at how do we formalize it.
Assume there is an total order of jobs for each user, where > means ‘has higher value than’
We introduce the notion of cumulative resource share or CRS.
CRS of a job j is the sum of all jobs of the same user, that are greater than or equal to j, divided by total resource.
Or in mathematic form, this.
Let’s see how that works.
First, we don’t quite know how to express the relative importance among all task. It is hard for us to say one researcher’s task is more important than another.
But we do know how to express the relative importance among tasks of the same user. The user has a easy way to tell us which of his tasks are more important.
Here the importance are shown in currencies and we have a ordering for each user’s tasks.
But since they are in different currencies, we still cannot compare them across users.
Note here unlike currency notion we used before, here a more valuable task has a lower CRS
Well, we don’t quite know how to express the relative value among all user’s job, but we have a fairly good idea of how to express relative value among a single user’s job
So far we are only considering a single type of resources but in reality we have multiple, for instance, memory and cpu.
Luckily, there is already some interesting research to help us with that.
Dominant Resource Fairness is a way to achieve fair allocation of multiple resource types. It’s paper published by UC Berkeley in 2011. And is implemented in Mesos itself.
It introduces the notion of Dominant resource share, or DRS, to be the maximum of all user’s resource share.
It’s simple yet has a lot good property. I won’t dig too much into it and I strongly suggest reading the paper.
Here, we extend the same idea to cumulative resource share.
To recap, here is the definition of CRS.
Dominant cumulative resource share, or DRS is defined as the max CRS among all resources
Finally, we define score to be the negation of DCRS because the higher the score, the more valuable the job, and DCRS is the opposite.
So far we’ve talk about the problem, fairness, preemption, score function. Finally, let’s see how these fit together in Cook.
This is a high level architecture of Cook.
On the left side, we have cook, which consists of three components.
The first component on the left side is Ranker. It’s functionality is to take all running and waiting jobs, sort them for each user, compute the score for those jobs return a list of jobs sorted by score.
The list of jobs is then passed to the other two components. On the top side is Matcher. This component takes resource offers from Mesos, and match them with the list of jobs to see if the offers are big enough to fit those jobs and if so, it sends them to Mesos.
The third component, Rebalancer, does preemption. Let’s zoom in to see what it does.
We asked the question of can we do better. Now is the time to answer it.
So far we’ve talk about the problem, fairness, preemption, score function. Finally, let’s see how these fit together in Cook.
Here is the results from the benchmark we ran against
We took a trace from our production workload and ran it with