Efficient cluster resource management using Mesos and Cook

•Download as PPTX, PDF•

3 likes•1,212 views

Managing resources (cpu, memory, network io) in compute clusters is difficult. Regardless of running Hadoop, Spark or customized workloads, we face the challenge of scheduling a mixture of long running, short running workload with different resource requirements and deadlines in a compute cluster. The difficulty often comes in when we try to maximize cluster utilization and share resources properly among workloads at the same time. This talk presents a solution to this problem by using two cutting-edge open source technology — Cook (https://github.com/twosigma/cook) and Apache Mesos (http://mesos.apache.org). At Two Sigma, we use Cook and Mesos to manage our entire compute clusters and run tens of thousands of compute workload every day. By using Cook and Mesos, we are able to efficiently utilize the compute cluster and achieve high user satisfaction. In this talk, we will discuss the idea behind our algorithm, the design of the system and show how Cook and Mesos can be used to solve cluster resource sharing problem for other people.

Engineering

Efficient cluster resource
management
using Mesos and Cook
Li Jin

About Me
• Software Engineer @ Two Sigma

What is Mesos
• Open Source Apache Project
• 2010: AMPLab, University of California Berkeley
• 2012: Twitter, Airbnb
• 2015: Twitter, Airbnb, Apple, Bloomberg, Cisco,
eBay, Yelp…

What is Mesos
• Tool to build distributed applications
– Hadoop, Spark…
– Cassandra, Kafta, Riak…

What is Mesos
• Distributed applications commonality:
– Manages resources (cpu, memory, disk…) on
worker hosts
– Manages life cycle of remote processes
– Manages communication between masters
and workers

What is Cook
• Two Sigma’s Simulation Platform
• Manages tens of thousands of simulations
• Shares compute resources among users

What is Simulation
• Idempotent, distributed, resource intensive
computations
• Simulation set
• A handful ~ thousands of simulations
• Simulation
• Multiple Mesos tasks

What is Simulation
• Simulation task footprint
• 10 ~ 100 GB RAM
• 1 ~ 20 CPUs
• 15 minutes ~ a few hours
• Simulation use cases
• Interactive
• Batch processing

Problem
• High resource demand
• 5 x capacity during peak hours
• Optimize
• Utilization
• Process workloads as fast as possible
• Fairness
• Allocate resources fairly to users

What is Fairness
• FIFO
• Time sharing
• Throw a dice
• …

What is Fairness, Really
• Fairness is not about ‘fair’
• Fairness is about user experience
• User should get their share of the cluster whenever
they need it

Outline
• Introduction: Mesos and Cook
• Problem: Utilization and Fairness
• Fairness: How do we do it

Static Quota
• Quota = Max percentage of the cluster allowed for
single user
• Static
• 100 % / # Max concurrent users
• Pros:
• Fairness
• Cons:
• Poor Utilization

Dynamic Quota
• Dynamic
• Quota * Utilization Adjustment
• Pros:
• Higher Utilization
• Cons:
• Poor Fairness

Dynamic Quota
Unfair Resource
Allocation
Fair Resource
Allocation
Hours…

Can we do better?
Static Quota Dynamic Quota ?
Fairness
Utilization

Preemption
• Kill a Simulation task and reschedule later
• Reclaim resource faster!
Unfair Resource
Allocation
Minutes!
Fair Resource
Allocation

Outline
• Introduction: Mesos and Cook
• Problem: Utilization and Fairness
• Fairness: How do we do it
• Preemption: How do we do it

Problem
• Not all tasks are equal
• We just preempted some important tasks!
Bad User
Experience

Score Function
• Score Function: Reflect task’s value
• Fairness
• Importance
• Preemption principal:
• Preempt low score task for high score task

Outline
• Introduction: Mesos and Cook
• Problem: Utilization and Fairness
• Fairness: How do we do it
• Preemption: How do we do it
• Intuition
• Formalization

Cumulative Resource Share (CRS)
• Assuming there is an total order of tasks for
each user, where > means ‘more important
than’.
– CRS of task t is sum of all tasks of the same user
that are greater than or equal to t, divided by total
cluster resource.
• 𝐶𝑅𝑆 𝑡 =
1
𝑅 𝑇𝑜𝑡𝑎𝑙
𝑡′≥𝑡 𝑅 𝑡′

Cumulative Resource Share (CRS)
• 𝑅 𝑎 = 𝑅 𝑏 = 𝑅 𝑐 = 1 𝑐𝑝𝑢, 𝑅𝑡𝑜𝑡𝑎𝑙 = 6 𝑐𝑝𝑢𝑠
• 𝑎 > 𝑏 > 𝑐
• 𝐶𝑅𝑆 𝑎 =
𝑅 𝑎
𝑅 𝑇𝑜𝑡𝑎𝑙
=
1
6
• 𝐶𝑅𝑆 𝑏 =
𝑅 𝑎+𝑅 𝑏
𝑅 𝑇𝑜𝑡𝑎𝑙
=
2
6
• 𝐶𝑅𝑆 𝑐 =
𝑅 𝑎+𝑅 𝑏+𝑅 𝑐
𝑅 𝑇𝑜𝑡𝑎𝑙
=
3
6

Preemption: Formalization
€€€
€€
€
Running
Waiting
£££££
££££
£££
¥¥¥¥
¥¥¥
¥¥

Preemption: Formalization
1/6
2/6
3/6
Running
Waiting
1/6
2/6
3/6
1/6
2/6
3/6

Multiple Resources?
• Dominant Resource Fairness: Fair Allocation of
Multiple Resource Types
• Published by UC Berkeley in 2011

Dominant Cumulative Resource
Share
• 𝐶𝑅𝑆 𝑡 =
1
𝑅 𝑇𝑜𝑡𝑎𝑙
𝑡′≥𝑡 𝑅 𝑡′
• 𝐷𝐶𝑅𝑆 𝑡 = max
𝑅
1
𝑅 𝑇𝑜𝑡𝑎𝑙
𝑡′≥𝑡 𝑅 𝑡′
• 𝑆𝑐𝑜𝑟𝑒(𝑡) = −𝐷𝐶𝑅𝑆(𝑡)

Are we doing better?
Static Quota Dynamic Quota Preemption?
Fairness
Utilization

Benchmark
• Simulated
• 7 day production workload trace

Benchmark
0
2
4
6
8
10
12
SpeedUp
Simulation Set Speed Up Distribution
Dynamic Quota
Preemption

Benchmark
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7
Utilization
Effective Utilization
Dynamic Quota
Preemption

Open Source
• https://github.com/apache/mesos
• https://github.com/twosigma/cook
• @icexelloss

What's hot

A New MongoDB Sharding Architecture for Higher Availability and Better Resour...leifwalsh

MySQL And Search At CraigslistJeremy Zawodny

Building Scalable, Distributed Job Queues with Redis and Redis::ClientMike Friedman

PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)Ontico

Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Miguel Gallardo

Null Bachaav - May 07 Attack Monitoring workshop.Prajal Kulkarni

Move Over, RsyncAll Things Open

SDEC2011 NoSQL concepts and modelsKorea Sdec

Aaron Mildenstein - Using Logstash with ZabbixZabbix

Document Locking with Redis in Symfony2Tom Corrigan

Redis Functions, Data Structures for Web Scale AppsDave Nielsen

비동기 회고 발표자료Benjamin Kim

Twitch Plays Pokémon: Twitch's Chat ArchitectureC4Media

Lessons learned while building Omroep.nltieleman

Introduction to Apache ZooKeeperknowbigdata

HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014Modern Data Stack France

Lessons learned while building Omroep.nlbartzon

Using Morphlines for On-the-Fly ETLCloudera, Inc.

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko

PHP Backends for Real-Time User Interaction using Apache Storm.DECK36

What's hot (20)

A New MongoDB Sharding Architecture for Higher Availability and Better Resour...

MySQL And Search At Craigslist

Building Scalable, Distributed Job Queues with Redis and Redis::Client

PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)

Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...

Null Bachaav - May 07 Attack Monitoring workshop.

Move Over, Rsync

SDEC2011 NoSQL concepts and models

Aaron Mildenstein - Using Logstash with Zabbix

Document Locking with Redis in Symfony2

Redis Functions, Data Structures for Web Scale Apps

비동기 회고 발표자료

Twitch Plays Pokémon: Twitch's Chat Architecture

Lessons learned while building Omroep.nl

Introduction to Apache ZooKeeper

HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014

Lessons learned while building Omroep.nl

Using Morphlines for On-the-Fly ETL

Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...

PHP Backends for Real-Time User Interaction using Apache Storm.

Viewers also liked

How Raft consensus algorithm will make replication even better in MongoDB 3.2...Ontico

Максим Литвинчик (Wargaming.net)Ontico

Premiers pas avec Ops ManagerMongoDB

Advertising 2015Miriam Smith

Ethiek en dierproevenBiology, Utrecht University

Vivliostyleの紹介Shinyu Murakami

WiredTiger & What's New in 3.0MongoDB

HighLoad Solutions On MySQL / Xiaobin Lin (Alibaba)Ontico

デジタル教材等の規格標準化の意義と動向Kazuo Shimokawa

Pixivの今と出版業界への関わりJapan Electronic Publishing Association

Hoofdstuk 18 2008 deel 2Biology, Utrecht University

RocksDB compactionMIJIN AN

Использование haproxy/iptables+etcd+confd для автоматического service discove...Ontico

ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)Ontico

Доставляя неприятности: о профессиональных наказаниях подчиненных в интеллект...Ontico

«Секретные» технологии инвестиционных банков / Алексей Рагозин (Дойче Банк)Ontico

Быстрое прототипирование бэкенда игры с геолокацией на OpenResty, Redis и Doc...Ontico

Как мы сделали ровную балансировку нагрузки на фронтенд-кластере / Насретдино...Ontico

Hadoop Administration pdfEdureka!

"Обзор Tarantool DB"Badoo Development

Viewers also liked (20)

How Raft consensus algorithm will make replication even better in MongoDB 3.2...

Максим Литвинчик (Wargaming.net)

Premiers pas avec Ops Manager

Advertising 2015

Ethiek en dierproeven

Vivliostyleの紹介

WiredTiger & What's New in 3.0

HighLoad Solutions On MySQL / Xiaobin Lin (Alibaba)

デジタル教材等の規格標準化の意義と動向

Pixivの今と出版業界への関わり

Hoofdstuk 18 2008 deel 2

RocksDB compaction

Использование haproxy/iptables+etcd+confd для автоматического service discove...

ToroDB: scaling PostgreSQL like MongoDB / Álvaro Hernández Tortosa (8Kdata)

Доставляя неприятности: о профессиональных наказаниях подчиненных в интеллект...

«Секретные» технологии инвестиционных банков / Алексей Рагозин (Дойче Банк)

Быстрое прототипирование бэкенда игры с геолокацией на OpenResty, Redis и Doc...

Как мы сделали ровную балансировку нагрузки на фронтенд-кластере / Насретдино...

Hadoop Administration pdf

"Обзор Tarantool DB"

Similar to Efficient cluster resource management using Mesos and Cook

STUDY ON PROJECT MANAGEMENT THROUGH GENETIC ALGORITHMAvay Minni

Performance Issue? Machine Learning to the rescue!Maarten Smeets

Performance OR Capacity #CMGimPACt2016 Alex Gilgur

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks

Horizon: Deep Reinforcement Learning at ScaleDatabricks

In the age of Big Data, what role for Software Engineers?CS, NcState

Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano

Web Performance BootCamp 2013Daniel Austin

Using Hystrix to Build Resilient Distributed SystemsMatt Jacobs

Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Neelabha Pant

QMRAS Project PresentationGary Spencer

Cloud workload analysis and simulationPrabhakar Ganesamurthy

The deep bootstrap framework reviewtaeseon ryu

Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney

Actor model : A Different Concurrency ApproachEmre Akış

ENAR short courseDeepak Agarwal

Learn Like a Human: Taking Machine Learning from Batch to Real-TimeDynamic Yield

Aws atlanta march_2015Adam Book

Test estimation sessionVipul Agarwal

Seven deadly sins of ElasticSearch BenchmarkingFan Robbin

Similar to Efficient cluster resource management using Mesos and Cook (20)

STUDY ON PROJECT MANAGEMENT THROUGH GENETIC ALGORITHM

Performance Issue? Machine Learning to the rescue!

Performance OR Capacity #CMGimPACt2016

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...

Horizon: Deep Reinforcement Learning at Scale

In the age of Big Data, what role for Software Engineers?

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval

Web Performance BootCamp 2013

Using Hystrix to Build Resilient Distributed Systems

Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...

QMRAS Project Presentation

Cloud workload analysis and simulation

The deep bootstrap framework review

Agile Data Science: Hadoop Analytics Applications

Actor model : A Different Concurrency Approach

ENAR short course

Learn Like a Human: Taking Machine Learning from Batch to Real-Time

Aws atlanta march_2015

Test estimation session

Seven deadly sins of ElasticSearch Benchmarking

Recently uploaded

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani

lifi-technology with integration of IOT.pptxsomshekarkn64

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

Indian Dairy Industry Present Status and.pptMadan Karki

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

Design and analysis of solar grass cutter.pdfTagore Institute of Engineering And Technology

Past, Present and Future of Generative AIabhishek36461

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

Introduction-To-Agricultural-Surveillance-Rover.pptxk795866

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

Work Experience-Dalton Park.pptxfvvvvvvvLewisJB

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913

Recently uploaded (20)

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf

lifi-technology with integration of IOT.pptx

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

Indian Dairy Industry Present Status and.ppt

POWER SYSTEMS-1 Complete notes examples

Design and analysis of solar grass cutter.pdf

Past, Present and Future of Generative AI

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service

Correctly Loading Incremental Data at Scale

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)

IVE Industry Focused Event - Defence Sector 2024

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

Class 1 | NFPA 72 | Overview Fire Alarm System

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

Introduction-To-Agricultural-Surveillance-Rover.pptx

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync

Work Experience-Dalton Park.pptxfvvvvvvv

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg

Efficient cluster resource management using Mesos and Cook

1. Efficient cluster resource management using Mesos and Cook Li Jin

2. About Me • Software Engineer @ Two Sigma

3. Outline • Introduction: Mesos and Cook

4. What is Mesos • Open Source Apache Project • 2010: AMPLab, University of California Berkeley • 2012: Twitter, Airbnb • 2015: Twitter, Airbnb, Apple, Bloomberg, Cisco, eBay, Yelp…

5. What is Mesos • Tool to build distributed applications – Hadoop, Spark… – Cassandra, Kafta, Riak…

6. What is Mesos • Distributed applications commonality: – Manages resources (cpu, memory, disk…) on worker hosts – Manages life cycle of remote processes – Manages communication between masters and workers

7. What is Mesos

8. What is Mesos

9. What is Mesos

10. What is Mesos • Distributed applications commonality: – Manages resources (cpu, memory, disk…) on worker hosts – Manages life cycle of remote processes – Manages communication between masters and workers

11. Mesos Primitives

12. Mesos @ Two Sigma Cook Mesos

13. What is Cook • Two Sigma’s Simulation Platform • Manages tens of thousands of simulations • Shares compute resources among users

14. What is Simulation • Idempotent, distributed, resource intensive computations • Simulation set • A handful ~ thousands of simulations • Simulation • Multiple Mesos tasks

15. What is Simulation • Simulation task footprint • 10 ~ 100 GB RAM • 1 ~ 20 CPUs • 15 minutes ~ a few hours • Simulation use cases • Interactive • Batch processing

16. Problem • High resource demand • 5 x capacity during peak hours • Optimize • Utilization • Process workloads as fast as possible • Fairness • Allocate resources fairly to users

17. What is Fairness • FIFO • Time sharing • Throw a dice • …

18. What is Fairness • A story…

19. What is Fairness Resource Allocation

20. What is Fairness, Really • Fairness is not about ‘fair’ • Fairness is about user experience • User should get their share of the cluster whenever they need it

21. Outline • Introduction: Mesos and Cook • Problem: Utilization and Fairness • Fairness: How do we do it

22. Static Quota • Quota = Max percentage of the cluster allowed for single user • Static • 100 % / # Max concurrent users • Pros: • Fairness • Cons: • Poor Utilization

23. Dynamic Quota • Dynamic • Quota * Utilization Adjustment • Pros: • Higher Utilization • Cons: • Poor Fairness

24. Dynamic Quota Unfair Resource Allocation Fair Resource Allocation Hours…

25. Can we do better? Static Quota Dynamic Quota ? Fairness Utilization

26. Preemption • Kill a Simulation task and reschedule later • Reclaim resource faster! Unfair Resource Allocation Minutes! Fair Resource Allocation

27. Outline • Introduction: Mesos and Cook • Problem: Utilization and Fairness • Fairness: How do we do it • Preemption: How do we do it

28. Preemption: Intuition Running Waiting

29. Preemption: Intuition Running Waiting

30. Preemption: Intuition Running Waiting

31. Preemption: Intuition ?? ? ?? ?Running Waiting

32. Preemption: Intuition Running Waiting

33. Preemption: Intuition Running Waiting

34. Preemption: Intuition Running Waiting

35. Preemption: Intuition Running Waiting

36. Preemption: Intuition Running Waiting

37. Preemption: Intuition Running Waiting

38. Problem • Not all tasks are equal • We just preempted some important tasks! Bad User Experience

39. Score Function • Score Function: Reflect task’s value • Fairness • Importance • Preemption principal: • Preempt low score task for high score task

40. Preemption: Intuition €€€ €€ € Running Waiting £££££ ££££ £££ ¥¥¥¥ ¥¥¥ ¥¥

41. Preemption: Intuition ₽₽₽ ₽₽ ₽ Running Waiting ₽₽₽ ₽₽ ₽ ₽₽₽ ₽₽ ₽

42. Preemption: Intuition ₽₽₽ ₽₽ ₽ Running Waiting ₽₽₽ ₽₽ ₽ ₽₽₽ ₽₽ ₽

43. Preemption: Intuition ₽₽₽ ₽₽ ₽ Running Waiting ₽₽₽ ₽₽ ₽ ₽₽₽ ₽₽ ₽

44. Preemption: Intuition ₽₽₽ ₽₽ ₽ Running Waiting ₽₽₽ ₽₽ ₽ ₽₽₽ ₽₽ ₽

45. Preemption: Intuition ₽₽₽ ₽₽ ₽ Running Waiting ₽₽₽ ₽₽ ₽ ₽₽₽ ₽₽ ₽

46. Preemption: Intuition ₽₽₽ ₽₽ ₽ Running Waiting ₽₽₽ ₽₽ ₽ ₽₽₽ ₽₽ ₽

47. Preemption: Intuition €€€ €€ € Running Waiting £££££ ££££ £££ ¥¥¥¥ ¥¥¥ ¥¥

48. Outline • Introduction: Mesos and Cook • Problem: Utilization and Fairness • Fairness: How do we do it • Preemption: How do we do it • Intuition • Formalization

49. Cumulative Resource Share (CRS) • Assuming there is an total order of tasks for each user, where > means ‘more important than’. – CRS of task t is sum of all tasks of the same user that are greater than or equal to t, divided by total cluster resource. • 𝐶𝑅𝑆 𝑡 = 1 𝑅 𝑇𝑜𝑡𝑎𝑙 𝑡′≥𝑡 𝑅 𝑡′

50. Cumulative Resource Share (CRS) • 𝑅 𝑎 = 𝑅 𝑏 = 𝑅 𝑐 = 1 𝑐𝑝𝑢, 𝑅𝑡𝑜𝑡𝑎𝑙 = 6 𝑐𝑝𝑢𝑠 • 𝑎 > 𝑏 > 𝑐 • 𝐶𝑅𝑆 𝑎 = 𝑅 𝑎 𝑅 𝑇𝑜𝑡𝑎𝑙 = 1 6 • 𝐶𝑅𝑆 𝑏 = 𝑅 𝑎+𝑅 𝑏 𝑅 𝑇𝑜𝑡𝑎𝑙 = 2 6 • 𝐶𝑅𝑆 𝑐 = 𝑅 𝑎+𝑅 𝑏+𝑅 𝑐 𝑅 𝑇𝑜𝑡𝑎𝑙 = 3 6

51. Preemption: Formalization €€€ €€ € Running Waiting £££££ ££££ £££ ¥¥¥¥ ¥¥¥ ¥¥

52. Preemption: Formalization 1/6 2/6 3/6 Running Waiting 1/6 2/6 3/6 1/6 2/6 3/6

53. Preemption: Formalization 1/6 2/6 3/6 Running Waiting 1/6 2/6 3/6 1/6 2/6 3/6

54. Preemption: Formalization 1/6 2/6 3/6 Running Waiting 1/6 2/6 3/6 1/6 2/6 3/6

55. Preemption: Formalization 1/6 2/6 3/6 Running Waiting 1/6 2/6 3/6 1/6 2/6 3/6

56. Multiple Resources? • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types • Published by UC Berkeley in 2011

57. Dominant Cumulative Resource Share • 𝐶𝑅𝑆 𝑡 = 1 𝑅 𝑇𝑜𝑡𝑎𝑙 𝑡′≥𝑡 𝑅 𝑡′ • 𝐷𝐶𝑅𝑆 𝑡 = max 𝑅 1 𝑅 𝑇𝑜𝑡𝑎𝑙 𝑡′≥𝑡 𝑅 𝑡′ • 𝑆𝑐𝑜𝑟𝑒(𝑡) = −𝐷𝐶𝑅𝑆(𝑡)

58. Outline • Introduction: Mesos and Cook • Problem: Utilization and Fairness • Fairness: How do we do it • Preemption: How do we do it • Intuition • Formalization • Put things together: Mesos and Cook

59. Cook: Architecture

60. Are we doing better? Static Quota Dynamic Quota Preemption? Fairness Utilization

61. Outline • Introduction: Mesos and Cook • Problem: Utilization and Fairness • Fairness: How do we do it • Preemption: How do we do it • Intuition • Formalization • Put things together: Mesos and Cook • Benchmark

62. Benchmark • Simulated • 7 day production workload trace

63. Benchmark 0 2 4 6 8 10 12 SpeedUp Simulation Set Speed Up Distribution Dynamic Quota Preemption

64. Benchmark 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Utilization Effective Utilization Dynamic Quota Preemption

65. It works!

66. Open Source • https://github.com/apache/mesos • https://github.com/twosigma/cook • @icexelloss

67. Questions?

Editor's Notes

Hello Everyone. It’s an honor to be here today. My name is Li Jin. I am from New York. Today I am going to talk about …
First, a little bit background about me. I am a software engineer @ Two Sigma. I have been working on Mesos and Cook for a little bit over a year now. Two Sigma is a quantitative hedge fund based in New York City. It is a technology company that applies computer science, engineering and math in finance and investment.
Ok, let’s jump right into it. Let’s talk about what’s Mesos and Cook
First of all, Mesos is a open source apache project. It is created in UC Berkeley in 2010. In 2012, Mesos is used by Twitter and Airbnb in their production enviroment And now, Mesos is powering many many more companies such as Apple, Bloomberg, Ciso
Mesos is powerful tool to build distributed applications: Here by distributed application, I mean applications that launches and manages remote processes on a set of worker hosts For instance, it can be distributed computing framework like hadoop or spark, or distributed storage systems like cassandra…
To explain why Mesos is a great tool to build distributed applications, let us think about commonality among those: Distributed applications need to account for resources on worker hosts in order not to overload them. They also need to implement resource isolation to make sure different processes don’t affect each other. And, these two things become even harder when multiple applications are running on the same set of worker hosts because they need to be aware of how the work hosts are being by the other applications. Distributed applications need to monitor the life cycle of remote processes. They need to know when a remote process starts, succeeds and fails. This might sound easy but if you think about all the failure cases – host can go down or worse, be overloaded; network partition can happen; the application can lose track of remote processes, and etc. How about communication? All distributed applications need to have some communication mechanism. Http, messaging, rpc…you name it. And worse, they all need to deal with message loss and resending. Finally, application need to optimize for execution. This including prioritizing workload, handle workload dependencies and etc, Hadoop, for instance, does straggler detection
Now let’s talk a look at how Mesos helps. Mesos provides an abstraction layer of the physical machines and presents those machines at essentially “resources” to the applications. Applications, then, can use those resources for their workloads. …
Now the applications no longer needs to worry about resource management. Whenever there is resource available, Mesos will just send resource offers to the application. Resouces isolation is taken care of as well. Mesos will launch remote processes in containers in monitor the resource usage of those containers.
TwoSigma is powered by Mesos. We have multiple data centers that run Mesos and we run multiple frameworks on top of that. Some of them are open source frameworks like Marathon and Spark, and some of them are built by us to meet specific use cases. The framework I am going to talk about today is a framework we developed at Two Sigma called Cook.
So what is Cook? Cook is TS’s simulation platform. At the very high level, Cook manages tens of thousands of simulation. And since the platform is shared by all researchers, cook is also responsible for sharing compute resources among users.
Simulation is a tool that quantitative researchers at TS use to back test their investment strategy. From an abstract point of view, simulations are just idempotent, distributed, resource intensive computations. One simulation is implemented as multiple Mesos tasks.
So here is what a simulation task looks like. It takes 10 – 100 GB of Memory, 1 – 20 Cpus and it runs from 15 mintues up to a few hours. But mostly, there are two major use cases. The first is interactive research. This type of workload usually finishes in 30 min to an hour and user actively waits for the result. The second type is batch computation, this type of workloads usually consume more resources, and users don’t care too much about the latency as long as they finish over night or weekend.
So In Cook, we face a very high resource demand. We can easily receive workload that are 5x capacity of the cluster during peak hours and we are often at full or near full utilization during business hours. Under such workloads, it’s very important for Cook to optimize for two things. First is utilization, because we want to process workloads as fast as possible. And second, since Cook is a shared platform, we need to make sure it allocates resources to all users fairly for some definition of ‘Fair’. Well, we all know what utilization means but fairness is a little bit unclear at the moment.
So what is fairness? Well, fairness has a lot of definitions and there are a lot ways to achieve fairness. Let’s see some examples. First come first serve is a way to achieve fairness. Most of services we use in real life everyday is first come first serve. Stores, post offices, you name it. Maybe we can do the same. Time-sharing is another way to achieve fairness. We can split one day into 1hour chunks and we can fairly share the cluster among 24 researchers. Or we can throw a dice every day and decide who is going to use the cluster for that day. *Explain more why they don’t work* *Explain how user experience maps to fairness* So, These approaches are all ‘fair’ but is that we want?
Let me use a story to answer that question. Imagine yourself as a researcher at Two Sigma, you have this great idea that you think is going to make a lot of money and you want to run some simulations to test your it. You submit a batch of simulations, normally they should complete in an hour so you decide to go get some lunch. So you have this great lunch, you are fully energized and ready to go, you sit down and start to look at the results. However, you find that your simulations are still sitting in the queue. You are quite upset because this is blocking you from doing your job. You need those make What makes you more upset is when you open the utilization dashboard, you see this.
You see you only have a tiny bit of the cluster and other users are using much more. A few words pop in your mind “This is not fair!” I can only assume this is what the researcher becomes.
So what is fairness, really Well, I think fairness is not about fair. If we think about the story again, the researcher won’t look at the dashboard in the first place if he gets his results back. So I think fairness is about user experience. Fairness is a way to make sure users can get resources to do their job. So Fairness to us means users should get their share of the cluster whenever they need it.
Now we have a better idea of what fairness is, let’s talk about how to achieve it
Well, the easiest thing we can do is to use quota Quota is basically a max percentage of the cluster allowed to single user. A static quota can be total resources divided by the number of max concurrent users Quota can guarantee fairness, any user can get his quota any time. However, an obvious problem with static quota is that it can lead to low utilization. During peak hours, we can still have 80, 90% utilization but during night, since the number of users are usually lower, utilization can drop to 30, 40% while workloads are being sitting in the queue because of quota.
To solve the utilization problem, we introduce this notion of dynamic quota. The basic idea is instead of use a static quota, we adjust quota based on current utilization. The lower the utilization is, the higher the quota can be. This approach brings us much higher utilization. During night, the utilization jumps from 30, 40% to 60,70%. However, dynamic quota brings us a new problem of unfairness
Let’s take a look at this. Since some users enter the system when it’s relatively empty, they can have a higher quota and run a lot of jobs. As the utilization increases, quota decreases and we reach the allocation on the left side. The problem is that even though quota can change quickly based on utilization, the change of allocation is much slower because we don’t have a way to reclaim resources other than wait for simulations to complete and as I mentioned earlier, that can take hours. These long delays can be very problematic for us because again they can lead to bad user experience.
So far we have static quota which is great in fairness but poor in utilization. And dynamic quota is quite the opposite. Can we do something better? Can we find an approach that have both high utilization and high fairness?
Well, not surprisingly, our answer is to use preemption. Preemption here simply means to kill a Simulation task and reschedule it later. The most important idea behind preemption is that we can reclaim resources much faster. By using preemption, instead of hours, we only need minutes to go from the left side to the right side.
So how to do preemption? Or more specifically, what’s the criteria to choose what tasks to preempt under what condition? Let’s first walk through an example to get some intuition behind preemption
Let’s say we have a cluster of 6 cpus. Each box here represents a task taking 1 cpu. Here we have two users, Jerry and Kevin, each of them is using half the cluster
Well, we know eventually, we want to reach a fair allocation like this.
Well, we know eventually, we want to reach a fair allocation like this.
But we don’t know how to get them yet. We don’t know which one of the six tasks we should preempt.
Well, we know that both Jerry of Kevin are above their fair share. So intuitively we can preempt either Jerry or Kevin, but we don’t know much more beyond that. So we consider all their tasks for preemption, which are marked in orange here to schedule for Dave’s task, which is marked in yellow.
And we decide to preempt one of Jerry’s task
And end up like this
Now we do it again and this time, since Jerry is no longer above his fair share, we only consider kevin’s tasks for preemption.
And similar, we decide to preempt one of Kevin’s task
And end up like this. So did we do a good job? Well, it turns out we did not.
The problem is not all tasks are equal. Different tasks are of different importance to the users and we’ve just preempted some important tasks. This, again, leads to bad user experience.
Now we know we cannot treat all tasks as equal so we need a score function to reflect a task’s value. We use value here to represent two things we’ve mentioned so far. The first is fairness, we want to use the score function to achieve fairness easily. The second is importance, we want to have the score function also reflects how important a task is. And Cook will use the score as a preemption criteria and preempt low score task for high score task
Let’s see how that works. First, we don’t quite know how to express the relative importance among all task. It is hard for us to say one researcher’s task is more important than another. But we do know how to express the relative importance among tasks of the same user. The user has a easy way to tell us which of his tasks are more important. Here the importance are shown in currencies and we have a ordering for each user’s tasks. But since they are in different currencies, we still cannot compare them across users.
Now it’s important for us to unify the currency. Here we apply our principle of fairness and say all users’ most important task are of the same value and so on and so forth and by doing this. The dollar amount on the task now reflects both fairness and importance.
Now things becomes easier, when we choose tasks to preempt for the yellow one, we consider all tasks that have a lower value. The reason we need to consider multiple tasks instead of the lowest one is because we need to preemption subject to bin packing constraint. The yellow task needs to be able to fit on the host after the preemption. In this example we don’t have this problem because all tasks are of equal size but in reality that is no longer true. *Add arrows*
Here, we preempt Jerry’s task
Similarly we do this again
This time, there is only task considered for preemption.
And finally, we reach fair allocation and we are running most important tasks for each user.
And finally, we reach fair allocation and we are running most important tasks for each user.
Now we have developed some intuition though this example. Especially, how the score function should look like. Let take a look at how do we formalize it.
Assume there is an total order of jobs for each user, where > means ‘has higher value than’ We introduce the notion of cumulative resource share or CRS. CRS of a job j is the sum of all jobs of the same user, that are greater than or equal to j, divided by total resource. Or in mathematic form, this.
Let’s see how that works. First, we don’t quite know how to express the relative importance among all task. It is hard for us to say one researcher’s task is more important than another. But we do know how to express the relative importance among tasks of the same user. The user has a easy way to tell us which of his tasks are more important. Here the importance are shown in currencies and we have a ordering for each user’s tasks. But since they are in different currencies, we still cannot compare them across users.
Note here unlike currency notion we used before, here a more valuable task has a lower CRS
Well, we don’t quite know how to express the relative value among all user’s job, but we have a fairly good idea of how to express relative value among a single user’s job
So far we are only considering a single type of resources but in reality we have multiple, for instance, memory and cpu. Luckily, there is already some interesting research to help us with that. Dominant Resource Fairness is a way to achieve fair allocation of multiple resource types. It’s paper published by UC Berkeley in 2011. And is implemented in Mesos itself. It introduces the notion of Dominant resource share, or DRS, to be the maximum of all user’s resource share. It’s simple yet has a lot good property. I won’t dig too much into it and I strongly suggest reading the paper.
Here, we extend the same idea to cumulative resource share. To recap, here is the definition of CRS. Dominant cumulative resource share, or DRS is defined as the max CRS among all resources Finally, we define score to be the negation of DCRS because the higher the score, the more valuable the job, and DCRS is the opposite.
So far we’ve talk about the problem, fairness, preemption, score function. Finally, let’s see how these fit together in Cook.
This is a high level architecture of Cook. On the left side, we have cook, which consists of three components. The first component on the left side is Ranker. It’s functionality is to take all running and waiting jobs, sort them for each user, compute the score for those jobs return a list of jobs sorted by score. The list of jobs is then passed to the other two components. On the top side is Matcher. This component takes resource offers from Mesos, and match them with the list of jobs to see if the offers are big enough to fit those jobs and if so, it sends them to Mesos. The third component, Rebalancer, does preemption. Let’s zoom in to see what it does.
We asked the question of can we do better. Now is the time to answer it.
So far we’ve talk about the problem, fairness, preemption, score function. Finally, let’s see how these fit together in Cook.
Here is the results from the benchmark we ran against We took a trace from our production workload and ran it with

Efficient cluster resource management using Mesos and Cook

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Efficient cluster resource management using Mesos and Cook

Similar to Efficient cluster resource management using Mesos and Cook (20)

More from Ontico

More from Ontico (20)

Recently uploaded

Recently uploaded (20)

Efficient cluster resource management using Mesos and Cook

Editor's Notes