2. Agenda
• What is cluster scheduler and why one would need it?
• Cluster scheduler architectures
• Specifics of YARN, Kubernetes, Mesos and Nomad:
• Architecture
• Specific features / positioning
• Pros and cons
3. What is cluster scheduler?
Do I really need it?
• Software component (monolith or distributed) with two major functions:
• Allocate resources on node(s) for incoming workload
• Maintain task lifecycle on allocated resources (distribute, run, keep
up, shutdown)
• Cluster scheduler is different from application scheduler
• You need one (and probably using one) if you run distributed
application
• You need a real one if you run more than one application and need
some elasticity
4. Monolith architecture
• Scheduler is a single process
that controls everything about
workloads
• Examples: Hadoop
JobTracker, Kubernetes (kube-
scheduler)
• Simple initial implementation
• Hard to implement different
requirements for different
workloads
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
5. Two-level architecture
• Task lifecycle is separated
from resource allocation
• Examples: YARN (you have to
see it), Mesos
• Easy to add different types of
application
• Hard to implement anti-
interference measures, priority
cross-application preemption
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
6. Shared-state architecture
• Each scheduler (i.e.
application type) maintains its
own state of the cluster and
commits changes as a
transactions (that could
succeed or fail)
• Example: Nomad
• State synchronisation has to
be done
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
7. Distributed architecture
• No centralised resource
allocation, simplified model
• Example: Sparrow
• Has great advantages on fine-
grained tasks randomly
distributed on large cluster
• Any synchronisation (e.g. to
avoid interference) is hard
* Picture source: http://www.firmament.io/blog/scheduler-architectures.html
9. History
• MapReduce JobTracker generalisation (decoupled Resource
Manager and Application Master), one of two parts of
“Hadoop”
• Resource allocation based on requests
• Works fine with large containers and batch processes, not so
much with fine-grained / services
• All Hadoop frameworks have 1st class support for YARN
(MRv2, Pig, Hive, Spark)
• Supports pluggable schedulers (cluster-level), containerisation
11. Specific features / issues
• Pluggable “queue management” scheduler:
• FairScheduler: memory-fair by default, possible DRF policy for specific queue
• CapacityScheduler: pluggable resource calculator,
DominantResourceCalculator supports CPU and Memory
• Data locality support possible (e.g. MRv2)
• Preemption: across queues and intra queues (2.8.0/3.0.0)
• Kerberos authentication, ACLs on queue and cluster
• Awful metric system, no support for metric collection from “frameworks”
• No volume management
13. History
• Kubernetes happened after internal “Borg” project in Google
• Initially: greenfield implementation of container orchestration
targeted for services
• kube-scheduler is a small part of what K8s does
• Best for micro services on cloud
• Huge momentum
• Very ops friendly, Google dogfooding it (Google Cloud
Engine is upstream K8s)
16. Issues
• Many concepts, hard to master and reason about (e.g.
controllers are like schedulers, but not really)
• Monolith kube-scheduler could be slow
• No IO isolation, not suitable for analytical workloads on
large on-premise clusters
• No real enterprise support (that I know of)
18. History
• UC Berkely 2009, Apache top-tier 2013
• Clean two-level architecture implementation
• Resource allocation based on offers
• Initially part of BDAS groups, targeted at Big Data first
(Apache Spark is Proof-of-Concept for Mesos)
• Popularised by Mesosphere in DC/OS product
20. Specific features
• Flexible in terms of resources available that could be
allocated: cpus, memory, disks / volumes, gpus
• Pluggable: schedulers (called frameworks), containerizers,
loggers, networking (CNI/libnetwork)
• Oversubscription, revocable resources, quotas
• Some volume management
• Very rough around edges
21. Framework support
• Although it’s very common when somebody runs X on Y, Mesos is a
leader in terms of hosting other stuff
• It’s really easy to develop Mesos framework
• Some examples:
• Marathon/Aurora for container orchestration (some people even tried
K8s, but that is too much)
• HDFS/Kafka/NoSQL DBs - if you like to live on the edge
• Jenkins/Artifactory/Gitlab
• Spark/TF/Flink/Storm
24. History
• 2015, developed by Hashicorp
• Shared-state architecture (service/batch/system
schedulers) Docker scheduler
• Dependent on other Hashicorp tools: Consul, Vault
26. Specific features & issues
• Multi-DC and multi-region support based on Gossip
• Service/batch/system schedulers
• No authorisations, only basic TLS on communication
• No volume management
• No IO isolation
• Preemption?