Enabling Diverse Workload Scheduling in YARN

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enabling diverse workload scheduling in YARN
June, 2015
Wangda Tan, Hortonworks, (wangda@apache.com)
Craig Welch, Hortonworks, (cwelch@hortonworks.com)

About us
Wangda Tan
• Last 5+ years in big data field,
Hadoop, Open-MPI, etc.
• Past
– Pivotal (PHD team, brings
OpenMPI/GraphLab to YARN)
– Alibaba (ODPS team, platform for
distributed data-mining)
• Now
– Apache Hadoop Committer
@Hortonworks, all in YARN.
– Now spending most of time on
resource scheduling enhancements.
Craig Welch
• Yarn Contributor

Hadoop+YARN is the home of
big data processing.

Our workloads vary,
Service | Batch | interactive/ real-time

They have different CRAZY requirements
I wanna be fast!
When cluster is busy
Don’t take away
MY RESOURCES
A huge job
needs be scheduled
at a special time

We want to make them
AS HAPPY AS POSSIBLE
to run together in YARN.

Let’s start…

Agenda today
• Overview
• Node Label
• Resource Preemption
• Reservation system
• Pluggable behavior for Scheduler
• Docker support
• Resource scheduling beyond memory

Overview

Background
• Resources are
managed by a
hierarchy of queues.
• One queue can have
multiple applications
• Container is the result
resource scheduling,
Which is a bundle of
resources and can run
process(es)

How to manage your workload by queues
• By organization:
–Marketing/Finance
queue
• By workload
–Interactive/Batch queue
• Hybrid
–Finance-
batch/Marketing-
realtime queue

Node Label

Node Label – Overview
• Types of node labels
– Node partition (Since 2.6)
– Node constraints (WIP)
• Node partition (Today’s focus)
– One node belongs to only one
partition
– Related to resource planning
• Node constraints
– One node can assign multiple
constraints
– Not related to resource planning

Node partition – Resource planning
• Nodes belong to “default partition” if not specified
• It’s possible to specify different capacities of queues on different partitions
–For example, sales queue can use different resource on GPU and default partition.
• It’s possible to specify some partition will be only used by some queues
(ACL for partition)
–For example, only sales queue can access “Large memory partition”

Node partition – Exclusive vs. Non-exclusive
Snake Partition Bear partition Default partition
Exclusive partition
Non-exclusive partition
Use it when
they're not at home
Resource Request

Node Partition – Use cases & best practice
• Dedicate nodes to run important services:
–E.g. Running HBase region server using Apache Slider
• Nodes with special hardware in the cluster are used by organizations.
–E.g. You may want a queue dedicated to the marketing department to use 80% of
these memory-heavy nodes.
• Use non-exclusive node partition to make better resource utilization.
• Be careful about user-limits, capacity, etc. to make sure jobs can be
launched
I will cover more details about implementation & usage in Thursday morning’s
session “YARN Node Labels” with Mayank Bansal from Ebay.

Resource Preemption

Resource Preemption – Overview
• Queue has configured minimum resource.
• Since it has a minimum resource value, the preemption policy (which
performs preempting resources) is used to insure that:
–When a queue is under its “minimum resource”, and the cluster doesn’t have
available resources, preemption policy can get resource from other queues use
more than their minimum resource.
A
B
C
20%
30%
50%

Resource Preemption – Example
• When preemption is not enabled
• When preemption is enabled

Resource Preemption – best practice
•Configurations to control the pace of preemption:
–yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill
–yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round
–yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor
•Configurations to control when or if preemption happens
–yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity
(deadzone)
–yarn.scheduler.capacity.<queue-path>.disable_preemption

Reservation System

Reservation System – Overview
• Reserving resource ahead of time
– Just like ordering table in a restaurant
– “I need a table for X people at Y time”
– “Wait for moment … Reservation
confirmed sir“
– (After some time), “Your table is ready”
–What Reservation System does is:
–Send a reservation request
–RM checks time table
–Send back reservation confirmation ID
–Notify when ready
•Enables more predictable start and
run time for time-critical / resource
intensive applications

Reservation System – Use cases
•Gang scheduling
– Currently, YARN can do gang
scheduling from application side (holding
resources until it meets requirements)
– Resources could be wasted and there’s
risk of deadlocks.
–RS lays the foundation for gang scheduling
•Workflow support
– I want to run jobs in stages
– Stage-1 at 1 AM tomorrow, needs 10k
containers
– Stage-2 after stage-1, needs 5k
containers
– Stage-3 after stage-2, needs 2k
containers
– You can submit such requests to RS!

Reservation System – Result & References
•Before & After Reservation System
(reports from MSR)
– It increased cluster utilization a lot!
•References
– Design / Discussion / Report : YARN-1051
– More detail about example : YARN-2609

Pluggable scheduler behavior

Why
• Problem
• It’s difficult to share functionality
between schedulers
• Users cannot achieve the same
behavior with all schedulers
• Fixes and enhancements tend to end up
in one scheduler, not all, leading to
fragmentation
• No simple mechanism exists to mix
behaviors for a given feature in a single
cluster
• Solution
• Move to sharable, pluggable scheduler
behavior

How
• The Goal
–Recast scheduler behavior as
policies – candidates include
–Resource limits for apps, users...
–Ordering for allocation and
preemption
• With this, we can:
–Maximize feature availability and
reduce fragmentation
–Configure different queues for
different workloads in a single
cluster
Flexible Scheduler configuration,
as simple
as building with Legos!

Ordering Policy of Capacity Scheduler
• Pluggable ordering policies for
LeafQueues in Capacity Scheduler
–Enables the implementation of different
policies for ordering assignment and
preemption of containers for applications
–Initial implementations include FIFO
(Capacity Scheduler original behavior)
and Fair
–User Limits and Queue Capacity limits
are still respected
• Fair scheduling inside Capacity
Scheduler
–Based on the Fair Sharing logic in
FairScheduler
–Assigns containers to applications in
order of least to greatest resource usage
–Allows many applications to make
progress concurrently
–Lets short jobs finish in reasonable time
while not starving long running jobs

Configuration and tuning
• Rough guidelines for when to use Fair
and FIFO ordering policies
• Configuration
–yarn.scheduler.capacity.<queue>.ordering-
policy (“fifo” or “fair”, default “fifo”)
–yarn.scheduler.capacity.<queue>.ordering-
policy.fair.enable-size-based-weight (true or
false)
• Tuning
–Use max-am-resource-percent to
avoid “peanut buttering” from having
too many apps running at once
–Sometimes it’s necessary to separate
large and small apps in different
queues, or use size-based-weight, to
avoid large app starvation
Workloads Policy
On-
demand/interactive/
exploratory
Fair
Predictable/Recu-
rring batch
FIFO
Mix of above two Fair

Docker container support

Docker container support – Overview
• Containers for the Cluster
–Brings the sandboxing and
dependency isolation of container
technology to Hadoop
–Containers make it simple to use
Hadoop resources for a wider range of
applications

Docker container support – Status
• Done
–(V1) Initial implementation
translating Kubernetes to an
Application Master launching
Docker containers from the Cluster
met with success.
–(V2) A custom container launcher
for Docker containers. This brought
the capability more fully under the
management of YARN,
–but a single cluster could not
support both traditional YARN
applications (MapReduce, etc)
and Docker concurrently
• Next phase
–(V3) WIP, is adding support for
running Docker and traditional
YARN applications side-by-side in
a single cluster

It’s not all about memory

It’s not all about Memory - CPU
• What’s in a CPU
–Some workloads are CPU
intensive, without accounting for
this nodes may end up CPU bound
or CPU may be under utilized
cluster-wide
–CPU awareness at the scheduer
level is enabled by selecting the
DominantResourceCalculator.
–Dominant? “Dominant” stands for
the “dominant factor”, or the
“bottleneck”. In simplified terms,
for the resource type which is the
most constrained becomes the
dominant factor for any given
comparison or calculation
–For example, If there is enough
memory but not enough cpu for a
resource request, the cpu
component is dominant ( and the
answer is “No”  )
–See
https://www.cs.berkeley.edu/~alig/pap
ers/drf.pdf for more detail

It’s not all about Memory – CPU - Vcores
• What’s in a CPU
–The unit used to abstract CPU
capability in YARN is the vcore
–Vcore counts are configured per-
node in the yarn-site.xml, typically
1-1 vcore to physical CPU
–If some Nodes’ CPUs outclass
other nodes’, the number of vcores
per physical CPU can be adjusted
upward to compensate

Q & A
?

Enabling Diverse Workload Scheduling in YARN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Enabling Diverse Workload Scheduling in YARN

Similar to Enabling Diverse Workload Scheduling in YARN (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Enabling Diverse Workload Scheduling in YARN