MesosCon 2018

Multi-tenant
Spark workflows
in Auto Scalable
Mesos clusters
Pablo Delgado
Prathima Donapudi
@pablete

Agenda
● Netflix intro. Context for the talk
● Spark on Mesos
● Mesos cluster configuration
● Autoscaling Mesos clusters

Netflix Scale
● Started streaming 10 years ago
● > 130M members
● > 190 countries
● > 1000 device types
● 1/3 of peak US downstream traffic
● 15% of global downstream traffic

The value of recommendations
● A few seconds to find something
great to watch…
● Can only show a few titles
● Enjoyment directly impacts
customer satisfaction
● How? Personalize everything, for
130M members across 190+
countries

Everything is a recommendation!

Selection and placement of the row types is personalized
From how to construct the page

Ordering of the titles in each row is personalized
...to what shows to recommend

Personalized artwork
...to what artwork to present
Profile 1 Profile 2

Member
streaming data
Training
pipeline
Models
Precompute
System
ML for Recommendations

Member
streaming data
AB Test
Allocation
Training
pipeline
Models
Training
pipeline
Models
Training
pipeline
Models
Precompute
System
ML for Recommendations

• Try an idea offline using historical data to see if it
would have improved recommendations
• If it would, deploy a live A/B test to see if it performs
well in production
Running Experiments

Machine Learning workflows
Workflow Directed Graph of steps, global parameters, triggers...
Step Describes a job and its configuration
Python DSL / Scala DSL / Rest API / UI

MESON Scheduler
Mesos Agent
Mesos Master
Meson executor
Mesos Agent
Meson executor
Mesos Framework
Scheduler
Meson as a Mesos Framework
Fenzo (Netflix OSS) makes
scheduling decisions
Mesos offers resources and runs
the steps
Fenzo

MESOS Clusters
Mesos Agent
Mesos Master
Meson
executor
Meson
executor
Docker container Service
Spark
driver
Mesos Agent
Spark
Executors
Mesos Agent
Spark
Executors
Mesos Agent

Minimal spark as a mesos framework
Spark on Mesos

Spark Physical Cluster
Shuffle service Shuffle service Shuffle service Shuffle service

(Dynamic Resource Allocation)
Shuffle service Shuffle service Shuffle service Shuffle service Shuffle service

Shuffle service Shuffle service Shuffle service Shuffle service Shuffle service
(Dynamic Resource Allocation)

Thinking about spark executors
Executor shape, memory, cores
Mesos Cluster
Configuration

48 GB
6 cpu
Offer: (6 cpu, 48GB)
48 GB
6 cpu
12 GB
2 cpu
Reserved
for agent
daemons
A cluster node machine
ie: r4.2xlarge has
8 cpus and 61GB
2 cpus and 12GB
reserved for agent
daemons
6 cpus and 48GB
available for spark
executors
Available resources in a mesos node
Available for
spark
executors

48 GB
6 cpu
16 GB
4 cpu
32 GB
2 cpu
Launch Task: (4 cpu, 16GB)
Mesos
Master
Mesos Offers
When a task (ie: a spark
executor) is launched on
an agent, an offer is
created with an updated
view of the available
resources.

48 GB
6 cpu
36 GB
2 cpu
12 GB
4 cpu
Mesos
Master
Mesos Offers

48 GB
6 cpu
16 GB
6 cpu
Mesos
Master
Mesos Offers
When a task uses all of
the resources of one
type (ie cpus) the
resulting offer is
unusable, since no other
task can execute, with
only one of the
resources present. (ie: 0
cpus)

48 GB
6 cpu
Offer: (4 cpu, 0GB)
48 GB
2 cpuLaunch Task: (2 cpu, 48GB)
Mesos
Master
Mesos Offers
Likewise if you consume
all the ram available, the
resulting offer can not be
used by other tasks.

The share of memory of each task, depends on the actively running tasks (N).
Each task is assigned 1/N of the available memory.
Spark Memory Model
(Dynamic Assignment)
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
Executor with 4 cores available

With N=2 Each task gets ½ of the memory.
Spark Memory Model

With N=4 Each task gets ¼ of the memory.
Spark Memory Model

48 GB
6 cpu
24 GB
3 cpu
1 cpu
8 GB
Equivalent executors
16 GB
2 cpu

Proposed Fixed Size Executors
spark.executor.cores = 2
spark.executor.memory = 16g
16 GB
2 cpu
24 GB
2 cpu
32 GB
2 cpu

48 GB
6 cpu
16 GB
2 cpu
16 GB
2 cpu
16 GB
2 cpu
Ideal case
24 GB
2 cpu
8 GB
2 cpu
16 GB
2 cpu

32 GB
2 cpu
16 GB
2 cpu
bad case
24 GB
2 cpu
24 GB
2 cpu
2 unused
cpus
bad case
(wasted cpu)
2 unused
cpus

Thinking about executors
Executor shape, memory, cores
Autoscaling
Spark Mesos Clusters

● A spark job runs in 1000 cpu/hours that cost
1000$
○ Run a job with 1 cpu over 1000 hours.
○ Run a job with 1000 cpus over 1 hour.
What is the effect of having more
resources on your spark job?
cores
minutes

...
Resources vs Time to completion
Time to completion
Resources

48 GB
6 cpu
16 GB
4 cpu
32
GB
2
cpu 48 GB
6 cpu36
GB
2
cpu
12 GB
4 cpu
48 GB
6 cpu
16 GB
6 cpu
48 GB
6 cpu
48
GB
2
cpu
Available: (0 cpu, 32GB) Available: (4 cpu, 0GB)Available: (4 cpu, 12GB)Available: (2 cpu, 32GB) Total reported:
(10 cpu, 76GB)
But ONLY Usable:
(2 cpu, 32GB)
+
(4 cpu, 12GB)
Reporting Free CPU/Memory

48 GB
6 cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
16
GB
2
cpu
100%66%33%0%
Reporting Usage as number of executors in use
Average: 50%
CAPACITY
Used: 6 executors
Available: 6 executors
Total: 12 executors

48 GB
6 cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
16
GB
2
cpu
100%100%100%0% Average: 75%
CAPACITY
Used: 3 agents
Available: 1 agent
Total: 4 agents
Reporting Usage as binary used / unused

Scale UP policy
Controls the slope of scaling up (scale out)

● Scaling down means terminating some instances to reduce the size of your ASG
● AWS AUTOSCALING has a default termination policy
a. Balance instances in multiple Availability Zones (one region zone a/b/c)
b. Pick unprotected instances in with the oldest launch configuration
c. If there are multiple unprotected instances, pick the closest to the next billing hour
d. Out of those select instances at random
● If one instance has running executors, those executors will be rescheduled somewhere
else in the cluster and the portion of computation the lost will be reprocessed. [SLOW]
● If one instance has running drivers, the entire spark job needs to be restarted. [BAD]
Scale up is sorted.
How about scaling down?

Schedule Spark drivers in a different ASG
ASG 1
ASG 2

Instance Protection
Terminate instance signal

Instance Protection
Terminate instance signal
protect protect protect protectprotect
Shuffle files

Scale DOWN policy
Controls the slope of scaling down (scale in)

Timeline of Executors in Spark

--conf spark.scheduler.minRegisteredResourcesRatio={0.15, 0.3, 0.55, 1.0}
--conf spark.scheduler.maxRegisteredResourcesWaitingTime={30s, 600s, 1200s)
15% 30% 55% 100%
30secs 600secs 1200secs=20min
When does the spark computation start?

Mesos Clusters
Adding extra capacity takes less than 5 minutes
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=600s

Mesos
Master
Mesos
Agent
Mesos
Agent
Mesos
Agent
Mesos
Agent
Available resources
(called offers)
Mesos agents periodically send a
message to the Mesos master
exposing how many resources
available they have, to execute tasks.
Resources normally consist of cpus,
gpus, memory, disk, and network
bandwidth.
We will focus on cpus and memory for
now. 5 cpus, 2GB free
2 cpus, 8GB free
1 cpu, 1GB free
Mesos receiving Offers

Mesos
Master
5 cpus, 2GB free
2 cpus, 8GB free
1 cpu, 1GB free
Mesos docs: Decline resources using a large timeout
Sort by Max Share Ascending
--conf spark.mesos.rejectOfferDuration = 120s
--conf spark.mesos.rejectOfferDurationForReachedExecutorLimit = 120s
http://mesos.apache.org/documentation/latest/app-framework-development-guide/

Mesos docs: “Do not revive frequently”
Get rid of this
https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L640-L663
http://mesos.apache.org/documentation/latest/app-framework-development-guide/

Timeline of acquiring Executors in Spark

Main Cluster size (# of instances)
● Number of ec2 instances
● Actually used ec2 instances

Anatomy of our managed clusters
Agent pools / Physical separation of concerns
Managed Mesos
Spark Clusters

TRION
TRION CI
Trion Family of clusters

TRION
TRION CI
TRION PLAY
DRA enabled
This will allow
to maximize
the usage of
shared clusters

TRION
TRION CI
TRION PLAY

TRION
TRION CI
TRION PLAY
Trion with High SLA pool
HIGH
SLA
UNBOUNDED

TRION
TRION CI
TRION PLAY
HIGH
SLA
UNBOUNDED
1234

Questions?
We are Hiring...
http://bit.ly/NetflixSpark
Pablo Delgado
Prathima Donapudi
@pablete pdelgado@netflix.com

MesosCon 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MesosCon 2018

Similar to MesosCon 2018 (20)

Recently uploaded

Recently uploaded (20)

MesosCon 2018