2. Agenda
● Netflix intro. Context for the talk
● Spark on Mesos
● Mesos cluster configuration
● Autoscaling Mesos clusters
3. Netflix Scale
● Started streaming 10 years ago
● > 130M members
● > 190 countries
● > 1000 device types
● 1/3 of peak US downstream traffic
● 15% of global downstream traffic
4. The value of recommendations
● A few seconds to find something
great to watch…
● Can only show a few titles
● Enjoyment directly impacts
customer satisfaction
● How? Personalize everything, for
130M members across 190+
countries
12. • Try an idea offline using historical data to see if it
would have improved recommendations
• If it would, deploy a live A/B test to see if it performs
well in production
Running Experiments
13. Machine Learning workflows
Workflow Directed Graph of steps, global parameters, triggers...
Step Describes a job and its configuration
Python DSL / Scala DSL / Rest API / UI
23. 48 GB
6 cpu
Offer: (6 cpu, 48GB)
48 GB
6 cpu
12 GB
2 cpu
Reserved
for agent
daemons
A cluster node machine
ie: r4.2xlarge has
8 cpus and 61GB
2 cpus and 12GB
reserved for agent
daemons
6 cpus and 48GB
available for spark
executors
Available resources in a mesos node
Available for
spark
executors
24. 48 GB
6 cpu
Offer: (2 cpu, 32GB)
16 GB
4 cpu
32 GB
2 cpu
Launch Task: (4 cpu, 16GB)
Mesos
Master
Mesos Offers
When a task (ie: a spark
executor) is launched on
an agent, an offer is
created with an updated
view of the available
resources.
25. 48 GB
6 cpu
Offer: (4 cpu, 12GB)
36 GB
2 cpu
12 GB
4 cpu
Launch Task: (2 cpu, 36GB)
Mesos
Master
Mesos Offers
26. 48 GB
6 cpu
Offer: (0 cpu, 32GB)
16 GB
6 cpu
Launch Task: (6 cpu, 16GB)
Mesos
Master
Mesos Offers
When a task uses all of
the resources of one
type (ie cpus) the
resulting offer is
unusable, since no other
task can execute, with
only one of the
resources present. (ie: 0
cpus)
27. 48 GB
6 cpu
Offer: (4 cpu, 0GB)
48 GB
2 cpuLaunch Task: (2 cpu, 48GB)
Mesos
Master
Mesos Offers
Likewise if you consume
all the ram available, the
resulting offer can not be
used by other tasks.
29. The share of memory of each task, depends on the actively running tasks (N).
Each task is assigned 1/N of the available memory.
Spark Memory Model
(Dynamic Assignment)
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
Executor with 4 cores available
30. With N=2 Each task gets ½ of the memory.
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
Executor with 4 cores available
Spark Memory Model
(Dynamic Assignment)
31. With N=4 Each task gets ¼ of the memory.
Executor with 4 cores available
https://www.slideshare.net/databricks/deep-dive-memory-management-in-apache-spark
Spark Memory Model
(Dynamic Assignment)
32. 48 GB
6 cpu
24 GB
3 cpu
1 cpu
8 GB
Equivalent executors
16 GB
2 cpu
33. Proposed Fixed Size Executors
spark.executor.cores = 2
spark.executor.memory = 16g
16 GB
2 cpu
24 GB
2 cpu
32 GB
2 cpu
34. 48 GB
6 cpu
16 GB
2 cpu
16 GB
2 cpu
16 GB
2 cpu
Ideal case
24 GB
2 cpu
8 GB
2 cpu
16 GB
2 cpu
Proposed Fixed Size Executors
35. 32 GB
2 cpu
16 GB
2 cpu
Proposed Fixed Size Executors
bad case
24 GB
2 cpu
24 GB
2 cpu
2 unused
cpus
bad case
(wasted cpu)
2 unused
cpus
37. ● A spark job runs in 1000 cpu/hours that cost
1000$
○ Run a job with 1 cpu over 1000 hours.
○ Run a job with 1000 cpus over 1 hour.
What is the effect of having more
resources on your spark job?
cores
minutes
40. 48 GB
6 cpu
16 GB
4 cpu
32
GB
2
cpu 48 GB
6 cpu36
GB
2
cpu
12 GB
4 cpu
48 GB
6 cpu
16 GB
6 cpu
48 GB
6 cpu
48
GB
2
cpu
Available: (0 cpu, 32GB) Available: (4 cpu, 0GB)Available: (4 cpu, 12GB)Available: (2 cpu, 32GB) Total reported:
(10 cpu, 76GB)
But ONLY Usable:
(2 cpu, 32GB)
+
(4 cpu, 12GB)
Reporting Free CPU/Memory
41. 48 GB
6 cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
16
GB
2
cpu
100%66%33%0%
Reporting Usage as number of executors in use
Average: 50%
CAPACITY
Used: 6 executors
Available: 6 executors
Total: 12 executors
42. 48 GB
6 cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
48 GB
6 cpu
16
GB
2
cpu
16
GB
2
cpu
16
GB
2
cpu
100%100%100%0% Average: 75%
CAPACITY
Used: 3 agents
Available: 1 agent
Total: 4 agents
Reporting Usage as binary used / unused
44. ● Scaling down means terminating some instances to reduce the size of your ASG
● AWS AUTOSCALING has a default termination policy
a. Balance instances in multiple Availability Zones (one region zone a/b/c)
b. Pick unprotected instances in with the oldest launch configuration
c. If there are multiple unprotected instances, pick the closest to the next billing hour
d. Out of those select instances at random
● If one instance has running executors, those executors will be rescheduled somewhere
else in the cluster and the portion of computation the lost will be reprocessed. [SLOW]
● If one instance has running drivers, the entire spark job needs to be restarted. [BAD]
Scale up is sorted.
How about scaling down?
52. Mesos Clusters
Adding extra capacity takes less than 5 minutes
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=600s
53. Mesos
Master
Mesos
Agent
Mesos
Agent
Mesos
Agent
Mesos
Agent
Available resources
(called offers)
Mesos agents periodically send a
message to the Mesos master
exposing how many resources
available they have, to execute tasks.
Resources normally consist of cpus,
gpus, memory, disk, and network
bandwidth.
We will focus on cpus and memory for
now. 5 cpus, 2GB free
2 cpus, 8GB free
1 cpu, 1GB free
Mesos receiving Offers
54. Mesos
Master
5 cpus, 2GB free
2 cpus, 8GB free
1 cpu, 1GB free
Mesos docs: Decline resources using a large timeout
Sort by Max Share Ascending
--conf spark.mesos.rejectOfferDuration = 120s
--conf spark.mesos.rejectOfferDurationForReachedExecutorLimit = 120s
http://mesos.apache.org/documentation/latest/app-framework-development-guide/
55. Mesos docs: “Do not revive frequently”
Get rid of this
https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L640-L663
http://mesos.apache.org/documentation/latest/app-framework-development-guide/