The benefits of running Spark on your own Docker

The Beneﬁts of Running
Spark on Docker
Shir Bromberg
Big Data Team Leader @ Yotpo

Agenda
● Motivation
● Spark on Docker
○ Solution overview
○ Building the Spark docker
○ Deploying your application using
“Spark on Nomad”
● Why is it better!
● Next steps

Motivation
● Many applications rely on Spark
● These applications also require:
○ unitesting
○ quality tests
○ analytics
○ …
● A year ago: running these jobs on an
on-demand managed cluster.
○ Managed cluster = EMR (AWS)
dataproc

Pain Points of Managed Clusters
● Setup time: long startup time
○ Tests should be quick
● Environment: manually manage
installed packages on the cluster
○ Use AMI ⇒ coupling with AWS :(
● Pricing: pay per instance
● Slow releases

Existing Solutions
Option 1: Start a cluster for every test
Problem:
● Time consuming

Existing Solutions
Option 1: Start a cluster for every test
Problem:
● Time consuming
Option 2: Cluster that is always up
Problems:
● Environment must be able to run all
test cases
● Can be expensive

Our solution for working with Spark should follow these guidelines
In an ideal world...
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability

An open-source framework for distributed data processing
Provides high-level functions in Scala, Java, Python, and R
Spark in a nutshell
Data storage layer (S3/HDFS)
Resource Manager (Yarn/Mesos/Standalone)
Spark Core
Streaming SQL GraphX MLlib

Enables you to package your code with dependencies into a deployable unit
called a container.
Docker in a nutshell
ServerDockerﬁle
Build Image
Docker Hub
Image Image
OS
Docker
ContainerContainerContainer

Container orchestration platform
by HashiCorp
Nomad in a nutshell
Server
Resource Scheduling
Task Scheduling
Leader Election
Allocation
Image
Instances
Memory
CPU
...
Nomad’s API
Node
NodeNode
Nomad Cluster
Client
Execute
tasks

User Nomad
EC2
Scaler
Nomad
Instance

User Nomad
EC2
Scaler
Docker
Spark
Docker

Building our own Spark
docker sounds difﬁcult…
Apparently it’s Not!!

And we made it
open source
Open source
https://github.com/YotpoLtd/metorikku/tree/master/docker/spark
Docker Hub
https://hub.docker.com/r/metorikku/spark

Spark Components
● Driver
○ Runs the main() function of the
application
○ Distributes tasks among the
nodes
● Executor
○ A process that execute
multiple tasks
Driver
ExecutorExecutor

Spark Components
● Master
○ Manages the resources
● Worker
○ The executor process that
execute multiple tasks on a
worker node
● Standalone deploy mode
● Client Mode
Master
Driver
Worker
Executor
Worker
Executor

Spark Docker
Submit Docker
Submits the Spark
command
Worker Docker
Runs the
spark-worker service
Master Docker
Runs the
spark-master service

Run locally
Docker compose
● Submit
● Master
● Worker

Submit
Docker
● Min workers
● command

Run your application
Your own job

We have achieved
Full data tests in CI

Simple Deployment
Spark docker can run on your favorite container
orchestration platform
Docker swarm

Nomad conﬁguration
with ease
● Batch and service (streaming)
● Dynamic port allocation
● Spark UI address using the job name
● Auto restart/reschedule
● Uniﬁed environments, no more
bootstrap actions

What About
Kubernetes?
Spark can run on clusters managed by Kubernetes.
● Pros:
○ Built-in
○ K8S is aware that it’s running with Spark
● Cons:
○ We are not using K8S (working with
hashicorp).
○ Nomad is simple

Have we met
your expectations?
Better
Testability

Have we met
your expectations?
Better
Testability
Resource
sharing: better
optimized
EMR pricing
3500$/month
(42K$/year)

Have we met
your expectations?
Better
Testability
Writing and
deploying
spark jobs is
easier than
ever

Have we met
your expectations?
Better
Testability
Setup time:
Cluster of 100
nodes in 1-2
minutes!
Run locally or
in CI env for
tests

Have we met
your expectations?
Better
Testability
Observability
“for free” on
existing
orchestration
platform infra

Have we met
your expectations?
Better
Testability
Can be used in
any
environment
(on prem. or
any cloud
provider)

Have we met
your expectations?
Better
Testability
Auto scaling
using
orchestration
tools (Libra)

Summary
Our own framework to run Spark
Simple to create
like all other microservices
It enables other teams to deploy spark jobs with ease

Questions?
sbromberg@yotpo.com

The benefits of running Spark on your own Docker

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The benefits of running Spark on your own Docker

Similar to The benefits of running Spark on your own Docker (20)

More from Itai Yaffe

More from Itai Yaffe (20)

Recently uploaded

Recently uploaded (20)

The benefits of running Spark on your own Docker