SlideShare a Scribd company logo
1 of 34
Download to read offline
Lessons Learned From
Dockerizing Spark Workloads
Thomas Phelan Nanda Vijaydev
Chief Architect, BlueData Data Scientist, BlueData
@tapbluedata @nandavijaydev
February 8, 2017
Outline
• Docker Containers and Big Data
• Spark on Docker: Challenges
• How We Did It: Lessons Learned
• Key Takeaways
• Q & A
Distributed Spark Environments
• Data scientists want flexibility:
– New tools, latest versions of Spark, Kafka, H2O, et.al.
– Multiple options – e.g. Zeppelin, RStudio, JupyterHub
– Fast, iterative prototyping
• IT wants control:
– Multi-tenancy
– Data security
– Network isolation
Why “Dockerize”?
Infrastructure
• Agility and elasticity
• Standardized environments
(dev, test, prod)
• Portability (on-premises and
public cloud)
• Efficient (higher resource
utilization)
Applications
• Fool-proof packaging (configs,
libraries, driver versions,etc.)
• Repeatable builds and
orchestration
• Faster app dev cycles
• Lightweight(virtually no
performance or startup penalty)
The Journey to Spark on Docker
Start with a clear
goal in sight
Begin with your Docker toolbox
of a single container and basic
networking and storage
So you want to run Spark on Docker in a
multi-tenant enterprise deployment?
Warning: there are some pitfalls & challenges
Spark on Docker: Pitfalls
Traverse the tightrope of
network configurations
Navigate the river of
container managers
• Swarm ?
• Kubernetes ?
• AWS ECS ?
• Mesos ?
• Overlay files ?
• Flocker ?
• Convoy ?
• Docker Networking ?
Calico
• Kubernetes Networking ?
Flannel, Weave Net
Cross the desert of storage
configurations
Spark on Docker: Challenges
Pass thru the jungle of
software configurations
Tame the lion of
performance
Finally you get to the top!
Trip down the staircase of
deployment mistakes
• R ?
• Python ?
• HDFS ?
• NoSQL ?
• On-premises ?
• Public cloud ?
Spark on Docker: Next Steps?
But for an enterprise-ready deployment,
you are still not even close to being done …
You still have to climb past:
high availability,
backup/recovery, security,
multi-host, multi-container,
upgrades and patches …
Spark on Docker: Quagmire
You realize it’s time
to get some help!
How We Did It: Spark on Docker
• Docker containers provide a powerful option for greater agility
and flexibility – whether on-premises or in the public cloud
• Running a complex, multi-service platform such as Spark in
containers in a distributed enterprise-grade environment can
be daunting
• Here is how we did it for the BlueData platform ... while
maintaining performance comparable to bare-metal
How We Did It: Design Decisions
• Run Spark (any version) with related models,
tools / applications, and notebooks unmodified
• Deploy all relevant services that typically run on
a single Spark host in a single Docker container
• Multi-tenancy support is key
– Network and storage security
– User management and access control
How We Did It: Design Decisions
• Docker images built to “auto-configure”
themselves at time of instantiation
– Not all instances of a single image run the same set
of services when instantiated
• Master vs. worker cluster nodes
How We Did It: Design Decisions
• Maintain the promise of Docker containers
– Keep them as stateless as possible
– Container storage is always ephemeral
– Persistent storage is external to the container
How We Did It: CPU & Network
• Resource Utilization
– CPU cores vs. CPU shares
– Over-provisioning of CPU recommended
– No over-provisioning of memory
• Network
– Connect containers across hosts
– Persistence of IP address across
container restart
– Deploy VLANs and VxLAN tunnels for
tenant-level traffic isolation
How We Did It: Network
OVS
Container
Orchestrator
DHCP/DNS
VxLAN	tunnel
NIC
Tenant	Networks
OVS
NIC
Resource	
Manager
Node	
Manager
Node	Manager SparkMaster
SparkWorker
Zeppelin
SparkWorker
How We Did It: Storage
• Storage
– Tweak the default size of a container’s /root
• Resizing of storage inside an existing container is tricky
– Mount logical volume on /data
• No use of overlay file systems
– BlueData DataTap (version-independent, HDFS-compliant)
• Connectivity to external storage
• Image Management
– Utilize Docker’s image repository
TIP: Mounting block
devices into a container
does not support
symbolic links (IOW:
/dev/sdb will not work,
/dm/… PCI device can
change across host
reboot).
TIP: Docker images can
get large. Use “docker
squash” to save on size.
How We Did It: Security
• Security is essential since containers
and host share one kernel
– Non-privileged containers
• Achieved through layered set of
capabilities
• Different capabilities provide different
levels of isolation and protection
How We Did It: Security
• Add “capabilities” to a container based on what
operations are permitted
How We Did It: Sample Dockerfile
# Spark-1. docker image for RHEL/CentOS 6.x
FROM centos:centos6
# Download and extract spark
RUN mkdir /usr/lib/spark; curl -s http://archive.apache.org/dist/spark/spark-2.0.1/spark-2.0.1-
bin-hadoop2.6.tgz | tar xz -C /usr/lib/spark/
## Install zeppelin
RUN mkdir /usr/lib/zeppelin; curl -s https://s3.amazonaws.com/bluedata-
catalog/thirdparty/zeppelin/zeppelin-0.7.0-SNAPSHOT.tar.gz | tar xz -C /usr/lib/zeppelin
ADD configure_spark_services.sh/root/configure_spark_services.sh
RUN chmod -x /root/configure_spark_services.sh&& /root/configure_spark_services.sh
BlueData App Image (.bin file)
Application	bin	file
Docker	
image
CentOS
Dockerfile
RHEL
Dockerfile
appconfig
conf Init.d startscript
<app>	
logo	file
<app>.wb
bdwb	
command
clusterconfig,			image,		role,	
appconfig,	catalog,	service,	..	
Sources
Docker	file,	
logo	.PNG,	
Init.d
RuntimeSoftware Bits
OR
Development
(e.g. extract .bin and modify
to create new bin)
Multi-Host
4 containers
on 2 different hosts
using 1 VLAN and 4
persistent IPs
Services per Container
Master Services
Worker Services
Container Storage from Host
Container Storage Host Storage
Performance Testing: Spark
• Spark 1.x on YARN
• HiBench - Terasort
• Data sizes: 100Gb, 500GB, 1TB
• 10 node physical/virtual cluster
• 36 cores and112GBmemory per node
• 2TB HDFS storage per node (SSDs)
• 800GB ephemeral storage
Spark on Docker: Performance
MB/s
“Dockerized” Spark on BlueData
Spark with Zeppelin
Notebook
5 fully
managed
Docker
containers with
persistent IP
addresses
Pre-Configured Docker Images
Choice of data
science tools and
notebooks
(e.g. JupyterHub,
RStudio, Zeppelin)
and ability to
“bring-your-own”
tools and versions
On-Demand Elastic Infrastructure
Create “pristine” Docker
containers with appropriate
resources
Log in to container
and install your
tools & libraries
Spark on Docker: Key Takeaways
• All apps can be “Dockerized”, including Spark
– Docker containers enable a more flexible and agile
deployment model
– Faster app dev cycles for Spark app developers,
data scientists, & engineers
– Enables DevOps for data science teams
Spark on Docker: Key Takeaways
• Deployment requirements:
– Docker base images include all needed Spark
libraries and jar files
– Container orchestration, including networking and
storage
– Resource-aware runtime environment, including
CPU and RAM
Spark on Docker: Key Takeaways
• Data scientist considerations:
– Access to data with full fidelity
– Access to data processing and modeling tools
– Ability to run, rerun, and scale analysis
– Ability to compare and contrast various techniques
– Ability to deploy & integrate enterprise-ready solution
Spark on Docker: Key Takeaways
• Enterprise deployment challenges:
– Access to container secured with ssh keypair or
PAM module (LDAP/AD)
– Fast access to external storage
– Management agents in Docker images
– Runtime injection of resource and configuration
information
Spark on Docker: Key Takeaways
• “Do It Yourself” will be costly & time-consuming
– Be prepared to tackle the infrastructure challenges and
pitfalls to Dockerize Spark
– Your business value will come from data science and
applications – not the plumbing
• There are other options:
– BlueData = a turnkey solution,
for on-premises or on AWS
Thank You
Contact Info:
@tapbluedata
@nandavijaydev
tap@bluedata.com
nanda@bluedata.com
www.bluedata.com
Visit BlueData’s booth in the expo hall

More Related Content

What's hot

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
Apache ignite Datagrid
Apache ignite DatagridApache ignite Datagrid
Apache ignite DatagridSurinder Mehra
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)Durga Gadiraju
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
Big Data Certifications Workshop - 201711 - Introduction and Database EssentialsBig Data Certifications Workshop - 201711 - Introduction and Database Essentials
Big Data Certifications Workshop - 201711 - Introduction and Database EssentialsDurga Gadiraju
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsBig Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsDurga Gadiraju
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 

What's hot (20)

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Apache ignite Datagrid
Apache ignite DatagridApache ignite Datagrid
Apache ignite Datagrid
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
Big Data Certifications Workshop - 201711 - Introduction and Database EssentialsBig Data Certifications Workshop - 201711 - Introduction and Database Essentials
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsBig Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 

Viewers also liked

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 

Viewers also liked (9)

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 

Similar to Lessons Learned From Dockerizing Spark Workloads

Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerSpark Summit
 
Continuous Integration with Docker on AWS
Continuous Integration with Docker on AWSContinuous Integration with Docker on AWS
Continuous Integration with Docker on AWSAndrew Heifetz
 
Docker in the Enterprise
Docker in the EnterpriseDocker in the Enterprise
Docker in the EnterpriseSaul Caganoff
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Using Docker in production: Get started today!
Using Docker in production: Get started today!Using Docker in production: Get started today!
Using Docker in production: Get started today!Clarence Bakirtzidis
 
Containers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciContainers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciRajesh Kolla
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetesDr Ganesh Iyer
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBlueData, Inc.
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013dotCloud
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDr Ganesh Iyer
 
Oracle database on Docker Container
Oracle database on Docker ContainerOracle database on Docker Container
Oracle database on Docker ContainerJesus Guzman
 
Undine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development EnvironmentsUndine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development EnvironmentsDavid Watson
 

Similar to Lessons Learned From Dockerizing Spark Workloads (20)

Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the EnterpriseDeploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Continuous Integration with Docker on AWS
Continuous Integration with Docker on AWSContinuous Integration with Docker on AWS
Continuous Integration with Docker on AWS
 
Docker in the Enterprise
Docker in the EnterpriseDocker in the Enterprise
Docker in the Enterprise
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Using Docker in production: Get started today!
Using Docker in production: Get started today!Using Docker in production: Get started today!
Using Docker in production: Get started today!
 
Webinar Docker Tri Series
Webinar Docker Tri SeriesWebinar Docker Tri Series
Webinar Docker Tri Series
 
Containers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciContainers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aci
 
Why to docker
Why to dockerWhy to docker
Why to docker
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
 
Webinar : Docker in Production
Webinar : Docker in ProductionWebinar : Docker in Production
Webinar : Docker in Production
 
Docker
DockerDocker
Docker
 
Docker handons-workshop-for-charity
Docker handons-workshop-for-charityDocker handons-workshop-for-charity
Docker handons-workshop-for-charity
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data Scientists
 
Oracle database on Docker Container
Oracle database on Docker ContainerOracle database on Docker Container
Oracle database on Docker Container
 
Undine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development EnvironmentsUndine: Turnkey Drupal Development Environments
Undine: Turnkey Drupal Development Environments
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 

Recently uploaded (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 

Lessons Learned From Dockerizing Spark Workloads

  • 1. Lessons Learned From Dockerizing Spark Workloads Thomas Phelan Nanda Vijaydev Chief Architect, BlueData Data Scientist, BlueData @tapbluedata @nandavijaydev February 8, 2017
  • 2. Outline • Docker Containers and Big Data • Spark on Docker: Challenges • How We Did It: Lessons Learned • Key Takeaways • Q & A
  • 3. Distributed Spark Environments • Data scientists want flexibility: – New tools, latest versions of Spark, Kafka, H2O, et.al. – Multiple options – e.g. Zeppelin, RStudio, JupyterHub – Fast, iterative prototyping • IT wants control: – Multi-tenancy – Data security – Network isolation
  • 4. Why “Dockerize”? Infrastructure • Agility and elasticity • Standardized environments (dev, test, prod) • Portability (on-premises and public cloud) • Efficient (higher resource utilization) Applications • Fool-proof packaging (configs, libraries, driver versions,etc.) • Repeatable builds and orchestration • Faster app dev cycles • Lightweight(virtually no performance or startup penalty)
  • 5. The Journey to Spark on Docker Start with a clear goal in sight Begin with your Docker toolbox of a single container and basic networking and storage So you want to run Spark on Docker in a multi-tenant enterprise deployment? Warning: there are some pitfalls & challenges
  • 6. Spark on Docker: Pitfalls Traverse the tightrope of network configurations Navigate the river of container managers • Swarm ? • Kubernetes ? • AWS ECS ? • Mesos ? • Overlay files ? • Flocker ? • Convoy ? • Docker Networking ? Calico • Kubernetes Networking ? Flannel, Weave Net Cross the desert of storage configurations
  • 7. Spark on Docker: Challenges Pass thru the jungle of software configurations Tame the lion of performance Finally you get to the top! Trip down the staircase of deployment mistakes • R ? • Python ? • HDFS ? • NoSQL ? • On-premises ? • Public cloud ?
  • 8. Spark on Docker: Next Steps? But for an enterprise-ready deployment, you are still not even close to being done … You still have to climb past: high availability, backup/recovery, security, multi-host, multi-container, upgrades and patches …
  • 9. Spark on Docker: Quagmire You realize it’s time to get some help!
  • 10. How We Did It: Spark on Docker • Docker containers provide a powerful option for greater agility and flexibility – whether on-premises or in the public cloud • Running a complex, multi-service platform such as Spark in containers in a distributed enterprise-grade environment can be daunting • Here is how we did it for the BlueData platform ... while maintaining performance comparable to bare-metal
  • 11. How We Did It: Design Decisions • Run Spark (any version) with related models, tools / applications, and notebooks unmodified • Deploy all relevant services that typically run on a single Spark host in a single Docker container • Multi-tenancy support is key – Network and storage security – User management and access control
  • 12. How We Did It: Design Decisions • Docker images built to “auto-configure” themselves at time of instantiation – Not all instances of a single image run the same set of services when instantiated • Master vs. worker cluster nodes
  • 13. How We Did It: Design Decisions • Maintain the promise of Docker containers – Keep them as stateless as possible – Container storage is always ephemeral – Persistent storage is external to the container
  • 14. How We Did It: CPU & Network • Resource Utilization – CPU cores vs. CPU shares – Over-provisioning of CPU recommended – No over-provisioning of memory • Network – Connect containers across hosts – Persistence of IP address across container restart – Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation
  • 15. How We Did It: Network OVS Container Orchestrator DHCP/DNS VxLAN tunnel NIC Tenant Networks OVS NIC Resource Manager Node Manager Node Manager SparkMaster SparkWorker Zeppelin SparkWorker
  • 16. How We Did It: Storage • Storage – Tweak the default size of a container’s /root • Resizing of storage inside an existing container is tricky – Mount logical volume on /data • No use of overlay file systems – BlueData DataTap (version-independent, HDFS-compliant) • Connectivity to external storage • Image Management – Utilize Docker’s image repository TIP: Mounting block devices into a container does not support symbolic links (IOW: /dev/sdb will not work, /dm/… PCI device can change across host reboot). TIP: Docker images can get large. Use “docker squash” to save on size.
  • 17. How We Did It: Security • Security is essential since containers and host share one kernel – Non-privileged containers • Achieved through layered set of capabilities • Different capabilities provide different levels of isolation and protection
  • 18. How We Did It: Security • Add “capabilities” to a container based on what operations are permitted
  • 19. How We Did It: Sample Dockerfile # Spark-1. docker image for RHEL/CentOS 6.x FROM centos:centos6 # Download and extract spark RUN mkdir /usr/lib/spark; curl -s http://archive.apache.org/dist/spark/spark-2.0.1/spark-2.0.1- bin-hadoop2.6.tgz | tar xz -C /usr/lib/spark/ ## Install zeppelin RUN mkdir /usr/lib/zeppelin; curl -s https://s3.amazonaws.com/bluedata- catalog/thirdparty/zeppelin/zeppelin-0.7.0-SNAPSHOT.tar.gz | tar xz -C /usr/lib/zeppelin ADD configure_spark_services.sh/root/configure_spark_services.sh RUN chmod -x /root/configure_spark_services.sh&& /root/configure_spark_services.sh
  • 20. BlueData App Image (.bin file) Application bin file Docker image CentOS Dockerfile RHEL Dockerfile appconfig conf Init.d startscript <app> logo file <app>.wb bdwb command clusterconfig, image, role, appconfig, catalog, service, .. Sources Docker file, logo .PNG, Init.d RuntimeSoftware Bits OR Development (e.g. extract .bin and modify to create new bin)
  • 21. Multi-Host 4 containers on 2 different hosts using 1 VLAN and 4 persistent IPs
  • 22. Services per Container Master Services Worker Services
  • 23. Container Storage from Host Container Storage Host Storage
  • 24. Performance Testing: Spark • Spark 1.x on YARN • HiBench - Terasort • Data sizes: 100Gb, 500GB, 1TB • 10 node physical/virtual cluster • 36 cores and112GBmemory per node • 2TB HDFS storage per node (SSDs) • 800GB ephemeral storage
  • 25. Spark on Docker: Performance MB/s
  • 26. “Dockerized” Spark on BlueData Spark with Zeppelin Notebook 5 fully managed Docker containers with persistent IP addresses
  • 27. Pre-Configured Docker Images Choice of data science tools and notebooks (e.g. JupyterHub, RStudio, Zeppelin) and ability to “bring-your-own” tools and versions
  • 28. On-Demand Elastic Infrastructure Create “pristine” Docker containers with appropriate resources Log in to container and install your tools & libraries
  • 29. Spark on Docker: Key Takeaways • All apps can be “Dockerized”, including Spark – Docker containers enable a more flexible and agile deployment model – Faster app dev cycles for Spark app developers, data scientists, & engineers – Enables DevOps for data science teams
  • 30. Spark on Docker: Key Takeaways • Deployment requirements: – Docker base images include all needed Spark libraries and jar files – Container orchestration, including networking and storage – Resource-aware runtime environment, including CPU and RAM
  • 31. Spark on Docker: Key Takeaways • Data scientist considerations: – Access to data with full fidelity – Access to data processing and modeling tools – Ability to run, rerun, and scale analysis – Ability to compare and contrast various techniques – Ability to deploy & integrate enterprise-ready solution
  • 32. Spark on Docker: Key Takeaways • Enterprise deployment challenges: – Access to container secured with ssh keypair or PAM module (LDAP/AD) – Fast access to external storage – Management agents in Docker images – Runtime injection of resource and configuration information
  • 33. Spark on Docker: Key Takeaways • “Do It Yourself” will be costly & time-consuming – Be prepared to tackle the infrastructure challenges and pitfalls to Dockerize Spark – Your business value will come from data science and applications – not the plumbing • There are other options: – BlueData = a turnkey solution, for on-premises or on AWS