SlideShare a Scribd company logo
1 of 34
Lessons Learned
Running Hadoop and
Spark in Docker
Thomas Phelan
Chief Architect, BlueData
@tapbluedata
September 29, 2016
Outline
• Docker Containers and Big Data
• Hadoop and Spark on Docker: Challenges
• How We Did It: Lessons Learned
• Key Takeaways
• Q & A
A Better Way to Deploy Big Data
Traditional Approach
IT
ManufacturingSalesR&DServices
< 30%
utilization
Weeks to
build
each
cluster
Duplication
of dataManagement
complexity
Painful,
complex
upgrades
Hadoop and Spark on Docker
ManufacturingSalesR&DServices
BI/Analytics
Tools
> 90%
utilization
A New Approach
No duplication
of data
Simplified
management
Multi-tenant
Self-service,
on-demand
clusters
Simple,
instant
upgrades
Deploying Multiple Big Data Clusters
Data scientists want flexibility:
• Different versions of Hadoop, Spark, et.al.
• Different sets of tools
IT wants control:
• Multi-tenancy
- Data security
- Network isolation
Containers = the Future of Big Data
Infrastructure
• Agility and elasticity
• Standardized environments
(dev, test, prod)
• Portability
(on-premises and cloud)
• Higher resource utilization
Applications
• Fool-proof packaging
(configs, libraries, driver
versions, etc.)
• Repeatable builds and
orchestration
• Faster app dev cycles
The Journey to Big Data on Docker
Start with a clear goal
in sight
Begin with your Docker
toolbox of a single
container and basic
networking and storage
So you want to run Hadoop and Spark on Docker
in a multi-tenant enterprise deployment?
Beware … there is trouble ahead
Traverse the tightrope of
network configurations
Navigate the river of
container managers
• Swarm ?
• Kubernetes ?
• AWS ECS ?
• Mesos ?
• Overlay files ?
• Flocker ?
• Convoy ?
• Docker Networking ?
Calico
• Kubernetes Networking ?
Flannel, Weave Net
Big Data on Docker: Pitfalls
Cross the desert of storage
configurations
Big Data on Docker: Challenges
Pass thru the jungle of
software compatibility
Tame the lion of
performance
Finally you get to the top!
Trip down the staircase of
deployment mistakes
But for deployment in the enterprise, you are
not even close to being done …
Big Data on Docker: Next Steps?
You still have to climb past:
high availability,
backup/recovery, security,
multi-host, multi-container,
upgrades and patches
You realize it’s time
to get some help!
Big Data on Docker: Quagmire
How We Did It: Design Decisions
• Run Hadoop/Spark distros and applications
unmodified
- Deploy all services that typically run on a single BM
host in a single container
• Multi-tenancy support is key
- Network and storage security
How We Did It: Design Decisions
• Images built to “auto-configure” themselves at
time of instantiation
- Not all instances of a single image run the same set of
services when instantiated
• Master vs. worker cluster nodes
How We Did It: Design Decisions
• Maintain the promise of containers
- Keep them as stateless as possible
- Container storage is always ephemeral
- Persistent storage is external to the container
How We Did It: Implementation
Resource Utilization
• CPU cores vs. CPU shares
• Over-provisioning of CPU recommended
• No over-provisioning of memory
Network
• Connect containers across hosts
• Persistence of IP address across container restart
• Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation
Noisy neighbors
How We Did It: Network Architecture
OVS
Container
Orchestrator
DHCP/DNS
VxLAN tunnel
NIC
Tenant Networks
OVS
NIC
Resource
Manager
Node
Manager
Node Manager SparkMaster
SparkWorker
Zeppelin
How We Did It: Implementation
Storage
• Tweak the default size of a container’s /root
- Resizing of storage inside an existing container is tricky
• Mount logical volume on /data
- No use of overlay file systems
• DataTap (version-independent, HDFS-compliant)
connectivity to external storage
Image Management
• Utilize Docker’s image repository
TIP: Mounting block
devices into a container
does not support
symbolic links (IOW:
/dev/sdb will not work,
/dm/… PCI device can
change across host
reboot).
TIP: Docker images can
get large. Use “docker
squash” to save on size.
How We Did It: Security Considerations
• Security is essential since containers and host share one kernel
- Non-privileged containers
• Achieved through layered set of capabilities
• Different capabilities provide different levels of isolation and protection
• Add “capabilities” to a container based on what operations are permitted
How We Did It: Sample Dockerfile
# Spark-1.5.2 docker image for RHEL/CentOS 6.x
FROM centos:centos6
# Download and extract spark
RUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/
# Download and extract scala
RUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/
# Install zeppelin
RUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C
/usr/lib/zeppelin
RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/*
ADD configure_spark_services.sh /root/configure_spark_services.sh
RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh
BlueData Application Image (.bin file)
Application bin file
Docker
image
CentOS
Dockerfile
RHEL
Dockerfile
appconfig
conf Init.d startscript
<app>
logo file
<app>.wb
bdwb
command
clusterconfig, image, role,
appconfig, catalog, service, ..
Sources
Docker file,
logo .PNG,
Init.d
RuntimeSoftware Bits
OR
Development
(e.g. extract .bin and modify to
create new bin)
Multi-Host
4 containers
on 2 different hosts
using 1 VLAN and 4
persistent IPs
Different Services in Each Container
Master Services
Worker Services
Container Storage from Host
Container Storage Host Storage
Performance Testing: Spark
• Spark 1.x on YARN
• HiBench - Terasort
- Data sizes: 100Gb, 500GB, 1TB
• 10 node physical/virtual cluster
• 36 cores and112GB memory per node
• 2TB HDFS storage per node (SSDs)
• 800GB ephemeral storage
Spark on Docker: Performance
MB/s
Performance Testing: Hadoop
Hadoop on Docker: Performance
Containers (BlueData EPIC with 1 virtual node per host)
Bare-Metal
4 Docker
containers
Multiple
Hadoop
versions
Different Hadoop
versions on
same set of
physical hosts
“Dockerized” Hadoop Clusters
5 fully managed Docker
containers with persistent
IP addresses
“Dockerized” Spark Standalone
Spark with Zeppelin
Notebook
Big Data on Docker: Key Takeaways
• All apps can be “Dockerized”,
including Hadoop & Spark
- Traditional bare-metal approach to Big
Data is rigid and inflexible
- Containers (e.g. Docker) provide a more
flexible & agile model
- Faster app dev cycles for Big Data app
developers, data scientists, & engineers
Big Data on Docker: Key Takeaways
• There are unique Big Data pitfalls & challenges with Docker
- For enterprise deployments, you will need to overcome these and more:
 Docker base images include Big Data libraries and jar files
 Container orchestration, including networking and storage
 Resource-aware runtime environment, including CPU and RAM
Big Data on Docker: Key Takeaways
• There are unique Big Data pitfalls & challenges with Docker
- More:
 Access to Container secured with ssh keypair or PAM module
(LDAP/AD)
 Fast access to external storage
 Management agents in Docker images
 Runtime injection of resource and configuration
information
Big Data on Docker: Key Takeaways
• “Do It Yourself” will be costly and time-consuming
- Be prepared to tackle the infrastructure & plumbing challenges
- In the enterprise, the business value is in the applications / data science
• There are other options …
- Public Cloud – AWS
- BlueData - turnkey solution
Thank You
Contact info:
@tapbluedata
tap@bluedata.com
www.bluedata.com
www.bluedata.com/free
FREE LITE VERSION

More Related Content

What's hot

Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리YoungHeon (Roy) Kim
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
Windows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonWindows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonSeungmo Koo
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsBryan Bende
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfAnya Bida
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석
[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석
[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석iFunFactory Inc.
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Helix presentation at ApacheCon 2013
Apache Helix presentation at ApacheCon 2013Apache Helix presentation at ApacheCon 2013
Apache Helix presentation at ApacheCon 2013Kishore Gopalakrishna
 
Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Treasure Data, Inc.
 
Windows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPWindows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPSeungmo Koo
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019
隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019
隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019株式会社クライム
 

What's hot (20)

Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Windows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonWindows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance Comparison
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석
[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석
[아이펀팩토리] 2018 데브데이 서버위더스 _04 리눅스 게임 서버 성능 분석
 
Power of the AWR Warehouse
Power of the AWR WarehousePower of the AWR Warehouse
Power of the AWR Warehouse
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Apache Helix presentation at ApacheCon 2013
Apache Helix presentation at ApacheCon 2013Apache Helix presentation at ApacheCon 2013
Apache Helix presentation at ApacheCon 2013
 
Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -
 
Windows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCPWindows Registered I/O (RIO) vs IOCP
Windows Registered I/O (RIO) vs IOCP
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019
隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019
隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019
 

Viewers also liked

Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on DockerRakesh Saha
 
Managing Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesManaging Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesMarc Sluiter
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containerspranav_joshi
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
Docker Swarm Cluster
Docker Swarm ClusterDocker Swarm Cluster
Docker Swarm ClusterFernando Ike
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Janos Matyas
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 

Viewers also liked (11)

Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
Managing Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesManaging Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing Kubernetes
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Docker Swarm Cluster
Docker Swarm ClusterDocker Swarm Cluster
Docker Swarm Cluster
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 

Similar to Lessons Learned Running Hadoop and Spark in Docker Containers

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
 
Lessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark WorkloadsLessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark WorkloadsBlueData, Inc.
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Using Docker in production: Get started today!
Using Docker in production: Get started today!Using Docker in production: Get started today!
Using Docker in production: Get started today!Clarence Bakirtzidis
 
Continuous Integration with Docker on AWS
Continuous Integration with Docker on AWSContinuous Integration with Docker on AWS
Continuous Integration with Docker on AWSAndrew Heifetz
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerSpark Summit
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013dotCloud
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetesDr Ganesh Iyer
 
Docker in the Enterprise
Docker in the EnterpriseDocker in the Enterprise
Docker in the EnterpriseSaul Caganoff
 
Lightweight Virtualization Docker in Practice
Lightweight Virtualization Docker in PracticeLightweight Virtualization Docker in Practice
Lightweight Virtualization Docker in PracticeDocker, Inc.
 
SQL Server in DevOps Town Hall Webinar
SQL Server in DevOps Town Hall WebinarSQL Server in DevOps Town Hall Webinar
SQL Server in DevOps Town Hall WebinarTravis Wright
 
Intro to Docker October 2013
Intro to Docker October 2013Intro to Docker October 2013
Intro to Docker October 2013Docker, Inc.
 
Containers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciContainers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciRajesh Kolla
 
Docker Containers Deep Dive
Docker Containers Deep DiveDocker Containers Deep Dive
Docker Containers Deep DiveWill Kinard
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBlueData, Inc.
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDr Ganesh Iyer
 

Similar to Lessons Learned Running Hadoop and Spark in Docker Containers (20)

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Lessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark WorkloadsLessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark Workloads
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the EnterpriseDeploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
 
Using Docker in production: Get started today!
Using Docker in production: Get started today!Using Docker in production: Get started today!
Using Docker in production: Get started today!
 
Continuous Integration with Docker on AWS
Continuous Integration with Docker on AWSContinuous Integration with Docker on AWS
Continuous Integration with Docker on AWS
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Webinar Docker Tri Series
Webinar Docker Tri SeriesWebinar Docker Tri Series
Webinar Docker Tri Series
 
Webinar : Docker in Production
Webinar : Docker in ProductionWebinar : Docker in Production
Webinar : Docker in Production
 
Intro Docker october 2013
Intro Docker october 2013Intro Docker october 2013
Intro Docker october 2013
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 
Docker in the Enterprise
Docker in the EnterpriseDocker in the Enterprise
Docker in the Enterprise
 
Lightweight Virtualization Docker in Practice
Lightweight Virtualization Docker in PracticeLightweight Virtualization Docker in Practice
Lightweight Virtualization Docker in Practice
 
SQL Server in DevOps Town Hall Webinar
SQL Server in DevOps Town Hall WebinarSQL Server in DevOps Town Hall Webinar
SQL Server in DevOps Town Hall Webinar
 
Intro to Docker October 2013
Intro to Docker October 2013Intro to Docker October 2013
Intro to Docker October 2013
 
Containers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aciContainers docker-docker hub-azureacr-azure aci
Containers docker-docker hub-azureacr-azure aci
 
Docker Containers Deep Dive
Docker Containers Deep DiveDocker Containers Deep Dive
Docker Containers Deep Dive
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
 
Docker handons-workshop-for-charity
Docker handons-workshop-for-charityDocker handons-workshop-for-charity
Docker handons-workshop-for-charity
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data Scientists
 

More from BlueData, Inc.

Introduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes MeetupIntroduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes MeetupBlueData, Inc.
 
Dell EMC Ready Solutions for Big Data
Dell EMC Ready Solutions for Big DataDell EMC Ready Solutions for Big Data
Dell EMC Ready Solutions for Big DataBlueData, Inc.
 
BlueData and Hortonworks Data Platform (HDP)
BlueData and Hortonworks Data Platform (HDP)BlueData and Hortonworks Data Platform (HDP)
BlueData and Hortonworks Data Platform (HDP)BlueData, Inc.
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentBlueData, Inc.
 
BlueData EPIC datasheet (en Français)
BlueData EPIC datasheet (en Français)BlueData EPIC datasheet (en Français)
BlueData EPIC datasheet (en Français)BlueData, Inc.
 
Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBlueData, Inc.
 
BlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec SheetBlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec SheetBlueData, Inc.
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceBlueData, Inc.
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperBlueData, Inc.
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorBlueData, Inc.
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
BlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData, Inc.
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBlueData, Inc.
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData, Inc.
 
Spark Infrastructure Made Easy
Spark Infrastructure Made EasySpark Infrastructure Made Easy
Spark Infrastructure Made EasyBlueData, Inc.
 
BlueData Integration with Cloudera Manager
BlueData Integration with Cloudera ManagerBlueData Integration with Cloudera Manager
BlueData Integration with Cloudera ManagerBlueData, Inc.
 

More from BlueData, Inc. (17)

Introduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes MeetupIntroduction to KubeDirector - SF Kubernetes Meetup
Introduction to KubeDirector - SF Kubernetes Meetup
 
Dell EMC Ready Solutions for Big Data
Dell EMC Ready Solutions for Big DataDell EMC Ready Solutions for Big Data
Dell EMC Ready Solutions for Big Data
 
BlueData and Hortonworks Data Platform (HDP)
BlueData and Hortonworks Data Platform (HDP)BlueData and Hortonworks Data Platform (HDP)
BlueData and Hortonworks Data Platform (HDP)
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
 
BlueData EPIC datasheet (en Français)
BlueData EPIC datasheet (en Français)BlueData EPIC datasheet (en Français)
BlueData EPIC datasheet (en Français)
 
Bare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containersBare-metal performance for Big Data workloads on Docker containers
Bare-metal performance for Big Data workloads on Docker containers
 
BlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec SheetBlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec Sheet
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline Accelerator
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White Paper
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
BlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData EPIC 2.0 Overview
BlueData EPIC 2.0 Overview
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for Hadoop
 
Spark Infrastructure Made Easy
Spark Infrastructure Made EasySpark Infrastructure Made Easy
Spark Infrastructure Made Easy
 
BlueData Integration with Cloudera Manager
BlueData Integration with Cloudera ManagerBlueData Integration with Cloudera Manager
BlueData Integration with Cloudera Manager
 

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Lessons Learned Running Hadoop and Spark in Docker Containers

  • 1. Lessons Learned Running Hadoop and Spark in Docker Thomas Phelan Chief Architect, BlueData @tapbluedata September 29, 2016
  • 2. Outline • Docker Containers and Big Data • Hadoop and Spark on Docker: Challenges • How We Did It: Lessons Learned • Key Takeaways • Q & A
  • 3. A Better Way to Deploy Big Data Traditional Approach IT ManufacturingSalesR&DServices < 30% utilization Weeks to build each cluster Duplication of dataManagement complexity Painful, complex upgrades Hadoop and Spark on Docker ManufacturingSalesR&DServices BI/Analytics Tools > 90% utilization A New Approach No duplication of data Simplified management Multi-tenant Self-service, on-demand clusters Simple, instant upgrades
  • 4. Deploying Multiple Big Data Clusters Data scientists want flexibility: • Different versions of Hadoop, Spark, et.al. • Different sets of tools IT wants control: • Multi-tenancy - Data security - Network isolation
  • 5. Containers = the Future of Big Data Infrastructure • Agility and elasticity • Standardized environments (dev, test, prod) • Portability (on-premises and cloud) • Higher resource utilization Applications • Fool-proof packaging (configs, libraries, driver versions, etc.) • Repeatable builds and orchestration • Faster app dev cycles
  • 6. The Journey to Big Data on Docker Start with a clear goal in sight Begin with your Docker toolbox of a single container and basic networking and storage So you want to run Hadoop and Spark on Docker in a multi-tenant enterprise deployment? Beware … there is trouble ahead
  • 7. Traverse the tightrope of network configurations Navigate the river of container managers • Swarm ? • Kubernetes ? • AWS ECS ? • Mesos ? • Overlay files ? • Flocker ? • Convoy ? • Docker Networking ? Calico • Kubernetes Networking ? Flannel, Weave Net Big Data on Docker: Pitfalls Cross the desert of storage configurations
  • 8. Big Data on Docker: Challenges Pass thru the jungle of software compatibility Tame the lion of performance Finally you get to the top! Trip down the staircase of deployment mistakes
  • 9. But for deployment in the enterprise, you are not even close to being done … Big Data on Docker: Next Steps? You still have to climb past: high availability, backup/recovery, security, multi-host, multi-container, upgrades and patches
  • 10. You realize it’s time to get some help! Big Data on Docker: Quagmire
  • 11. How We Did It: Design Decisions • Run Hadoop/Spark distros and applications unmodified - Deploy all services that typically run on a single BM host in a single container • Multi-tenancy support is key - Network and storage security
  • 12. How We Did It: Design Decisions • Images built to “auto-configure” themselves at time of instantiation - Not all instances of a single image run the same set of services when instantiated • Master vs. worker cluster nodes
  • 13. How We Did It: Design Decisions • Maintain the promise of containers - Keep them as stateless as possible - Container storage is always ephemeral - Persistent storage is external to the container
  • 14. How We Did It: Implementation Resource Utilization • CPU cores vs. CPU shares • Over-provisioning of CPU recommended • No over-provisioning of memory Network • Connect containers across hosts • Persistence of IP address across container restart • Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation Noisy neighbors
  • 15. How We Did It: Network Architecture OVS Container Orchestrator DHCP/DNS VxLAN tunnel NIC Tenant Networks OVS NIC Resource Manager Node Manager Node Manager SparkMaster SparkWorker Zeppelin
  • 16. How We Did It: Implementation Storage • Tweak the default size of a container’s /root - Resizing of storage inside an existing container is tricky • Mount logical volume on /data - No use of overlay file systems • DataTap (version-independent, HDFS-compliant) connectivity to external storage Image Management • Utilize Docker’s image repository TIP: Mounting block devices into a container does not support symbolic links (IOW: /dev/sdb will not work, /dm/… PCI device can change across host reboot). TIP: Docker images can get large. Use “docker squash” to save on size.
  • 17. How We Did It: Security Considerations • Security is essential since containers and host share one kernel - Non-privileged containers • Achieved through layered set of capabilities • Different capabilities provide different levels of isolation and protection • Add “capabilities” to a container based on what operations are permitted
  • 18. How We Did It: Sample Dockerfile # Spark-1.5.2 docker image for RHEL/CentOS 6.x FROM centos:centos6 # Download and extract spark RUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/ # Download and extract scala RUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/ # Install zeppelin RUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C /usr/lib/zeppelin RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/* ADD configure_spark_services.sh /root/configure_spark_services.sh RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh
  • 19. BlueData Application Image (.bin file) Application bin file Docker image CentOS Dockerfile RHEL Dockerfile appconfig conf Init.d startscript <app> logo file <app>.wb bdwb command clusterconfig, image, role, appconfig, catalog, service, .. Sources Docker file, logo .PNG, Init.d RuntimeSoftware Bits OR Development (e.g. extract .bin and modify to create new bin)
  • 20. Multi-Host 4 containers on 2 different hosts using 1 VLAN and 4 persistent IPs
  • 21. Different Services in Each Container Master Services Worker Services
  • 22. Container Storage from Host Container Storage Host Storage
  • 23. Performance Testing: Spark • Spark 1.x on YARN • HiBench - Terasort - Data sizes: 100Gb, 500GB, 1TB • 10 node physical/virtual cluster • 36 cores and112GB memory per node • 2TB HDFS storage per node (SSDs) • 800GB ephemeral storage
  • 24. Spark on Docker: Performance MB/s
  • 26. Hadoop on Docker: Performance Containers (BlueData EPIC with 1 virtual node per host) Bare-Metal
  • 27. 4 Docker containers Multiple Hadoop versions Different Hadoop versions on same set of physical hosts “Dockerized” Hadoop Clusters
  • 28. 5 fully managed Docker containers with persistent IP addresses “Dockerized” Spark Standalone Spark with Zeppelin Notebook
  • 29. Big Data on Docker: Key Takeaways • All apps can be “Dockerized”, including Hadoop & Spark - Traditional bare-metal approach to Big Data is rigid and inflexible - Containers (e.g. Docker) provide a more flexible & agile model - Faster app dev cycles for Big Data app developers, data scientists, & engineers
  • 30. Big Data on Docker: Key Takeaways • There are unique Big Data pitfalls & challenges with Docker - For enterprise deployments, you will need to overcome these and more:  Docker base images include Big Data libraries and jar files  Container orchestration, including networking and storage  Resource-aware runtime environment, including CPU and RAM
  • 31. Big Data on Docker: Key Takeaways • There are unique Big Data pitfalls & challenges with Docker - More:  Access to Container secured with ssh keypair or PAM module (LDAP/AD)  Fast access to external storage  Management agents in Docker images  Runtime injection of resource and configuration information
  • 32. Big Data on Docker: Key Takeaways • “Do It Yourself” will be costly and time-consuming - Be prepared to tackle the infrastructure & plumbing challenges - In the enterprise, the business value is in the applications / data science • There are other options … - Public Cloud – AWS - BlueData - turnkey solution
  • 33.

Editor's Notes

  1. Jason to briefly cover agenda
  2. Overview of Results Testing revealed that, across the HiBench micro-workloads investigated, the BlueData EPIC software platform enables performance that is comparable or superior to that on bare-metal. Results are summarized in the above chart. The elapsed or execution times for physical Hadoop were used as the baseline (i.e. 100%) and the corresponding elapsed or execution time for the same test on BlueData EPIC compared to the bare metal execution time. The elapsed time results show that while BlueData EPIC is twice as fast for write dominant I/O workloads as shown with DFSIO Write as well as Teragen. The elapsed time results show the BlueData EPIC is comparable to bare metal for read dominant I/O workloads. However, even with lower throughput for DFSIO read, the performance of real world workloads that consist of series of read and write operations over a period of time, would make BlueData performance equivalent or better than bare metal. This balanced read/write performance is demonstrated by the TeraSort read/write results (shown above). Even with just a single virtual node per host, the performance of virtualized EPIC platform is within 2% of bare metal performance. The elapsed time results show that BlueData EPIC is 10-15% faster for compute intensive workloads as shown with Wordcount. The superior performance of the BlueData EPIC software platform for write-dominant workloads is due to the application aware caching enabled by IOBoost technology. The non-persistent memory cache improves the efficiency of access to physical-storage devices by enabling a write-behind cache, optimizing the performance of sequential writes. In general, any write operation, including those outside the scope of these benchmarks will benefit from write-behind cache since Hadoop uses a separate thread for write operations. BlueData IOBoost acknowledges this write request immediately thereby enabling the application to continue data processing without any latency. BlueData performance on read-dominant workloads is comparable to bare metal in part due to the read-ahead cache implemented in IOBoost. Unlike write operations where a separate thread is used, Hadoop applications have rigid semantics for read operations where in they wait for the read to complete before processing is continued. As a result, the read ahead cache has a lesser contribution to the I/O throughput for typical Hadoop applications. In summary, the use of a single virtual node per physical host shows that there is minimal to no overhead using virtualized EPIC platform compared to bare metal/physical Hadoop.
  3. Jason to moderate and introduce questions – then field questions with Anant & Tom to answer
  4. Jason: Thank you to everyone for attending this webinar – and a special thanks to our presenters …. All attendees of this webcast can click on the “Attachments” tab and download a copy of the slides – you’ll also have access to the on-demand replay. And if you’d like more information about BlueData software, you can visit our website at bluedata.com, contact us directly, or try the free version of our software at bluedata.com/free Again, thanks and have a great day.