8. Containers
• Repeatable Builds and
Workflows
• Application Portability
• High Degree of Control over
Software
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure
Utilization
libs
app
kernel
libs
app
libs
app
libs
app
9. • Based on Google's experience running containers in
production for over 15 years
• Large OSS Community - 1200+ contributors and 45k+
commits
• Ecosystem and Partners - 100+ organizations involved
• One of the top 100 projects overall on GitHub - 23k+
stars
Statistics
14. Controllers
● Drive current state -> desired state
● Act independently
● Recurring pattern in the system
Examples:
● Deployment
● DaemonSet
● StatefulSet
observe
diff
act
16. • Resource sharing between batch, serving and stateful
workloads
– Streamlined developer experience
– Reduced operational costs
– Improved infrastructure utilization
• Kubernetes and the Container Ecosystem
– Lots of addon services: third-party logging, monitoring,
and security tools
– For example, the Istio project, announced May 24, by IBM,
Google and Lyft
Why Kubernetes?
19. • Beta recently announced at Spark Summit 2017
• Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata,
Red Hat, and growing.
Spark on Kubernetes
https://github.com/apache-spark-on-k8s/spar
k
Spark Core
Kubernetes Standalone YARN Mesos
GraphX SparkSQL MLlib Streaming
20. Spark on Kubernetes
Kubernetes
Integration
Container images with dependencies baked
in
Files from GCS/S3/HDFS/HTTP
File Staging Server
Staged files and
JARs
Several ways of running Spark Jobs along with their dependencies
on Kubernetes
21. Spark on Kubernetes
Spark Core Kubernetes Scheduler
Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s
22. State of Spark
Spark Streaming
Spark Shell
Client Mode
Python/R support
Cluster Mode
Java/Scala
Support
Dynamic
Allocation
Local File Staging High Availability
Spark SQL
GraphX MLlib
Dec 2016
Development
Began
Mar 2017
Alpha
Release
June 2017
Beta
Release
Nov 2016
Design
= supported but
untested
= not yet
supported
23. • Community driven effort to get HDFS running well on
Kubernetes
• Uses a helm chart to install onto a cluster
• Identified and solved several problems around data
locality when running Spark Jobs
HDFS on Kubernetes
https://github.com/apache-spark-on-k8s/kubernetes-HDFS
24. HDFS on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Pod Datanode Pod 1 Datanode Pod 2
HDFS on Kubernetes -- Lessons Learned [Public]
Kimoon Kim (PepperData)
25. State of HDFS
• HDFS with basic data locality works!
• Future Work
– Remaining data locality issues -- rack locality, node
preference, etc
– Performance benchmarks and testing
– Kerberos support
– Namenode HA
27. • Pipelines feature many other components.
• All of the below must run well on K8s
– Cassandra
– Kafka
– Zookeeper
– Elasticsearch, Kibana, etc
Data Pipelines are complicated!