SlideShare a Scribd company logo
1 of 31
Download to read offline
Data Processing and
Kubernetes
Anirudh Ramanathan (Google Inc.)
Agenda
• Basics of Kubernetes & Containers
• Motivation
• Apache Spark and HDFS on Kubernetes
• Data Processing Ecosystem
• Future Work
What is Kubernetes?
Kubernetes
Kubernetes is an open-source system
Kubernetes is an open-source system for
automating deployment, scaling, and
management
Kubernetes
Kubernetes is an open-source system for
automating deployment, scaling, and
management of containerized applications.
Kubernetes
‘Containerized’
Containers
• Repeatable Builds and
Workflows
• Application Portability
• High Degree of Control over
Software
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure
Utilization
libs
app
kernel
libs
app
libs
app
libs
app
• Based on Google's experience running containers in
production for over 15 years
• Large OSS Community - 1200+ contributors and 45k+
commits
• Ecosystem and Partners - 100+ organizations involved
• One of the top 100 projects overall on GitHub - 23k+
stars
Statistics
Overview
At a Glance
kubelet
kubeletCLI
API
users master nodes
etcd
kubelet
scheduler
controllers
apiserver
UI
Nodes and Pods
Pod
Volume
Containers
Pod
Containers
8080 8080
• Pod is set of co-located
containers
• Created by declarative
specification
• Each pod has distinct IP
address
• Volumes local or
network-attached
8080
Volume
Controllers
● Drive current state -> desired state
● Act independently
● Recurring pattern in the system
Examples:
● Deployment
● DaemonSet
● StatefulSet
observe
diff
act
Motivation
• Resource sharing between batch, serving and stateful
workloads
– Streamlined developer experience
– Reduced operational costs
– Improved infrastructure utilization
• Kubernetes and the Container Ecosystem
– Lots of addon services: third-party logging, monitoring,
and security tools
– For example, the Istio project, announced May 24, by IBM,
Google and Lyft
Why Kubernetes?
Cluster Administration
Namespaces
Resource
Accounting
Logging
Monitoring
Resource
Quota
Pluggable
Authorization
Admission
Control
RBAC
• Launch Jobs as a particular
user into a specific
namespace
• RBAC and Namespace-level
resource quotas
• Audit logging for clusters
• Several monitoring solutions
to see node, cluster and
pod-level statistics
Data Processing
• Beta recently announced at Spark Summit 2017
• Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata,
Red Hat, and growing.
Spark on Kubernetes
https://github.com/apache-spark-on-k8s/spar
k
Spark Core
Kubernetes Standalone YARN Mesos
GraphX SparkSQL MLlib Streaming
Spark on Kubernetes
Kubernetes
Integration
Container images with dependencies baked
in
Files from GCS/S3/HDFS/HTTP
File Staging Server
Staged files and
JARs
Several ways of running Spark Jobs along with their dependencies
on Kubernetes
Spark on Kubernetes
Spark Core Kubernetes Scheduler
Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s
State of Spark
Spark Streaming
Spark Shell
Client Mode
Python/R support
Cluster Mode
Java/Scala
Support
Dynamic
Allocation
Local File Staging High Availability
Spark SQL
GraphX MLlib
Dec 2016
Development
Began
Mar 2017
Alpha
Release
June 2017
Beta
Release
Nov 2016
Design
= supported but
untested
= not yet
supported
• Community driven effort to get HDFS running well on
Kubernetes
• Uses a helm chart to install onto a cluster
• Identified and solved several problems around data
locality when running Spark Jobs
HDFS on Kubernetes
https://github.com/apache-spark-on-k8s/kubernetes-HDFS
HDFS on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Pod Datanode Pod 1 Datanode Pod 2
HDFS on Kubernetes -- Lessons Learned [Public]
Kimoon Kim (PepperData)
State of HDFS
• HDFS with basic data locality works!
• Future Work
– Remaining data locality issues -- rack locality, node
preference, etc
– Performance benchmarks and testing
– Kerberos support
– Namenode HA
Ecosystem
• Pipelines feature many other components.
• All of the below must run well on K8s
– Cassandra
– Kafka
– Zookeeper
– Elasticsearch, Kibana, etc
Data Pipelines are complicated!
• Cassandra:
https://github.com/kubernetes/examples/tree/master/cassandra
• Kafka:
https://github.com/kubernetes/contrib/tree/master/statefulsets/ka
fka
• Zookeeper:
https://github.com/kubernetes/charts/tree/master/incubator/zook
eeper
• zetcd: https://github.com/coreos/zetcd
• Elasticsearch Operator:
https://github.com/upmc-enterprises/elasticsearch-operator
Cassandra, Kafka and Zookeeper
Future Work
• Batch Scheduling and Resource Sharing
– Priorities and Preemption
• Storage
– Local Storage Provisioning
• Extensibility
– Kubernetes CustomResources (formerly
ThirdPartyResources)
– UI and Dashboard Improvements
• Cluster Federation and Multi-cloud deployments
• Get involved!
https://github.com/kubernetes/community/t
ree/master/sig-big-data
• SIG BigData weekly meeting open to all
(10am PT on Wednesdays) via Zoom:
http://zoom.us/my/sig.big.data
Future Work
Questions/Discussion

More Related Content

What's hot

GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...Oleg Shalygin
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use casesGDG Cloud Bengaluru
 
OpenShift Enterprise 3.1 vs kubernetes
OpenShift Enterprise 3.1 vs kubernetesOpenShift Enterprise 3.1 vs kubernetes
OpenShift Enterprise 3.1 vs kubernetesSamuel Terburg
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes mattersPlatform9
 
Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018CloudOps2005
 
Kubernetes and Istio
Kubernetes and IstioKubernetes and Istio
Kubernetes and IstioKetan Gote
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineKit Merker
 
Episode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at ScaleEpisode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at ScaleMesosphere Inc.
 
Managing kubernetes deployment with operators
Managing kubernetes deployment with operatorsManaging kubernetes deployment with operators
Managing kubernetes deployment with operatorsCloud Technology Experts
 
Top 3 reasons why you should run your Enterprise workloads on GKE
Top 3 reasons why you should run your Enterprise workloads on GKETop 3 reasons why you should run your Enterprise workloads on GKE
Top 3 reasons why you should run your Enterprise workloads on GKESreenivas Makam
 
The Operator Pattern - Managing Stateful Services in Kubernetes
The Operator Pattern - Managing Stateful Services in KubernetesThe Operator Pattern - Managing Stateful Services in Kubernetes
The Operator Pattern - Managing Stateful Services in KubernetesQAware GmbH
 
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...Edureka!
 
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1Etsuji Nakai
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveKubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveSanjeev Rampal
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container enginebrendandburns
 
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetesmountpoint.io
 
Ports, pods and proxies
Ports, pods and proxiesPorts, pods and proxies
Ports, pods and proxiesLibbySchulze
 
Kubecon seattle 2018 recap - Application Deployment aspects
Kubecon seattle 2018 recap - Application Deployment aspectsKubecon seattle 2018 recap - Application Deployment aspects
Kubecon seattle 2018 recap - Application Deployment aspectsKrishna-Kumar
 

What's hot (20)

GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use cases
 
OpenShift Enterprise 3.1 vs kubernetes
OpenShift Enterprise 3.1 vs kubernetesOpenShift Enterprise 3.1 vs kubernetes
OpenShift Enterprise 3.1 vs kubernetes
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes matters
 
Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018
 
Kubernetes and Istio
Kubernetes and IstioKubernetes and Istio
Kubernetes and Istio
 
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoTZephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container Engine
 
Episode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at ScaleEpisode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at Scale
 
Managing kubernetes deployment with operators
Managing kubernetes deployment with operatorsManaging kubernetes deployment with operators
Managing kubernetes deployment with operators
 
Top 3 reasons why you should run your Enterprise workloads on GKE
Top 3 reasons why you should run your Enterprise workloads on GKETop 3 reasons why you should run your Enterprise workloads on GKE
Top 3 reasons why you should run your Enterprise workloads on GKE
 
The Operator Pattern - Managing Stateful Services in Kubernetes
The Operator Pattern - Managing Stateful Services in KubernetesThe Operator Pattern - Managing Stateful Services in Kubernetes
The Operator Pattern - Managing Stateful Services in Kubernetes
 
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
 
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveKubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container engine
 
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
 
Ports, pods and proxies
Ports, pods and proxiesPorts, pods and proxies
Ports, pods and proxies
 
Kubecon seattle 2018 recap - Application Deployment aspects
Kubecon seattle 2018 recap - Application Deployment aspectsKubecon seattle 2018 recap - Application Deployment aspects
Kubecon seattle 2018 recap - Application Deployment aspects
 

Similar to Big data and Kubernetes

[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on KubernetesTimothy Chen
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasadGlobal azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasadashishpd
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?DataWorks Summit
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Hao H. Zhang
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018Rachit Arora
 
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...DevDay.org
 
Container Conf 2017: Rancher Kubernetes
Container Conf 2017: Rancher KubernetesContainer Conf 2017: Rancher Kubernetes
Container Conf 2017: Rancher KubernetesVishal Biyani
 
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMwareVMUG IT
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Vietnam Open Infrastructure User Group
 
Deploying Anything as a Service (XaaS) Using Operators on Kubernetes
Deploying Anything as a Service (XaaS) Using Operators on KubernetesDeploying Anything as a Service (XaaS) Using Operators on Kubernetes
Deploying Anything as a Service (XaaS) Using Operators on KubernetesAll Things Open
 
Centralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container OperationsCentralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container OperationsKublr
 
Building Cloud-Native Applications with Kubernetes, Helm and Kubeless
Building Cloud-Native Applications with Kubernetes, Helm and KubelessBuilding Cloud-Native Applications with Kubernetes, Helm and Kubeless
Building Cloud-Native Applications with Kubernetes, Helm and KubelessBitnami
 
Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Neckerinovex GmbH
 
Docker kubernetes fundamental(pod_service)_190307
Docker kubernetes fundamental(pod_service)_190307Docker kubernetes fundamental(pod_service)_190307
Docker kubernetes fundamental(pod_service)_190307Inhye Park
 
A Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container EngineA Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container EngineRightScale
 

Similar to Big data and Kubernetes (20)

[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasadGlobal azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
 
Container Conf 2017: Rancher Kubernetes
Container Conf 2017: Rancher KubernetesContainer Conf 2017: Rancher Kubernetes
Container Conf 2017: Rancher Kubernetes
 
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
 
Deploying Anything as a Service (XaaS) Using Operators on Kubernetes
Deploying Anything as a Service (XaaS) Using Operators on KubernetesDeploying Anything as a Service (XaaS) Using Operators on Kubernetes
Deploying Anything as a Service (XaaS) Using Operators on Kubernetes
 
Centralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container OperationsCentralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container Operations
 
Building Cloud-Native Applications with Kubernetes, Helm and Kubeless
Building Cloud-Native Applications with Kubernetes, Helm and KubelessBuilding Cloud-Native Applications with Kubernetes, Helm and Kubeless
Building Cloud-Native Applications with Kubernetes, Helm and Kubeless
 
Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Necker
 
Docker kubernetes fundamental(pod_service)_190307
Docker kubernetes fundamental(pod_service)_190307Docker kubernetes fundamental(pod_service)_190307
Docker kubernetes fundamental(pod_service)_190307
 
A Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container EngineA Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container Engine
 
Moby KubeCon 2017
Moby KubeCon 2017Moby KubeCon 2017
Moby KubeCon 2017
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Big data and Kubernetes

  • 1. Data Processing and Kubernetes Anirudh Ramanathan (Google Inc.)
  • 2. Agenda • Basics of Kubernetes & Containers • Motivation • Apache Spark and HDFS on Kubernetes • Data Processing Ecosystem • Future Work
  • 4. Kubernetes Kubernetes is an open-source system
  • 5. Kubernetes is an open-source system for automating deployment, scaling, and management Kubernetes
  • 6. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes
  • 8. Containers • Repeatable Builds and Workflows • Application Portability • High Degree of Control over Software • Faster Development Cycle • Reduced dev-ops load • Improved Infrastructure Utilization libs app kernel libs app libs app libs app
  • 9. • Based on Google's experience running containers in production for over 15 years • Large OSS Community - 1200+ contributors and 45k+ commits • Ecosystem and Partners - 100+ organizations involved • One of the top 100 projects overall on GitHub - 23k+ stars Statistics
  • 10.
  • 12. At a Glance kubelet kubeletCLI API users master nodes etcd kubelet scheduler controllers apiserver UI
  • 13. Nodes and Pods Pod Volume Containers Pod Containers 8080 8080 • Pod is set of co-located containers • Created by declarative specification • Each pod has distinct IP address • Volumes local or network-attached 8080 Volume
  • 14. Controllers ● Drive current state -> desired state ● Act independently ● Recurring pattern in the system Examples: ● Deployment ● DaemonSet ● StatefulSet observe diff act
  • 16. • Resource sharing between batch, serving and stateful workloads – Streamlined developer experience – Reduced operational costs – Improved infrastructure utilization • Kubernetes and the Container Ecosystem – Lots of addon services: third-party logging, monitoring, and security tools – For example, the Istio project, announced May 24, by IBM, Google and Lyft Why Kubernetes?
  • 17. Cluster Administration Namespaces Resource Accounting Logging Monitoring Resource Quota Pluggable Authorization Admission Control RBAC • Launch Jobs as a particular user into a specific namespace • RBAC and Namespace-level resource quotas • Audit logging for clusters • Several monitoring solutions to see node, cluster and pod-level statistics
  • 19. • Beta recently announced at Spark Summit 2017 • Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat, and growing. Spark on Kubernetes https://github.com/apache-spark-on-k8s/spar k Spark Core Kubernetes Standalone YARN Mesos GraphX SparkSQL MLlib Streaming
  • 20. Spark on Kubernetes Kubernetes Integration Container images with dependencies baked in Files from GCS/S3/HDFS/HTTP File Staging Server Staged files and JARs Several ways of running Spark Jobs along with their dependencies on Kubernetes
  • 21. Spark on Kubernetes Spark Core Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s
  • 22. State of Spark Spark Streaming Spark Shell Client Mode Python/R support Cluster Mode Java/Scala Support Dynamic Allocation Local File Staging High Availability Spark SQL GraphX MLlib Dec 2016 Development Began Mar 2017 Alpha Release June 2017 Beta Release Nov 2016 Design = supported but untested = not yet supported
  • 23. • Community driven effort to get HDFS running well on Kubernetes • Uses a helm chart to install onto a cluster • Identified and solved several problems around data locality when running Spark Jobs HDFS on Kubernetes https://github.com/apache-spark-on-k8s/kubernetes-HDFS
  • 24. HDFS on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Namenode Pod Datanode Pod 1 Datanode Pod 2 HDFS on Kubernetes -- Lessons Learned [Public] Kimoon Kim (PepperData)
  • 25. State of HDFS • HDFS with basic data locality works! • Future Work – Remaining data locality issues -- rack locality, node preference, etc – Performance benchmarks and testing – Kerberos support – Namenode HA
  • 27. • Pipelines feature many other components. • All of the below must run well on K8s – Cassandra – Kafka – Zookeeper – Elasticsearch, Kibana, etc Data Pipelines are complicated!
  • 28. • Cassandra: https://github.com/kubernetes/examples/tree/master/cassandra • Kafka: https://github.com/kubernetes/contrib/tree/master/statefulsets/ka fka • Zookeeper: https://github.com/kubernetes/charts/tree/master/incubator/zook eeper • zetcd: https://github.com/coreos/zetcd • Elasticsearch Operator: https://github.com/upmc-enterprises/elasticsearch-operator Cassandra, Kafka and Zookeeper
  • 29. Future Work • Batch Scheduling and Resource Sharing – Priorities and Preemption • Storage – Local Storage Provisioning • Extensibility – Kubernetes CustomResources (formerly ThirdPartyResources) – UI and Dashboard Improvements • Cluster Federation and Multi-cloud deployments
  • 30. • Get involved! https://github.com/kubernetes/community/t ree/master/sig-big-data • SIG BigData weekly meeting open to all (10am PT on Wednesdays) via Zoom: http://zoom.us/my/sig.big.data Future Work