SlideShare a Scribd company logo
1 of 44
Download to read offline
Spark on Kubernetes
Containerization of Spark
https://github.com/phatak-dev/kubernetes-spark
● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
Agenda
1. Introduction to Containers
2. Spark and Containers
3. Introduction to Kubernetes
4. Kubernetes Abstractions
5. Static Spark Cluster on Kubernetes
6. Shortcomings of Spark Cluster on Kubernetes
7. Kubernetes as YARN
8. Spark Native Integration on Kubernetes
9. Future Work
Introduction to Containers
MicroServices
● Way of developing and deploying an application as
collection of multiple services which communicate to
each other with lightweight mechanisms, often an HTTP
resource API
● These services are built around business capabilities
and independently deployable by fully automated
deployment machinery
● These services can be written in different languages
and can have different deployment strategies
Containers
● Containerisation is os-level virtualization
● In VM world, each VM has its own copy of operating
system.
● Container share common kernel in a given machine
● Very light weight
● Supports resource isolation
● Most of the time, each microservice will be deployed as
independent container
● This gives ability to scale independently
Introduction to Docker
● Containers were available in some operating systems
like solaris over a decade
● Docker popularised the containers on linux
● Docker is container runtime for running containers on
multiple operating system
● Started at 2013 and now synonymous with container
● Rocket from Coreos and LXD from canonical are the
alternative ones
Challenges with Containers
● Containers makes individual services of application
scale independently, but make discovering and
consuming these services challenging
● Also monitoring these services across multiple hosts are
also challenging
● Ability to cluster multiple containers for big data
clustering is challenge by default docker tools
● So there need to be way to orchestrate these containers
when you run a lot of services on top of it
Container Orchestrators
● Container orchestration are the tools for orchestrating
the containers on scale
● They provide mainly
○ Declarative configurations
○ Rules and Constraints
○ Provisioning on multiple hosts
○ Service Discovery
○ Health Monitoring
● Support multiple container runtimes
Different Container Orchestrators
● Docker Compose - Not a orchestrator, but has basic
service discovery
● Docker Swarm by Docker Company
● Kubernetes by Google
● Apache Mesos with Docker integrations
Spark and Containers
Need of Spark be on Containers
● Most of the spark clusters today run on their own
hardware and VM’s
● Cloud providers like AWS provide their own managed
resource handlers like EMR
● But more and more non-spark workloads are getting
deployed in container environments
● Managing multiple different environments to run spark
and non-spark are tedious for operations and
management
Challenges with Seperate Spark Env
● Cannot fully utilise the infrastructure when spark is not
using all the hardware that’s dedicated to it
● Integrating with non-spark services are tedious as
different network infrastructure needs to be deployed
● No automatic scalability in on-prem deployments
● Resource sharing and restriction cannot be uniformly
applied across the multiple applications
● Setting up clustering is challenging on multiple different
deployments like clouds and on-prem
Spark on Containers
● More and more organisations wants to unify their data
pipelines on single container infrastructure
● So they want to spark to be a good citizen of the
container world where kubernetes is becoming de facto
standard.
● Spark when it runs on same infrastructure as other
systems it becomes much easier to share and consume
resources
● These are the motivations to deploy spark on
kubernetes
Introduction to Kubernetes
Kubernetes
● Open source system for
○ Automating deployment
○ Scaling
○ Management
of containerized applications.
● Production Grade Container Orchestrator
● Based on Borg and Omega , the internal container
orchestrators used by Google for 15 years
● https://kubernetes.io/
Why Kubernetes
● Production Grade Container Orchestration
● Support for Cloud and On-Prem deployments
● Agnostic to Container Runtime
● Support for easy clustering and load balancing
● Support for service upgradation and rollback
● Effective Resource Isolation and Management
● Well defined storage management
Minikube
● Minikube is a tool that is used to run kubernetes locally
● It runs single node kubernetes cluster using
virtualization layers like virtualbox, hyper-v etc
● In our example, we run minikube using virtualbox
● Very useful trying out kubernetes for development and
testing purpose
● For installation steps, refer
http://blog.madhukaraphatak.com/scaling-spark-with-kuber
netes-part-2/
Kubectl
● Kubectl is a command line utility to interact with
kubernetes REST API
● This allows us to create, manage and delete different
resources in kubernetes
● Kubectl can connect to any kubernetes cluster
irrespective where it’s running
● We need to install the kubectl with minikube for
interacting with kubernetes
Minikube Operations
● Starting minikube
minikube start
● Observe running VM in the virtualbox
● See kubernetes dashboard
minikube dashboard
● Run kubectl
kubectl get po
Kubernetes Abstractions
Different Types of Abstraction
● Compute Abstractions ( CPU)
Abstraction related to create and manage compute
entities. Ex : Pod, Deployment
● Service/Network Abstractions (Network)
Abstraction related to exposing service on network
● Storage Abstractions (Disk)
Disk related abstractions
Compute Abstractions
Pod Abstraction
● Pod is a collection of one or more containers
● Smallest compute unit you can deploy on the
kubernetes
● Host Abstraction for Kubernetes
● All containers run in single node
● Provides the ability for containers to communicate to
each other using localhost
Defining Pod
● Kubernetes uses YAML/Json for defining resources in
its framework
● YAML is human readable serialization format mainly
used for configuration
● All our examples, uses the YAML.
● We are going to define a pod , where we create
container of nginx
● kube_examples/nginxpod.yaml
Creating and Running Pod
● Once we define the pod, we need create and run the
pod
kubectl create -f kube_examples/nginxpod.yaml
● See running pod
kubectl get po
● Observe same on dashboard
● Stop Pod
kubectl delete -f kube_examples/ngnixpod.yaml
Spark Static Cluster on Kubernetes
Spark Cluster on Kubernetes
● A Single pod is created for Spark Master
● For all workers, there will pod for each worker
● All the pods runs custom built spark image
● These pods are connected using kubernetes networking
abstractions
● This creates a static spark cluster on kubernetes
● Whole Talk on Same is given before [1]
Resource Definition
● As the spark is not aware it’s not running on kubernetes
, it doesn’t recognise the limits put on kubernetes pods
● For ex: In kubernetes we can define pod to have 1 GB
RAM, but we may end up configure spark worker to
have 10 GB memory
● This mismatch in resource definition makes it tedious in
keeping both in sync
● The same applies for CPU and Disk bounds also
Static Nature
● As spark cluster is created statically, it cannot scale
automatically like it can do in YARN or other standalone
clusters
● This makes spark keep on consuming kubernetes
resources even when nothing is going on
● This makes spark not a good neighbour to have in the
cluster
● Also static nature means, it cannot request more
resources when needed. Manual interversion is needed.
Kubernetes as YARN
Kubernetes as YARN
● YARN is one of the first general purpose container
creation system created for big data
● In YARN , even though containers run as Java process
they can run any applications using JNI
● It makes YARN a generic container management tool
which can run any applications
● It’s very rarely used outside of big data even though it
has generic container underpinnings
Spark on YARN
● When spark is deployed on YARN, spark treats YARN
as a container management system
● Spark requests the containers from YARN with defined
resources
● Once it acquires the containers, it builds a RPC based
communication between containers to run driver and
executors
● Spark can scale automatically by releasing and aquiring
containers
Spark Native Integration with K8
Spark and Kubernetes
● From Spark 2.3, spark supports kubernetes as new
cluster backend
● It adds to existing list of YARN, Mesos and standalone
backend
● This is a native integration, where no need of static
cluster is need to built before hand
● Works very similar to how spark works yarn
● Next section shows the different capabalities
Running Spark on Kubernets
Building Image
● Every kubernetes abstraction needs a image to run
● Spark 2.3 ships a script to build image of latest spark
with all the dependencies needs
● So as the first step, we are going to run the script to
build the image
● Once image is ready, we can run a simple spark
example to see integrations is working
● ./bin/docker-image-tool.sh -t spark_2.3 build [2]
Run Pi Example On Kubernetes
bin/spark-submit 
--master k8s://https://192.168.99.100:8443
--deploy-mode cluster 
--name spark-pi 
--class org.apache.spark.examples.SparkPi 
--conf spark.executor.instances=2 
--conf spark.kubernetes.container.image=madhu/spark:spark_2.3
local:///opt/examples/jars/examples.jar
Accessing UI and Logs
● kubectl port-forward <driver-pod-name> 4040:4040
● kubectl -n=<namespace> logs -f <driver-pod-name>
●
Architecture
Kubernetes Custom Controller
● Kubernetes Custom Controller is an extension to
kubernetes API to defined and create custom resources
in Kubernetes
● Spark uses customer controller to create spark driver
which interns responsible for creating worker pods
● This functionality is added in 1.6 version of kubernetes
● This allows spark like frameworks to natively integrate
with kubernetes
Architecture
References
● https://www.youtube.com/watch?v=Q0miRvKA4yk&t=13
s
● https://spark.apache.org/docs/2.3.0/running-on-kubernet
es.html#docker-images
● https://databricks.com/session/apache-spark-on-kubern
etes
● https://martinfowler.com/articles/microservices.html
● https://thenewstack.io/containers-container-orchestratio
n/
References
● http://blog.madhukaraphatak.com/categories/kubernete
s-series/
● https://kubernetes.io/docs/home/

More Related Content

What's hot

Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...
Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...
Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...Edureka!
 
An intro to Kubernetes operators
An intro to Kubernetes operatorsAn intro to Kubernetes operators
An intro to Kubernetes operatorsJ On The Beach
 
K8s cluster autoscaler
K8s cluster autoscaler K8s cluster autoscaler
K8s cluster autoscaler k8s study
 
Extending kubernetes with CustomResourceDefinitions
Extending kubernetes with CustomResourceDefinitionsExtending kubernetes with CustomResourceDefinitions
Extending kubernetes with CustomResourceDefinitionsStefan Schimanski
 
Kubernetes Deployment Strategies
Kubernetes Deployment StrategiesKubernetes Deployment Strategies
Kubernetes Deployment StrategiesAbdennour TM
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Ryan Jarvinen
 
Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Michael Elder
 
Kubernetes Problem-Solving
Kubernetes Problem-SolvingKubernetes Problem-Solving
Kubernetes Problem-SolvingAll Things Open
 
Serverless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on KubernetesServerless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on KubernetesClaus Ibsen
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftDatabricks
 
Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)Animesh Singh
 
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and LinkerdService Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and LinkerdKai Wähner
 
쿠버네티스 ( Kubernetes ) 소개 자료
쿠버네티스 ( Kubernetes ) 소개 자료쿠버네티스 ( Kubernetes ) 소개 자료
쿠버네티스 ( Kubernetes ) 소개 자료Opennaru, inc.
 
Distributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastDistributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastMesut Celik
 
Kubernetes Networking 101
Kubernetes Networking 101Kubernetes Networking 101
Kubernetes Networking 101Weaveworks
 
(Tech DeepDive #1) Java Flight Recorder を活用した問題解決
(Tech DeepDive #1) Java Flight Recorder を活用した問題解決(Tech DeepDive #1) Java Flight Recorder を活用した問題解決
(Tech DeepDive #1) Java Flight Recorder を活用した問題解決オラクルエンジニア通信
 

What's hot (20)

Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...
Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...
Kubernetes Interview Questions And Answers | Kubernetes Tutorial | Kubernetes...
 
An intro to Kubernetes operators
An intro to Kubernetes operatorsAn intro to Kubernetes operators
An intro to Kubernetes operators
 
Quarkus k8s
Quarkus   k8sQuarkus   k8s
Quarkus k8s
 
K8s cluster autoscaler
K8s cluster autoscaler K8s cluster autoscaler
K8s cluster autoscaler
 
01. Kubernetes-PPT.pptx
01. Kubernetes-PPT.pptx01. Kubernetes-PPT.pptx
01. Kubernetes-PPT.pptx
 
Extending kubernetes with CustomResourceDefinitions
Extending kubernetes with CustomResourceDefinitionsExtending kubernetes with CustomResourceDefinitions
Extending kubernetes with CustomResourceDefinitions
 
Kubernetes Deployment Strategies
Kubernetes Deployment StrategiesKubernetes Deployment Strategies
Kubernetes Deployment Strategies
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
 
Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...
 
Kubernetes Problem-Solving
Kubernetes Problem-SolvingKubernetes Problem-Solving
Kubernetes Problem-Solving
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
Serverless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on KubernetesServerless integration with Knative and Apache Camel on Kubernetes
Serverless integration with Knative and Apache Camel on Kubernetes
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)Kubeflow Pipelines (with Tekton)
Kubeflow Pipelines (with Tekton)
 
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and LinkerdService Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
 
쿠버네티스 ( Kubernetes ) 소개 자료
쿠버네티스 ( Kubernetes ) 소개 자료쿠버네티스 ( Kubernetes ) 소개 자료
쿠버네티스 ( Kubernetes ) 소개 자료
 
Distributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastDistributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with Hazelcast
 
Kubernetes Networking 101
Kubernetes Networking 101Kubernetes Networking 101
Kubernetes Networking 101
 
(Tech DeepDive #1) Java Flight Recorder を活用した問題解決
(Tech DeepDive #1) Java Flight Recorder を活用した問題解決(Tech DeepDive #1) Java Flight Recorder を活用した問題解決
(Tech DeepDive #1) Java Flight Recorder を活用した問題解決
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 

Similar to Spark on Kubernetes

Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...Anant Corporation
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Google Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTEGoogle Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTEGokhan Boranalp
 
Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021Avanti Patil
 
Kubernetes-introduction to kubernetes for beginers.pptx
Kubernetes-introduction to kubernetes for beginers.pptxKubernetes-introduction to kubernetes for beginers.pptx
Kubernetes-introduction to kubernetes for beginers.pptxrathnavel194
 
Kubernetes overview and Exploitation
Kubernetes overview and ExploitationKubernetes overview and Exploitation
Kubernetes overview and ExploitationOWASPSeasides
 
Getting started with kubernetes
Getting started with kubernetesGetting started with kubernetes
Getting started with kubernetesJanakiram MSV
 
Kubernetes: https://youtu.be/KnjnQj-FvfQ
Kubernetes: https://youtu.be/KnjnQj-FvfQKubernetes: https://youtu.be/KnjnQj-FvfQ
Kubernetes: https://youtu.be/KnjnQj-FvfQRahul Malhotra
 
Kubernetes from the ground up
Kubernetes from the ground upKubernetes from the ground up
Kubernetes from the ground upSander Knape
 
Kubernetes-Fundamentals.pptx
Kubernetes-Fundamentals.pptxKubernetes-Fundamentals.pptx
Kubernetes-Fundamentals.pptxsatish642065
 
Kuberenetes - From Zero to Hero
Kuberenetes  - From Zero to HeroKuberenetes  - From Zero to Hero
Kuberenetes - From Zero to HeroOri Stoliar
 
Pydata 2020 containers meetup
Pydata  2020 containers meetup Pydata  2020 containers meetup
Pydata 2020 containers meetup Walid Shaari
 

Similar to Spark on Kubernetes (20)

Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Google Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTEGoogle Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTE
 
Intro to Kubernetes
Intro to KubernetesIntro to Kubernetes
Intro to Kubernetes
 
Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021
 
Kubernetes-introduction to kubernetes for beginers.pptx
Kubernetes-introduction to kubernetes for beginers.pptxKubernetes-introduction to kubernetes for beginers.pptx
Kubernetes-introduction to kubernetes for beginers.pptx
 
Kubernetes intro
Kubernetes introKubernetes intro
Kubernetes intro
 
Swarm migration
Swarm migrationSwarm migration
Swarm migration
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Kubernetes overview and Exploitation
Kubernetes overview and ExploitationKubernetes overview and Exploitation
Kubernetes overview and Exploitation
 
Getting started with kubernetes
Getting started with kubernetesGetting started with kubernetes
Getting started with kubernetes
 
Gdg izmir kubernetes
Gdg izmir kubernetesGdg izmir kubernetes
Gdg izmir kubernetes
 
Kubernetes Presentation
Kubernetes PresentationKubernetes Presentation
Kubernetes Presentation
 
Kubernetes: https://youtu.be/KnjnQj-FvfQ
Kubernetes: https://youtu.be/KnjnQj-FvfQKubernetes: https://youtu.be/KnjnQj-FvfQ
Kubernetes: https://youtu.be/KnjnQj-FvfQ
 
Webinar kubernetes and-spark
Webinar  kubernetes and-sparkWebinar  kubernetes and-spark
Webinar kubernetes and-spark
 
Kubernetes from the ground up
Kubernetes from the ground upKubernetes from the ground up
Kubernetes from the ground up
 
Kubernetes-Fundamentals.pptx
Kubernetes-Fundamentals.pptxKubernetes-Fundamentals.pptx
Kubernetes-Fundamentals.pptx
 
Kuberenetes - From Zero to Hero
Kuberenetes  - From Zero to HeroKuberenetes  - From Zero to Hero
Kuberenetes - From Zero to Hero
 
Pydata 2020 containers meetup
Pydata  2020 containers meetup Pydata  2020 containers meetup
Pydata 2020 containers meetup
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 

Recently uploaded

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 

Spark on Kubernetes

  • 1. Spark on Kubernetes Containerization of Spark https://github.com/phatak-dev/kubernetes-spark
  • 2. ● Madhukara Phatak ● Director of Engineering,Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  • 3. Agenda 1. Introduction to Containers 2. Spark and Containers 3. Introduction to Kubernetes 4. Kubernetes Abstractions 5. Static Spark Cluster on Kubernetes 6. Shortcomings of Spark Cluster on Kubernetes 7. Kubernetes as YARN 8. Spark Native Integration on Kubernetes 9. Future Work
  • 5. MicroServices ● Way of developing and deploying an application as collection of multiple services which communicate to each other with lightweight mechanisms, often an HTTP resource API ● These services are built around business capabilities and independently deployable by fully automated deployment machinery ● These services can be written in different languages and can have different deployment strategies
  • 6. Containers ● Containerisation is os-level virtualization ● In VM world, each VM has its own copy of operating system. ● Container share common kernel in a given machine ● Very light weight ● Supports resource isolation ● Most of the time, each microservice will be deployed as independent container ● This gives ability to scale independently
  • 7. Introduction to Docker ● Containers were available in some operating systems like solaris over a decade ● Docker popularised the containers on linux ● Docker is container runtime for running containers on multiple operating system ● Started at 2013 and now synonymous with container ● Rocket from Coreos and LXD from canonical are the alternative ones
  • 8. Challenges with Containers ● Containers makes individual services of application scale independently, but make discovering and consuming these services challenging ● Also monitoring these services across multiple hosts are also challenging ● Ability to cluster multiple containers for big data clustering is challenge by default docker tools ● So there need to be way to orchestrate these containers when you run a lot of services on top of it
  • 9. Container Orchestrators ● Container orchestration are the tools for orchestrating the containers on scale ● They provide mainly ○ Declarative configurations ○ Rules and Constraints ○ Provisioning on multiple hosts ○ Service Discovery ○ Health Monitoring ● Support multiple container runtimes
  • 10. Different Container Orchestrators ● Docker Compose - Not a orchestrator, but has basic service discovery ● Docker Swarm by Docker Company ● Kubernetes by Google ● Apache Mesos with Docker integrations
  • 12. Need of Spark be on Containers ● Most of the spark clusters today run on their own hardware and VM’s ● Cloud providers like AWS provide their own managed resource handlers like EMR ● But more and more non-spark workloads are getting deployed in container environments ● Managing multiple different environments to run spark and non-spark are tedious for operations and management
  • 13. Challenges with Seperate Spark Env ● Cannot fully utilise the infrastructure when spark is not using all the hardware that’s dedicated to it ● Integrating with non-spark services are tedious as different network infrastructure needs to be deployed ● No automatic scalability in on-prem deployments ● Resource sharing and restriction cannot be uniformly applied across the multiple applications ● Setting up clustering is challenging on multiple different deployments like clouds and on-prem
  • 14. Spark on Containers ● More and more organisations wants to unify their data pipelines on single container infrastructure ● So they want to spark to be a good citizen of the container world where kubernetes is becoming de facto standard. ● Spark when it runs on same infrastructure as other systems it becomes much easier to share and consume resources ● These are the motivations to deploy spark on kubernetes
  • 16. Kubernetes ● Open source system for ○ Automating deployment ○ Scaling ○ Management of containerized applications. ● Production Grade Container Orchestrator ● Based on Borg and Omega , the internal container orchestrators used by Google for 15 years ● https://kubernetes.io/
  • 17. Why Kubernetes ● Production Grade Container Orchestration ● Support for Cloud and On-Prem deployments ● Agnostic to Container Runtime ● Support for easy clustering and load balancing ● Support for service upgradation and rollback ● Effective Resource Isolation and Management ● Well defined storage management
  • 18. Minikube ● Minikube is a tool that is used to run kubernetes locally ● It runs single node kubernetes cluster using virtualization layers like virtualbox, hyper-v etc ● In our example, we run minikube using virtualbox ● Very useful trying out kubernetes for development and testing purpose ● For installation steps, refer http://blog.madhukaraphatak.com/scaling-spark-with-kuber netes-part-2/
  • 19. Kubectl ● Kubectl is a command line utility to interact with kubernetes REST API ● This allows us to create, manage and delete different resources in kubernetes ● Kubectl can connect to any kubernetes cluster irrespective where it’s running ● We need to install the kubectl with minikube for interacting with kubernetes
  • 20. Minikube Operations ● Starting minikube minikube start ● Observe running VM in the virtualbox ● See kubernetes dashboard minikube dashboard ● Run kubectl kubectl get po
  • 22. Different Types of Abstraction ● Compute Abstractions ( CPU) Abstraction related to create and manage compute entities. Ex : Pod, Deployment ● Service/Network Abstractions (Network) Abstraction related to exposing service on network ● Storage Abstractions (Disk) Disk related abstractions
  • 24. Pod Abstraction ● Pod is a collection of one or more containers ● Smallest compute unit you can deploy on the kubernetes ● Host Abstraction for Kubernetes ● All containers run in single node ● Provides the ability for containers to communicate to each other using localhost
  • 25. Defining Pod ● Kubernetes uses YAML/Json for defining resources in its framework ● YAML is human readable serialization format mainly used for configuration ● All our examples, uses the YAML. ● We are going to define a pod , where we create container of nginx ● kube_examples/nginxpod.yaml
  • 26. Creating and Running Pod ● Once we define the pod, we need create and run the pod kubectl create -f kube_examples/nginxpod.yaml ● See running pod kubectl get po ● Observe same on dashboard ● Stop Pod kubectl delete -f kube_examples/ngnixpod.yaml
  • 27. Spark Static Cluster on Kubernetes
  • 28. Spark Cluster on Kubernetes ● A Single pod is created for Spark Master ● For all workers, there will pod for each worker ● All the pods runs custom built spark image ● These pods are connected using kubernetes networking abstractions ● This creates a static spark cluster on kubernetes ● Whole Talk on Same is given before [1]
  • 29. Resource Definition ● As the spark is not aware it’s not running on kubernetes , it doesn’t recognise the limits put on kubernetes pods ● For ex: In kubernetes we can define pod to have 1 GB RAM, but we may end up configure spark worker to have 10 GB memory ● This mismatch in resource definition makes it tedious in keeping both in sync ● The same applies for CPU and Disk bounds also
  • 30. Static Nature ● As spark cluster is created statically, it cannot scale automatically like it can do in YARN or other standalone clusters ● This makes spark keep on consuming kubernetes resources even when nothing is going on ● This makes spark not a good neighbour to have in the cluster ● Also static nature means, it cannot request more resources when needed. Manual interversion is needed.
  • 32. Kubernetes as YARN ● YARN is one of the first general purpose container creation system created for big data ● In YARN , even though containers run as Java process they can run any applications using JNI ● It makes YARN a generic container management tool which can run any applications ● It’s very rarely used outside of big data even though it has generic container underpinnings
  • 33. Spark on YARN ● When spark is deployed on YARN, spark treats YARN as a container management system ● Spark requests the containers from YARN with defined resources ● Once it acquires the containers, it builds a RPC based communication between containers to run driver and executors ● Spark can scale automatically by releasing and aquiring containers
  • 35. Spark and Kubernetes ● From Spark 2.3, spark supports kubernetes as new cluster backend ● It adds to existing list of YARN, Mesos and standalone backend ● This is a native integration, where no need of static cluster is need to built before hand ● Works very similar to how spark works yarn ● Next section shows the different capabalities
  • 36. Running Spark on Kubernets
  • 37. Building Image ● Every kubernetes abstraction needs a image to run ● Spark 2.3 ships a script to build image of latest spark with all the dependencies needs ● So as the first step, we are going to run the script to build the image ● Once image is ready, we can run a simple spark example to see integrations is working ● ./bin/docker-image-tool.sh -t spark_2.3 build [2]
  • 38. Run Pi Example On Kubernetes bin/spark-submit --master k8s://https://192.168.99.100:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=madhu/spark:spark_2.3 local:///opt/examples/jars/examples.jar
  • 39. Accessing UI and Logs ● kubectl port-forward <driver-pod-name> 4040:4040 ● kubectl -n=<namespace> logs -f <driver-pod-name> ●
  • 41. Kubernetes Custom Controller ● Kubernetes Custom Controller is an extension to kubernetes API to defined and create custom resources in Kubernetes ● Spark uses customer controller to create spark driver which interns responsible for creating worker pods ● This functionality is added in 1.6 version of kubernetes ● This allows spark like frameworks to natively integrate with kubernetes
  • 43. References ● https://www.youtube.com/watch?v=Q0miRvKA4yk&t=13 s ● https://spark.apache.org/docs/2.3.0/running-on-kubernet es.html#docker-images ● https://databricks.com/session/apache-spark-on-kubern etes ● https://martinfowler.com/articles/microservices.html ● https://thenewstack.io/containers-container-orchestratio n/