Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling Data Pipelines with
Apache Spark on Kubernetes
on Google Cloud
Rajesh Thallam
Machine Learning Specialist
Google
Sougata Biswas
Data Analytics Specialist
Google
May 2021

Outline
Spark on Kubernetes on Google Cloud
Why Spark on Kubernetes?
1
2
4 Use Cases / Implementation Patterns
Things to Know
3
5 Wrap up

Utilize existing
Kubernetes infrastructure
to run data engineering or
ML workload along with
other applications without
maintaining separate big
data infrastructure
Containerization of spark
applications gives ability
to run the spark
application on-prem and
on cloud
Packaging job
dependencies in
containers provides a
great way to isolate
workloads. Allowing
teams to scale
independently
Scaling containers are
much faster than VMs
(Virtual Machines)
Unique benefits orchestrating Spark Jobs on Kubernetes compared to other cluster managers -
YARN and Mesos
Optimize
Costs
Portability
Isolation
Faster Scaling

Proprietary + Confidential
Comparing Cluster Managers
Apache Hadoop YARN vs Kubernetes for Apache Spark
Apache Hadoop YARN
● First cluster manager since
inception of Apache Spark
● Battle tested
● General purpose scheduler for
big data applications
● Runs on cluster of VMs or
physical machines (e.g. on-prem
Hadoop clusters)
● Option to run: spark-submit to
YARN
Kubernetes (k8s)
● Resource manager starting Spark
2.3 as experimental and GA with
Spark 3.1.1
● Not in feature parity with YARN
● General purpose scheduler for
any containerized apps
● Runs as a container on k8s
cluster. Faster scaling in and out.
● Option to run: spark-submit,
spark k8s operator

Spark on Kubernetes on Google Cloud

Secure
Enterprise security
Encryption
Access control
Cost Effective
Only pay for what you
use
Managed Jobs
Spark on GKE
Workflow Templates
Airflow Operators
Managed Clusters
90s cluster spin-up
Autoscaling
Autozone placement
Cloud Dataproc
Combining the best of open source and cloud and simplifying Hadoop & Spark workloads
on Cloud
Built-in support for Hadoop & Spark
Managed hardware and configuration
Simplified version management
Flexible job configuration
Features of Dataproc

● Manage applications, not machines
○ Manages container clusters
○ Inspired and informed by Google’s experiences
○ Supports multiple cloud and bare-metal
environments
○ Supports multiple container runtimes
● Features similar to an OS for a host
○ Scheduling workload
○ Finding the right host to fit your workload
○ Monitoring health of the workload
○ Scaling it up and down as needed
○ Moving it around as needed
Kubernetes
OS for your compute fleet

Google Kubernetes Engine (GKE)
Secured and fully managed Kubernetes service
GKE, Kubernetes-as-a-service
Control
Plane
Nodes
kubectl
gcloud
● Turn-key solution to Kubernetes
○ Provision a cluster in minutes
○ Industry-leading automation
○ Scales to an industry-leading 15k worker nodes
○ Reliable and available
○ Deep GCP integration
● Generally Available since August, 2015
○ 99.5% or 99.95% SLA on Kubernetes APIs
○ $0.10 per cluster/hour + infrastructure cost
○ Supports GCE sole-tenant nodes and
reservations

Dataproc on GKE BETA
Run Spark jobs on GKE clusters with Dataproc Jobs API
● Simple way of executing Spark jobs on GKE clusters
● Single API to run Spark job on Dataproc as well as
GKE
● Extensible with custom Docker image for Spark job
● Enterprise security control out-of-box
● Ease of logging and monitoring with cloud Logging
and Monitoring
Create Cluster
Dataproc
GKE
Submit Job
Allocate resources Run Spark Job

Node
Dataproc
Agent
Spark Submit
using
Dataproc API
Kubernetes
Master
API Server
Scheduler
..
Job Scheduling
& Monitoring
Driver Pod
(Node 1)
Executor Pod
(Node 1)
Executor Pod
(Node 2)
Executor Pod
(Node n)
Dataproc on GKE - How it works?
Submit Spark jobs to a running GKE cluster from the Dataproc Jobs API
● Dataproc agent runs as container inside GKE
communicating with GKE scheduler using
spark-kubernetes operator
● User submit jobs using Dataproc Jobs API while
job execution happens inside GKE cluster
● Spark driver and executor run on different Pods
inside separate namespaces within GKE cluster
● Driver and executor logs are sent to Google
Cloud Logging service

How is Dataproc on GKE different from alternatives?
Comparing against Spark Submit and Spark Operator for Kubernetes
Create Cluster
Dataproc
GKE
Submit Job
Allocate resources Run Spark Job
● Easy to get started with familiar Dataproc API
● Easy to setup and manage. No need to install
Spark Kubernetes operator and set up monitoring
or logging separately.
● Built-in security features with Dataproc API -
access control, auditing, encryption and more.
● Inherent benefits of managed services - Dataproc
and GKE

Demo
Spark on GKE using Dataproc Jobs API

Step 1: Setup a GKE Cluster
# setup environment variables
GCE_REGION=us-west2 #GCP region
GCE_ZONE=us-west2-a #GCP zone
GKE_CLUSTER=spark-on-gke #GKE Cluster name
DATAPROC_CLUSTER=dataproc-gke #Dataproc Cluster name
VERSION=1.4.27-beta #Dataproc image version
BUCKET=my-project-spark-on-k8s #GCS bucket
# create GKE cluster with auto-scaling enabled
gcloud container clusters create "${GKE_CLUSTER}"
--scopes=cloud-platform
--workload-metadata=GCE_METADATA
--machine-type=n1-standard-4
--zone="${GCE_ZONE}"
--enable-autoscaling --min-nodes 1 --max-nodes 10
# add Kubernetes Engine Admin role to service-
projectid@dataproc-accounts.iam.gserviceaccount.com

Step 2: Create and Register Dataproc to GKE
# create dataproc cluster and register with GKE with
K8s namespace
gcloud dataproc clusters create "${DATAPROC_CLUSTER}"

--gke-cluster="${GKE_CLUSTER}"
--region="${GCE_REGION}"
--image-version="${VERSION}"
--bucket="${BUCKET}"
--gke-cluster-namespace="spark-on-gke"

Step 3: Spark Job Execution
# Running a sample pyspark job using Dataproc API
# to read a table in Bigquery and generate word counts
gcloud dataproc jobs submit pyspark bq-word-count.py
--cluster=${DATAPROC_CLUSTER}
--region=${GCE_REGION}
--
properties="spark.dynamicAllocation.enabled=false,spar
k.executor.instances=5,spark.executors.core=4"
--jars gs://spark-lib/bigquery/spark-bigquery-
latest_2.11.jar

Step 4a: Monitoring - GKE & Cloud Logging
# Spark Driver Logs from Google Cloud Logging
resource.type="k8s_container"
resource.labels.cluster_name="spark-on-gke"
resource.labels.namespace_name="spark-on-gke"
resource.labels.container_name="spark-kubernetes-
driver"
# Spark Executor Logs from Google Cloud Logging
resource.type="k8s_container"
resource.labels.cluster_name="spark-on-gke"
resource.labels.namespace_name="spark-on-gke"
resource.labels.container_name="executor"

# TCP port forwarding to driver pod to view Spark UI
gcloud container clusters get-credentials
"${GKE_CLUSTER}"
--zone "${GCE_ZONE}"
--project "${PROJECT_ID}" &&
kubectl port-forward
--namespace "${GKE_NAMESPACE}"
$(kubectl get pod --namespace ${GKE_NAMESPACE}
--selector="spark-
role=driver,sparkoperator.k8s.io/app-name=dataproc-
app_name"
--output jsonpath='{.items[0].metadata.name}')
8080:4040
Step 4b: Monitoring with Spark Web UI

Dataproc with Apache Spark on GKE
Things to Know

Autoscaling Spark Jobs
Automatically resize node pools of GKE cluster based on the workload demands
# create GKE cluster with autoscaling enabled
--scopes=cloud-platform
--workload-metadata=GCE_METADATA
--machine-type n1-standard-2
--num-nodes 2
--enable-autoscaling --min-nodes 1 --max-nodes 10
# create dataproc cluster on GKE
gcloud dataproc clusters create "${DATAPROC_CLUSTER}"
--gke-cluster="${GKE_CLUSTER}"
--image-version="${VERSION}"
--bucket="${BUCKET}"
● Dataproc Autoscaler not supported with Dataproc on
GKE
● Instead enable autoscaling on GKE Cluster node pool
● Specify a minimum and maximum size for the GKE
Cluster’s node pool, and the rest is automatic
● You can combine GKE Cluster Autoscaler with
Horizontal/Vertical Pod Autoscaling

# create GKE cluster or a node pool with local SSD
...
--local-ssd-count ${NUMBER_OF_DISKS}
# config YAML to use local SSD as scratch space
spec:
volumes:
- name: "spark-local-dir-1"
hostPath:
path: "/tmp/spark-local-dir"
executor:
volumeMounts:
- name: "spark-local-dir-1"
mountPath: "/tmp/spark-local-dir"
# spark job conf to override scratch space
spark.local.dir=/tmp/spark-local-dir/
Shuffle in Spark on Kubernetes
Writes shuffle data to scratch space or local volume or Persistent Volume Claims
● Shuffle is the data exchange between different stages
in a Spark job.
● Shuffle is expensive and its performance depends on
disk IOPS and network throughput between the nodes.
● Spark supports writing shuffle data to Persistent
Volume Claims or local volumes or scratch space.
● Local SSDs are performant compared to Persistent
Disks but they are transient. Disk IOPS and throughput
improves as disk size increases.
● External shuffle service is not available today.
Source

Dynamic Resource Allocation *
Dynamically adjust the resources Spark application occupies based on the workload
# spark job conf to enable dynamic allocation
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.shuffleTracking.enabled=true
● When enabled, Spark dynamically adjusts resources
based on workload demand
● External shuffle service is not available in Spark on
Kubernetes (work in progress)
● Instead soft dynamic resource allocation is available in
Spark 3.0 where the driver tracks the shuffle files and
evicts only executors not storing active shuffle files
● Dynamic allocation is a cost optimization technique -
cost vs latency trade-off
● To improve latency consider over-provisioning GKE
cluster - fine-tune Horizontal Pod Autoscaling or
configure pause Pods
* Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support
for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE

# create GKE cluster with preemptible VMs
--preemptible
# or create GKE node pool with preemptible VMs
gcloud container node-pools create "${GKE_NODE_POOL}"
--preemptible
--cluster "${GKE_CLUSTER}"
# submit Dataproc job to node pool with preemptible VMs
gcloud dataproc jobs submit pyspark
--cluster="${DATAPROC_CLUSTER}" foo.py
--
properties=spark.kubernetes.node.selector.cloud.google.
com/gke-nodepool=${GKE_NODE_POOL}"
Running Spark Jobs on Preemptible VMs (PVMs) on GKE
Reduce cost of running Spark jobs without sacrificing predictability
● PVMs are excess Compute Engine capacity, that last
for a max of 24 hours with no availability guarantees
● Best suited for running batch or fault-tolerant jobs
● Much cheaper than standard VMs and running Spark
on GKE with PVMs reduces cost of deployment. But,
○ PVMs can shut down inadvertently and
rescheduling Pods to a new node may add
latency
○ Spark executors with active shuffle files that were
shut down will be recomputed adding latency

● At the time of creating a Dataproc cluster on GKE, the
default Dataproc Docker image is used based on the
image version specified
● You can bring your own image or extend the default
image as the container image to use for the Spark
application
● Create Dataproc cluster with custom image when you
need to include your own packages or applications
# submit Dataproc job with custom container image
gcloud dataproc jobs submit pyspark
--cluster="${DATAPROC_CLUSTER}" foo.py
--
properties=spark.kubernetes.container.image="gcr.io/${P
ROJECT_ID}/my-spark-image"
Create Dataproc Cluster on GKE with Custom Image
Bring your own image or extend the default Dataproc image

Integrating with Google Cloud Storage (GCS) and BigQuery (BQ)
Use Spark BigQuery Connector and Google Cloud Storage connector for better performance
# submit Dataproc job to use BigQuery as source/sink
gcloud dataproc jobs submit pyspark bq-word-count.py
--cluster=${DATAPROC_CLUSTER}
--region=${GCE_REGION}
--
properties="spark.dynamicAllocation.enabled=false,spark
.executor.instances=5,spark.executors.core=4"
--jars gs://spark-lib/bigquery/spark-bigquery-
latest_2.11.jar
● Built-in Cloud Storage Connector in the Dataproc
default image
● Add Spark BigQuery connector as dependency, which
uses BQ Storage API to stream data directly from BQ
via gRPC without using GCS as an intermediary.

Autoscaling
Automatically resize GKE
cluster node pools based
on workload demand
Shuffle
Writes to scratch space or
local volume or Persistent
Volume Claims
Dynamic Allocation
Dynamically adjust the job
resources based on the
workload
Preemptible VMs
Reduce cost of running
Spark jobs without
sacrificing predictability
Custom Image
Bring your own image or
extend the default
Dataproc image
Integration with
Google Cloud Services
Built-in Cloud Storage
connector and add Spark
BigQuery connector
Dataproc with Apache Spark on GKE - Things to Know at a Glance

Use Cases / Architectural Patterns

Unified Infrastructure
Google Kubernetes Engine (GKE) Cluster
Dataproc Clusters on
GKE
Apache Spark 2.4 Airflow Kubeflow
Other Workloads
Apache Spark 3.x
● Unify all of our processing - data processing pipeline
or a machine learning pipeline or a web application or
anything else
● By migrating Spark jobs to a single cluster manager,
you can focus on modern cloud management in
Kubernetes
● Leads to a more efficient use of resources and
provides a unified logging and management
framework
Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support
for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE

Cloud Composer
Managed Apache Airflow service to create, schedule, monitor and manage
workflows
Author end-to-end
workflows on GCP via
triggers and integrations
Enterprise security for
your workflows through
Google managed
credentials.
What is Cloud Composer?
No need to think about
managing the
infrastructure after
initial config done with a
click.
Makes troubleshooting
simple with observability
through Cloud Logging
and Monitoring
Azure Blob Storage
AWS EMR
AWS S3
AWS EC2
AWS Redshift
Databricks
SubmitRunOperator
Workflow
Orchestration
Cloud Composer
Public Cloud
Integrations
GCP Integrations
On-prem
integration
BigQuery
Cloud
Dataproc
Cloud
Dataflow
Cloud
Pub/Sub
Cloud AI
Platform
Cloud
Storage
Cloud
Datastore

Orchestrating Apache Spark Jobs from Cloud Composer
Cloud Storage
Source/Targe
t
BigQuery
Source/Targe
t
Dataproc on GKE
Data
Processing
Cloud
Composer
Any other data
sources or
targets
● Trigger DAG from Composer to submit job to
Dataproc cluster running on GKE
● Save time by not creating and tear down
ephemeral Dataproc cluster
● One cluster manager to orchestrate and
process jobs. Better utilization of resources.
● Optimize costs + better visibility and reliability

Machine Learning Lifecycle
DATA SCIENTIST / ML ENGINEER
• Apply ML model code on large
datasets
• Test performance and validate
• Train on LARGE or FULL dataset
DATA SCIENTIST
• Explore data
• Test features + algorithms
• Build model prototypes
• Prototype on SMALL or SAMPLED
dataset
DATA / ML ENGINEER
• Operationalize data processing
• Deploy models to production
Model Accuracy
Information
ML Model Code
ML Model
DATA ENGINEER
• Ingestion
• Cleaning
• Storage
Exploration &
Model Prototyping
Model Scoring &
Inference
Production Training
& Evaluation
Data

MLflow
Open Source platform to manage the ML lifecycle
Registry
Store, annotate,
discover, and manage
models in a central
repository
Models
Deploy machine learning
models in diverse serving
environments
Projects
Package data science
code to reproduce runs
on any platform
Tracking
Record and query
experiments: code, data,
config and results
Components of MLflow

Unifying Machine Learning & Data Pipeline Deployments
API Connectors
&
Data Imports
Cloud Storage
Data Source
Cloud Scheduler
Trigger
Security & Integrations
Key
Manageme
nt Service
Secret
Manager
Cloud
IAM
AI Platform
Data Science
/ ML
Target Bucket
Cloud Bigtable
BigQuery
BigQuery
Data Source
Artifacts Storage
Cloud
Storage
Dataproc on GKE
Data
Processing
Cloud
Composer
ML Tracking
Kubeflow
Data Science /
ML
Notebooks
Training
Experimentation

Wrapping up

Apache Spark on Kubernetes
● Do you have apps running on Kubernetes clusters? Are
they underutilized?
● Do you have pain managing multiple cluster managers -
YARN, Kubernetes?
● Do you have difficulties managing Spark job
dependencies, different Spark versions?
● Do you want to get same benefits as apps running on
Kubernetes - multitenancy, autoscaling, fine-grained
access control?
Why Dataproc on GKE?
● Faster scaling with reliability
● Inherent benefits of managed infrastructure
● Enterprise security control
● Unified logging and monitoring
● Optimized costs due to effective resource sharing

Open Source Documentation
● Running Spark on Kubernetes - Spark Documentation
● Kubernetes operator for managing the lifecycle of
Apache Spark applications on Kubernetes.
● Code Example used in the demo.
Blog Posts & Solution
● Make the most out of your Data Lake with Google Cloud
● Cloud Dataproc Spark Jobs on GKE: How to get started
Google Cloud Documentation
● Google Cloud Dataproc
● Google Kubernetes Engine (GKE)
● Google Cloud Composer
● Dataproc on Google Kubernetes Engine
Resources
Google Cloud

Feedback
Your feedback is important to us.
Don’t forget to rate and review the
sessions.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling your Data Pipelines with Apache Spark on Kubernetes

Similar to Scaling your Data Pipelines with Apache Spark on Kubernetes (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Scaling your Data Pipelines with Apache Spark on Kubernetes