Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Spark Streaming in K8s with
ArgoCD & Spark Operator
Albert Franzi - Data Engineer Lead @ Typeform

Agenda
val sc: SparkContext
Where are we nowadays
Spark(implicit mode:K8s)
When Spark met K8s
type Deploy=SparkOperator
How we deploy into K8s
Some[Learnings]
Why it matters

About me
Data Engineer Lead @ Typeform

About me
Data Engineer Lead @ Typeform
○ Leading the Data Platform team
Previously
○ Data Engineer @ Alpha Health
○ Data Engineer @ Schibsted Classiﬁed Media
○ Data Engineer @ Trovit Search
albert-franzi FranziCros

Where are we nowadays

Where are we nowadays - Environments

Where are we nowadays - Executions
Great for batch processing
Good orchestrators
Old school Area 51 Next slides

When Spark met K8s

● Delayed EMR releases
EMR 6.1.0 Spark 3.0.0 after ~3 months.
● Spark ﬁxed version per cluster.
● Unused resources.
● Same IAM role shared across the entire cluster.
When Spark met K8s - EMR : The Past

● Multiple Spark versions running in parallel in the
same cluster.
● Use what you need, share what you don’t.
● IAM role per Service Account.
● Different node types based on your needs.
● You deﬁne the dockers.
When Spark met K8s - The future

When Spark met K8s - Requirements
Kubernetes Cluster
v : 1.13+
AWS SDK
v : 1.788+
🔗 WebIdentityTokenCredentialsProvider
IAM Roles
Fine-grained IAM roles for service accounts
🔗 IRSA
Spark docker image
hadoop : v3.2.1
aws_sdk: v1.11.788
scala: v2.12
spark: v3.0.0
java: 8
🔗 hadoop.Dockerﬁle & spark.Dockerﬁle

ref: github.com - spark-on-k8s-operator
Kubernetes operator for managing the
lifecycle of Apache Spark applications on
Kubernetes.

How we deploy into K8s - Application Specs
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: our-spark-job-name
namespace: spark
spec:
type: Scala
mode: cluster
image: "xxx/typeform/spark:3.0.0"
imagePullPolicy: Always
imagePullSecrets: [xxx]
sparkVersion: "3.0.0"
restartPolicy:
type: Never
volumes:
- name: temp-volume
emptyDir: {}
hadoopConf:
fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
mainClass: com.typeform.data.spark.our.class.package
mainApplicationFile: "s3a://my_spark_bucket/spark_jars/0.8.23/data-spark-jobs-assembly-0.8.23.jar"
arguments:
- --argument_name_1
- argument_value_1
driver:
cores: 1
coreLimit: "1000m"
memory: "512m"
labels:
version: 3.0.0
serviceAccount: "spark"
deleteOnTermination: true
secrets:
- name: my-secret
secretType: generic
path: /mnt/secrets
volumeMounts:
- name: "temp-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 4
memory: "512m"
labels:
version: 3.0.0
serviceAccount: "spark"
deleteOnTermination: true
volumeMounts:
- name: "temp-volume"
mountPath: "/tmp"

schedule: "@every 5m"
concurrencyPolicy
Replace
Allow
Forbid
crontab.guru

Never AlwaysOnFailure
restartPolicy

How we deploy into K8s - Deployment Flow

How we deploy into K8s - Deploying it manually (Simple & easy)
$ sbt assembly
$ aws s3 cp
target/scala-2.12/data-spark-jobs-assembly-0.8.23.jar
s3://my_spark_bucket/spark_jars/data-spark-jobs_2.12/0.8.23/
$ kubectl apply -f spark-job.yaml
Build the jar, put it into S3 and deploy the Spark Application
$ kubectl delete -f spark-job.yaml
Delete our Spark Application

How we deploy into K8s - Deploying it automatically (Simple & easy)
Argo CD is a declarative, GitOps continuous
delivery tool for Kubernetes.
ref: argoproj.github.io/argo-cd

How we deploy into K8s - Deploying it automatically (Simple & easy)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: data-spark-jobs
namespace: argocd
spec:
destination:
namespace: spark
server: 'https://kubernetes.default.svc'
project: data-platform-projects
source:
helm:
valueFiles:
- values.yaml
- values.prod.yaml
path: k8s/data-spark-jobs
repoURL: 'https://github.com/thereponame'
targetRevision: HEAD
syncPolicy: {}
Argo CD Application Spec

Some[Learnings]
Why it matters

Some[Learnings]
● It was really easy to set up with the right team and the right infrastructure.
● Different teams & Projects adopting new Spark versions with their own pace.
● Spark Testing Cluster always ready to accept new jobs without “paying for it”.
-- K8s cluster already available in dev environments.
● Monitor the pods consumption to tune their memory and cpu properly.
Why it matters

Some[Learnings]
Why it matters : Data Devops makes a difference
Adopt a Devops in your team and convert it into a Data Devops.

The[team]
Digital Analytics Specialists (x2)
BI / DWH Architect (x2)
Data Devops (x1)
Data engineers (x4)
Data Platform : A multidisciplinary team

Links of Interest
Spark structured streaming in K8s with ArgoCD by Albert Franzi
Spark on K8s operator
ArgoCD - App of apps pattern
Spark History Server in K8s by Carlos Escura
Spark Operator - Specs

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark Streaming in K8s with ArgoCD & Spark Operator

Similar to Apache Spark Streaming in K8s with ArgoCD & Spark Operator (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Streaming in K8s with ArgoCD & Spark Operator