Over the last year, we have been moving from a batch processing jobs setup with Airflow using EC2s to a powerful & scalable setup using Airflow & Spark in K8s.
The increasing need of moving forward with all the technology changes, the new community advances, and multidisciplinary teams, forced us to design a solution where we were able to run multiple Spark versions at the same time by avoiding duplicating infrastructure and simplifying its deployment, maintenance, and development.
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
1. Spark Streaming in K8s with
ArgoCD & Spark Operator
Albert Franzi - Data Engineer Lead @ Typeform
2. Agenda
val sc: SparkContext
Where are we nowadays
Spark(implicit mode:K8s)
When Spark met K8s
type Deploy=SparkOperator
How we deploy into K8s
Some[Learnings]
Why it matters
4. About me
Data Engineer Lead @ Typeform
○ Leading the Data Platform team
Previously
○ Data Engineer @ Alpha Health
○ Data Engineer @ Schibsted Classified Media
○ Data Engineer @ Trovit Search
albert-franzi FranziCros
11. ● Delayed EMR releases
EMR 6.1.0 Spark 3.0.0 after ~3 months.
● Spark fixed version per cluster.
● Unused resources.
● Same IAM role shared across the entire cluster.
Spark(implicit mode:K8s)
When Spark met K8s - EMR : The Past
12. ● Multiple Spark versions running in parallel in the
same cluster.
● Use what you need, share what you don’t.
● IAM role per Service Account.
● Different node types based on your needs.
● You define the dockers.
Spark(implicit mode:K8s)
When Spark met K8s - The future
13. Spark(implicit mode:K8s)
When Spark met K8s - Requirements
Kubernetes Cluster
v : 1.13+
AWS SDK
v : 1.788+
🔗 WebIdentityTokenCredentialsProvider
IAM Roles
Fine-grained IAM roles for service accounts
🔗 IRSA
Spark docker image
hadoop : v3.2.1
aws_sdk: v1.11.788
scala: v2.12
spark: v3.0.0
java: 8
🔗 hadoop.Dockerfile & spark.Dockerfile
15. type Deploy=SparkOperator
How we deploy into K8s
ref: github.com - spark-on-k8s-operator
Kubernetes operator for managing the
lifecycle of Apache Spark applications on
Kubernetes.
20. type Deploy=SparkOperator
How we deploy into K8s - Deploying it manually (Simple & easy)
$ sbt assembly
$ aws s3 cp
target/scala-2.12/data-spark-jobs-assembly-0.8.23.jar
s3://my_spark_bucket/spark_jars/data-spark-jobs_2.12/0.8.23/
$ kubectl apply -f spark-job.yaml
Build the jar, put it into S3 and deploy the Spark Application
$ kubectl delete -f spark-job.yaml
Delete our Spark Application
21. type Deploy=SparkOperator
How we deploy into K8s - Deploying it automatically (Simple & easy)
Argo CD is a declarative, GitOps continuous
delivery tool for Kubernetes.
ref: argoproj.github.io/argo-cd
22. type Deploy=SparkOperator
How we deploy into K8s - Deploying it automatically (Simple & easy)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: data-spark-jobs
namespace: argocd
spec:
destination:
namespace: spark
server: 'https://kubernetes.default.svc'
project: data-platform-projects
source:
helm:
valueFiles:
- values.yaml
- values.prod.yaml
path: k8s/data-spark-jobs
repoURL: 'https://github.com/thereponame'
targetRevision: HEAD
syncPolicy: {}
Argo CD Application Spec
26. Some[Learnings]
● It was really easy to set up with the right team and the right infrastructure.
● Different teams & Projects adopting new Spark versions with their own pace.
● Spark Testing Cluster always ready to accept new jobs without “paying for it”.
-- K8s cluster already available in dev environments.
● Monitor the pods consumption to tune their memory and cpu properly.
Why it matters
27. Some[Learnings]
Why it matters : Data Devops makes a difference
Adopt a Devops in your team and convert it into a Data Devops.
29. Links of Interest
Spark structured streaming in K8s with ArgoCD by Albert Franzi
Spark on K8s operator
ArgoCD - App of apps pattern
Spark History Server in K8s by Carlos Escura
Spark Operator - Specs