Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 29

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

0

Share

Download to read offline

Over the last year, we have been moving from a batch processing jobs setup with Airflow using EC2s to a powerful & scalable setup using Airflow & Spark in K8s.

The increasing need of moving forward with all the technology changes, the new community advances, and multidisciplinary teams, forced us to design a solution where we were able to run multiple Spark versions at the same time by avoiding duplicating infrastructure and simplifying its deployment, maintenance, and development.

Apache Spark Streaming in K8s with ArgoCD & Spark Operator

  1. 1. Spark Streaming in K8s with ArgoCD & Spark Operator Albert Franzi - Data Engineer Lead @ Typeform
  2. 2. Agenda val sc: SparkContext Where are we nowadays Spark(implicit mode:K8s) When Spark met K8s type Deploy=SparkOperator How we deploy into K8s Some[Learnings] Why it matters
  3. 3. About me Data Engineer Lead @ Typeform
  4. 4. About me Data Engineer Lead @ Typeform ○ Leading the Data Platform team Previously ○ Data Engineer @ Alpha Health ○ Data Engineer @ Schibsted Classified Media ○ Data Engineer @ Trovit Search albert-franzi FranziCros
  5. 5. About Typeform
  6. 6. val sc: SparkContext Where are we nowadays
  7. 7. val sc: SparkContext Where are we nowadays - Environments
  8. 8. val sc: SparkContext Where are we nowadays - Executions Great for batch processing Good orchestrators Old school Area 51 Next slides
  9. 9. Spark(implicit mode:K8s) When Spark met K8s
  10. 10. ● Delayed EMR releases EMR 6.1.0 Spark 3.0.0 after ~3 months. ● Spark fixed version per cluster. ● Unused resources. ● Same IAM role shared across the entire cluster. Spark(implicit mode:K8s) When Spark met K8s - EMR : The Past
  11. 11. ● Multiple Spark versions running in parallel in the same cluster. ● Use what you need, share what you don’t. ● IAM role per Service Account. ● Different node types based on your needs. ● You define the dockers. Spark(implicit mode:K8s) When Spark met K8s - The future
  12. 12. Spark(implicit mode:K8s) When Spark met K8s - Requirements Kubernetes Cluster v : 1.13+ AWS SDK v : 1.788+ 🔗 WebIdentityTokenCredentialsProvider IAM Roles Fine-grained IAM roles for service accounts 🔗 IRSA Spark docker image hadoop : v3.2.1 aws_sdk: v1.11.788 scala: v2.12 spark: v3.0.0 java: 8 🔗 hadoop.Dockerfile & spark.Dockerfile
  13. 13. type Deploy=SparkOperator How we deploy into K8s
  14. 14. type Deploy=SparkOperator How we deploy into K8s ref: github.com - spark-on-k8s-operator Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
  15. 15. type Deploy=SparkOperator How we deploy into K8s - Application Specs apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: our-spark-job-name namespace: spark spec: type: Scala mode: cluster image: "xxx/typeform/spark:3.0.0" imagePullPolicy: Always imagePullSecrets: [xxx] sparkVersion: "3.0.0" restartPolicy: type: Never volumes: - name: temp-volume emptyDir: {} hadoopConf: fs.s3a.aws.credentials.provider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider mainClass: com.typeform.data.spark.our.class.package mainApplicationFile: "s3a://my_spark_bucket/spark_jars/0.8.23/data-spark-jobs-assembly-0.8.23.jar" arguments: - --argument_name_1 - argument_value_1 driver: cores: 1 coreLimit: "1000m" memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true secrets: - name: my-secret secretType: generic path: /mnt/secrets volumeMounts: - name: "temp-volume" mountPath: "/tmp" executor: cores: 1 instances: 4 memory: "512m" labels: version: 3.0.0 serviceAccount: "spark" deleteOnTermination: true volumeMounts: - name: "temp-volume" mountPath: "/tmp"
  16. 16. type Deploy=SparkOperator How we deploy into K8s schedule: "@every 5m" concurrencyPolicy Replace Allow Forbid crontab.guru
  17. 17. type Deploy=SparkOperator How we deploy into K8s Never AlwaysOnFailure restartPolicy
  18. 18. type Deploy=SparkOperator How we deploy into K8s - Deployment Flow
  19. 19. type Deploy=SparkOperator How we deploy into K8s - Deploying it manually (Simple & easy) $ sbt assembly $ aws s3 cp target/scala-2.12/data-spark-jobs-assembly-0.8.23.jar s3://my_spark_bucket/spark_jars/data-spark-jobs_2.12/0.8.23/ $ kubectl apply -f spark-job.yaml Build the jar, put it into S3 and deploy the Spark Application $ kubectl delete -f spark-job.yaml Delete our Spark Application
  20. 20. type Deploy=SparkOperator How we deploy into K8s - Deploying it automatically (Simple & easy) Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. ref: argoproj.github.io/argo-cd
  21. 21. type Deploy=SparkOperator How we deploy into K8s - Deploying it automatically (Simple & easy) apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: data-spark-jobs namespace: argocd spec: destination: namespace: spark server: 'https://kubernetes.default.svc' project: data-platform-projects source: helm: valueFiles: - values.yaml - values.prod.yaml path: k8s/data-spark-jobs repoURL: 'https://github.com/thereponame' targetRevision: HEAD syncPolicy: {} Argo CD Application Spec
  22. 22. ArgoCD manual Sync
  23. 23. type Deploy=SparkOperator How we deploy into K8s - Deployment Flow
  24. 24. Some[Learnings] Why it matters
  25. 25. Some[Learnings] ● It was really easy to set up with the right team and the right infrastructure. ● Different teams & Projects adopting new Spark versions with their own pace. ● Spark Testing Cluster always ready to accept new jobs without “paying for it”. -- K8s cluster already available in dev environments. ● Monitor the pods consumption to tune their memory and cpu properly. Why it matters
  26. 26. Some[Learnings] Why it matters : Data Devops makes a difference Adopt a Devops in your team and convert it into a Data Devops.
  27. 27. The[team] Digital Analytics Specialists (x2) BI / DWH Architect (x2) Data Devops (x1) Data engineers (x4) Data Platform : A multidisciplinary team
  28. 28. Links of Interest Spark structured streaming in K8s with ArgoCD by Albert Franzi Spark on K8s operator ArgoCD - App of apps pattern Spark History Server in K8s by Carlos Escura Spark Operator - Specs

×