Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the native way

727 views

Published on

Roi Teveth (Data Engineer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day.

In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.

Published in: Data & Analytics
  • Be the first to comment

Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the native way

  1. 1. @ItaiYaffe, @RTeveth
  2. 2. @ItaiYaffe, @RTeveth $30,000/month Savings Open your eyes Robust
  3. 3. @ItaiYaffe, @RTeveth ● ● ● ● ● ●
  4. 4. @ItaiYaffe, @RTeveth ETLs with Airflow & 01. = 02. 03. -on- Airflow Integration & Contribution04.
  5. 5. @ItaiYaffe, @RTeveth ETLs with Airflow & 01. = 02. 03. -on- Airflow Integration & Contribution04.
  6. 6. @ItaiYaffe, @RTeveth ● ● ● ● ●
  7. 7. @ItaiYaffe, @RTeveth ~30TB ingested/day S3 ~3000 nodes/day 10’s of TB ingested/day druid$100K’s/month
  8. 8. @ItaiYaffe, @RTeveth Scalability Cost Efficiency Fault-tolerance
  9. 9. @ItaiYaffe, @RTeveth ● ● ● ○ ○ ○
  10. 10. @ItaiYaffe, @RTeveth ~20 automatic DAG deployments/day ~1000 DAG Runs/day ~2 years in production Met all requirements & more ~40 users across 4 groups 6 contributions to open-source
  11. 11. @ItaiYaffe, @RTeveth
  12. 12. @ItaiYaffe, @RTeveth
  13. 13. @ItaiYaffe, @RTeveth
  14. 14. @ItaiYaffe, @RTeveth
  15. 15. @ItaiYaffe, @RTeveth ● ● ●
  16. 16. @ItaiYaffe, @RTeveth ● ● ● ○ ● ○
  17. 17. @ItaiYaffe, @RTeveth EMR is an AWS managed service to run Hadoop & Spark clusters Allows you to reduce costs by using Spot instances Charges management cost for each instance in a cluster
  18. 18. @ItaiYaffe, @RTeveth ● ● ○ ■ ○ … ■ ■ emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded
  19. 19. @ItaiYaffe, @RTeveth ● ● ○ ■ ○ … ■ ■ emr_create_job_flow_operator emr_add_steps_operator emr_step_sensor Creates new emr cluster Adds Spark step to the cluster Checks if the step succeeded
  20. 20. @ItaiYaffe, @RTeveth $$$ Visibility Robustness
  21. 21. @ItaiYaffe, @RTeveth
  22. 22. @ItaiYaffe, @RTeveth ● ● ○ ○ ●
  23. 23. @ItaiYaffe, @RTeveth ClusterControl plane Worker nodes (EC2 in our case) Pods group of one or more containers (such as Docker containers)
  24. 24. @ItaiYaffe, @RTeveth ●
  25. 25. @ItaiYaffe, @RTeveth ●
  26. 26. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ■
  27. 27. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ■
  28. 28. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ■
  29. 29. @ItaiYaffe, @RTeveth ClusterControl plane
  30. 30. @ItaiYaffe, @RTeveth ClusterControl plane
  31. 31. @ItaiYaffe, @RTeveth ClusterControl plane
  32. 32. @ItaiYaffe, @RTeveth ClusterControl plane
  33. 33. @ItaiYaffe, @RTeveth ClusterControl plane
  34. 34. @ItaiYaffe, @RTeveth ClusterControl plane
  35. 35. @ItaiYaffe, @RTeveth ● ● ○ ○ ○ ○ ● ●
  36. 36. @ItaiYaffe, @RTeveth …
  37. 37. @ItaiYaffe, @RTeveth ● ● ○
  38. 38. @ItaiYaffe, @RTeveth
  39. 39. @ItaiYaffe, @RTeveth ./bin/spark-submit --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances =3 --conf spark.kubernetes.container.image =<spark-image> local:///path/to/examples.jar Kubernetes control plane Kubernetes cluster SparkPi driver Executor 1 Executor 2 Executor 3
  40. 40. @ItaiYaffe, @RTeveth https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ● ● ●
  41. 41. @ItaiYaffe, @RTeveth apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: "spark-pi” namespace: default spec: ... driver: ... executor: ... Spark-pi.yaml Kubernetes control plane Kubernetes cluster SparkPi driver Executor 1 Executor 2 Executor 3 kubectl Spark- on-K8s operator
  42. 42. @ItaiYaffe, @RTeveth VS
  43. 43. @ItaiYaffe, @RTeveth
  44. 44. @ItaiYaffe, @RTeveth …
  45. 45. @ItaiYaffe, @RTeveth … https://github.com/apache/airflow/pull/7163
  46. 46. @ItaiYaffe, @RTeveth ł
  47. 47. @ItaiYaffe, @RTeveth KubernetesHook SparkKubernetes operator SparkKubernetes sensor
  48. 48. @ItaiYaffe, @RTeveth ● ○ ● ○
  49. 49. @ItaiYaffe, @RTeveth ● ● ● apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: "spark-pi-{{ ds }}-{{ task_instance.try_number }}" namespace: default spec: ... driver: ... executor: ... SparkKubernetes operator
  50. 50. @ItaiYaffe, @RTeveth ● ●
  51. 51. @ItaiYaffe, @RTeveth ● ● ○ ● ○ ● ○
  52. 52. @ItaiYaffe, @RTeveth
  53. 53. @ItaiYaffe, @RTeveth
  54. 54. @ItaiYaffe, @RTeveth
  55. 55. @ItaiYaffe, @RTeveth
  56. 56. @ItaiYaffe, @RTeveth ● ○ ● ○ ● ○
  57. 57. @ItaiYaffe, @RTeveth ● ○ ○ ● ●
  58. 58. @ItaiYaffe, @RTeveth ● ○ ●
  59. 59. @ItaiYaffe, @RTeveth ● ○ on_success_callback on_failure_callback default_args ● ○ 😉
  60. 60. @ItaiYaffe, @RTeveth
  61. 61. @ItaiYaffe, @RTeveth Reduce costs Open your eyes Robust ● ●
  62. 62. @ItaiYaffe, @RTeveth ● ○ ■ ○ ○ ● ○ ○

×