Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 32

Getting Started with Apache Spark on Kubernetes

0

Share

Download to read offline

Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release. Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads. In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.

Getting Started with Apache Spark on Kubernetes

  1. 1. https://www.datamechanics.co Getting Started with Apache Spark on Kubernetes Jean-Yves Stephan, Co-Founder & CEO @ Data Mechanics Julien Dumazert, Co-Founder & CTO @ Data Mechanics www.datamechanics.co
  2. 2. https://www.datamechanics.co Who We Are Jean-Yves “JY” Stephan Co-Founder & CEO @ Data Mechanics jy@datamechanics.co Previously: Software Engineer and Spark Infrastructure Lead @ Databricks Julien Dumazert Co-Founder & CTO @ Data Mechanics julien@datamechanics.co Previously: Lead Data Scientist @ ContentSquare Data Scientist @ BlaBlaCar
  3. 3. https://www.datamechanics.co Who Are You? (Live Poll) What is your experience with running Spark on Kubernetes? ● I’ve never used it, but I’m curious to learn more about it. ● I’ve prototyped using it, but I’m not using it in production. ● I’m using it in production.
  4. 4. https://www.datamechanics.co Agenda What is Data Mechanics ? Why run Spark on Kubernetes ? How to get started ? End-to-end dev workflow (demo) Future of Spark-on-Kubernetes
  5. 5. https://www.datamechanics.co Agenda What is Data Mechanics ? Why run Spark on Kubernetes ? How to get started ? End-to-end dev workflow (demo) Future of Spark-on-Kubernetes
  6. 6. https://www.datamechanics.co Data Mechanics is a serverless Spark platform... ● Autopilot features ○ Fast autoscaling ○ Automated pod and disk sizing ○ Autotuning Spark configuration ● Fully Dockerized ● Priced on Spark tasks time (instead of wasted server uptime)
  7. 7. https://www.datamechanics.co ... deployed on a k8s cluster in our customers’ cloud account ● Sensitive data does not leave this cloud account. Private clusters are supported. ● Data Mechanics manages the Kubernetes cluster (using EKS, GKE, AKS). A Kubernetes cluster in our customer’s AWS, GCP, or Azure cloud account APINotebooks Data scientists Data engineers Script, Airflow, or other scheduler Data Mechanics Gateway Autoscaling node groups
  8. 8. https://www.datamechanics.co How is Data Mechanics different from Spark-on-k8s open-source? Check our blog post How Data Mechanics Improves On Spark on Kubernetes for more details ● Monitor your application logs, configs, and metrics ● Jupyter and Airflow Integrations ● Track your costs and performance over time ● Automated tuning of VMs, disks, and Spark configs ● Fast Autoscaling ● I/O optimizations ● Spot Nodes Support Dynamic OptimizationsAn intuitive UI ● SSO & Private Clusters support ● Optimized Spark images for your use case. ● No Setup, No Maintenance. Slack Support. A Managed Service
  9. 9. https://www.datamechanics.co Agenda What is Data Mechanics ? Why run Spark on Kubernetes ? How to get started ? End-to-end dev workflow (demo) Future of Spark-on-Kubernetes
  10. 10. Architecture of Spark-on-Kubernetes https://www.datamechanics.co
  11. 11. Motivations for running Spark on Kubernetes ● High resource sharing - k8s reallocates resources across concurrent apps in <10s ● Each Spark app has its own Spark version, python version, and dependencies ● A rich ecosystem of tools for your entire stack (logging & monitoring, CI/CD, security) ● Reduce lock-in and deploy everywhere (cloud, on-prem) ● Run non-Spark workloads on the same cluster (Python ETL, ML model serving, etc) A cloud-agnostic infra layer for your entire stack Full isolation in a shared cost-efficient cluster ● Reliable and fast way to package dependencies ● Same environment in local, dev, testing, and prod ● Simple workflow for data scientists and engineers Docker Development Workflow https://www.datamechanics.co
  12. 12. https://www.datamechanics.co Agenda What is Data Mechanics ? Why choose Spark on Kubernetes ? How to get started ? End-to-end dev workflow (demo) Future of Spark-on-Kubernetes
  13. 13. Checklist to get started with Spark-on-Kubernetes ● Save Spark logs to a persistent storage ● Collect system metrics (memory, CPU, I/O, …) ● Host the Spark History Server Monitoring ● 5-10x shuffle performance boost using local SSDs ● Configure Spot Nodes and handle spot interruptions ● Optimize Spark app configs (pod sizing, bin-packing) Optimizations ● Create the cluster, with proper networking, data access, and node pools ● Install the spark-operator and cluster-autoscaler ● Integrate your tools (Airflow, Jupyter, CI/CD, …) Basic Setup Check our blog post Setting up, Managing & Monitoring Spark on Kubernetes for more details.
  14. 14. Set up the Spark History Server (Spark UI) Do It Yourself (the hard way): ● Write Spark event logs to a persistent storage (using spark.eventLog.dir) ● Follow these instructions to install the Spark History Server as a Helm Chart. Use Our Free Hosted Spark History Server (the easy way): ● Install our open-sourced Spark agent http://github.com/datamechanics/delight ● View the Spark UI at https://datamechanics.co/delight https://www.datamechanics.co
  15. 15. Data Mechanics Delight: a free & cross-platform Spark UI ● With new system metrics (memory & CPU) and a better UX ● First milestone is available: Free Hosted Spark History Server ● Second milestone: Early 2021 New metrics and data vizs :) ● Get Started at https://datamechanics.co/delight https://www.datamechanics.co
  16. 16. For reliability & cost reduction, you should have different node pools: ● system pods on small on-demand nodes (m5.large) ● Spark driver pods on on-demand nodes (m5.xlarge) ● Spark executor pods on larger spot nodes (r5d.2xlarge) Multiple node pools that scale down to zero On-demand m5.xlarge Driver Driver Spot r5d.2xlarge Exec Spot r5d.2xlarge Exec Spot r5d.2xlarge Exec On-demand m5.large Spark-operator Ingress controller https://www.datamechanics.co
  17. 17. ● Install the cluster-autoscaler ● Define a labelling scheme for your nodes to select them ● Create auto-scaling groups (ASGs) manually (use the Terraform AWS EKS module) ● Add those labels as ASG tags to inform the cluster-autoscaler Example setup on AWS EKS Node label ASG tag acme-lifecycle: spot k8s.io/cluster-autoscaler/node-template/label/acme-lifecycle: spot acme-instance: r5d.2xlarge k8s.io/cluster-autoscaler/node-template/label/acme-instance: r5d.2xlarge https://www.datamechanics.co
  18. 18. Using preemptible nodes We’re all set to schedule pods on preemptible nodes! ● Using vanilla Spark submit (another option is pod templates): ● Using the spark operator: --conf spark.kubernetes.node.selector.acme-lifecyle=spot spec: driver: nodeSelector: - acme-lifecyle=preemptible executor: nodeSelector: - acme-lifecyle=spot https://www.datamechanics.co
  19. 19. https://www.datamechanics.co Agenda What is Data Mechanics ? Why choose Spark on Kubernetes ? How to get started ? End-to-end dev workflow (demo) Future of Spark-on-Kubernetes
  20. 20. Advantages of the Docker Dev Workflow for Spark Build & run locally for dev/testing Build, push & run with prod data on k8s Control your environment ● Pick your Spark and Python version independently ● Package your complex dependencies in the image Make Spark more reliable ● Same environment between dev, test, and prod ● No flaky runtime downloads/bootstrap actions Speed up your iteration cycle ● Docker caches previous layers ● <30 seconds iteration cycle on prod data ! https://www.datamechanics.co
  21. 21. Spark & Docker Dev Workflow: Demo Time What we’ll show ● Package your code and dependencies in a Docker image ● Iterate locally on the image ● Run the same image on Kubernetes ● Optimize performance at scale The example ● Using the million song dataset (500G) from the Echo Nest ● Create harmonious playlists by comparing soundtracks Credits to Kerem Turgutlu https://www.datamechanics.co
  22. 22. https://www.datamechanics.co Agenda What is Data Mechanics ? Why choose Spark on Kubernetes ? How to get started ? End-to-end dev workflow (demo) Future of Spark-on-Kubernetes
  23. 23. Spark-on-Kubernetes improvements February 2018 Spark 2.3 Initial release https://www.datamechanics.co
  24. 24. Spark-on-Kubernetes improvements February 2018 Spark 2.3 Initial release November 2018 Spark 2.4 Client Mode Volume mounts Simpler dependency mgt https://www.datamechanics.co
  25. 25. Spark-on-Kubernetes improvements February 2018 Spark 2.3 Initial release June 2020 Spark 3.0 Dynamic allocation Local code upload November 2018 Spark 2.4 Client Mode Volume mounts Simpler dependency mgt https://www.datamechanics.co
  26. 26. Dynamic allocation on Kubernetes ● Plays well with k8s autoscaling ○ Executors spin up in 5 seconds when there is capacity, 1-2 min when a new node must be provisioned ● Available since Spark 3.0 through shuffle tracking spark.dynamicAllocation.enabled=true spark.dynamicAllocation.shuffleTracking.enabled=true https://www.datamechanics.co
  27. 27. Spark-on-Kubernetes improvements February 2018 Spark 2.3 Initial release June 2020 Spark 3.0 Dynamic allocation Local code upload December 2020 Spark 3.1 Spark-on-k8s is GA (“experimental” removed) Better Handle Node Shutdown November 2018 Spark 2.4 Client Mode Volume mounts Simpler dependency mgt https://www.datamechanics.co
  28. 28. NodeNode Better Handling for Node Shutdown Copy shuffle and cache data during graceful decommissioning of a node Node Driver Exec 1) k8s warns node of shutdown Exec This will occur: ● During dynamic allocation (downscale) ● Or when a node goes down (e.g. spot interruption). To handle spot interruptions, you need a node termination handler (daemonset) ● AWS ● GCP ● Azure https://www.datamechanics.co
  29. 29. NodeNode Better Handling for Node Shutdown Copy shuffle and cache data during graceful decommissioning of a node Node Driver Exec 1) k8s warns node of shutdown 2) Driver stops scheduling tasks. Failed tasks do not count against stage failure. Exec 3) Shuffle & cached data is copied to other executors. This will occur: ● During dynamic allocation (downscale) ● Or when a node goes down (e.g. spot interruption). To handle spot interruptions, you need a node termination handler (daemonset) ● AWS ● GCP ● Azure https://www.datamechanics.co
  30. 30. Node Better Handling for Node Shutdown Copy shuffle and cache data during graceful decommissioning of a node Node Driver 4) Spark application continues unimpacted Exec This will occur: ● During dynamic allocation (downscale) ● Or when a node goes down (e.g. spot interruption). To handle spot interruptions, you need a node termination handler (daemonset) ● AWS ● GCP ● Azure https://www.datamechanics.co
  31. 31. Spark-on-Kubernetes improvements February 2018 Spark 2.3 Initial release June 2020 Spark 3.0 Dynamic allocation Local code upload December 2020 Spark 3.1 Spark-on-k8s is GA (“experimental” removed) Better Handle Node Shutdown November 2018 Spark 2.4 Client Mode Volume mounts Simpler dependency mgt TBD Use remote storage for persisting shuffle data https://www.datamechanics.co
  32. 32. Thank you! Your feedback is important to us. Don’t forget to rate and review the sessions. Get in touch! @JYStephan Jean-Yves Stephan @DumazertJulien Julien Dumazert @DataMechanics_ www.datamechanics.co https://www.datamechanics.co

×