Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The benefits of running Spark on your own Docker

Shir Bromberg (Big Data team leader) @ Yotpo:
Nowadays, many of an organization’s main applications rely on Spark pipelines. As these applications become more significant to businesses, so does the need to quickly deploy, test and monitor them.

The standard way of running spark jobs is to deploy it on a dedicated managed cluster. However, this solution is relatively expensive with potentially high setup time. Therefore, we developed a way to run Spark on any container orchestration platform. This allows us to run Spark in a simple, custom and testable way.

In this talk, we will present our open-source dockers for running Spark on Nomad servers. We will cover:
* The issues we had running spark on managed clusters and the solution we developed.
* How to build a spark docker.
* And finally, what you may achieve by using Spark on Nomad.

  • Login to see the comments

The benefits of running Spark on your own Docker

  1. 1. The Benefits of Running Spark on Docker Shir Bromberg Big Data Team Leader @ Yotpo
  2. 2. Agenda ● Motivation ● Spark on Docker ○ Solution overview ○ Building the Spark docker ○ Deploying your application using “Spark on Nomad” ● Why is it better! ● Next steps
  3. 3. Motivation ● Many applications rely on Spark ● These applications also require: ○ unitesting ○ quality tests ○ analytics ○ … ● A year ago: running these jobs on an on-demand managed cluster. ○ Managed cluster = EMR (AWS) dataproc
  4. 4. Pain Points of Managed Clusters ● Setup time: long startup time ○ Tests should be quick ● Environment: manually manage installed packages on the cluster ○ Use AMI ⇒ coupling with AWS :( ● Pricing: pay per instance ● Slow releases
  5. 5. Existing Solutions Option 1: Start a cluster for every test Problem: ● Time consuming
  6. 6. Existing Solutions Option 1: Start a cluster for every test Problem: ● Time consuming Option 2: Cluster that is always up Problems: ● Environment must be able to run all test cases ● Can be expensive
  7. 7. Our solution for working with Spark should follow these guidelines In an ideal world... Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability
  8. 8. Basic Terms
  9. 9. An open-source framework for distributed data processing Provides high-level functions in Scala, Java, Python, and R Spark in a nutshell Data storage layer (S3/HDFS) Resource Manager (Yarn/Mesos/Standalone) Spark Core Streaming SQL GraphX MLlib
  10. 10. Enables you to package your code with dependencies into a deployable unit called a container. Docker in a nutshell ServerDockerfile Build Image Docker Hub Image Image OS Docker ContainerContainerContainer
  11. 11. Container orchestration platform by HashiCorp Nomad in a nutshell Server Resource Scheduling Task Scheduling Leader Election Allocation Image Instances Memory CPU ... Nomad’s API Node NodeNode Nomad Cluster Client Execute tasks
  12. 12. High Level Solution
  13. 13. EC2 User Nomad
  14. 14. EC2 User Nomad
  15. 15. User Nomad EC2 Scaler Nomad Instance
  16. 16. User Nomad EC2 Scaler Docker
  17. 17. User Nomad EC2 Scaler Docker Spark Docker
  18. 18. User Nomad EC2 Scaler Docker
  19. 19. Dockerizing Spark
  20. 20. Building our own Spark docker sounds difficult… Apparently it’s Not!!
  21. 21. And we made it open source Open source https://github.com/YotpoLtd/metorikku/tree/master/docker/spark Docker Hub https://hub.docker.com/r/metorikku/spark
  22. 22. Spark Components ● Driver ○ Runs the main() function of the application ○ Distributes tasks among the nodes ● Executor ○ A process that execute multiple tasks Driver ExecutorExecutor
  23. 23. Spark Components ● Master ○ Manages the resources ● Worker ○ The executor process that execute multiple tasks on a worker node ● Standalone deploy mode ● Client Mode Master Driver Worker Executor Worker Executor
  24. 24. Spark Docker Submit Docker Submits the Spark command Worker Docker Runs the spark-worker service Master Docker Runs the spark-master service
  25. 25. Run locally Docker compose ● Submit ● Master ● Worker
  26. 26. Submit Docker ● Min workers ● command
  27. 27. Run your application Your own job
  28. 28. We have achieved Full data tests in CI
  29. 29. Simple Deployment Spark docker can run on your favorite container orchestration platform Docker swarm
  30. 30. Orchestration Layer
  31. 31. Nomad configuration with ease ● Batch and service (streaming) ● Dynamic port allocation ● Spark UI address using the job name ● Auto restart/reschedule ● Unified environments, no more bootstrap actions
  32. 32. What About Kubernetes? Spark can run on clusters managed by Kubernetes. ● Pros: ○ Built-in ○ K8S is aware that it’s running with Spark ● Cons: ○ We are not using K8S (working with hashicorp). ○ Nomad is simple
  33. 33. Have we met your expectations? Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability
  34. 34. Have we met your expectations? Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability Resource sharing: better optimized EMR pricing 3500$/month (42K$/year)
  35. 35. Have we met your expectations? Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability Writing and deploying spark jobs is easier than ever
  36. 36. Have we met your expectations? Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability Setup time: Cluster of 100 nodes in 1-2 minutes! Run locally or in CI env for tests
  37. 37. Have we met your expectations? Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability Observability “for free” on existing orchestration platform infra
  38. 38. Have we met your expectations? Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability Can be used in any environment (on prem. or any cloud provider)
  39. 39. Have we met your expectations? Reduce Cost Simple To Use Better Testability Monitoring Cloud Agnostic Scalability Auto scaling using orchestration tools (Libra)
  40. 40. Spark - DIY Write Test Run
  41. 41. How we deploy at YOTPO
  42. 42. Deploy Jenkins
  43. 43. Allocation Nomad UI
  44. 44. Services Consul
  45. 45. Monitoring Grafana
  46. 46. Logs ELK
  47. 47. Summary Our own framework to run Spark Simple to create like all other microservices It enables other teams to deploy spark jobs with ease
  48. 48. Questions? sbromberg@yotpo.com

×