Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimizing Spark-based data pipelines - are you up for it?

Etti Gur (Senior Big Data developer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Marketing Cloud, we provide our customers (marketers and publishers) real-time analytics tools to measure their ongoing campaigns' efficiency.

To achieve that, we need to ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner.
In this talk, we will discuss how we significantly optimized our Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 2 hours, resulting in a huge cost reduction.

Topics include:
* Ways to identify optimization opportunities
* Optimizing Spark resource allocation
* Parallelizing Spark output phase with dynamic partition inserts
* Running multiple Spark "jobs" in parallel within a single Spark application

  • Login to see the comments

Optimizing Spark-based data pipelines - are you up for it?

  1. 1. Optimizing Spark-based data pipelines - Are you up for it? Etti Gur & Itai Yaffe Nielsen
  2. 2. @ItaiYaffe, @ettigur Introduction Etti Gur Itai Yaffe ● Senior Big Data developer ● Building data pipelines using Spark, Kafka, Druid, Airflow and more ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012 ● Women in Big Data Israeli chapter co-founder
  3. 3. @ItaiYaffe, @ettigur Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Working with Spark? ● First time at a Women in Big Data meetup?
  4. 4. @ItaiYaffe, @ettigur Agenda ● Nielsen Marketing Cloud (NMC) ○ About ○ The challenges ● The business use-case and our data pipeline ● Optimizing Spark resource allocation & utilization ○ Tools and examples ● Parallelizing Spark output phase with dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application
  5. 5. @ItaiYaffe, @ettigur Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  6. 6. @ItaiYaffe, @ettigur Nielsen Marketing Cloud in numbers >10B events/day >20TB/day S3 1000s nodes/day 10s of TB ingested/day druid $100Ks/month
  7. 7. @ItaiYaffe, @ettigur The challenges Scalability Cost Efficiency Fault-tolerance
  8. 8. @ItaiYaffe, @ettigur The challenges Scalability Cost Efficiency Fault-tolerance
  9. 9. @ItaiYaffe, @ettigur What are the logical phases of a campaign? The business use-case - measure campaigns in-flight
  10. 10. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M CHECKOUT 3M HOMEPAGE 15M 7M Drop-off 5M Drop-off AD EXPOSURE 100M 85M Drop-off The business use-case - measure campaigns in-flight
  11. 11. @ItaiYaffe, @ettigur In-flight analytics pipeline - high-level architecture date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  12. 12. @ItaiYaffe, @ettigur The Problem Metric Growing execution time >24 hours/day Stability Sporadic failures High costs $33,000/month Exhausting recovery Many hours/incident (“babysitting”) In-flight analytics pipeline - problems
  13. 13. @ItaiYaffe, @ettigur In-flight analytics pipeline - Mart Generator date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  14. 14. @ItaiYaffe, @ettigur Mart Generator problems ● Execution time: ran for over 7 hours ● Stability: experienced sporadic OOM failures
  15. 15. @ItaiYaffe, @ettigur Digging deeper into resource allocation & utilization There are various ways to examine Spark resource allocation and utilization: ● Spark UI (e.g Executors Tab) ● Spark metrics system, e.g: ○ JMX ○ Graphite ● YARN UI (if applicable) ● Cluster-wide monitoring tools, e.g Ganglia
  16. 16. @ItaiYaffe, @ettigur Resource allocation - YARN UI
  17. 17. @ItaiYaffe, @ettigur Resource allocation - YARN UI
  18. 18. @ItaiYaffe, @ettigur Resource utilization - Ganglia
  19. 19. @ItaiYaffe, @ettigur Resource utilization - Ganglia
  20. 20. @ItaiYaffe, @ettigur Mart Generator - initial resource allocation ● EMR cluster with 32 X i3.8xlarge worker nodes ○ Each with 32 cores, 244GB RAM each, NVMe SSD ● spark.executor.cores=6 ● spark.executor.memory=40g ● spark.executor.memoryOverhead=4g (0.10 * executorMemory) ● Executors per node=32/6=5(2) ● Unused resources per node=24GB mem, 2 cores ● Unused resources across the cluster=768GB mem, 64 cores ○ Remember our OOM failures?
  21. 21. @ItaiYaffe, @ettigur How to better allocate resources? Ec2 instance type Best for Cores per executor Memory per executor Overhead per executor Executors per node i3.8xlarge 32 vCore, 244 GiB mem 4 x 1,900 NVMe SSD Memory & storage optimized 8 50g 8g 32/8 = 4 executors per node r4.8xlarge 32 vCore, 244 GiB mem Memory optimized 8 50g 8g 32/8 = 4 executors per node c4.8xlarge 36 vCore, 60 GiB mem Compute optimized 6 7g 2g 36/6=6 Number of available executors = (total cores/num-cores-per-executor)
  22. 22. @ItaiYaffe, @ettigur Mart Generator - better resource allocation
  23. 23. @ItaiYaffe, @ettigur Mart Generator - better resource utilization, but...
  24. 24. @ItaiYaffe, @ettigur Mart Generator requirement - overwrite latest date only date=2019-11-22 date=2019-11-23 date=2019-11-24 1. Read files of last day Data Lake 2. Write files by campaign,date Mart Generator Campaigns’ marts
  25. 25. @ItaiYaffe, @ettigur Overwrite partitions - the “trivial” Spark implementation dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting the entire root folder
  26. 26. @ItaiYaffe, @ettigur Overwrite specific partitions - our “naive” implementationdataframesMap is of type <campaignCode, campaignDataframe> dataframesMap.foreach(campaign => { val outputPath = rootPath+"campaign="+campaign.code+"/date="+date campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath) }) The result: ● Overwriting only relevant folders ● An extremely long tail (w.r.t execution time)
  27. 27. @ItaiYaffe, @ettigur Overwrite specific partitions - Spark 2.3 implementation sparkSession.conf.set("spark.sql.sources. partitionOverwriteMode","dynamic") dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting only relevant folders ● Possible side-effect due to sequential S3 MV cmd by the driver
  28. 28. @ItaiYaffe, @ettigur Mart Generator - optimal resource utilization
  29. 29. @ItaiYaffe, @ettigur Mart Generator - summary ● Better resource allocation & utilization ● Execution time decreased from 7+ hours to ~40 minutes ● No sporadic OOM failures ● Overwriting only relevant folders (i.e partitions)
  30. 30. @ItaiYaffe, @ettigur In-flight analytics pipeline - Enricher date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  31. 31. @ItaiYaffe, @ettigur Enricher problem - execution time ● Grew from 9 hours to 18 hours ● Sometimes took more than 20 hours
  32. 32. @ItaiYaffe, @ettigur Enricher - initial resource utilization
  33. 33. @ItaiYaffe, @ettigur Running multiple Spark “jobs” within a single Spark application ● Create one spark application with one sparkContext ● Create a thread pool ○ Thread pool size is configurable ● Each thread should execute a separate spark “job” (i.e action) ● “Jobs” are waiting in a queue and are executed based on available resources ○ This is managed by Spark’s scheduler
  34. 34. @ItaiYaffe, @ettigur Running multiple Spark “jobs” within a single Spark application val executorService = Executors. newFixedThreadPool(numOfThreads) val futures = campaigns map (campaign => { executorService.submit(new Callable[Result]() { override def call: (Result) = { val ans = processCampaign(campaign, appConf, log) return Result(campaign. code, ans)) } }) }) val completedCampaigns = futures map (future => { try { future.get() } catch { case e: Exception => { log.info( "Some thread caused exception : " + e.getMessage) Result( "", "", false, false) } } })
  35. 35. @ItaiYaffe, @ettigur Spark UI - multiple Spark “jobs” within a single Spark application
  36. 36. @ItaiYaffe, @ettigur Enricher - optimal resource utilization
  37. 37. @ItaiYaffe, @ettigur Enricher - summary ● Running multiple Spark “jobs” within a single Spark app ● Better resource utilization ● Execution time decreased from 20+ hours to ~1:20 hours
  38. 38. @ItaiYaffe, @ettigur The Problem Before After Growing execution time >24 hours/day 2 hours/day Stability Sporadic failures Improved High costs $33,000/month $3000/month Exhausting recovery Many hours/incident (“babysitting”) 2 hours/incident In-flight analytics pipeline - before & after
  39. 39. @ItaiYaffe, @ettigur The Problem Before After Growing execution time >24 hours/day 2 hours/day Stability Sporadic failures Improved High costs $33,000/month $3000/month Exhausting recovery Many hours/incident (“babysitting”) 2 hours/incident In-flight analytics pipeline - before & after > 90% improvement
  40. 40. @ItaiYaffe, @ettigur What have we learned? ● You too can optimize Spark resource allocation & utilization ○ Leverage the tools at hand to deep-dive into your cluster ● Spark output phase can be parallelized even when overwriting specific partitions ○ Use dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application can be useful ● ● Optimizing data pipelines is an ongoing effort (not a one-off)
  41. 41. @ItaiYaffe, @ettigur DRUID ES Want to know more? ● Women in Big Data Israel YouTube channel - https://tinyurl.com/y5jozqpg ● Marketing Performance Analytics Using Druid - https://tinyurl.com/t3dyo5b ● NMC Tech Blog - https://medium.com/nmc-techblog
  42. 42. QUESTIONS
  43. 43. THANK YOU Itai Yaffe Itai Yaffe Etti Gur Etti Gur

×