Etti Gur (Senior Big Data developer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Marketing Cloud, we provide our customers (marketers and publishers) real-time analytics tools to measure their ongoing campaigns' efficiency.
To achieve that, we need to ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner.
In this talk, we will discuss how we significantly optimized our Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 2 hours, resulting in a huge cost reduction.
Topics include:
* Ways to identify optimization opportunities
* Optimizing Spark resource allocation
* Parallelizing Spark output phase with dynamic partition inserts
* Running multiple Spark "jobs" in parallel within a single Spark application
2. @ItaiYaffe, @ettigur
Introduction
Etti Gur Itai Yaffe
● Senior Big Data developer
● Building data pipelines using
Spark, Kafka, Druid, Airflow
and more
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012
● Women in Big Data Israeli
chapter co-founder
3. @ItaiYaffe, @ettigur
Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Working with Spark?
● First time at a Women in Big Data meetup?
4. @ItaiYaffe, @ettigur
Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ The challenges
● The business use-case and our data pipeline
● Optimizing Spark resource allocation & utilization
○ Tools and examples
● Parallelizing Spark output phase with dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application
5. @ItaiYaffe, @ettigur
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
10. @ItaiYaffe, @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
CHECKOUT
3M
HOMEPAGE
15M
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
The business use-case - measure campaigns in-flight
11. @ItaiYaffe, @ettigur
In-flight analytics pipeline - high-level architecture
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
12. @ItaiYaffe, @ettigur
The Problem Metric
Growing execution time >24 hours/day
Stability Sporadic failures
High costs $33,000/month
Exhausting recovery Many hours/incident
(“babysitting”)
In-flight analytics pipeline - problems
13. @ItaiYaffe, @ettigur
In-flight analytics pipeline - Mart Generator
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
20. @ItaiYaffe, @ettigur
Mart Generator - initial resource allocation
● EMR cluster with 32 X i3.8xlarge worker nodes
○ Each with 32 cores, 244GB RAM each, NVMe SSD
● spark.executor.cores=6
● spark.executor.memory=40g
● spark.executor.memoryOverhead=4g (0.10 * executorMemory)
● Executors per node=32/6=5(2)
● Unused resources per node=24GB mem, 2 cores
● Unused resources across the cluster=768GB mem, 64 cores
○ Remember our OOM failures?
21. @ItaiYaffe, @ettigur
How to better allocate resources?
Ec2 instance type Best for Cores per
executor
Memory
per
executor
Overhead
per
executor
Executors per
node
i3.8xlarge
32 vCore,
244 GiB mem
4 x 1,900 NVMe SSD
Memory & storage
optimized
8 50g 8g 32/8 = 4
executors per
node
r4.8xlarge
32 vCore,
244 GiB mem
Memory optimized 8 50g 8g 32/8 = 4
executors per
node
c4.8xlarge
36 vCore,
60 GiB mem
Compute optimized 6 7g 2g 36/6=6
Number of available executors = (total cores/num-cores-per-executor)
24. @ItaiYaffe, @ettigur
Mart Generator requirement - overwrite latest date only
date=2019-11-22
date=2019-11-23
date=2019-11-24
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Mart Generator
Campaigns’ marts
25. @ItaiYaffe, @ettigur
Overwrite partitions - the “trivial” Spark implementation
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting the entire root folder
26. @ItaiYaffe, @ettigur
Overwrite specific partitions - our “naive”
implementationdataframesMap is of type <campaignCode, campaignDataframe>
dataframesMap.foreach(campaign => {
val outputPath = rootPath+"campaign="+campaign.code+"/date="+date
campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath)
})
The result:
● Overwriting only relevant folders
● An extremely long tail (w.r.t execution time)
27. @ItaiYaffe, @ettigur
Overwrite specific partitions - Spark 2.3 implementation
sparkSession.conf.set("spark.sql.sources. partitionOverwriteMode","dynamic")
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting only relevant folders
● Possible side-effect due to sequential S3 MV cmd by the driver
29. @ItaiYaffe, @ettigur
Mart Generator - summary
● Better resource allocation & utilization
● Execution time decreased from 7+ hours to ~40 minutes
● No sporadic OOM failures
● Overwriting only relevant folders (i.e partitions)
30. @ItaiYaffe, @ettigur
In-flight analytics pipeline - Enricher
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
33. @ItaiYaffe, @ettigur
Running multiple Spark “jobs” within a single Spark application
● Create one spark application with one sparkContext
● Create a thread pool
○ Thread pool size is configurable
● Each thread should execute a separate spark “job” (i.e action)
● “Jobs” are waiting in a queue and are executed based on available
resources
○ This is managed by Spark’s scheduler
34. @ItaiYaffe, @ettigur
Running multiple Spark “jobs” within a single Spark application
val executorService = Executors. newFixedThreadPool(numOfThreads)
val futures = campaigns map (campaign => {
executorService.submit(new Callable[Result]() {
override def call: (Result) = {
val ans = processCampaign(campaign, appConf, log)
return Result(campaign. code, ans))
}
})
})
val completedCampaigns = futures map (future => {
try {
future.get()
} catch {
case e: Exception => {
log.info( "Some thread caused exception : " + e.getMessage)
Result( "", "", false, false)
}
}
})
37. @ItaiYaffe, @ettigur
Enricher - summary
● Running multiple Spark “jobs” within a single Spark app
● Better resource utilization
● Execution time decreased from 20+ hours to ~1:20 hours
38. @ItaiYaffe, @ettigur
The Problem Before After
Growing execution time >24 hours/day 2 hours/day
Stability Sporadic failures Improved
High costs $33,000/month $3000/month
Exhausting recovery Many hours/incident
(“babysitting”)
2 hours/incident
In-flight analytics pipeline - before & after
39. @ItaiYaffe, @ettigur
The Problem Before After
Growing execution time >24 hours/day 2 hours/day
Stability Sporadic failures Improved
High costs $33,000/month $3000/month
Exhausting recovery Many hours/incident
(“babysitting”)
2 hours/incident
In-flight analytics pipeline - before & after
> 90%
improvement
40. @ItaiYaffe, @ettigur
What have we learned?
● You too can optimize Spark resource allocation & utilization
○ Leverage the tools at hand to deep-dive into your cluster
● Spark output phase can be parallelized even when overwriting specific partitions
○ Use dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application can be useful
●
● Optimizing data pipelines is an ongoing effort (not a one-off)
41. @ItaiYaffe, @ettigur
DRUID
ES
Want to know more?
● Women in Big Data Israel YouTube channel - https://tinyurl.com/y5jozqpg
● Marketing Performance Analytics Using Druid - https://tinyurl.com/t3dyo5b
● NMC Tech Blog - https://medium.com/nmc-techblog