SlideShare a Scribd company logo
1 of 43
Download to read offline
Optimizing Spark-based data pipelines -
Are you up for it?
Etti Gur & Itai Yaffe
Nielsen
@ItaiYaffe, @ettigur
Introduction
Etti Gur Itai Yaffe
● Senior Big Data developer
● Building data pipelines using
Spark, Kafka, Druid, Airflow
and more
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012
● Women in Big Data Israeli
chapter co-founder
@ItaiYaffe, @ettigur
Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Working with Spark?
● First time at a Women in Big Data meetup?
@ItaiYaffe, @ettigur
Agenda
● Nielsen Marketing Cloud (NMC)
○ About
○ The challenges
● The business use-case and our data pipeline
● Optimizing Spark resource allocation & utilization
○ Tools and examples
● Parallelizing Spark output phase with dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application
@ItaiYaffe, @ettigur
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions
@ItaiYaffe, @ettigur
Nielsen Marketing Cloud in numbers
>10B events/day >20TB/day
S3
1000s nodes/day 10s of TB
ingested/day
druid
$100Ks/month
@ItaiYaffe, @ettigur
The challenges
Scalability
Cost Efficiency
Fault-tolerance
@ItaiYaffe, @ettigur
The challenges
Scalability
Cost Efficiency
Fault-tolerance
@ItaiYaffe, @ettigur
What are the logical phases of a campaign?
The business use-case - measure campaigns in-flight
@ItaiYaffe, @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
CHECKOUT
3M
HOMEPAGE
15M
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
The business use-case - measure campaigns in-flight
@ItaiYaffe, @ettigur
In-flight analytics pipeline - high-level architecture
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
The Problem Metric
Growing execution time >24 hours/day
Stability Sporadic failures
High costs $33,000/month
Exhausting recovery Many hours/incident
(“babysitting”)
In-flight analytics pipeline - problems
@ItaiYaffe, @ettigur
In-flight analytics pipeline - Mart Generator
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Mart Generator problems
● Execution time: ran for over 7 hours
● Stability: experienced sporadic OOM failures
@ItaiYaffe, @ettigur
Digging deeper into resource allocation & utilization
There are various ways to examine Spark resource allocation and utilization:
● Spark UI (e.g Executors Tab)
● Spark metrics system, e.g:
○ JMX
○ Graphite
● YARN UI (if applicable)
● Cluster-wide monitoring tools, e.g Ganglia
@ItaiYaffe, @ettigur
Resource allocation - YARN UI
@ItaiYaffe, @ettigur
Resource allocation - YARN UI
@ItaiYaffe, @ettigur
Resource utilization - Ganglia
@ItaiYaffe, @ettigur
Resource utilization - Ganglia
@ItaiYaffe, @ettigur
Mart Generator - initial resource allocation
● EMR cluster with 32 X i3.8xlarge worker nodes
○ Each with 32 cores, 244GB RAM each, NVMe SSD
● spark.executor.cores=6
● spark.executor.memory=40g
● spark.executor.memoryOverhead=4g (0.10 * executorMemory)
● Executors per node=32/6=5(2)
● Unused resources per node=24GB mem, 2 cores
● Unused resources across the cluster=768GB mem, 64 cores
○ Remember our OOM failures?
@ItaiYaffe, @ettigur
How to better allocate resources?
Ec2 instance type Best for Cores per
executor
Memory
per
executor
Overhead
per
executor
Executors per
node
i3.8xlarge
32 vCore,
244 GiB mem
4 x 1,900 NVMe SSD
Memory & storage
optimized
8 50g 8g 32/8 = 4
executors per
node
r4.8xlarge
32 vCore,
244 GiB mem
Memory optimized 8 50g 8g 32/8 = 4
executors per
node
c4.8xlarge
36 vCore,
60 GiB mem
Compute optimized 6 7g 2g 36/6=6
Number of available executors = (total cores/num-cores-per-executor)
@ItaiYaffe, @ettigur
Mart Generator - better resource allocation
@ItaiYaffe, @ettigur
Mart Generator - better resource utilization, but...
@ItaiYaffe, @ettigur
Mart Generator requirement - overwrite latest date only
date=2019-11-22
date=2019-11-23
date=2019-11-24
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Mart Generator
Campaigns’ marts
@ItaiYaffe, @ettigur
Overwrite partitions - the “trivial” Spark implementation
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting the entire root folder
@ItaiYaffe, @ettigur
Overwrite specific partitions - our “naive”
implementationdataframesMap is of type <campaignCode, campaignDataframe>
dataframesMap.foreach(campaign => {
val outputPath = rootPath+"campaign="+campaign.code+"/date="+date
campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath)
})
The result:
● Overwriting only relevant folders
● An extremely long tail (w.r.t execution time)
@ItaiYaffe, @ettigur
Overwrite specific partitions - Spark 2.3 implementation
sparkSession.conf.set("spark.sql.sources. partitionOverwriteMode","dynamic")
dataframe.write
.partitionBy("campaign", "date")
.mode(SaveMode.Overwrite)
.parquet(folderPath)
The result:
● Output written in parallel
● Overwriting only relevant folders
● Possible side-effect due to sequential S3 MV cmd by the driver
@ItaiYaffe, @ettigur
Mart Generator - optimal resource utilization
@ItaiYaffe, @ettigur
Mart Generator - summary
● Better resource allocation & utilization
● Execution time decreased from 7+ hours to ~40 minutes
● No sporadic OOM failures
● Overwriting only relevant folders (i.e partitions)
@ItaiYaffe, @ettigur
In-flight analytics pipeline - Enricher
date=2019-12-16
date=2019-12-17
date=2019-12-18
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Enricher problem - execution time
● Grew from 9 hours to 18 hours
● Sometimes took more than 20 hours
@ItaiYaffe, @ettigur
Enricher - initial resource utilization
@ItaiYaffe, @ettigur
Running multiple Spark “jobs” within a single Spark application
● Create one spark application with one sparkContext
● Create a thread pool
○ Thread pool size is configurable
● Each thread should execute a separate spark “job” (i.e action)
● “Jobs” are waiting in a queue and are executed based on available
resources
○ This is managed by Spark’s scheduler
@ItaiYaffe, @ettigur
Running multiple Spark “jobs” within a single Spark application
val executorService = Executors. newFixedThreadPool(numOfThreads)
val futures = campaigns map (campaign => {
executorService.submit(new Callable[Result]() {
override def call: (Result) = {
val ans = processCampaign(campaign, appConf, log)
return Result(campaign. code, ans))
}
})
})
val completedCampaigns = futures map (future => {
try {
future.get()
} catch {
case e: Exception => {
log.info( "Some thread caused exception : " + e.getMessage)
Result( "", "", false, false)
}
}
})
@ItaiYaffe, @ettigur
Spark UI - multiple Spark “jobs” within a single Spark application
@ItaiYaffe, @ettigur
Enricher - optimal resource utilization
@ItaiYaffe, @ettigur
Enricher - summary
● Running multiple Spark “jobs” within a single Spark app
● Better resource utilization
● Execution time decreased from 20+ hours to ~1:20 hours
@ItaiYaffe, @ettigur
The Problem Before After
Growing execution time >24 hours/day 2 hours/day
Stability Sporadic failures Improved
High costs $33,000/month $3000/month
Exhausting recovery Many hours/incident
(“babysitting”)
2 hours/incident
In-flight analytics pipeline - before & after
@ItaiYaffe, @ettigur
The Problem Before After
Growing execution time >24 hours/day 2 hours/day
Stability Sporadic failures Improved
High costs $33,000/month $3000/month
Exhausting recovery Many hours/incident
(“babysitting”)
2 hours/incident
In-flight analytics pipeline - before & after
> 90%
improvement
@ItaiYaffe, @ettigur
What have we learned?
● You too can optimize Spark resource allocation & utilization
○ Leverage the tools at hand to deep-dive into your cluster
● Spark output phase can be parallelized even when overwriting specific partitions
○ Use dynamic partition inserts
● Running multiple Spark "jobs" within a single Spark application can be useful
●
● Optimizing data pipelines is an ongoing effort (not a one-off)
@ItaiYaffe, @ettigur
DRUID
ES
Want to know more?
● Women in Big Data Israel YouTube channel - https://tinyurl.com/y5jozqpg
● Marketing Performance Analytics Using Druid - https://tinyurl.com/t3dyo5b
● NMC Tech Blog - https://medium.com/nmc-techblog
QUESTIONS
THANK YOU
Itai Yaffe Itai Yaffe
Etti Gur Etti Gur

More Related Content

What's hot

Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueDatabricks
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeDatabricks
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLDatabricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 

What's hot (20)

Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 

Similar to Optimizing Spark-based data pipelines - are you up for it?

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Etti Gur
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryMárton Kodok
 
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...DataBench
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streamingt_ivanov
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow ObstructionsTatiana Al-Chueyr
 
Spring Data and In-Memory Data Management in Action
Spring Data and In-Memory Data Management in ActionSpring Data and In-Memory Data Management in Action
Spring Data and In-Memory Data Management in ActionJohn Blum
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 TigerGraph
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway Tu Pham
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldDatabricks
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 

Similar to Optimizing Spark-based data pipelines - are you up for it? (20)

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
 
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Spring Data and In-Memory Data Management in Action
Spring Data and In-Memory Data Management in ActionSpring Data and In-Memory Data Management in Action
Spring Data and In-Memory Data Management in Action
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4 Graph Gurus 15: Introducing TigerGraph 2.4
Graph Gurus 15: Introducing TigerGraph 2.4
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentItai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 

Recently uploaded

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Recently uploaded (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Optimizing Spark-based data pipelines - are you up for it?

  • 1. Optimizing Spark-based data pipelines - Are you up for it? Etti Gur & Itai Yaffe Nielsen
  • 2. @ItaiYaffe, @ettigur Introduction Etti Gur Itai Yaffe ● Senior Big Data developer ● Building data pipelines using Spark, Kafka, Druid, Airflow and more ● Tech Lead, Big Data group ● Dealing with Big Data challenges since 2012 ● Women in Big Data Israeli chapter co-founder
  • 3. @ItaiYaffe, @ettigur Introduction - part 2 (or: “your turn…”) ● Data engineers? Data architects? Something else? ● Working with Spark? ● First time at a Women in Big Data meetup?
  • 4. @ItaiYaffe, @ettigur Agenda ● Nielsen Marketing Cloud (NMC) ○ About ○ The challenges ● The business use-case and our data pipeline ● Optimizing Spark resource allocation & utilization ○ Tools and examples ● Parallelizing Spark output phase with dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application
  • 5. @ItaiYaffe, @ettigur Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen on March 2015 ● A Data company ● Machine learning models for insights ● Targeting ● Business decisions
  • 6. @ItaiYaffe, @ettigur Nielsen Marketing Cloud in numbers >10B events/day >20TB/day S3 1000s nodes/day 10s of TB ingested/day druid $100Ks/month
  • 9. @ItaiYaffe, @ettigur What are the logical phases of a campaign? The business use-case - measure campaigns in-flight
  • 10. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M CHECKOUT 3M HOMEPAGE 15M 7M Drop-off 5M Drop-off AD EXPOSURE 100M 85M Drop-off The business use-case - measure campaigns in-flight
  • 11. @ItaiYaffe, @ettigur In-flight analytics pipeline - high-level architecture date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 12. @ItaiYaffe, @ettigur The Problem Metric Growing execution time >24 hours/day Stability Sporadic failures High costs $33,000/month Exhausting recovery Many hours/incident (“babysitting”) In-flight analytics pipeline - problems
  • 13. @ItaiYaffe, @ettigur In-flight analytics pipeline - Mart Generator date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 14. @ItaiYaffe, @ettigur Mart Generator problems ● Execution time: ran for over 7 hours ● Stability: experienced sporadic OOM failures
  • 15. @ItaiYaffe, @ettigur Digging deeper into resource allocation & utilization There are various ways to examine Spark resource allocation and utilization: ● Spark UI (e.g Executors Tab) ● Spark metrics system, e.g: ○ JMX ○ Graphite ● YARN UI (if applicable) ● Cluster-wide monitoring tools, e.g Ganglia
  • 20. @ItaiYaffe, @ettigur Mart Generator - initial resource allocation ● EMR cluster with 32 X i3.8xlarge worker nodes ○ Each with 32 cores, 244GB RAM each, NVMe SSD ● spark.executor.cores=6 ● spark.executor.memory=40g ● spark.executor.memoryOverhead=4g (0.10 * executorMemory) ● Executors per node=32/6=5(2) ● Unused resources per node=24GB mem, 2 cores ● Unused resources across the cluster=768GB mem, 64 cores ○ Remember our OOM failures?
  • 21. @ItaiYaffe, @ettigur How to better allocate resources? Ec2 instance type Best for Cores per executor Memory per executor Overhead per executor Executors per node i3.8xlarge 32 vCore, 244 GiB mem 4 x 1,900 NVMe SSD Memory & storage optimized 8 50g 8g 32/8 = 4 executors per node r4.8xlarge 32 vCore, 244 GiB mem Memory optimized 8 50g 8g 32/8 = 4 executors per node c4.8xlarge 36 vCore, 60 GiB mem Compute optimized 6 7g 2g 36/6=6 Number of available executors = (total cores/num-cores-per-executor)
  • 22. @ItaiYaffe, @ettigur Mart Generator - better resource allocation
  • 23. @ItaiYaffe, @ettigur Mart Generator - better resource utilization, but...
  • 24. @ItaiYaffe, @ettigur Mart Generator requirement - overwrite latest date only date=2019-11-22 date=2019-11-23 date=2019-11-24 1. Read files of last day Data Lake 2. Write files by campaign,date Mart Generator Campaigns’ marts
  • 25. @ItaiYaffe, @ettigur Overwrite partitions - the “trivial” Spark implementation dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting the entire root folder
  • 26. @ItaiYaffe, @ettigur Overwrite specific partitions - our “naive” implementationdataframesMap is of type <campaignCode, campaignDataframe> dataframesMap.foreach(campaign => { val outputPath = rootPath+"campaign="+campaign.code+"/date="+date campaign.dataframe.write.mode(SaveMode.Overwrite).parquet(outputPath) }) The result: ● Overwriting only relevant folders ● An extremely long tail (w.r.t execution time)
  • 27. @ItaiYaffe, @ettigur Overwrite specific partitions - Spark 2.3 implementation sparkSession.conf.set("spark.sql.sources. partitionOverwriteMode","dynamic") dataframe.write .partitionBy("campaign", "date") .mode(SaveMode.Overwrite) .parquet(folderPath) The result: ● Output written in parallel ● Overwriting only relevant folders ● Possible side-effect due to sequential S3 MV cmd by the driver
  • 28. @ItaiYaffe, @ettigur Mart Generator - optimal resource utilization
  • 29. @ItaiYaffe, @ettigur Mart Generator - summary ● Better resource allocation & utilization ● Execution time decreased from 7+ hours to ~40 minutes ● No sporadic OOM failures ● Overwriting only relevant folders (i.e partitions)
  • 30. @ItaiYaffe, @ettigur In-flight analytics pipeline - Enricher date=2019-12-16 date=2019-12-17 date=2019-12-18 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 31. @ItaiYaffe, @ettigur Enricher problem - execution time ● Grew from 9 hours to 18 hours ● Sometimes took more than 20 hours
  • 32. @ItaiYaffe, @ettigur Enricher - initial resource utilization
  • 33. @ItaiYaffe, @ettigur Running multiple Spark “jobs” within a single Spark application ● Create one spark application with one sparkContext ● Create a thread pool ○ Thread pool size is configurable ● Each thread should execute a separate spark “job” (i.e action) ● “Jobs” are waiting in a queue and are executed based on available resources ○ This is managed by Spark’s scheduler
  • 34. @ItaiYaffe, @ettigur Running multiple Spark “jobs” within a single Spark application val executorService = Executors. newFixedThreadPool(numOfThreads) val futures = campaigns map (campaign => { executorService.submit(new Callable[Result]() { override def call: (Result) = { val ans = processCampaign(campaign, appConf, log) return Result(campaign. code, ans)) } }) }) val completedCampaigns = futures map (future => { try { future.get() } catch { case e: Exception => { log.info( "Some thread caused exception : " + e.getMessage) Result( "", "", false, false) } } })
  • 35. @ItaiYaffe, @ettigur Spark UI - multiple Spark “jobs” within a single Spark application
  • 36. @ItaiYaffe, @ettigur Enricher - optimal resource utilization
  • 37. @ItaiYaffe, @ettigur Enricher - summary ● Running multiple Spark “jobs” within a single Spark app ● Better resource utilization ● Execution time decreased from 20+ hours to ~1:20 hours
  • 38. @ItaiYaffe, @ettigur The Problem Before After Growing execution time >24 hours/day 2 hours/day Stability Sporadic failures Improved High costs $33,000/month $3000/month Exhausting recovery Many hours/incident (“babysitting”) 2 hours/incident In-flight analytics pipeline - before & after
  • 39. @ItaiYaffe, @ettigur The Problem Before After Growing execution time >24 hours/day 2 hours/day Stability Sporadic failures Improved High costs $33,000/month $3000/month Exhausting recovery Many hours/incident (“babysitting”) 2 hours/incident In-flight analytics pipeline - before & after > 90% improvement
  • 40. @ItaiYaffe, @ettigur What have we learned? ● You too can optimize Spark resource allocation & utilization ○ Leverage the tools at hand to deep-dive into your cluster ● Spark output phase can be parallelized even when overwriting specific partitions ○ Use dynamic partition inserts ● Running multiple Spark "jobs" within a single Spark application can be useful ● ● Optimizing data pipelines is an ongoing effort (not a one-off)
  • 41. @ItaiYaffe, @ettigur DRUID ES Want to know more? ● Women in Big Data Israel YouTube channel - https://tinyurl.com/y5jozqpg ● Marketing Performance Analytics Using Druid - https://tinyurl.com/t3dyo5b ● NMC Tech Blog - https://medium.com/nmc-techblog
  • 43. THANK YOU Itai Yaffe Itai Yaffe Etti Gur Etti Gur