SlideShare a Scribd company logo
1 of 48
Download to read offline
DESIGNING ETL PIPELINES WITH
How to architect things right
Spark Summit Europe
16 October 2019
Tathagata “TD” Das
@tathadas
STRUCTURED
STREAMING
About Me
Started Streaming project in
AMPLab, UC Berkeley
Currently focused on Structured Streaming
and Delta Lake
Staff Engineer on the StreamTeam @
Team Motto: "We make all your streams come true"
Structured Streaming
Distributed stream processing built on SQL engine
High throughput, second-scale latencies
Fault-tolerant, exactly-once
Great set of connectors
Philosophy: Treat data streams like unbounded tables
Users write batch-like queries on tables
Spark will continuously execute the queries incrementally on streams
3
Structured Streaming
4
Example
Read JSON data from Kafka
Parse nested JSON
Store in structured Parquet table
Get end-to-end failure guarantees
ETL
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/"
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Anatomy of a Streaming Query
5
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/"
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Specify where to read data from
Specify data transformations
Specify where to write data to
Specify how to process data
DataFrames,
Datasets, SQL
Logical
Plan
Read from
Kafka
Project
cast(value) as json
Project
from_json(json)
Write to
Parquet
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Parquet
Sink
Optimized
Plan
Series of Incremental
Execution Plans
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/"
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
• ACID transactions
• Schema Enforcement and Evolution
• Data versioning and Audit History
• Time travel to old versions
• Open formats
• Scalable Metadata Handling
• Great with batch + streaming
• Upserts and Deletes
Open-source storage layer that brings ACID transactions to
Apache Spark™ and big data workloads.
https://delta.io/
THE GOOD OF DATA LAKES
• Massive scale out
• Open Formats
• Mixed workloads
THE GOOD OF DATA WAREHOUSES
• Pristine Data
• Transactional Reliability
• Fast SQL Queries
Open-source storage layer that brings ACID transactions to
Apache Spark™ and big data workloads.
https://delta.io/
STRUCTURED
STREAMING
How to build streaming data
pipelines with them?
STRUCTURED
STREAMING
What are the design patterns
to correctly architect
streaming data pipelines?
Another streaming design pattern talk????
Most talks
Focus on a pure
streaming engine
Explain one way of
achieving the end goal
11
This talk
Spark is more than a
streaming engine
Spark has multiple ways of
achieving the end goal with
tunable perf, cost and quality
This talk
How to think about design
Common design patterns
How we are making this easier
12
Streaming Pipeline Design
13
Data streams Insights
????
14
????
What?
Why?
How?
What?
15
What is your input?
What is your data?
What format and system is
your data in?
What is your output?
What results do you need?
What throughput and
latency do you need?
Why?
16
Why do you want this output in this way?
Who is going to take actions based on it?
When and how are they going to consume it?
humans?
computers?
Why? Common mistakes!
#1
"I want my dashboard with counts to
be updated every second"
#2
"I want to generate automatic alerts
with up-to-the-last second counts"
(but my input data is often delayed)
17
No point of updating every
second if humans are going to
take actions in minutes or hours
No point taking fast actions on
low quality data and results
Why? Common mistakes!
#3
"I want to train machine learning
models on the results"
(but my results are in a key-value store)
18
Key-value stores are not great
for large, repeated data scans
which machine learning
workloads perform
How?
19
????
How to process
the data?
????
How to store
the results?
Streaming Design Patterns
20
What?
Why?
How?Complex
ETL
Pattern 1: ETL
21
Input: unstructured input stream
from files, Kafka, etc.
What?
Why? Query latest structured data interactively or with periodic jobs
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
Output: structured
tabular data
P1: ETL
22
Convert unstructured input to
structured tabular data
Latency: few minutes
What? How?
Why?
Query latest structured data
interactively or with periodic jobs
Process: Use Structured Streaming query to transform
unstructured, dirty data
Run 24/7 on a cluster with default trigger
Store: Save to structured scalable storage that supports
data skipping, etc.
E.g.: Parquet, ORC, or even better, Delta Lake
STRUCTURED
STREAMING
ETL QUERY
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
P1: ETL with Delta Lake
23
How?
Store: Save to
STRUCTURED
STREAMING
Read with snapshot guarantees while writes are in progress
Concurrently reprocess data with full ACID guarantees
Coalesce small files into larger files
Update table to fix mistakes in data
Delete data for GDPR
REPROCESS
ETL QUERY
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
P1.1: Cheaper ETL
24
Convert unstructured input to
structured tabular data
Latency: few minutes hours
Not have clusters up 24/7
What? How?
Why?
Query latest data interactively
or with periodic jobs
Cheaper solution
Process: Still use Structured Streaming query!
Run streaming query with "trigger.once" for
processing all available data since last batch
Set up external schedule (every few hours?) to
periodically start a cluster and run one batch
STRUCTURED
STREAMING
RESTART ON
SCHEDULE
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
P1.2: Query faster than ETL!
25
Latency: hours seconds
What? How?
Why?
Query latest up-to-the last
second data interactively
Query data in Kafka directly using Spark SQL
Can process up to the last records received by
Kafka when the query was started
SQL
Pattern 2: Key-value output
26
Input: new data
for each key
Lookup latest value for key (dashboards, websites, etc.)
OR
Summary tables for querying interactively or with periodic jobs
KEY LATEST VALUE
key1 value2
key2 value3
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
Output: updated
values for each key
Aggregations (sum, count, …)
Sessionizations
What?
Why?
P2.1: Key-value output for lookup
27
Generate updated values for keys
Latency: seconds/minutes
What? How?
Why?
Lookup latest value for key
STRUCTURED
STREAMING
Process: Use Structured Streaming with
stateful operations for aggregation
Store: Save in key-values stores optimized
for single key lookups
STATEFUL AGGREGATION LOOKUP
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
P2.2: Key-value output for analytics
28
Generate updated values for keys
Latency: seconds/minutes
What? How?
Why?
Lookup latest value for key
Summary tables for analytics
Process: Use Structured Streaming with
Store: Save in Delta Lake!
STRUCTURED
STREAMING
STATEFUL AGGREGATION SUMMARY
stateful operations for aggregation
Delta Lake supports upserts using MERGE
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
P2.2: Key-value output for analytics
29
How?
STRUCTURED
STREAMING
STATEFUL AGGREGATION
streamingDataFrame.foreachBatch { batchOutputDF =>
DeltaTable.forPath(spark, "/aggs/").as("t")
.merge(
batchOutputDF.as("s"),
"t.key = s.key")
.whenMatched().update(...)
.whenNotMatched().insert(...)
.execute()
}.start()
SUMMARY
Stateful operations for aggregation
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
Delta Lake supports upserts using
Merge SQL operation
Scala/Java/Python APIs with
same semantics as SQL Merge
SQL Merge supported in
Databricks Delta, not in OSS yet
P2.2: Key-value output for analytics
30
How?
Stateful aggregation requires setting
watermark to drop very late data
Dropping some data leads some
inaccuracies
Stateful operations for aggregation
Delta Lake supports upserts
using Merge
STRUCTURED
STREAMING
STATEFUL AGGREGATION SUMMARY
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
P2.3: Key-value aggregations for analytics
31
Generate aggregated values
for keys
Latency: hours/days
Do not drop any late data
What? How?
Why?
Summary tables for analytics
Correct
Process: ETL to structured table (no stateful aggregation)
Store: Save in Delta Lake
Post-process: Aggregate after all delayed data received
STRUCTURED
STREAMING
STATEFUL AGGREGATION
AGGREGATIONETL SUMMARY
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
Pattern 3: Joining multiple inputs
32
#UnifiedAnalytics #SparkAISummit
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
Input: Multiple data streams
based on common key
Output:
Combined
information
What?
P3.1: Joining fast and slow streams
33
Input: One fast stream of facts
and one slow stream of
dimension changes
Output: Fast stream enriched by
data from slow stream
Example:
product sales info ("facts") enriched
by more product info ("dimensions")
What + Why?
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
How?
ETL slow stream to a dimension table
Join fast stream with snapshots of the dimension table
FAST ETL
DIMENSION
TABLE
COMBINED
TABLE
JOIN
SLOW ETL
P3.1: Joining fast and slow streams
34
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
SLOW ETL
How? - Caveats
FAST ETL
JOIN COMBINED
TABLE
DIMENSION
TABLE
Store dimension table in Delta Lake
Delta Lake's versioning allows changes
to be detected and the snapshot
automatically reloaded without restart**
Better Solution
** available only in Databricks Delta Lake
Structured Streaming does not reload
dimension table snapshot
Changes by slow ETL wont be seen until restart
P3.1: Joining fast and slow stream
35
How? - Caveats
Treat it as a "joining fast and fast data"
Better Solution
Delays in updates to dimension table can
cause joining with stale dimension data
E.g. sale of a product received even before product
table has any info on the product
P3.2: Joining fast and fast data
36
Input: Two fast streams where
either stream maybe delayed
Output: Combined info even if
one is delayed over another
Example:
ad impressions and ad clicks
Use stream-stream joins in Structured Streaming
Data will be buffered as state
Watermarks define how long to buffer before
giving up on matching
What + Why?
id update valueSTREAM
STREAM
JOIN
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
See my past deep dive talk for more info
How?
Pattern 4: Change data capture
37
#UnifiedAnalytics #SparkAISummit
INSERT a, 1
INSERT b, 2
UPDATE a, 3
DELETE b
INSERT b, 4
key value
a 3
b 4
Input: Change data based
on a primary key
Output: Final table
after changes
End-to-end replication of transactional tables into analytical tables
What?
Why?
P4: Change data capture
38
How?
Use foreachBatch and Merge
In each batch, apply changes to the
Delta table using Merge
Delta Lake's Merge support extended
syntax with includes multiple match
clauses, clause conditions and
deletes
INSERT a, 1
INSERT b, 2
UPDATE a, 3
DELETE b
INSERT b, 4
STRUCTURED
STREAMING
streamingDataFrame.foreachBatch { batchOutputDF =>
DeltaTable.forPath(spark, "/deltaTable/").as("t")
.merge(batchOutputDF.as("updates"),
"t.key = s.key")
.whenMatched("delete = false").update(...)
.whenMatched("delete = true").delete()
.whenNotMatched().insert(...)
.execute()
}.start()
See merge examples in docs for more info
Pattern 5: Writing to multiple outputs
39
id val1 val2
id sum count
RAW LOGS
+
SUMMARIES
TABLE 2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
SUMMARY FOR
ANALYTICS
+
SUMMARY FOR
LOOKUP
TABLE CHANGE LOG
+
UPDATED TABLE
What?
P5: Serial or Parallel?
40
id val1 val2
id sum count
TABLE 2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
Parallel
How?
id val1 val2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
TABLE 2
Writes table 1 and reads it again
Cheap or expensive depends on
the size and format of table 1
Higher latency
Reads input twice, may have to
parse the data twice
Cheap or expensive depends on
size of raw input + parsing cost
Serial
P5: Combination!
41
Combo 1: Multiple streaming queries
id val1 val2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
TABLE 2
Do expensive parsing once, write to table 1
Do cheaper follow up processing from table 1
Good for
ETL + multiple levels of summaries
Change log + updated table
Still writing + reading table1
Compute once, write multiple times
Cheaper, but looses exactly-once guarantee
id val1 val2
TABLE 3
Combo 2: Single query + foreachBatch
EXPENSIVE
CHEAP
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
LOCATION 1
id val1 val2
LOCATION 2
streamingDataFrame.foreachBatch { batchSummaryData =>
// write summary to Delta Lake
// write summary to key value store
}.start()
How?
Production Pipelines @
Bronze Tables Silver Tables Gold Tables
REPORTING
STREAMING
ANALYTICS
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
In Progress: Declarative Pipelines
43
Allow users to declaratively express the
entire DAG of pipelines as a dataflow graph
REPORTING
STREAMING
ANALYTICS
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively
dataset("warehouse")
.query(input("kafka").select(…).join(…)) // Query to materialize
.location(…) // Storage Location
.schema(…) // Optional strict schema checking
.metastoreName(…) // Hive Metastore
.description(…) // Human readable description
.expect("validTimestamp", // Expectations on data quality
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
*Coming Soon
In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively
dataset("warehouse")
.query(input("kafka").withColumn(…).join(…)) // Query to materialize
.location(…) // Storage Location
.schema(…) // Optional strict schema checking
.metastoreName(…) // Hive Metastore
.description(…) // Human readable description
.expect("validTimestamp", // Expectations on data quality
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
*Coming Soon
In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively
dataset("warehouse")
.query(input("kafka").withColumn(…).join(…)) // Query to materialize
.location(…) // Storage Location
.schema(…) // Optional strict schema checking
.metastoreName(…) // Hive Metastore
.description(…) // Human readable description
.expect("validTimestamp", // Expectations on data quality
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
*Coming Soon
In Progress: Declarative Pipelines
47
Unit test part or whole DAG with
sample input data
Integration test DAG with
production data
Deploy/upgrade new DAG code
Monitor whole DAG in one place
Rollback code / data when needed
REPORTING
STREAMING ANALYTICS
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
*Coming Soon
Build your own Delta Lake
at https://delta.io

More Related Content

What's hot

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 

What's hot (20)

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

Similar to Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Architect Things Right

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkEren Avşaroğulları
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs FasterBob Ward
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkReynold Xin
 

Similar to Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Architect Things Right (20)

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Architect Things Right

  • 1. DESIGNING ETL PIPELINES WITH How to architect things right Spark Summit Europe 16 October 2019 Tathagata “TD” Das @tathadas STRUCTURED STREAMING
  • 2. About Me Started Streaming project in AMPLab, UC Berkeley Currently focused on Structured Streaming and Delta Lake Staff Engineer on the StreamTeam @ Team Motto: "We make all your streams come true"
  • 3. Structured Streaming Distributed stream processing built on SQL engine High throughput, second-scale latencies Fault-tolerant, exactly-once Great set of connectors Philosophy: Treat data streams like unbounded tables Users write batch-like queries on tables Spark will continuously execute the queries incrementally on streams 3
  • 4. Structured Streaming 4 Example Read JSON data from Kafka Parse nested JSON Store in structured Parquet table Get end-to-end failure guarantees ETL spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/" .trigger("1 minute") .option("checkpointLocation", "…") .start()
  • 5. Anatomy of a Streaming Query 5 spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/" .trigger("1 minute") .option("checkpointLocation", "…") .start() Specify where to read data from Specify data transformations Specify where to write data to Specify how to process data
  • 6. DataFrames, Datasets, SQL Logical Plan Read from Kafka Project cast(value) as json Project from_json(json) Write to Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Plan Series of Incremental Execution Plans process newdata t = 1 t = 2 t = 3 process newdata process newdata spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/" .trigger("1 minute") .option("checkpointLocation", "…") .start()
  • 7. • ACID transactions • Schema Enforcement and Evolution • Data versioning and Audit History • Time travel to old versions • Open formats • Scalable Metadata Handling • Great with batch + streaming • Upserts and Deletes Open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. https://delta.io/
  • 8. THE GOOD OF DATA LAKES • Massive scale out • Open Formats • Mixed workloads THE GOOD OF DATA WAREHOUSES • Pristine Data • Transactional Reliability • Fast SQL Queries Open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. https://delta.io/
  • 9. STRUCTURED STREAMING How to build streaming data pipelines with them?
  • 10. STRUCTURED STREAMING What are the design patterns to correctly architect streaming data pipelines?
  • 11. Another streaming design pattern talk???? Most talks Focus on a pure streaming engine Explain one way of achieving the end goal 11 This talk Spark is more than a streaming engine Spark has multiple ways of achieving the end goal with tunable perf, cost and quality
  • 12. This talk How to think about design Common design patterns How we are making this easier 12
  • 13. Streaming Pipeline Design 13 Data streams Insights ????
  • 15. What? 15 What is your input? What is your data? What format and system is your data in? What is your output? What results do you need? What throughput and latency do you need?
  • 16. Why? 16 Why do you want this output in this way? Who is going to take actions based on it? When and how are they going to consume it? humans? computers?
  • 17. Why? Common mistakes! #1 "I want my dashboard with counts to be updated every second" #2 "I want to generate automatic alerts with up-to-the-last second counts" (but my input data is often delayed) 17 No point of updating every second if humans are going to take actions in minutes or hours No point taking fast actions on low quality data and results
  • 18. Why? Common mistakes! #3 "I want to train machine learning models on the results" (but my results are in a key-value store) 18 Key-value stores are not great for large, repeated data scans which machine learning workloads perform
  • 19. How? 19 ???? How to process the data? ???? How to store the results?
  • 21. Pattern 1: ETL 21 Input: unstructured input stream from files, Kafka, etc. What? Why? Query latest structured data interactively or with periodic jobs 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … Output: structured tabular data
  • 22. P1: ETL 22 Convert unstructured input to structured tabular data Latency: few minutes What? How? Why? Query latest structured data interactively or with periodic jobs Process: Use Structured Streaming query to transform unstructured, dirty data Run 24/7 on a cluster with default trigger Store: Save to structured scalable storage that supports data skipping, etc. E.g.: Parquet, ORC, or even better, Delta Lake STRUCTURED STREAMING ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  • 23. P1: ETL with Delta Lake 23 How? Store: Save to STRUCTURED STREAMING Read with snapshot guarantees while writes are in progress Concurrently reprocess data with full ACID guarantees Coalesce small files into larger files Update table to fix mistakes in data Delete data for GDPR REPROCESS ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  • 24. P1.1: Cheaper ETL 24 Convert unstructured input to structured tabular data Latency: few minutes hours Not have clusters up 24/7 What? How? Why? Query latest data interactively or with periodic jobs Cheaper solution Process: Still use Structured Streaming query! Run streaming query with "trigger.once" for processing all available data since last batch Set up external schedule (every few hours?) to periodically start a cluster and run one batch STRUCTURED STREAMING RESTART ON SCHEDULE 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  • 25. P1.2: Query faster than ETL! 25 Latency: hours seconds What? How? Why? Query latest up-to-the last second data interactively Query data in Kafka directly using Spark SQL Can process up to the last records received by Kafka when the query was started SQL
  • 26. Pattern 2: Key-value output 26 Input: new data for each key Lookup latest value for key (dashboards, websites, etc.) OR Summary tables for querying interactively or with periodic jobs KEY LATEST VALUE key1 value2 key2 value3 { "key1": "value1" } { "key1": "value2" } { "key2": "value3" } Output: updated values for each key Aggregations (sum, count, …) Sessionizations What? Why?
  • 27. P2.1: Key-value output for lookup 27 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key STRUCTURED STREAMING Process: Use Structured Streaming with stateful operations for aggregation Store: Save in key-values stores optimized for single key lookups STATEFUL AGGREGATION LOOKUP { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 28. P2.2: Key-value output for analytics 28 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key Summary tables for analytics Process: Use Structured Streaming with Store: Save in Delta Lake! STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY stateful operations for aggregation Delta Lake supports upserts using MERGE { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 29. P2.2: Key-value output for analytics 29 How? STRUCTURED STREAMING STATEFUL AGGREGATION streamingDataFrame.foreachBatch { batchOutputDF => DeltaTable.forPath(spark, "/aggs/").as("t") .merge( batchOutputDF.as("s"), "t.key = s.key") .whenMatched().update(...) .whenNotMatched().insert(...) .execute() }.start() SUMMARY Stateful operations for aggregation { "key1": "value1" } { "key1": "value2" } { "key2": "value3" } Delta Lake supports upserts using Merge SQL operation Scala/Java/Python APIs with same semantics as SQL Merge SQL Merge supported in Databricks Delta, not in OSS yet
  • 30. P2.2: Key-value output for analytics 30 How? Stateful aggregation requires setting watermark to drop very late data Dropping some data leads some inaccuracies Stateful operations for aggregation Delta Lake supports upserts using Merge STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 31. P2.3: Key-value aggregations for analytics 31 Generate aggregated values for keys Latency: hours/days Do not drop any late data What? How? Why? Summary tables for analytics Correct Process: ETL to structured table (no stateful aggregation) Store: Save in Delta Lake Post-process: Aggregate after all delayed data received STRUCTURED STREAMING STATEFUL AGGREGATION AGGREGATIONETL SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 32. Pattern 3: Joining multiple inputs 32 #UnifiedAnalytics #SparkAISummit { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value Input: Multiple data streams based on common key Output: Combined information What?
  • 33. P3.1: Joining fast and slow streams 33 Input: One fast stream of facts and one slow stream of dimension changes Output: Fast stream enriched by data from slow stream Example: product sales info ("facts") enriched by more product info ("dimensions") What + Why? { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value How? ETL slow stream to a dimension table Join fast stream with snapshots of the dimension table FAST ETL DIMENSION TABLE COMBINED TABLE JOIN SLOW ETL
  • 34. P3.1: Joining fast and slow streams 34 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value SLOW ETL How? - Caveats FAST ETL JOIN COMBINED TABLE DIMENSION TABLE Store dimension table in Delta Lake Delta Lake's versioning allows changes to be detected and the snapshot automatically reloaded without restart** Better Solution ** available only in Databricks Delta Lake Structured Streaming does not reload dimension table snapshot Changes by slow ETL wont be seen until restart
  • 35. P3.1: Joining fast and slow stream 35 How? - Caveats Treat it as a "joining fast and fast data" Better Solution Delays in updates to dimension table can cause joining with stale dimension data E.g. sale of a product received even before product table has any info on the product
  • 36. P3.2: Joining fast and fast data 36 Input: Two fast streams where either stream maybe delayed Output: Combined info even if one is delayed over another Example: ad impressions and ad clicks Use stream-stream joins in Structured Streaming Data will be buffered as state Watermarks define how long to buffer before giving up on matching What + Why? id update valueSTREAM STREAM JOIN { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … See my past deep dive talk for more info How?
  • 37. Pattern 4: Change data capture 37 #UnifiedAnalytics #SparkAISummit INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 key value a 3 b 4 Input: Change data based on a primary key Output: Final table after changes End-to-end replication of transactional tables into analytical tables What? Why?
  • 38. P4: Change data capture 38 How? Use foreachBatch and Merge In each batch, apply changes to the Delta table using Merge Delta Lake's Merge support extended syntax with includes multiple match clauses, clause conditions and deletes INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 STRUCTURED STREAMING streamingDataFrame.foreachBatch { batchOutputDF => DeltaTable.forPath(spark, "/deltaTable/").as("t") .merge(batchOutputDF.as("updates"), "t.key = s.key") .whenMatched("delete = false").update(...) .whenMatched("delete = true").delete() .whenNotMatched().insert(...) .execute() }.start() See merge examples in docs for more info
  • 39. Pattern 5: Writing to multiple outputs 39 id val1 val2 id sum count RAW LOGS + SUMMARIES TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … SUMMARY FOR ANALYTICS + SUMMARY FOR LOOKUP TABLE CHANGE LOG + UPDATED TABLE What?
  • 40. P5: Serial or Parallel? 40 id val1 val2 id sum count TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … Parallel How? id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Writes table 1 and reads it again Cheap or expensive depends on the size and format of table 1 Higher latency Reads input twice, may have to parse the data twice Cheap or expensive depends on size of raw input + parsing cost Serial
  • 41. P5: Combination! 41 Combo 1: Multiple streaming queries id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Do expensive parsing once, write to table 1 Do cheaper follow up processing from table 1 Good for ETL + multiple levels of summaries Change log + updated table Still writing + reading table1 Compute once, write multiple times Cheaper, but looses exactly-once guarantee id val1 val2 TABLE 3 Combo 2: Single query + foreachBatch EXPENSIVE CHEAP { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 LOCATION 1 id val1 val2 LOCATION 2 streamingDataFrame.foreachBatch { batchSummaryData => // write summary to Delta Lake // write summary to key value store }.start() How?
  • 42. Production Pipelines @ Bronze Tables Silver Tables Gold Tables REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING
  • 43. In Progress: Declarative Pipelines 43 Allow users to declaratively express the entire DAG of pipelines as a dataflow graph REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING
  • 44. In Progress: Declarative Pipelines Enforce metadata, storage, and quality declaratively dataset("warehouse") .query(input("kafka").select(…).join(…)) // Query to materialize .location(…) // Storage Location .schema(…) // Optional strict schema checking .metastoreName(…) // Hive Metastore .description(…) // Human readable description .expect("validTimestamp", // Expectations on data quality "timestamp > 2012-01-01 AND …", "fail / alert / quarantine") *Coming Soon
  • 45. In Progress: Declarative Pipelines Enforce metadata, storage, and quality declaratively dataset("warehouse") .query(input("kafka").withColumn(…).join(…)) // Query to materialize .location(…) // Storage Location .schema(…) // Optional strict schema checking .metastoreName(…) // Hive Metastore .description(…) // Human readable description .expect("validTimestamp", // Expectations on data quality "timestamp > 2012-01-01 AND …", "fail / alert / quarantine") *Coming Soon
  • 46. In Progress: Declarative Pipelines Enforce metadata, storage, and quality declaratively dataset("warehouse") .query(input("kafka").withColumn(…).join(…)) // Query to materialize .location(…) // Storage Location .schema(…) // Optional strict schema checking .metastoreName(…) // Hive Metastore .description(…) // Human readable description .expect("validTimestamp", // Expectations on data quality "timestamp > 2012-01-01 AND …", "fail / alert / quarantine") *Coming Soon
  • 47. In Progress: Declarative Pipelines 47 Unit test part or whole DAG with sample input data Integration test DAG with production data Deploy/upgrade new DAG code Monitor whole DAG in one place Rollback code / data when needed REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING *Coming Soon
  • 48. Build your own Delta Lake at https://delta.io