Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Architect Things Right
1. DESIGNING ETL PIPELINES WITH
How to architect things right
Spark Summit Europe
16 October 2019
Tathagata “TD” Das
@tathadas
STRUCTURED
STREAMING
2. About Me
Started Streaming project in
AMPLab, UC Berkeley
Currently focused on Structured Streaming
and Delta Lake
Staff Engineer on the StreamTeam @
Team Motto: "We make all your streams come true"
3. Structured Streaming
Distributed stream processing built on SQL engine
High throughput, second-scale latencies
Fault-tolerant, exactly-once
Great set of connectors
Philosophy: Treat data streams like unbounded tables
Users write batch-like queries on tables
Spark will continuously execute the queries incrementally on streams
3
4. Structured Streaming
4
Example
Read JSON data from Kafka
Parse nested JSON
Store in structured Parquet table
Get end-to-end failure guarantees
ETL
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/"
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
5. Anatomy of a Streaming Query
5
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/"
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Specify where to read data from
Specify data transformations
Specify where to write data to
Specify how to process data
6. DataFrames,
Datasets, SQL
Logical
Plan
Read from
Kafka
Project
cast(value) as json
Project
from_json(json)
Write to
Parquet
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Parquet
Sink
Optimized
Plan
Series of Incremental
Execution Plans
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/"
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
7. • ACID transactions
• Schema Enforcement and Evolution
• Data versioning and Audit History
• Time travel to old versions
• Open formats
• Scalable Metadata Handling
• Great with batch + streaming
• Upserts and Deletes
Open-source storage layer that brings ACID transactions to
Apache Spark™ and big data workloads.
https://delta.io/
8. THE GOOD OF DATA LAKES
• Massive scale out
• Open Formats
• Mixed workloads
THE GOOD OF DATA WAREHOUSES
• Pristine Data
• Transactional Reliability
• Fast SQL Queries
Open-source storage layer that brings ACID transactions to
Apache Spark™ and big data workloads.
https://delta.io/
11. Another streaming design pattern talk????
Most talks
Focus on a pure
streaming engine
Explain one way of
achieving the end goal
11
This talk
Spark is more than a
streaming engine
Spark has multiple ways of
achieving the end goal with
tunable perf, cost and quality
12. This talk
How to think about design
Common design patterns
How we are making this easier
12
15. What?
15
What is your input?
What is your data?
What format and system is
your data in?
What is your output?
What results do you need?
What throughput and
latency do you need?
16. Why?
16
Why do you want this output in this way?
Who is going to take actions based on it?
When and how are they going to consume it?
humans?
computers?
17. Why? Common mistakes!
#1
"I want my dashboard with counts to
be updated every second"
#2
"I want to generate automatic alerts
with up-to-the-last second counts"
(but my input data is often delayed)
17
No point of updating every
second if humans are going to
take actions in minutes or hours
No point taking fast actions on
low quality data and results
18. Why? Common mistakes!
#3
"I want to train machine learning
models on the results"
(but my results are in a key-value store)
18
Key-value stores are not great
for large, repeated data scans
which machine learning
workloads perform
21. Pattern 1: ETL
21
Input: unstructured input stream
from files, Kafka, etc.
What?
Why? Query latest structured data interactively or with periodic jobs
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
Output: structured
tabular data
22. P1: ETL
22
Convert unstructured input to
structured tabular data
Latency: few minutes
What? How?
Why?
Query latest structured data
interactively or with periodic jobs
Process: Use Structured Streaming query to transform
unstructured, dirty data
Run 24/7 on a cluster with default trigger
Store: Save to structured scalable storage that supports
data skipping, etc.
E.g.: Parquet, ORC, or even better, Delta Lake
STRUCTURED
STREAMING
ETL QUERY
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
23. P1: ETL with Delta Lake
23
How?
Store: Save to
STRUCTURED
STREAMING
Read with snapshot guarantees while writes are in progress
Concurrently reprocess data with full ACID guarantees
Coalesce small files into larger files
Update table to fix mistakes in data
Delete data for GDPR
REPROCESS
ETL QUERY
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
24. P1.1: Cheaper ETL
24
Convert unstructured input to
structured tabular data
Latency: few minutes hours
Not have clusters up 24/7
What? How?
Why?
Query latest data interactively
or with periodic jobs
Cheaper solution
Process: Still use Structured Streaming query!
Run streaming query with "trigger.once" for
processing all available data since last batch
Set up external schedule (every few hours?) to
periodically start a cluster and run one batch
STRUCTURED
STREAMING
RESTART ON
SCHEDULE
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
25. P1.2: Query faster than ETL!
25
Latency: hours seconds
What? How?
Why?
Query latest up-to-the last
second data interactively
Query data in Kafka directly using Spark SQL
Can process up to the last records received by
Kafka when the query was started
SQL
26. Pattern 2: Key-value output
26
Input: new data
for each key
Lookup latest value for key (dashboards, websites, etc.)
OR
Summary tables for querying interactively or with periodic jobs
KEY LATEST VALUE
key1 value2
key2 value3
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
Output: updated
values for each key
Aggregations (sum, count, …)
Sessionizations
What?
Why?
27. P2.1: Key-value output for lookup
27
Generate updated values for keys
Latency: seconds/minutes
What? How?
Why?
Lookup latest value for key
STRUCTURED
STREAMING
Process: Use Structured Streaming with
stateful operations for aggregation
Store: Save in key-values stores optimized
for single key lookups
STATEFUL AGGREGATION LOOKUP
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
28. P2.2: Key-value output for analytics
28
Generate updated values for keys
Latency: seconds/minutes
What? How?
Why?
Lookup latest value for key
Summary tables for analytics
Process: Use Structured Streaming with
Store: Save in Delta Lake!
STRUCTURED
STREAMING
STATEFUL AGGREGATION SUMMARY
stateful operations for aggregation
Delta Lake supports upserts using MERGE
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
29. P2.2: Key-value output for analytics
29
How?
STRUCTURED
STREAMING
STATEFUL AGGREGATION
streamingDataFrame.foreachBatch { batchOutputDF =>
DeltaTable.forPath(spark, "/aggs/").as("t")
.merge(
batchOutputDF.as("s"),
"t.key = s.key")
.whenMatched().update(...)
.whenNotMatched().insert(...)
.execute()
}.start()
SUMMARY
Stateful operations for aggregation
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
Delta Lake supports upserts using
Merge SQL operation
Scala/Java/Python APIs with
same semantics as SQL Merge
SQL Merge supported in
Databricks Delta, not in OSS yet
30. P2.2: Key-value output for analytics
30
How?
Stateful aggregation requires setting
watermark to drop very late data
Dropping some data leads some
inaccuracies
Stateful operations for aggregation
Delta Lake supports upserts
using Merge
STRUCTURED
STREAMING
STATEFUL AGGREGATION SUMMARY
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
31. P2.3: Key-value aggregations for analytics
31
Generate aggregated values
for keys
Latency: hours/days
Do not drop any late data
What? How?
Why?
Summary tables for analytics
Correct
Process: ETL to structured table (no stateful aggregation)
Store: Save in Delta Lake
Post-process: Aggregate after all delayed data received
STRUCTURED
STREAMING
STATEFUL AGGREGATION
AGGREGATIONETL SUMMARY
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
32. Pattern 3: Joining multiple inputs
32
#UnifiedAnalytics #SparkAISummit
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
Input: Multiple data streams
based on common key
Output:
Combined
information
What?
33. P3.1: Joining fast and slow streams
33
Input: One fast stream of facts
and one slow stream of
dimension changes
Output: Fast stream enriched by
data from slow stream
Example:
product sales info ("facts") enriched
by more product info ("dimensions")
What + Why?
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
How?
ETL slow stream to a dimension table
Join fast stream with snapshots of the dimension table
FAST ETL
DIMENSION
TABLE
COMBINED
TABLE
JOIN
SLOW ETL
34. P3.1: Joining fast and slow streams
34
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
SLOW ETL
How? - Caveats
FAST ETL
JOIN COMBINED
TABLE
DIMENSION
TABLE
Store dimension table in Delta Lake
Delta Lake's versioning allows changes
to be detected and the snapshot
automatically reloaded without restart**
Better Solution
** available only in Databricks Delta Lake
Structured Streaming does not reload
dimension table snapshot
Changes by slow ETL wont be seen until restart
35. P3.1: Joining fast and slow stream
35
How? - Caveats
Treat it as a "joining fast and fast data"
Better Solution
Delays in updates to dimension table can
cause joining with stale dimension data
E.g. sale of a product received even before product
table has any info on the product
36. P3.2: Joining fast and fast data
36
Input: Two fast streams where
either stream maybe delayed
Output: Combined info even if
one is delayed over another
Example:
ad impressions and ad clicks
Use stream-stream joins in Structured Streaming
Data will be buffered as state
Watermarks define how long to buffer before
giving up on matching
What + Why?
id update valueSTREAM
STREAM
JOIN
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
See my past deep dive talk for more info
How?
37. Pattern 4: Change data capture
37
#UnifiedAnalytics #SparkAISummit
INSERT a, 1
INSERT b, 2
UPDATE a, 3
DELETE b
INSERT b, 4
key value
a 3
b 4
Input: Change data based
on a primary key
Output: Final table
after changes
End-to-end replication of transactional tables into analytical tables
What?
Why?
38. P4: Change data capture
38
How?
Use foreachBatch and Merge
In each batch, apply changes to the
Delta table using Merge
Delta Lake's Merge support extended
syntax with includes multiple match
clauses, clause conditions and
deletes
INSERT a, 1
INSERT b, 2
UPDATE a, 3
DELETE b
INSERT b, 4
STRUCTURED
STREAMING
streamingDataFrame.foreachBatch { batchOutputDF =>
DeltaTable.forPath(spark, "/deltaTable/").as("t")
.merge(batchOutputDF.as("updates"),
"t.key = s.key")
.whenMatched("delete = false").update(...)
.whenMatched("delete = true").delete()
.whenNotMatched().insert(...)
.execute()
}.start()
See merge examples in docs for more info
39. Pattern 5: Writing to multiple outputs
39
id val1 val2
id sum count
RAW LOGS
+
SUMMARIES
TABLE 2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
SUMMARY FOR
ANALYTICS
+
SUMMARY FOR
LOOKUP
TABLE CHANGE LOG
+
UPDATED TABLE
What?
40. P5: Serial or Parallel?
40
id val1 val2
id sum count
TABLE 2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
Parallel
How?
id val1 val2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
TABLE 2
Writes table 1 and reads it again
Cheap or expensive depends on
the size and format of table 1
Higher latency
Reads input twice, may have to
parse the data twice
Cheap or expensive depends on
size of raw input + parsing cost
Serial
41. P5: Combination!
41
Combo 1: Multiple streaming queries
id val1 val2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
TABLE 2
Do expensive parsing once, write to table 1
Do cheaper follow up processing from table 1
Good for
ETL + multiple levels of summaries
Change log + updated table
Still writing + reading table1
Compute once, write multiple times
Cheaper, but looses exactly-once guarantee
id val1 val2
TABLE 3
Combo 2: Single query + foreachBatch
EXPENSIVE
CHEAP
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
LOCATION 1
id val1 val2
LOCATION 2
streamingDataFrame.foreachBatch { batchSummaryData =>
// write summary to Delta Lake
// write summary to key value store
}.start()
How?
43. In Progress: Declarative Pipelines
43
Allow users to declaratively express the
entire DAG of pipelines as a dataflow graph
REPORTING
STREAMING
ANALYTICS
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
44. In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively
dataset("warehouse")
.query(input("kafka").select(…).join(…)) // Query to materialize
.location(…) // Storage Location
.schema(…) // Optional strict schema checking
.metastoreName(…) // Hive Metastore
.description(…) // Human readable description
.expect("validTimestamp", // Expectations on data quality
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
*Coming Soon
45. In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively
dataset("warehouse")
.query(input("kafka").withColumn(…).join(…)) // Query to materialize
.location(…) // Storage Location
.schema(…) // Optional strict schema checking
.metastoreName(…) // Hive Metastore
.description(…) // Human readable description
.expect("validTimestamp", // Expectations on data quality
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
*Coming Soon
46. In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively
dataset("warehouse")
.query(input("kafka").withColumn(…).join(…)) // Query to materialize
.location(…) // Storage Location
.schema(…) // Optional strict schema checking
.metastoreName(…) // Hive Metastore
.description(…) // Human readable description
.expect("validTimestamp", // Expectations on data quality
"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
*Coming Soon
47. In Progress: Declarative Pipelines
47
Unit test part or whole DAG with
sample input data
Integration test DAG with
production data
Deploy/upgrade new DAG code
Monitor whole DAG in one place
Rollback code / data when needed
REPORTING
STREAMING ANALYTICS
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
*Coming Soon