Recent efforts by academia and open-source communities have established stream processing as a principal data analysis technology across industry. All major cloud vendors offer streaming dataflow pipelines and online analytics as managed services. Notable use-cases include real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and anomaly detection in financial transactions. At the same time, streaming dataflow systems are increasingly being used for event-driven applications beyond analytics, such as orchestrating microservices and model serving. In the past decades, streaming technology has evolved significantly, however, emerging applications are once more challenging the design decisions of modern streaming systems. In this talk, I will discuss the evolution of stream processing and bring current trends and open problems to the attention of our community.
8. Traffic light adjustment in real time
Alibaba City Brain analyzes
vehicle locations to:
• clear paths for emergency
response vehicles
• provide scheduling information
for public transport
• recommend alternative routes
Read more: https://edition.cnn.com/2019/01/15/tech/alibaba-city-brain-hangzhou/index.html
6
9. Fault-detection for NASA’s Deep
Space Network
NASA’s DSN Complex Event Processing
analyzes real-time network data, predicted
antenna pointing parameters, and physical
hardware logs to:
• ingest, filter, store, and visualize all of the
DSN's monitor and control data
• ensure the successful DSN tracking,
ranging, and communication integrity of
dozens of concurrent deep-space missions
Read more: https://www.confluent.io/kafka-summit-san-francisco-2019/mission-critical-real-time-
fault-detection-for-nasas-deep-space-network-using-apache-kafka/ 7
10. • How did we get here?
• Are we there yet?
• What lies ahead?
13. SIGMOD ’92
[… A new class of queries, continuous queries, are similar to
conventional database queries, except that they are issued once and
henceforth run “continually” over the database …]
9
33. 1. Process events online without storing them
18
SIGMOD
Record ’05
persistently store events and state
34. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
18
SIGMOD
Record ’05
persistently store events and state
35. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
36. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
37. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
38. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
39. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
40. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
41. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
42. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
43. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
44. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
45. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
46. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
47. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
48. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
18
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
50. Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
Some of my recent
and ongoing work
19
51. Automatic elasticity and reconfiguration
20
heuristic policies
if CPU > 80% => scale
stop-and-restart
migration and
reconfiguration
52. Automatic elasticity and reconfiguration
21
Accuracy: no over/under-provisioning
Stability:no oscillations
Performance: fast convergence
Safe migration: correct results
Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows (OSDI ’18).
Megaphone: Latency-conscious state migration for distributed streaming dataflows (VLDB’19).
github.com/strymon-system/ds2
github.com/strymon-system/megaphone
o1 cannot keep up
waiting for output
waiting for input
src
o1
o2
54. Performance analysis of
streaming dataflows is itself a
challenging streaming
computation with strict latency
requirements
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
55. Performance analysis of
streaming dataflows is itself a
challenging streaming
computation with strict latency
requirements
Re-configurable Stream Processing
Automatic scaling
Analyzer
invoke
re-configure job
performance
metrics
decision
Profiler
Adaptive scheduling
Straggler mitigation
Query optimization
Instrumented
stream processor
22
Snailtrail: Generalizing critical paths for online analysis of distributed dataflows (NSDI’18).
github.com/li1/snailtrail
56. 1. Process events online without storing them
2. Support a high-level language (SQL-like)
3. Handle missing, out-of-order, delayed data
4. Guarantee deterministic (on replay) and correct results (on recovery)
5. Combine batch and stream processing
6. Ensure availability despite failures
7. Support distribution and automatic elasticity
8. Offer low-latency
23
SIGMOD
Record ’05
persistently store events and state
Java, Scala, Python, and SQL-like
with tunable latency trade-offs
and exactly-once
batch is a special case of streaming
and exactly-once state updates
high throughput and “acceptable" latency
accurate, stable, latency-aware
58. reliability, production readiness and community can be
more important than raw performance
In open-source software
24
Apache Flink, Nexmark Q4
latency (ms)
CDF
1.0
0.8
0.6
0.4
0.2
0.0
In-memory
state RocksDB state
1000080006000400020000
serde at every access
60. 25
write-heavy, large state
RMW a single value
globally configured store
Type-aware, flexible state
management provides up to an order
of magnitude latency improvement
We need configurable streaming backends
New streaming state benchmarks
62. Model serving
27
Stream Processor Model Server
RPC
input
stream
predictions
Stream Processor
op
input
stream
predictions
Model management and versioning
1. Model stored externally 2. Model stored in managed state
Exactly-once guarantees?
Latency trade-offs unclear
What kind of state store to use?
63. Stateful serverless (FaaS)
28
Automatic scaling
Function orchestration
Support for transactions
External requests
Events and
function triggers
f
λ
f
f
f
output
Apache Flink Stateful Functions: https://statefun.io
Stateful Functions as a Service in Action (VLDB’19)
64. Graph streaming & online trainingdatarate
analytics complexity
low
high
low high
Machine
Learning
Data
Mining
Streaming
CEP
Relational
analytics
Graph processing
Complex streaming
data analytics
Streaming Graph Partitioning: An Experimental Study (VLDB’18).
Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism (arxiv.org/abs/1912.12740).
29
Graph state management
Data-parallel graph synopses
Languages & operator semantics
Adaptive graph partitioning
65. Spark Streaming
30
Data Stream Management Systems
1992 20132004
Tapestry
Aurora
TelegraphCQ
STREAM
20202000 2002
Gigascope
MapReduce
Storm
S4
Naiad
Samza
Flink
Millwheel
2015
Google Dataflow
Distributed Dataflow Systems
NiagaraCQ
Timely Dataflow
ML
operator semantics
event time & progress
representations
synopses & sketches
load management
high availability
scheduling
data parallelism
exactly-once fault-tolerance
state management
general-purpose languages
iterations UDFs
Graphs
FaaS
Edge
Modern hardware
66. From stream data management
To distributed dataflows
And beyond...
Vasiliki (vasia) Kalavri
(vkalavri@bu.edu)