Virtual Flink Forward 2020: Everything is connected: How watermarking, scaling, and exactly once impact one another in Pravega - Flavio Paiva Junqueira, Tom Kaitchuck
The document discusses how Pravega, an open source stream storage system, enables features like watermarking, scaling, and exactly-once processing in stream processing systems. It explains that Pravega stores streams as sequences of events across distributed segments, which allows for watermarking of event timestamps, dynamic scaling of streams, and tracking of event ingestion to enable exactly-once processing. Checkpointing and replay of events from checkpoints also allows stream processors using Pravega to recover from failures while maintaining exactly-once semantics.
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Similar to Virtual Flink Forward 2020: Everything is connected: How watermarking, scaling, and exactly once impact one another in Pravega - Flavio Paiva Junqueira, Tom Kaitchuck
Introduction to Apache Flink at Vienna Meet UpStefan Papp
Similar to Virtual Flink Forward 2020: Everything is connected: How watermarking, scaling, and exactly once impact one another in Pravega - Flavio Paiva Junqueira, Tom Kaitchuck (20)
Virtual Flink Forward 2020: Everything is connected: How watermarking, scaling, and exactly once impact one another in Pravega - Flavio Paiva Junqueira, Tom Kaitchuck
1. Everything is connected: How watermarking, scaling, and
exactly once impact one another in Pravega
Flavio Junqueira
Senior Director, Senior Distinguished Engineer
Tom Kaitchuck
Distinguished Engineer
2. Flink Forward, April 2020 – Pravega http://pravega.io
Pravega: Storage for data streams
• Pravega is
– A stream store: stream is the storage primitive
– The foundation is segments
– Segments enable a flexible composition of streams
– Segments enable watermarking, scaling, transactions, and state replication
• Pravega is open source http://pravega.io
https://github.com/pravega/pravega
3. Flink Forward, April 2020 – Pravega http://pravega.io
Data streams
Data Pipelines
Storage
(Pravega)
Stream
Processor
(Apache Flink)
• Data Sources:
• User status
• Online transactions
• Server telemetry data
• Sensor samples
• Connected cars
• Drone videos
• Core elements
• Storage (Pravega)
• Stream processor (Apache Flink)
• Arbitrary Direct Acyclic Processing Graphs
• Visualization
• Alerts
• Near real-time insights
• Recommendations
• Actionable instructions
Landscape:
5. Flink Forward, April 2020 – Pravega http://pravega.io
Data streams
Social networks
Online shopping
Server monitoring
IoT, Edge
Stream of user events
• Status updates
• Online transactions
Telemetry streams
• CPU, memory, disk utilization
Stream of sensor events
• Temperature samples
• Radar and image
Tools and services
Jira, Git, Jenkins
• Logs
• Results
6. Flink Forward, April 2020 – Pravega http://pravega.io
Drone video streams + telemetry
Video stream
Telemetry
Drone fleet
UNIFIED
ANALYTICSPROCESSINGEST
• From cattle health
• To airplane inspection between flights
Telemetry
Video feed
8. Flink Forward, April 2020 – Pravega http://pravega.io
DATA STREM:S – SEQUENTIAL
EventsServer,
Sensor
,
etc.
e1e2e3e4e5e6e7e8ekek+1ek+2ek+3
HeadTail
Data streams
14. Flink Forward, April 2020 – Pravega http://pravega.io
Stream in
Streaming
Storage
• Unbounded data
• Elastic
• Consistent
• Tailing and historical data
analytics
• Cloud native
Stream out
Data stream: Cloud native, storage primitive
15. Flink Forward, April 2020 – Pravega http://pravega.io
The recipe for effective stream processing
• Exactly once processing
– Don’t miss, don’t duplicate
• Checkpoints
– Enable rewinding
• Durability
– Enable replaying
• Scaling
– Workloads change dynamically and provisioning follow the changes
• Watermarking
– Advance event, ingest, processing time
16. Flink Forward, April 2020 – Pravega http://pravega.io
Everything is connected …
Exactly-once semantics
Durability
CheckpointsScaling
Watermarking
17. Flink Forward, April 2020 – Pravega http://pravega.io
Time windows
Source: Emits samples,
records, messages
en …. e3 e2 e1
Examples:
• Sensors in IoT
• End users in social
networks
• Server metrics
e3 e2 e1
Window aggregation:
Count1
e5 e4
e7
Window 3
(closed)
Window 6
(open)
Window 9
(open)
… e8 e6 e7 e5 e4 e3 e2 e1
Event
time
18. Flink Forward, April 2020 – Pravega http://pravega.io
Time windows
Source: Emits samples,
records, messages
en …. e3 e2 e1
Examples:
• Sensors in IoT
• End users in social
networks
• Server metrics
e3 e2 e1
Window aggregation:
Count1
e5 e4
e7
Window 3
(closed)
Window 6
(open)
Window 9
(open)
… e8 e6 e7 e5 e4 e3 e2 e1
Writer
needs to
supply time
information
Event
time
19. Flink Forward, April 2020 – Pravega http://pravega.io
Low watermark
e3 e2 e1
Window aggregation:
Count1
e5 e4
e7
Window 3
(closed)
Window 6
(open)
Window 9
(open)
Event
time
Source: Emits events
W(3)W(6)
Watermark W(t)
• Has an associated timestamp t
• Contract: all events with a timestamp smaller
than or equal to t have been received
• Closes window with smaller ending timestamp
• Late events violate contract
en …. e3 e2 e1 … e8 | e6 e7 e5 e4 | e3 e2 e1
20. Flink Forward, April 2020 – Pravega http://pravega.io
Order
Source
en …. e3 e2 e1 … e8 | e6 e7 e5 e4 | e3 e2 e1
W(3)W(6)
Source
en …. e3 e2 e1 … e8 e6 | e7 e5 e4 | e3 e2 e1
W(3)W(6)
• Out of order events are expected
• Low watermarks must be able to
accomodate out-of-orderness
OK
21. Flink Forward, April 2020 – Pravega http://pravega.io
Global order
en …. e5 e3 e1
en …. e6 e4 e2
en …. e10 e7 e4
… e15 | e13 e11 | e9 | e7 e5 | e3 e1
… e16 e14 | e12 | e10 e8 | e6 | e4 e2
… e25 | e22 | e19 | e16 | e13 | e10 | e7 | e4
W(3)W(6)W(9)W(12)
W(3)W(9)W(12) W(6)
W(3)W(9) W(6)W(12)
Multiple sources
Watermarks must reflect time
progress across all sources
• Requires global
coordination
• Aggregate watermarks
• Compare timestamps
from many sources
• Sources need to report:
• Event time
• Position
22. Flink Forward, April 2020 – Pravega http://pravega.io
Last is 3Counter:0
Tracking source position
• Source connects to stream store to append
e1e4 e2e3
e4 e3 e2 e1
23. Flink Forward, April 2020 – Pravega http://pravega.io
e3 e2 e1
Tracking source position
• Source connects to stream store to append
e1e4 e2e3
• Appends e1 e2 e3 successfuly
e4 e3 e2 e1
Counter:3
24. Flink Forward, April 2020 – Pravega http://pravega.io
e3 e2 e1
Last is 3Counter:0
Tracking source position
• Source connects to stream store to append
e1e4 e2e3
• Appends e1 e2 e3 successfuly
• Disconnects
e4 e3 e2 e1
Counter:1Counter:2Counter:3
25. Flink Forward, April 2020 – Pravega http://pravega.io
e3 e2 e1
Last is 3Counter:0
Tracking source position
• Source connects to stream store to append
e1e4 e2e3
• Appends e1 e2 e3 successfuly
• Disconnects
e4 e3 e2 e1
• Reconnects and determines that the last event
written is e3
Counter:1Counter:2Counter:3Counter:3Last is 3
26. Flink Forward, April 2020 – Pravega http://pravega.io
e4 e3 e2 e1
Last is 3Counter:0
Tracking source position
• Source connects to stream store to append
e1e4 e2e3
• Appends e1 e2 e3 successfuly
• Disconnects
e4 e3 e2 e1
• Reconnects and determines that the last event
written is e3
• Appends e4
Counter:1Counter:2Counter:3Counter:4
27. Flink Forward, April 2020 – Pravega http://pravega.io
Time windows: Accuracy
e3 e2 e1
Window aggregation:
Count1
e5 e4
e7
Window 3
(closed)
Window 6
(open)
Window 9
(open)
Source: Emits events
en …. e3 e2 e1
W(3)W(6)
… e8 | e6 e7 e5 e4 | e3 e2 e1
Exactly once is needed for accurate
windows
Event
time
29. Flink Forward, April 2020 – Pravega http://pravega.io
Upon crashes
Source: Emits events
en …. e3 e2 e1 … e8 | e6 e7 e5 e4 | e3 e2 e1
W(3)W(6)
Watermark W(t)
• Has an associated timestamp t
• Contract: all events with a timestamp smaller
than or equal to t have been received
• Closes window with smaller ending timestamp
• Late events violate contract
e3 e2 e1
Window aggregation:
Count1
e5 e4
e7
Window 3
(closed)
Window 6
(open)
Window 9
(open)
Crash
30. Flink Forward, April 2020 – Pravega http://pravega.io
Upon crashes
Source: Emits events
en …. e3 e2 e1 … e8 | e6 e7 e5 e4 | e3 e2 e1
W(3)W(6)
e3 e2 e1
Window aggregation:
Count1
e5 e4
e7
Window 3
(closed)
Window 6
(open)
Window 9
(open)
• To guarantee exactly-once semantics
• Connector enables Flink to track progress
• Flink determines where to resume from upon recovery
Crash
31. Flink Forward, April 2020 – Pravega http://pravega.io
Upon crashes
e3 e2 e1
Window aggregation:
Count1
Window 3
(closed)
• Ability to rewind to a safe position
• Replay events
• Checkpoints
Event
time
Source: Emits events
en …. e3 e2 e1
W(3)W(6)
… e8 | e6 e7 e5 e4 | e3 e2 e1
32. Flink Forward, April 2020 – Pravega http://pravega.io
Checkpointing
• Crash failures
– Hosts across the system can crash at any time
– Ability to back up and re-read data.
• Checkpoints
– A point in the application execution
– All intermediate state is persisted
– Computation can be resumed from checkpoint
• Implications to sources
– Ability to replay (durability)
– Ability to control the position and persist it
… e8 | e6 e7 e5 e4 | e3 e2 e1
W(3)W(6)
• Ability to rewind to a safe position
• Replay events
• Checkpoints
33. Flink Forward, April 2020 – Pravega http://pravega.io
Checkpoint complicates sinks
• Sinks
– Output data of a job
• When rolling back to a checkpoint
– Preserve exactly-once semantics
– Data needs to be “unwritten” or “deduped”
• Also applies to time
– Output time can only advance on checkpoints
34. Flink Forward, April 2020 – Pravega http://pravega.io
Scaling and rebalancing
Key
space
Worker
Key
space
Key
space
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Scaling Rebalancing
• Consistent assignment
• Time advances
consistently for
watermarking
• Exactly-once guarantee
35. Flink Forward, April 2020 – Pravega http://pravega.io
The bottom line:
Computing watermarks is complicated…
36. Flink Forward, April 2020 – Pravega http://pravega.io
Support for watermarks
in Pravega
37. Flink Forward, April 2020 – Pravega http://pravega.io
Segment 7
Segment 4
Segment 3
Segment 1
Segment 2
Segment 0
Time
.75
.5
.25
0
t0 t1 t2 t3
Segment 6
Segment 5
t4
KeySpace
Stream S
1.0
49. Flink Forward, April 2020 – Pravega http://pravega.io
Pravega time windows
Pravega
Reader
TimeWindow getCurrentTimeWindow();
Time window
• Two attributes:
1. Lower bound
2. Upper bound
• Lower bound
• Timestamp less than or equal to the most recent time
noted (writer API)
• Upper bound
• Timestamp greater than or equal to the most recent
time noted (writer API)
50. Flink Forward, April 2020 – Pravega http://pravega.io
Watermarks
public abstract class LowerBoundAssigner<T>
implements AssignerWithTimeWindows<T> {
…
@Override
public abstract long extractTimestamp(T element,
long previousElementTimestamp);
// built-in watermark implementation which emits the lower bound
@Override
public Watermark getWatermark(TimeWindow timeWindow) {
if (timeWindow == null || timeWindow.isNearHeadOfStream()) {
return null;
}
return timeWindow.getLowerTimeBound() == Long.MIN_VALUE ?
new Watermark(Long.MIN_VALUE) :
new Watermark(timeWindow.getLowerTimeBound());
}
}
• Timestamp assigner
• LowerBoundAssigner
• Gets the time window
• Watermark is the lower
bound
51. Flink Forward, April 2020 – Pravega http://pravega.io
With a Pravega reader source
FlinkPravegaReader<Sample> source = FlinkPravegaReader.<Sample>builder()
.withPravegaConfig(pravegaConfig)
.forStream(Stream.of("scope", "stream"))
.withDeserializationSchema(…)
.withTimestampAssigner(new LowerBoundAssigner<Sample>() {
@Override
public long extractTimestamp(Sample sample,
long previousElementTimestamp) {
long timestamp = sample.getTimestamp();
return timestamp;
}
})
.build();
52. Flink Forward, April 2020 – Pravega http://pravega.io
Task Manager
Task Manager
Task Manager
Task Manager
Pravega Stream
Pravega
Reader
Upon a checkpoint
Segment 2
Segment 1
Segment 3
Segment 4
Flink Job
Source tasks
Pravega
Reader
Master
• Initiates checkpoint
• Invokes call on ReaderGroup
API
• Implementation of
MasterTriggerRestoreHook
1
53. Flink Forward, April 2020 – Pravega http://pravega.io
Upon a Flink checkpoint
Task Manager
Task Manager
Task Manager
Task Manager
Pravega Stream
Pravega
Reader
Flink Job
Source tasks
Pravega
Reader
Master
C
C
12
Revisioned
Stream
• Initiates checkpoint
• Invokes call on ReaderGroup
API
• Implementation of
MasterTriggerRestoreHook
• Coordinate via state synchronizer
• Readers emit checkpoint event
Segment 2
Segment 1
Segment 3
Segment 4
54. Flink Forward, April 2020 – Pravega http://pravega.io
Upon a Flink checkpoint
Task Manager
Task Manager
Task Manager
Task Manager
Pravega Stream
Pravega
Reader
Flink Job
Source tasks
Pravega
Reader
Master
C
C
12
Revisioned
Stream
• Coordinate via state synchronizer
• Readers emit checkpoint event
3
• Receives checkpoint
Segment 2
Segment 1
Segment 3
Segment 4
55. Flink Forward, April 2020 – Pravega http://pravega.io
Exactly-once with Transactions
Sink tasks
Flink Job
Pravega
Txn
writes
• Transactional writes for job output
• Executes a 2PC to commit results
• Option to not use transactions
• At-least-once semantics
56. Flink Forward, April 2020 – Pravega http://pravega.io
Exactly-once with Transactions
Flink
Master
Flink
Master
S
S
Pravega
Coordinates
checkpointing
Start
checkpoint
(Prepare)
Ack
Prepare
Complete
checkpoint
Commit txn
2-Phase commit protocol
S
S
Sink tasks Sink tasks