12. Design Principle
human fault-tolerance
– the system is unsusceptible to data loss or data corruption because
at scale it could be irreparable.
data immutability
– store data in it’s rawest form immutable and for perpetuity. (INSERT/
SELECT/DELETE but no UPDATE !)
recomputation
– with the two principles above it is always possible to (re)-compute
results by running a function on the raw data.
30. Stream Processing Model
One at a time Micro Batch
Low Latency Y N
High Throughput N Y
at least once Y Y
excatly once Sometimes Y
simple programing model Y N
31. Stream Computing the Limitation
• Queries must be written before data
• There should be another way to query past data
• Queries cannot be run twice
• All results will be lost when any error occurs All data have gone
when bugs found
• Disorders of events break results
• Recorded time based queries? Or arrival time based queries?
36. Fault Tolerance in Stream
• At Least Once : ensure all operators see all events
• Stream -> Replay on failure
• Exactly Once :
• Flink : distributed Snapshot
• Spark : Micro Batch