An experimental study in using natural admixture as an alternative for chemic...
Flink meetup
1. Get your hands on implementing a Flink app: A
tutorial
Christos Hadjinikolis & Satyasheel | DataReply.uk
2. Tutorial Overview:
What is Apache Flink?
Why Flink?
Processing both bounded and un-bounded data!
Anatomy of a Flink App
Windowing in Flink
Event time & Process time in Flink
2/22/17C. Hadjinikolis & Satyasheel | DataReply 2
3. What is Apache Flink?
“A distributed data processing platform…”
2/22/17C. Hadjinikolis & Satyasheel | DataReply 3
4. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 4
Flink is a distributed stream- & batch- data
processing platform
Stream processing
…the real-time processing of data continuously, concurrently, and in a record-by-record
fashion, where data is not static.
Batch processing
…the execution of a series of programs each on a set or "batch" of static inputs, rather
than a single input (which would instead be a custom job).
5. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 5
…distributed processing dataset types
Unbounded
Infinite datasets that are appended to continuously:
End users interacting with mobile or web applications
Physical sensors providing measurements
Financial markets
Machine log data
Surveillance camera frames
7. Why Flink?
“The world is turning more and more towards stream processing…”
2/22/17C. Hadjinikolis & Satyasheel | DataReply 7
8. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 8
Opt for Flink because it:
Provides results that are accurate
Is stateful and fault-tolerant and can seamlessly
recover from failures
Performs at large scale
9. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 9
…exactly-once semantics
Statefull
… apps can maintain summaries of
processed data.
Checkpointing
… a mechanism that ensures that in the
event of failure no duplicate re-
computation of an event will take place.
10. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 10
…event time semantics
…event-time-based windowing
Event time makes it easy to compute accurate results over streams where events arrive out
of order and where events may arrive delayed.
11. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 11
… flexible windowing
Windows can be customized with flexible triggering conditions to
support sophisticated streaming patterns based on:
Time;
Count, and;
Sessions.
12. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 12
… lightweight fault tolerance
Recovers from failures with zero
data loss while the tradeoff
between reliability and latency is
negligible.
13. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 13
… lightweight fault tolerance
Savepoints
Provide a state versioning mechanism.
Applications can update and reprocess historic
data with no lost state.
14. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 14
… Scalable
Designed to run on large
scale clusters with many
thousands on nodes.
15. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 15
So, in summary…
Flink is an open-source stream processing framework, which:
Eliminates the “performance vs. reliability” problem and;
Performs consistently in both categories.
16. Processing both
bounded & un-bouded data!
“Unbounding the boundaries…”
2/22/17C. Hadjinikolis & Satyasheel | DataReply 16
17. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 17
…the streaming model & bounded datasets
DataStream API un-bounded data
DataSet API bounded data
A bounded dataset is handled inside of Flink
as a “finite stream”, with only a few minor
differences in how Flink manages un-
bounded datasets.
18. Anatomy of a Flink App
“Let’s get this started…”
2/22/17C. Hadjinikolis & Satyasheel | DataReply 18
19. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 19
…Flink programs transform collections of data
Each program consists of the same basic parts:
Obtain an execution environment,
Load/create the initial data,
Specify transformations on this data,
Specify where to put the results of your computations
Trigger the program execution
21. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 21
…Lazy evaluation
When the program’s main method is
executed:
Each operation is created and added to the
program’s plan.
execution is explicitly triggered by
an execute() call.
This helps with constructing an optimised
data-flow as a holistically planned unit.
22. Lets take 15 mins
…
2/22/17C. Hadjinikolis & Satyasheel | DataReply 22
23. Windowing in Flink
“…a simple word count app.”
2/22/17C. Hadjinikolis & Satyasheel | DataReply 23
24. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 24
…so what is a window?
A window is a way to get a {snapshot} of the streaming data.
A {snapshot} can be based on time or other variables.
One can define the window based on no of records or other stream
specific variables.
25. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 25
…enough with theory! Give us some code!
A streaming word count example with no windowing
26. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 26
…updating states
Flink automatically updates its states without
the user explicitly doing so.
To better appreciate this, it is worth
contrasting Flink with Spark.
Spark relies on micro-batches:
This means one has to define the batch size either in
terms of time or size
Flink, does not require defining a batch size.
It can process each and every new event individually
(it is true stream processing!)
27. Lets see an example
…
2/22/17C. Hadjinikolis & Satyasheel | DataReply 27
28. Windowing in Flink
“Don't waste a minute not being happy. If one window closes, run to the next window - or break down a door. …”
2/22/17C. Hadjinikolis & Satyasheel | DataReply 28
29. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 29
…so why use windowing at all?
Aggregation on DataStream is different from aggregation
Dataset.
One cannot count all records on infinite stream.
DataStream aggregation makes sense on window stream.
30. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 30
…what types of windowing can you use?
Tumbling Windows :
Aligned, fixed length, non-overlapping window.
Sliding Windows :
Aligned, fixed length, overlapping window.
Session Windows :
Non aligned, variable length window.
Count Windows :
Fixed number of records/events, non-overlapping window.
31. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 31
…anatomy of the window API
3 window functions:
Window Assigner:
Responsible for assigning given element to window.
Depending upon the definition of window, one element can belong to one or more windows at a
time.
Trigger:
Defines the condition for triggering window evaluation.
This function controls when a given window created by window assigner is evaluated.
Evictor:
An optional function which defines the preprocessing before firing window operations.
32. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 32
…understanding count window
Window Assigner (for count-based window user-defined)
No start or end to the window, therefore the window is non-time based.
For these windows we use the GlobalWindows window assigner.
For a given key, all key-values are filled into the same window.
keyValue.window(GlobalWindows.create())
The window API allows us to add the window assigner to the window.
Every window assigner has a default trigger.
for global windows that trigger is NeverTrigger which never triggers.
so, this window assigner has to be used with a custom trigger.
33. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 33
…understanding count window
Count trigger
Once we have the window assigner, we have to define when the window needs to be
trigger-ed, for example:
trigger(CountTrigger.of(2))
This results in the window being evaluated every two records.
Evictor
In addition to these, an evictor can be used for further preprocessing tasks before firing a
window operation, e.g. to remove the every 3rd element of all window.
Some default evictors:
CountEvictor , DeltaEvictor , TimeEvictor
34. The anatomy of a
window API
…
2/22/17C. Hadjinikolis & Satyasheel | DataReply 34
37. Lets take 15 mins
…
2/22/17C. Hadjinikolis & Satyasheel | DataReply 37
38. Timing in Flink
“The two most powerful warriors are patience and time.
2/22/17C. Hadjinikolis & Satyasheel | DataReply 38
39. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 39
…the time concept in streaming
A streaming application is an always running application.
..we need to take snapshots of the stream at various points.
..these points can be defined using a time component.
..we can group, correlate, different events happening in the stream.
Some of the constructs like window, heavily use the time component.
Most of the streaming frameworks support a single meaning of time, which
is mostly tied to the processing time.
40. 2/22/17C. Hadjinikolis & Satyasheel | DataReply 40
…time in Flink
When we say, last “t” seconds, what do we mean exactly? Well in Flink
it’s one of three things:
Processing Time
“…the records arrived in last "t" seconds for the processing.”
Event Time
“… all the records generated in those last "t" seconds at the source.”
Ingestion Time
The time when events ingested into the system.
This time is in between of the event time and processing time.
Second, 2 types of execution models
Streaming: Processing that executes continuously as long as data is being produced
Batch: Processing that is executed and runs to completeness in a finite amount of time, releasing computing resources when finished
It’s possible, though not necessarily optimal, to process either type of dataset with either type of execution model. For instance, batch execution has long been applied to unbounded datasets despite potential problems with windowing, state management, and out-of-order data.
Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets: streaming execution is continuous processing on data that is continuously produced. And alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance.
Before we go into detail about Flink, let’s review at a higher level the types of datasets you’re likely to encounter when processing data as well as types of execution models you can choose for processing. These two ideas are often conflated, and it’s useful to clearly separate them.
First, 2 types of datasets
Unbounded: Infinite datasets that are appended to continuously
Bounded: Finite, unchanging datasets
Many real-word data sets that are traditionally thought of as bounded or “batch” data are in reality unbounded datasets. This is true whether the data is stored in a sequence of directories on HDFS or in a log-based system like Apache Kafka.
Examples of unbounded datasets include but are not limited to:
End users interacting with mobile or web applications
Physical sensors providing measurements
Financial markets
Machine log data
We have all interacted with bounded dataset on our machines, like: picturesm or documents of any kind, database tables etc.
Earlier, we discussed aligning the type of dataset (bounded vs. unbounded) with the type of execution model (batch vs. streaming). Many of the Flink features listed below–state management, handling of out-of-order data, flexible windowing–are essential for computing accurate results on unbounded datasets and are enabled by Flink’s streaming execution model.
Flink guarantees exactly-once semantics for stateful computations. ‘Stateful’ means that applications can maintain an aggregation or summary of data that has been processed over time, and Flink’s checkpointing mechanism ensures exactly-once semantics for an application’s state in the event of a failure.
Flink guarantees exactly-once semantics for stateful computations. ‘Stateful’ means that applications can maintain an aggregation or summary of data that has been processed over time, and Flink’s checkpointing mechanism ensures exactly-once semantics for an application’s state in the event of a failure.
Flink supports stream processing and windowing with event time semantics. Event time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed.
Flink supports flexible windowing based on time, count, or sessions in addition to data-driven windows. Windows can be customized with flexible triggering conditions to support sophisticated streaming patterns. Flink’s windowing makes it possible to model the reality of the environment in which data is created.
… allows the system to maintain high throughput rates and provide exactly-once consistency guarantees at the same time. Flink recovers from failures with zero data loss while the tradeoff between reliability and latency is negligible.
Flink’s savepoints provide a state versioning mechanism, making it possible to update applications or reprocess historic data with no lost state and minimal downtime.
Flink is designed to run on large-scale clusters with many thousands of nodes, and in addition to a standalone cluster mode, Flink provides support for YARN and Mesos
In summary, Apache Flink is an open-source stream processing framework that eliminates the “performance vs. reliability” tradeoff often associated with open-source streaming engines and performs consistently in both categories.
Earlier in this write-up, we introduced the streaming execution model (“processing that executes continuously, an event-at-a-time”) as an intuitive fit for unbounded datasets. So how do bounded datasets relate to the stream processing paradigm?
In Flink’s case, the relationship is quite natural. A bounded dataset can simply be treated as a special case of an unbounded one, so it’s possible to apply all of the same streaming concepts that we’ve laid out above to finite data.
This is exactly how Flink’s DataSet API behaves. A bounded dataset is handled inside of Flink as a “finite stream”, with only a few minor differences in how Flink manages bounded vs. unbounded datasets.
And so it’s possible to use Flink to process both bounded and unbounded data, with both APIs running on the same distributed streaming execution engine–a simple yet powerful architecture.
Lazy Evaluation
All Flink programs are executed lazily: When the program’s main method is executed, the data loading and transformations do not happen directly. Rather, each operation is created and added to the program’s plan. The operations are actually executed when the execution is explicitly triggered by an execute() call on the execution environment. Whether the program is executed locally or on a cluster depends on the type of execution environment.
The lazy evaluation lets you construct sophisticated programs that Flink executes as one holistically planned unit.
For example, if we create a window for 5 seconds then it will be all the records which arrived in the that time frame.
Why do we need windowing?
Aggregation on DataStream is different from aggregation dataset, One cannot count all records on infinite stream.
DataStream aggregation makes sense on window stream.
In spark, after each batch, the state has to be updated explicitly if you want to keep track of wordcount across batches. But in flink the state is up-to-dated as and when new records arrive implicitly.
Most of the window operations are encouraged to be used on KeyedDataStream. A KeyedDataStream is a datastream which is partitioned by the key. This partitioning by key allows window to be distributed across machines resulting in good performance.
Trigerring:Most of the window operations are encouraged to be used on KeyedDataStream. A KeyedDataStream is a datastream which is partitioned by the key. This partitioning by key allows window to be distributed across machines resulting in good performance.Evictor:Like removing the third element in a count window of 10 elements…
Most of the window operations are encouraged to be used on KeyedDataStream. A KeyedDataStream is a datastream which is partitioned by the key. This partitioning by key allows window to be distributed across machines resulting in good performance.
**CountEvictor:** keeps up to a user-specified number of elements from the window and discards the remaining ones from the beginning of the window buffer.
**DeltaEvictor:** takes a DeltaFunction and a threshold, computes the delta between the last element in the window buffer and each of the remaining ones, and removes the ones with a delta greater or equal to the threshold.
**TimeEvictor:** takes as argument an interval in milliseconds and for a given window, it finds the maximum timestamp max_ts among its elements and removes all the elements with timestamps smaller than max_ts - interval.
**Note:** All evictors apply their logic before the window function.
In Flink it depends and it could be one of three following.
Processing TimeMost of the streaming application uses this concept and this is one of the most familiar concept users. This time is tracked using a clock run by the processing engine. So, last "t" seconds means the records arrived in last "t" seconds for the processing.
Processing time is very good way of keeping track of time, but not always helpful. Let's say we want to measure the state of sensor at a given point of time so, we want to collect the event at that time. But if the events arrive lately to processing system due to various reasons, we may miss some of the events as processing clock does not care about the actual time of events. To address this, Flink support another kind of time called event time.
Event TimeThis time is embedded in data. Means this time comes with the data. So here last "t" seconds means, all the records generated in those last "t" seconds at the source. These may come out of order to processing. This time is independent of the clock that is kept by the processing engine.Event time is extremely useful for handling the late arrival events.
Ingestion TimeIngestion time is the time when events ingested into the system. This time is in between of the event time and processing time. Normally in processing time, each machine in cluster is used to assign the time stamp to track events. This may result in little inconsistent view of the data, as there may be delays in time across the cluster. But ingestion time, timestamp is assigned in ingestion so that all the machines in the cluster have exact same view. These are useful to calculate results on data that arrive in order at the level of ingestion.