Flink meetup

Get your hands on implementing a Flink app: A
tutorial
Christos Hadjinikolis & Satyasheel | DataReply.uk

Tutorial Overview:
 What is Apache Flink?
 Why Flink?
 Processing both bounded and un-bounded data!
 Anatomy of a Flink App
 Windowing in Flink
 Event time & Process time in Flink
2/22/17C. Hadjinikolis & Satyasheel | DataReply 2

What is Apache Flink?
“A distributed data processing platform…”

Flink is a distributed stream- & batch- data
processing platform
 Stream processing
…the real-time processing of data continuously, concurrently, and in a record-by-record
fashion, where data is not static.
 Batch processing
…the execution of a series of programs each on a set or "batch" of static inputs, rather
than a single input (which would instead be a custom job).

…distributed processing dataset types
 Unbounded
Infinite datasets that are appended to continuously:
 End users interacting with mobile or web applications
 Physical sensors providing measurements
 Financial markets
 Machine log data
 Surveillance camera frames

…distributed processing dataset types
 Bounded
Finite, unchanging datasets:
 Pictures
 Documents
 Database tables

Why Flink?
“The world is turning more and more towards stream processing…”

Opt for Flink because it:
 Provides results that are accurate
 Is stateful and fault-tolerant and can seamlessly
recover from failures
 Performs at large scale

…exactly-once semantics
 Statefull
… apps can maintain summaries of
processed data.
 Checkpointing
… a mechanism that ensures that in the
event of failure no duplicate re-
computation of an event will take place.

…event time semantics
…event-time-based windowing
Event time makes it easy to compute accurate results over streams where events arrive out
of order and where events may arrive delayed.

… flexible windowing
Windows can be customized with flexible triggering conditions to
support sophisticated streaming patterns based on:
 Time;
 Count, and;
 Sessions.

… lightweight fault tolerance
Recovers from failures with zero
data loss while the tradeoff
between reliability and latency is
negligible.

… lightweight fault tolerance
Savepoints
 Provide a state versioning mechanism.
 Applications can update and reprocess historic
data with no lost state.

… Scalable
Designed to run on large
scale clusters with many
thousands on nodes.

So, in summary…
Flink is an open-source stream processing framework, which:
 Eliminates the “performance vs. reliability” problem and;
 Performs consistently in both categories.

Processing both
bounded & un-bouded data!
“Unbounding the boundaries…”

…the streaming model & bounded datasets
 DataStream API  un-bounded data
 DataSet API  bounded data
A bounded dataset is handled inside of Flink
as a “finite stream”, with only a few minor
differences in how Flink manages un-
bounded datasets.

Anatomy of a Flink App
“Let’s get this started…”

…Flink programs transform collections of data
Each program consists of the same basic parts:
 Obtain an execution environment,
 Load/create the initial data,
 Specify transformations on this data,
 Specify where to put the results of your computations
 Trigger the program execution

Create execution environment
Load streaming data
Trigger transformations
Specify dumping location
Execute

…Lazy evaluation
When the program’s main method is
executed:
 Each operation is created and added to the
program’s plan.
 execution is explicitly triggered by
an execute() call.
This helps with constructing an optimised
data-flow as a holistically planned unit.

Lets take 15 mins
…

Windowing in Flink
“…a simple word count app.”

…so what is a window?
 A window is a way to get a {snapshot} of the streaming data.
 A {snapshot} can be based on time or other variables.
 One can define the window based on no of records or other stream
specific variables.

…enough with theory! Give us some code!
A streaming word count example with no windowing

…updating states
 Flink automatically updates its states without
the user explicitly doing so.
 To better appreciate this, it is worth
contrasting Flink with Spark.
 Spark relies on micro-batches:
 This means one has to define the batch size either in
terms of time or size
 Flink, does not require defining a batch size.
 It can process each and every new event individually
(it is true stream processing!)

Lets see an example
…

Windowing in Flink
“Don't waste a minute not being happy. If one window closes, run to the next window - or break down a door. …”

…so why use windowing at all?
 Aggregation on DataStream is different from aggregation
Dataset.
 One cannot count all records on infinite stream.
 DataStream aggregation makes sense on window stream.

…what types of windowing can you use?
 Tumbling Windows :
 Aligned, fixed length, non-overlapping window.
 Sliding Windows :
 Aligned, fixed length, overlapping window.
 Session Windows :
 Non aligned, variable length window.
 Count Windows :
 Fixed number of records/events, non-overlapping window.

…anatomy of the window API
 3 window functions:
 Window Assigner:
 Responsible for assigning given element to window.
 Depending upon the definition of window, one element can belong to one or more windows at a
time.
 Trigger:
 Defines the condition for triggering window evaluation.
 This function controls when a given window created by window assigner is evaluated.
 Evictor:
 An optional function which defines the preprocessing before firing window operations.

…understanding count window
 Window Assigner (for count-based window  user-defined)
 No start or end to the window, therefore the window is non-time based.
 For these windows we use the GlobalWindows window assigner.
 For a given key, all key-values are filled into the same window.
keyValue.window(GlobalWindows.create())
 The window API allows us to add the window assigner to the window.
 Every window assigner has a default trigger.
 for global windows that trigger is NeverTrigger which never triggers.
 so, this window assigner has to be used with a custom trigger.

…understanding count window
 Count trigger
 Once we have the window assigner, we have to define when the window needs to be
trigger-ed, for example:
trigger(CountTrigger.of(2))
 This results in the window being evaluated every two records.
 Evictor
 In addition to these, an evictor can be used for further preprocessing tasks before firing a
window operation, e.g. to remove the every 3rd element of all window.
 Some default evictors:
 CountEvictor , DeltaEvictor , TimeEvictor

The anatomy of a
window API
…

Tumbling Windows
…

Sliding Windows
…

Lets take 15 mins
…

Timing in Flink
“The two most powerful warriors are patience and time.

…the time concept in streaming
 A streaming application is an always running application.
 ..we need to take snapshots of the stream at various points.
 ..these points can be defined using a time component.
 ..we can group, correlate, different events happening in the stream.
 Some of the constructs like window, heavily use the time component.
 Most of the streaming frameworks support a single meaning of time, which
is mostly tied to the processing time.

…time in Flink
 When we say, last “t” seconds, what do we mean exactly? Well in Flink
it’s one of three things:
 Processing Time
“…the records arrived in last "t" seconds for the processing.”
 Event Time
“… all the records generated in those last "t" seconds at the source.”
 Ingestion Time
 The time when events ingested into the system.
 This time is in between of the event time and processing time.

…time in Flink

Time in Flink
…

Thanks for your attention!

Flink meetup

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Flink meetup

Similar to Flink meetup (20)

Recently uploaded

Recently uploaded (20)

Flink meetup

Editor's Notes