SlideShare a Scribd company logo
1 of 33
Download to read offline
Introduction to Apache
Flink
A stream based big data processing engine
https://github.com/phatak-dev/flink-examples
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Evolution of big data frameworks to platform
● Shortcomings of the RDD abstraction
● Streaming dataflow as the abstraction layer
● Flink introduction
● Flink history
● Flink vs Spark
● Flink examples
Big data frameworks
● Big data started as the multiple frameworks focused on
specific problems
● Every framework defined its own abstraction
● Though combining multiple frameworks in powerful in
theory, putting together is no fun
● Combining right set of frameworks and making them
work on same platform became non trivial task
● Distributions like Cloudera, Hortonworks solved these
problems to some level
Dawn of big data platforms
● Time of big data processing frameworks is over, it’s time
for big platforms
● More and more advances in big data shows people
want to use single platform for their big data needs
● Single platform normally evolves faster compared to a
distribution of multiple frameworks
● Spark pioneering the platform and others are following
their lead
Framework vs Platform
● Each framework is free to define it’s own abstraction but
platform should define single abstraction
● Frameworks often embrace multiple runtime whereas
platform embraces single runtime
● Platforms normally supports libraries rather than
frameworks
● Frameworks can be in different versions, normally all
libraries on a platform will be on a single version
Abstractions in frameworks
● Map/Reduce provides file as abstraction as it’s a
framework focused on batch processing
● Storm provided realtime stream as abstraction as it’s
focused on real time processing
● Hive provides table as abstraction as it’s focused on
structured data analysis
● Every framework define their own abstraction which
defines their capabilities and limitations
Abstraction in platform
● As we moved into platform, abstraction provided by
platform becomes crucial
● Abstraction provided by platform should be generic
enough for all frameworks/ libraries built on top of it.
● Requirements of platform abstraction
○ Ability to support different kind of libraries/
framework on platform
○ Ability to evolve with new requirements
○ Ability to harness advances in the hardware
Abstraction in Spark platform
● RDD is the single abstraction on which all other API’s
have built
● RDD allowed Spark to evolve to be platform rather than
just another framework
● RDD currently supports
○ Batch processing
○ Structure data analysis using DF abstraction
○ Streaming
○ Iterative processing - ML
RDD is the assembly of big data?
● RDD abstraction as platform abstraction was much
better than any other framework abstractions that came
before that.
● But as platform idea becomes common place, there is
been question is RDD is the assembly of big data?
● Databricks has convinced it is, but other disagree
● RDD seems to start showing some cracks in it’s perfect
image, particularly in
○ Streaming
○ Iterative processing
RDD in streaming
● RDD is essentially a map/reduce file based abstraction
that sits in memory
● RDD file like behaviour makes it difficult to amend to the
streaming latencies
● This is the reason why spark streaming in a near
realtime not realtime system
● So this is holding platform back in API’s like custom
windows like window based on record count, event
based time
Spark streaming
● RDD is an abstraction based on partitions
● To satisfy the RDD abstraction, spark streaming inserts blocking
operations like block boundary, partition boundary to streaming data
● Normally it’s good practice to avoid any blocking operation on stream but
spark streaming we can’t avoid them
● Due to this nature, dashboard our example has to wait till all the records in
the partition to be processed before it can see even first result.
R1 R2 R3 M1 M2 M3Input
Stream
Block boundary
Partition
Partition boundary
Dashboard
Map operation
Spark streaming in spikes
Input
Stream
Block boundary
Partition
Partition boundary
Dashboard
Map operation
R
1
R
2
R
3
R
4
R
5
M
1
M
2
M
3
M
4
M
5
● In case of spikes, as spark streaming is purely on time, data in single block
increases.
● As block increases the partition size increases which makes blocking time
bigger
● So again it puts more pressure on the processing system and increases
latency in downstream systems
Spark iterative processing
RDD in iterative processing
● Spark runtime does not support iteration in the runtime
level
● RDD caching with essential looping gets the effect of
iterative processing
● To support iterative process, the platform should
support some kind of cycle in its dataflow which is
currently DAG
● So to implement advanced matrix factorization algorithm
it’s very hard to do in spark
Stream as the abstraction
● A stream is a sequence of data elements made
available over time.
● A stream can be thought of as items on a conveyor belt
being processed one at a time rather than in large
batches.
● Streams can be unbounded ( streaming application) and
bounded ( batch processing)
● Streams are becoming new abstractions to build data
pipelines.
Streams in last two years
● There's been many projects started to make stream as
the new abstraction layer for data processing
● Some of them are
○ Reactive streams like akka-streams, akka-http
● Java 8 streams
● RxJava etc
Streams are picking up steam from last few years now.
Why stream abstraction?
● Stream is one of most generic interface out there
● Flexible in representing different data flow scenarios
● Stream are stateless by nature
● Streams are normally lazy and easily parallelizable
● Streams work well with functional programming
● Stream can model different data sources/sinks
○ File can be treated as stream of bytes
● Stream normally good at exploiting the system
resources.
Why not streams?
● Long running pipelined streams are hard to reason
about.
● As stream is by default stateless, doing fault tolerance
becomes harder
● Stream change the way we look at data in normal
program
● As there may be multiple things running parallely debug
becomes tricky
Stream abstraction in big data
● Stream is the new abstraction layer people are
exploring in the big data
● With right implementation, stream can support both
streaming and batch applications much more effectively
than existing abstractions.
● Batch on streaming is new way of looking at processing
rather than treating streaming as the special case of
batch
● Batch can be faster on streaming than dedicated batch
processing
Apache flink
● Flink’s core is a streaming dataflow engine that
provides data distribution, communication, and fault
tolerance for distributed computations over data
streams.
● Flink provides
○ Dataset API - for bounded streams
○ Datastream API - for unbounded streams
● Flink embraces the stream as abstraction to implement
it’s dataflow.
Flink stack
Flink history
● Stratosphere project started in Technical university,
Berlin in 2009
● Incubated to Apache in March, 2014
● Top level project in Dec 2014
● Started as stream engine for batch processing
● Started to support streaming few versions before
● DataArtisians is company founded by core flink team
● Complete history is here [1]
Spark vs Flink
● Spark was the first open source big data platform. Flink
is following it’s lead
● They share many ideas and frameworks
○ Same collection like API
○ Same frameworks like AKKA, YARN underneath
○ Full fledges DSL in Scala
● It is very easy to port code from Spark to Flink as they
have very similar API. Ex : WordCount.scala
● But there are many differences in the abstractions and
runtime.
Abstraction
Spark Flink
RDD Dataflow (JobGraph)
RDD API for batch and DStream API for
streaming
Dataset for batch API and DataStream API
for streaming
RDD does not go through optimizer.
Dataframe is API for that.
All API’s go through optimizer. Dataset is
dataframe equivalent.
Can combine RDD and DStream Cannot combine Dataset and DataStream
Streaming over batch Batch over streaming
Memory management
Spark Flink
Till 1.5, Java memory layout to manage
cached data
Custom memory management from day
one.
Many OOM errors due to inaccurate
approximation of java heap memory
Fewer OOM as memory is pre allocated
and controlled manually
RDD does not uses optimizer so it can’t use
benefits of Tungsten
All API’s go through optimizer. So all can
depend upon the controlled memory
management
No support for off heap memory natively Support for off heap memory natively from
0.10 version
Supports explicit data caching for
interactive applications
No explicit data caching yet
Streaming
Spark Flink
Near real time / microbatch Realtime / Event based
Only supports single concept of time i.e
processing time
Support multiple concepts of time like event
time, process time etc
Supports windows based on process time. Supports windows based on time, record
count, triggers etc
Great support to combine historical data
and stream
Limited support to combine historical data
and stream
Supports most of the datastreams used in
real world
Limited supported for streams which are
limited to kafka as of now
Batch processing
Spark Flink
Spark runtime is dedicated batch processor Flink uses different optimizer to run batch
on streaming system
All operators are blocking aka partitioned to
run in flow
Even some of the batch operators are
streamed as it’s transparent to program
Uses RDD recomputation for fault tolerance Chandy-Lamport algorithm for barrier
checks for fault tolerance
● As spark has dedicated batch processing, it’s expected
to do better compare to flink
● Not really. Stream pipeline can result in better
performance in even in batch[2].
Structured Data analysis
Spark Flink
Excellent supports for structured data using
Datasource API
Only supports Hadoop InputFormat based
API’s both for structured and unstructured
data
Support Dataframe DSL and SQL interface
for data analysis
Limited support for DSL using Table API
and no SQL support yet
Integrates nicely with Hive No integration with Hive yet
Support for statically types dataframe
transformation is getting added in 1.6
Supports statically typed structured data
analysis using Dataset API
No support for Dataframe for streaming yet Table API is supported for streaming
Maturity
● Spark is already 5 years old in Apache community
where as flink is around 2 year old
● Spark is already in version 1.6 whereas flink is yet to hit
1.0
● Spark has great ecosystem support and mature
compared to flink at this point of time
● Materials to learn, understand are far more better for
spark compared to Flink
● Very few companies are using flink in production as of
now
FAQ
● Is this all real? Are you saying me Flink is taking over
spark?
● Can spark adopt to new ideas proposed by Flink?
● How easy to move the project from Spark to Flink?
● Do you have any customer who are evaluating Flink as
of now?
● Is there any resources to learn more about it?
References
1. http://www.slideshare.net/stephanewen1/flink-history-
roadmap-and-vision
2. http://www.slideshare.net/FlinkForward/dongwon-kim-a-
comparative-performance-evaluation-of-flink
3. http://www.slideshare.net/stephanewen1/flink-history-
roadmap-and-vision
References
● http://flink.apache.org/
● http://flink.apache.org/faq.html
● http://flink-forward.org/
● http://flink.apache.org/news/2015/05/11/Juggling-with-
Bits-and-Bytes.html
● https://cwiki.apache.
org/confluence/display/FLINK/Data+exchange+between
+tasks

More Related Content

What's hot

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 

What's hot (20)

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
kafka
kafkakafka
kafka
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 

Viewers also liked

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 

Viewers also liked (6)

Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 

Similar to Stream-based big data processing with Apache Flink

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesTimothy Spann
 

Similar to Stream-based big data processing with Apache Flink (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Stream-based big data processing with Apache Flink

  • 1. Introduction to Apache Flink A stream based big data processing engine https://github.com/phatak-dev/flink-examples
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Evolution of big data frameworks to platform ● Shortcomings of the RDD abstraction ● Streaming dataflow as the abstraction layer ● Flink introduction ● Flink history ● Flink vs Spark ● Flink examples
  • 4. Big data frameworks ● Big data started as the multiple frameworks focused on specific problems ● Every framework defined its own abstraction ● Though combining multiple frameworks in powerful in theory, putting together is no fun ● Combining right set of frameworks and making them work on same platform became non trivial task ● Distributions like Cloudera, Hortonworks solved these problems to some level
  • 5. Dawn of big data platforms ● Time of big data processing frameworks is over, it’s time for big platforms ● More and more advances in big data shows people want to use single platform for their big data needs ● Single platform normally evolves faster compared to a distribution of multiple frameworks ● Spark pioneering the platform and others are following their lead
  • 6. Framework vs Platform ● Each framework is free to define it’s own abstraction but platform should define single abstraction ● Frameworks often embrace multiple runtime whereas platform embraces single runtime ● Platforms normally supports libraries rather than frameworks ● Frameworks can be in different versions, normally all libraries on a platform will be on a single version
  • 7. Abstractions in frameworks ● Map/Reduce provides file as abstraction as it’s a framework focused on batch processing ● Storm provided realtime stream as abstraction as it’s focused on real time processing ● Hive provides table as abstraction as it’s focused on structured data analysis ● Every framework define their own abstraction which defines their capabilities and limitations
  • 8. Abstraction in platform ● As we moved into platform, abstraction provided by platform becomes crucial ● Abstraction provided by platform should be generic enough for all frameworks/ libraries built on top of it. ● Requirements of platform abstraction ○ Ability to support different kind of libraries/ framework on platform ○ Ability to evolve with new requirements ○ Ability to harness advances in the hardware
  • 9. Abstraction in Spark platform ● RDD is the single abstraction on which all other API’s have built ● RDD allowed Spark to evolve to be platform rather than just another framework ● RDD currently supports ○ Batch processing ○ Structure data analysis using DF abstraction ○ Streaming ○ Iterative processing - ML
  • 10. RDD is the assembly of big data? ● RDD abstraction as platform abstraction was much better than any other framework abstractions that came before that. ● But as platform idea becomes common place, there is been question is RDD is the assembly of big data? ● Databricks has convinced it is, but other disagree ● RDD seems to start showing some cracks in it’s perfect image, particularly in ○ Streaming ○ Iterative processing
  • 11. RDD in streaming ● RDD is essentially a map/reduce file based abstraction that sits in memory ● RDD file like behaviour makes it difficult to amend to the streaming latencies ● This is the reason why spark streaming in a near realtime not realtime system ● So this is holding platform back in API’s like custom windows like window based on record count, event based time
  • 12. Spark streaming ● RDD is an abstraction based on partitions ● To satisfy the RDD abstraction, spark streaming inserts blocking operations like block boundary, partition boundary to streaming data ● Normally it’s good practice to avoid any blocking operation on stream but spark streaming we can’t avoid them ● Due to this nature, dashboard our example has to wait till all the records in the partition to be processed before it can see even first result. R1 R2 R3 M1 M2 M3Input Stream Block boundary Partition Partition boundary Dashboard Map operation
  • 13. Spark streaming in spikes Input Stream Block boundary Partition Partition boundary Dashboard Map operation R 1 R 2 R 3 R 4 R 5 M 1 M 2 M 3 M 4 M 5 ● In case of spikes, as spark streaming is purely on time, data in single block increases. ● As block increases the partition size increases which makes blocking time bigger ● So again it puts more pressure on the processing system and increases latency in downstream systems
  • 15. RDD in iterative processing ● Spark runtime does not support iteration in the runtime level ● RDD caching with essential looping gets the effect of iterative processing ● To support iterative process, the platform should support some kind of cycle in its dataflow which is currently DAG ● So to implement advanced matrix factorization algorithm it’s very hard to do in spark
  • 16. Stream as the abstraction ● A stream is a sequence of data elements made available over time. ● A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches. ● Streams can be unbounded ( streaming application) and bounded ( batch processing) ● Streams are becoming new abstractions to build data pipelines.
  • 17. Streams in last two years ● There's been many projects started to make stream as the new abstraction layer for data processing ● Some of them are ○ Reactive streams like akka-streams, akka-http ● Java 8 streams ● RxJava etc Streams are picking up steam from last few years now.
  • 18. Why stream abstraction? ● Stream is one of most generic interface out there ● Flexible in representing different data flow scenarios ● Stream are stateless by nature ● Streams are normally lazy and easily parallelizable ● Streams work well with functional programming ● Stream can model different data sources/sinks ○ File can be treated as stream of bytes ● Stream normally good at exploiting the system resources.
  • 19. Why not streams? ● Long running pipelined streams are hard to reason about. ● As stream is by default stateless, doing fault tolerance becomes harder ● Stream change the way we look at data in normal program ● As there may be multiple things running parallely debug becomes tricky
  • 20. Stream abstraction in big data ● Stream is the new abstraction layer people are exploring in the big data ● With right implementation, stream can support both streaming and batch applications much more effectively than existing abstractions. ● Batch on streaming is new way of looking at processing rather than treating streaming as the special case of batch ● Batch can be faster on streaming than dedicated batch processing
  • 21. Apache flink ● Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. ● Flink provides ○ Dataset API - for bounded streams ○ Datastream API - for unbounded streams ● Flink embraces the stream as abstraction to implement it’s dataflow.
  • 23. Flink history ● Stratosphere project started in Technical university, Berlin in 2009 ● Incubated to Apache in March, 2014 ● Top level project in Dec 2014 ● Started as stream engine for batch processing ● Started to support streaming few versions before ● DataArtisians is company founded by core flink team ● Complete history is here [1]
  • 24. Spark vs Flink ● Spark was the first open source big data platform. Flink is following it’s lead ● They share many ideas and frameworks ○ Same collection like API ○ Same frameworks like AKKA, YARN underneath ○ Full fledges DSL in Scala ● It is very easy to port code from Spark to Flink as they have very similar API. Ex : WordCount.scala ● But there are many differences in the abstractions and runtime.
  • 25. Abstraction Spark Flink RDD Dataflow (JobGraph) RDD API for batch and DStream API for streaming Dataset for batch API and DataStream API for streaming RDD does not go through optimizer. Dataframe is API for that. All API’s go through optimizer. Dataset is dataframe equivalent. Can combine RDD and DStream Cannot combine Dataset and DataStream Streaming over batch Batch over streaming
  • 26. Memory management Spark Flink Till 1.5, Java memory layout to manage cached data Custom memory management from day one. Many OOM errors due to inaccurate approximation of java heap memory Fewer OOM as memory is pre allocated and controlled manually RDD does not uses optimizer so it can’t use benefits of Tungsten All API’s go through optimizer. So all can depend upon the controlled memory management No support for off heap memory natively Support for off heap memory natively from 0.10 version Supports explicit data caching for interactive applications No explicit data caching yet
  • 27. Streaming Spark Flink Near real time / microbatch Realtime / Event based Only supports single concept of time i.e processing time Support multiple concepts of time like event time, process time etc Supports windows based on process time. Supports windows based on time, record count, triggers etc Great support to combine historical data and stream Limited support to combine historical data and stream Supports most of the datastreams used in real world Limited supported for streams which are limited to kafka as of now
  • 28. Batch processing Spark Flink Spark runtime is dedicated batch processor Flink uses different optimizer to run batch on streaming system All operators are blocking aka partitioned to run in flow Even some of the batch operators are streamed as it’s transparent to program Uses RDD recomputation for fault tolerance Chandy-Lamport algorithm for barrier checks for fault tolerance ● As spark has dedicated batch processing, it’s expected to do better compare to flink ● Not really. Stream pipeline can result in better performance in even in batch[2].
  • 29. Structured Data analysis Spark Flink Excellent supports for structured data using Datasource API Only supports Hadoop InputFormat based API’s both for structured and unstructured data Support Dataframe DSL and SQL interface for data analysis Limited support for DSL using Table API and no SQL support yet Integrates nicely with Hive No integration with Hive yet Support for statically types dataframe transformation is getting added in 1.6 Supports statically typed structured data analysis using Dataset API No support for Dataframe for streaming yet Table API is supported for streaming
  • 30. Maturity ● Spark is already 5 years old in Apache community where as flink is around 2 year old ● Spark is already in version 1.6 whereas flink is yet to hit 1.0 ● Spark has great ecosystem support and mature compared to flink at this point of time ● Materials to learn, understand are far more better for spark compared to Flink ● Very few companies are using flink in production as of now
  • 31. FAQ ● Is this all real? Are you saying me Flink is taking over spark? ● Can spark adopt to new ideas proposed by Flink? ● How easy to move the project from Spark to Flink? ● Do you have any customer who are evaluating Flink as of now? ● Is there any resources to learn more about it?
  • 33. References ● http://flink.apache.org/ ● http://flink.apache.org/faq.html ● http://flink-forward.org/ ● http://flink.apache.org/news/2015/05/11/Juggling-with- Bits-and-Bytes.html ● https://cwiki.apache. org/confluence/display/FLINK/Data+exchange+between +tasks