SlideShare a Scribd company logo
1 of 23
Download to read offline
Apache Flink 
Fast and reliable big data processing 
Till Rohrmann 
trohrmann@apache.org
What is Apache Flink? 
• Project undergoing incubation in the Apache 
Software Foundation 
• Originating from the Stratosphere research project 
started at TU Berlin in 2009 
• http://flink.incubator.apache.org 
• 59 contributors (doubled in ~4 months) 
• Has awesome squirrel logo
What is Apache Flink? 
Flink Client Master' 
Worker' 
Worker'
Current state 
• Fast - much faster than Hadoop, faster than Spark 
in many cases 
• Reliable - does not suffer from memory problems
Outline of this talk 
• Introduction to Flink 
• Distributed PageRank with Flink 
• Other Flink features 
• Flink roadmap and closing
Where to locate Flink in the 
Open Source landscape? 
Crunch" 
5" 5 
Hive" 
Mahout" 
MapReduce" 
Flink" 
Spark" Storm" 
Yarn" Mesos" 
HDFS" 
Cascading" 
Tez" 
Pig" 
Applica3ons$ 
Data$processing$ 
engines$ 
App$and$resource$ 
management$ 
Storage,$streams$ HBase" KaAa" 
…"
Distributed data sets 
DataSet 
A 
DataSet 
B 
DataSet 
C 
A (1) 
A (2) 
B (1) 
B (2) 
C (1) 
C (2) 
X 
X 
Y 
Y 
Program 
Parallel Execution 
X Y 
Operator X Operator Y
Log Analysis 
LogFile 
Filter 
Users Join 
Result 
1 ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment(); 
2 
3 DataSet<Tuple2<Integer, String>> log = env.readCsvFile(logInput) 
.types(Integer.class, String.class); 
4 DataSet<Tuple2<String, Integer>> users = env.readCsvFile(userInput) 
.types(String.class, Integer.class); 
5 
6 DataSet<String> usersWhoDownloadedFlink = log 
7 .filter( 
8 (msg) -> msg.f1.contains(“flink.jar”) 
9 ) 
10 .join(users).where(0).equalTo(1) 
11 .with( 
12 (msg,user,Collector<String> out) -> { 
14 out.collect(user.f0); 
15 } 
16 ); 
17 
18 usersWhoDownloadedFlink.print(); 
19 
20 env.execute(“Log filter example”);
PageRank 
• Algorithm which made Google 
a multi billion dollar business 
• Ranking of search results 
• Model: Random surfer 
• Follows links 
• Randomly selects arbitrary 
website
How can we solve the 
problem? 
PageRankDS = { 
(1, 0.3) 
(2, 0.5) 
(3, 0.2) 
} 
AdjacencyDS = { 
(1, [2, 3]) 
(2, [1]) 
(3, [1, 2]) 
}
PageRank{ 
node: Int 
rank: Double 
} 
Adjacency{ 
node: Int 
neighbours: List[Int] 
}
case class Pagerank(node: Int, rank: Double) 
case class Adjacency(node: Int, neighbours: Array[Int]) 
! 
def main(args: Array[String]): Unit = { 
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment 
! 
val initialPagerank:DataSet[Pagerank] = createInitialPagerank(numVertices, env) 
val adjacency: DataSet[Adjacency] = createRandomAdjacency(numVertices, sparsity, env) 
! 
val solution = initialPagerank.iterate(100){ 
pagerank => 
val partialPagerank = pagerank.join(adjacency). 
where(“node”). 
equalTo(“node”). 
flatMap{ 
// generating the partial pageranks 
pair => { 
val (Pagerank(node, rank), Adjacency(_, neighbours)) = pair 
val length = neighbours.length 
neighbours.map{ 
neighbour=> 
Pagerank(neighbour, dampingFactor*rank/length) 
} :+ Pagerank(node, (1-dampingFactor)/numVertices) 
} 
} 
! 
// adding the partial pageranks up 
partialPagerank. 
groupBy(“node”). 
reduce{ 
(left, right) => Pagerank(left.node, left.rank + right.rank) 
} 
! 
} 
solution.print() 
env.execute("Flink pagerank.") 
}
Common%API% 
Storage% 
Streams% 
Flink%Op;mizer% 
Hybrid%Batch/Streaming%Run;me% 
Files%! HDFS%! S3! 
Cluster% 
Manager%! Na;ve YARN%! EC2%! ! 
Scala%API% 
(batch)% 
Graph%API% 
(„Spargel“)% 
JDBC! Rabbit% Redis%! 
Azure! KaRa! MQ! …% 
Java% 
Collec;ons% 
Streams%Builder% 
Apache%Tez% 
Python%API% 
Java%API% 
(streaming)% 
Apache% 
MRQL% 
Batch! 
Streaming! 
Java%API% 
(batch)% 
Local% 
Execu;on%
Memory management 
• Flink manages its own memory on the heap 
• Caching and data processing happens in managed memory 
• Allows graceful spilling, never out of memory exceptions 
JVM$Heap) 
Unmanaged& 
Heap& 
Flink&Managed& 
Heap& 
Network&Buffers& 
User Code 
Flink Runtime 
Shuffles/Broadcasts
Hadoop compatibility 
• Flink supports out of the box 
• Hadoop data types (Writables) 
• Hadoop input/output formats 
• Hadoop functions and object model 
Input& Map& Reduce& Output& 
S DataSet Red DataSet Join DataSet 
Output& 
DataSet Map DataSet 
Input&
Flink Streaming 
Word count with Java API 
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 
! 
DataSet<Tuple2<String, Integer>> result = env 
.readTextFile(input) 
.flatMap(sentence -> asList(sentence.split(“ “))) 
.map(word -> new Tuple2<>(word, 1)) 
.groupBy(0) 
.aggregate(SUM, 1); 
Word count with Flink Streaming 
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(); 
! 
DataStream<Tuple2<String, Integer>> result = env 
.readTextFile(input) 
.flatMap(sentence -> asList(sentence.split(“ “))) 
.map(word -> new Tuple2<>(word, 1)) 
.groupBy(0) 
.sum(1);
Streaming throughput
Write once, run everywhere! 
Cluster((Batch)( 
Local( 
Debugging( 
Cluster((Streaming)( 
Flink&Run)me&or&Apache&Tez& 
As(Java(Collec;on( 
Programs( 
Embedded( 
(e.g.,(Web(Container)(
Write once, run with any 
data! 
Execution$ Reusing%partition/sort, 
Run$on$a$sample$ 
on$the$laptop. 
Hash%vs.%Sort, 
Partition%vs.%Broadcast, 
Caching, 
Run$a$month$later$ 
after$the$data$evolved$. 
Plan$A. 
Execution$ 
Plan$B. 
Run$on$large$files$ 
on$the$cluster. 
Execution$ 
Plan$C.
Little tuning required 
• Requires no memory thresholds to configure 
• Requires no complicated network configs 
• Requires no serializers to be configured 
• Programs adjust to data automatically
Flink roadmap 
• Flink has a major release every 3 months 
• Finer grained fault-tolerance 
• Logical (SQL-like) field addressing 
• Python API 
• Flink Streaming, Lambda architecture 
support 
• Flink on Tez 
• ML on Flink (Mahout DSL) 
• Graph DSL on Flink 
• … and much more
http://flink.incubator.apache.org 
! 
github.com/apache/incubator-flink 
! 
@ApacheFlink

More Related Content

What's hot

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 

What's hot (20)

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Viewers also liked

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 

Viewers also liked (20)

Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Aljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of TimeAljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of Time
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed Experiments
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 

Similar to Introduction to Apache Flink - Fast and reliable big data processing

Similar to Introduction to Apache Flink - Fast and reliable big data processing (20)

Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Java days gbg online
Java days gbg onlineJava days gbg online
Java days gbg online
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Speedment - Reactive programming for Java8
Speedment - Reactive programming for Java8Speedment - Reactive programming for Java8
Speedment - Reactive programming for Java8
 

More from Till Rohrmann

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 

More from Till Rohrmann (19)

Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
 
Apache flink 1.7 and Beyond
Apache flink 1.7 and BeyondApache flink 1.7 and Beyond
Apache flink 1.7 and Beyond
 
Elastic Streams at Scale @ Flink Forward 2018 Berlin
Elastic Streams at Scale @ Flink Forward 2018 BerlinElastic Streams at Scale @ Flink Forward 2018 Berlin
Elastic Streams at Scale @ Flink Forward 2018 Berlin
 
Scaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkScaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache Flink
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin
Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup BerlinApache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin
Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin
 
Apache Flink® Meets Apache Mesos® and DC/OS
Apache Flink® Meets Apache Mesos® and DC/OSApache Flink® Meets Apache Mesos® and DC/OS
Apache Flink® Meets Apache Mesos® and DC/OS
 
From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4
 
Apache Flink and More @ MesosCon Asia 2017
Apache Flink and More @ MesosCon Asia 2017Apache Flink and More @ MesosCon Asia 2017
Apache Flink and More @ MesosCon Asia 2017
 
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinInteractive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Recently uploaded (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 

Introduction to Apache Flink - Fast and reliable big data processing

  • 1. Apache Flink Fast and reliable big data processing Till Rohrmann trohrmann@apache.org
  • 2. What is Apache Flink? • Project undergoing incubation in the Apache Software Foundation • Originating from the Stratosphere research project started at TU Berlin in 2009 • http://flink.incubator.apache.org • 59 contributors (doubled in ~4 months) • Has awesome squirrel logo
  • 3. What is Apache Flink? Flink Client Master' Worker' Worker'
  • 4. Current state • Fast - much faster than Hadoop, faster than Spark in many cases • Reliable - does not suffer from memory problems
  • 5. Outline of this talk • Introduction to Flink • Distributed PageRank with Flink • Other Flink features • Flink roadmap and closing
  • 6. Where to locate Flink in the Open Source landscape? Crunch" 5" 5 Hive" Mahout" MapReduce" Flink" Spark" Storm" Yarn" Mesos" HDFS" Cascading" Tez" Pig" Applica3ons$ Data$processing$ engines$ App$and$resource$ management$ Storage,$streams$ HBase" KaAa" …"
  • 7. Distributed data sets DataSet A DataSet B DataSet C A (1) A (2) B (1) B (2) C (1) C (2) X X Y Y Program Parallel Execution X Y Operator X Operator Y
  • 8. Log Analysis LogFile Filter Users Join Result 1 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2 3 DataSet<Tuple2<Integer, String>> log = env.readCsvFile(logInput) .types(Integer.class, String.class); 4 DataSet<Tuple2<String, Integer>> users = env.readCsvFile(userInput) .types(String.class, Integer.class); 5 6 DataSet<String> usersWhoDownloadedFlink = log 7 .filter( 8 (msg) -> msg.f1.contains(“flink.jar”) 9 ) 10 .join(users).where(0).equalTo(1) 11 .with( 12 (msg,user,Collector<String> out) -> { 14 out.collect(user.f0); 15 } 16 ); 17 18 usersWhoDownloadedFlink.print(); 19 20 env.execute(“Log filter example”);
  • 9. PageRank • Algorithm which made Google a multi billion dollar business • Ranking of search results • Model: Random surfer • Follows links • Randomly selects arbitrary website
  • 10. How can we solve the problem? PageRankDS = { (1, 0.3) (2, 0.5) (3, 0.2) } AdjacencyDS = { (1, [2, 3]) (2, [1]) (3, [1, 2]) }
  • 11. PageRank{ node: Int rank: Double } Adjacency{ node: Int neighbours: List[Int] }
  • 12.
  • 13. case class Pagerank(node: Int, rank: Double) case class Adjacency(node: Int, neighbours: Array[Int]) ! def main(args: Array[String]): Unit = { val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment ! val initialPagerank:DataSet[Pagerank] = createInitialPagerank(numVertices, env) val adjacency: DataSet[Adjacency] = createRandomAdjacency(numVertices, sparsity, env) ! val solution = initialPagerank.iterate(100){ pagerank => val partialPagerank = pagerank.join(adjacency). where(“node”). equalTo(“node”). flatMap{ // generating the partial pageranks pair => { val (Pagerank(node, rank), Adjacency(_, neighbours)) = pair val length = neighbours.length neighbours.map{ neighbour=> Pagerank(neighbour, dampingFactor*rank/length) } :+ Pagerank(node, (1-dampingFactor)/numVertices) } } ! // adding the partial pageranks up partialPagerank. groupBy(“node”). reduce{ (left, right) => Pagerank(left.node, left.rank + right.rank) } ! } solution.print() env.execute("Flink pagerank.") }
  • 14. Common%API% Storage% Streams% Flink%Op;mizer% Hybrid%Batch/Streaming%Run;me% Files%! HDFS%! S3! Cluster% Manager%! Na;ve YARN%! EC2%! ! Scala%API% (batch)% Graph%API% („Spargel“)% JDBC! Rabbit% Redis%! Azure! KaRa! MQ! …% Java% Collec;ons% Streams%Builder% Apache%Tez% Python%API% Java%API% (streaming)% Apache% MRQL% Batch! Streaming! Java%API% (batch)% Local% Execu;on%
  • 15. Memory management • Flink manages its own memory on the heap • Caching and data processing happens in managed memory • Allows graceful spilling, never out of memory exceptions JVM$Heap) Unmanaged& Heap& Flink&Managed& Heap& Network&Buffers& User Code Flink Runtime Shuffles/Broadcasts
  • 16. Hadoop compatibility • Flink supports out of the box • Hadoop data types (Writables) • Hadoop input/output formats • Hadoop functions and object model Input& Map& Reduce& Output& S DataSet Red DataSet Join DataSet Output& DataSet Map DataSet Input&
  • 17. Flink Streaming Word count with Java API ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); ! DataSet<Tuple2<String, Integer>> result = env .readTextFile(input) .flatMap(sentence -> asList(sentence.split(“ “))) .map(word -> new Tuple2<>(word, 1)) .groupBy(0) .aggregate(SUM, 1); Word count with Flink Streaming StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); ! DataStream<Tuple2<String, Integer>> result = env .readTextFile(input) .flatMap(sentence -> asList(sentence.split(“ “))) .map(word -> new Tuple2<>(word, 1)) .groupBy(0) .sum(1);
  • 19. Write once, run everywhere! Cluster((Batch)( Local( Debugging( Cluster((Streaming)( Flink&Run)me&or&Apache&Tez& As(Java(Collec;on( Programs( Embedded( (e.g.,(Web(Container)(
  • 20. Write once, run with any data! Execution$ Reusing%partition/sort, Run$on$a$sample$ on$the$laptop. Hash%vs.%Sort, Partition%vs.%Broadcast, Caching, Run$a$month$later$ after$the$data$evolved$. Plan$A. Execution$ Plan$B. Run$on$large$files$ on$the$cluster. Execution$ Plan$C.
  • 21. Little tuning required • Requires no memory thresholds to configure • Requires no complicated network configs • Requires no serializers to be configured • Programs adjust to data automatically
  • 22. Flink roadmap • Flink has a major release every 3 months • Finer grained fault-tolerance • Logical (SQL-like) field addressing • Python API • Flink Streaming, Lambda architecture support • Flink on Tez • ML on Flink (Mahout DSL) • Graph DSL on Flink • … and much more