2. Apache Flink
• Apache Flink is an open source stream
processing framework
• Low latency
• High throughput
• Stateful
• Distributed
• Developed at the Apache Software
Foundation, 1.0.0 released in March
2016,
used in production
2016-06-23 #CSHUG 2
3. Flink in the wild
2016-06-23 #CSHUG 3 3
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration & distribution
platform
See talks by at
4. The Flink Stack
2016-06-23 #CSHUG 4
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
5. The Flink Job
2016-06-23 #CSHUG 5
case class Impressions(id: String, impressions: Long)
val events: DataStream[Event] =
env.addSource(new FlinkKafkaConsumer09(…))
val impressions: DataStream[Impressions] = events
.filter(evt => evt.isImpression)
.map(evt => Impressions(evt.id, evt.numImpressions)
val counts: DataStream[Impressions]= stream
.keyBy("id")
.timeWindow(Time.hours(1))
.sum("impressions")
8. Batch on a streaming (pipelined) engine
2016-06-23 #CSHUG 8
Map
Operator
Map
Operator
Map
Operator
Data
Source
(small)
Stream
Data
Sink
Data
Sink
Data
Sink
Join
Operator
in parallel
Data
Source
(large)
Data
Sink
in parallel (once buildside finished)
Map
11. Smooth out of core performance
2016-06-23 #CSHUG 11
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core
14. Counting hierarchy of needs
2016-06-23 #CSHUG 14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's
hierarchy of needs
17. Counting hierarchy of needs
17
Continuous counting
... with low latency,
... efficiently on high volume streams,
18. Counting hierarchy of needs
18
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
19. Counting hierarchy of needs
19
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
20. Counting hierarchy of needs
20
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
1.1+
22. Classic ETL
2016-06-23 #CSHUG 22
Server
Logs
HDFS / S3
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 0: Raw data Tier 1: Normalized, cleansed data
Periodic
jobs Parquet /
ORC in
HDFS
Tier 2: Aggregated data
Periodic
jobs
User
User
“Data Warehouse”
23. Streaming ETL
2016-06-23 #CSHUG 23
Stream Processor
Server
Logs
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Parquet /
ORC in HDFS
Tier 2: Aggregated data
User
Kafka
Connector
ES
Connector
Rolling file
sink
JDBC sink
Cassandra
sink
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
Batch
Processing
25. BigPetStore
Blueprints for Big Data
applications
Consists of:
• Data Generators
• Examples using tools in Big Data
ecosystem to process data
• Build system and tests for
integrating tools and multiple JVM
languages
Part of the Apache BigTop project
26. BigPetStore model
• Customers visiting pet stores generating
transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in
Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference
on , vol., no., pp.49-55, 3-5 Dec. 2014
27. Data generation
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers)
.flatMap(new TransactionGenerator(products))
.withBroadcastSet(stores, ”stores”)
.map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
• Use RJ Nowling’s Java generator classes
• Write transactions to JSON
28. ETL with the DataSet API
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts)
.distinct
.zipWithUniqueId
val customerAndProductPairs = transactions
.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,
p)))
.join(productsWithIndex).where(_._2).equalTo(_._2)
.map(pair => (pair._1._1, pair._2._1))
.distinct
customerAndProductPairs.writeAsCsv(output)
• Read the dirty JSON
• Output (customer, product) pairs for the
recommender
29. ETL with the Table API
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId)
.select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId)
.select('storeId.max as 'max)
.join(storeTransactionCount)
.where(”count = max”)
.select('storeId, 'storeName, 'storeId.count as 'count)
.toDataSet[StoreCount]
• Read the dirty JSON
• SQL style queries
30. A little Recommeder theory
Item
factors
User side
information
User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R is potentially huge, approximate it with P∗Q
• Prediction is TopK(user’s row ∗ Q)
31. Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.get
p.writeAsText(pOut)
q.writeAsText(qOut)
• Read the (customer, product) pairs
• Write P and Q to file
32. Recommendation with the DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999)
.map(new GetUserVector())
.broadcast()
.map(new PartialTopK())
.keyBy(0)
.flatMap(new GlobalTopK())
.print();
• Get the user’s row for a userID
• Compute the distributed TopK of the
user’s row ∗ Q
34. Next steps
• SQL: ongoing work in collaboration with Apache Calcite
• Dynamic scaling: adapt resources to stream volume, historical
stream processing
• Queryable state: ability to query the state inside the stream
processor
• Mesos support
• More sources and sinks (e.g., Kinesis, Cassandra)
2016-06-23 #CSHUG 34
35. Big thanks to
• Ataccama
• Cloudera
• data Artisans
• MTA SZTAKI
2016-06-23 #CSHUG 35
36. Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org