Meet the squirrel @ #CSHUG

Meeting the Squirrel
Márton Balassi
mbalassi@apache.org
@MartonBalassi

Apache Flink
• Apache Flink is an open source stream
processing framework
• Low latency
• High throughput
• Stateful
• Distributed
• Developed at the Apache Software
Foundation, 1.0.0 released in March
2016,
used in production
2016-06-23 #CSHUG 2

Flink in the wild
2016-06-23 #CSHUG 3 3
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration & distribution
platform
See talks by at

The Flink Stack
2016-06-23 #CSHUG 4
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.

The Flink Job
2016-06-23 #CSHUG 5
case class Impressions(id: String, impressions: Long)
val events: DataStream[Event] =
env.addSource(new FlinkKafkaConsumer09(…))
val impressions: DataStream[Impressions] = events
.filter(evt => evt.isImpression)
.map(evt => Impressions(evt.id, evt.numImpressions)
val counts: DataStream[Impressions]= stream
.keyBy("id")
.timeWindow(Time.hours(1))
.sum("impressions")

The Flink Job
2016-06-23 #CSHUG 6
Kafka
Source
map()
window()/
sum()
Sink
Kafka
Source
map()
window()/
sum()
Sink
filter()
filter()
keyBy()
keyBy()

A selection of cool features
2016-06-23 #CSHUG 7

Batch on a streaming (pipelined) engine
2016-06-23 #CSHUG 8
Map
Operator
Map
Operator
Map
Operator
Data
Source
(small)
Stream
Data
Sink
Data
Sink
Data
Sink
Join
Operator
in parallel
Data
Source
(large)
Data
Sink
in parallel (once buildside finished)
Map

Query Optimizer
2016-06-23 #CSHUG 9

Operators on managed memory
2016-06-23 #CSHUG 10

Smooth out of core performance
2016-06-23 #CSHUG 11
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core

Fast iterations
2016-06-23 #CSHUG 12
More at: http://data-artisans.com/data-analysis-with-flink.html

Hadoop integration
2016-06-23 #CSHUG 13

Counting hierarchy of needs
2016-06-23 #CSHUG 14
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's
hierarchy of needs

15
Continuous counting

16
Continuous counting

17
Continuous counting

18
Continuous counting

19
Continuous counting
... accurate and
repeatable,

20
Continuous counting
... accurate and
repeatable,
... queryable
1.1+

Streaming architecture
2016-06-23 #CSHUG 21

Classic ETL
2016-06-23 #CSHUG 22
Server
Logs
HDFS / S3
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Tier 0: Raw data Tier 1: Normalized, cleansed data
Periodic
jobs Parquet /
ORC in
HDFS
Tier 2: Aggregated data
Periodic
jobs
User
User
“Data Warehouse”

Streaming ETL
2016-06-23 #CSHUG 23
Stream Processor
Server
Logs
“Data Lake”
Server
Logs
Server
Logs
Mobile
IoT
Parquet /
ORC in HDFS
Tier 2: Aggregated data
User
Kafka
Connector
ES
Connector
Rolling file
sink
JDBC sink
Cassandra
sink
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
Batch
Processing

The BigPetStore example
2016-06-23 #CSHUG 24

BigPetStore
Blueprints for Big Data
applications
Consists of:
• Data Generators
• Examples using tools in Big Data
ecosystem to process data
• Build system and tests for
integrating tools and multiple JVM
languages
Part of the Apache BigTop project

BigPetStore model
• Customers visiting pet stores generating
transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in
Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference
on , vol., no., pp.49-55, 3-5 Dec. 2014

Data generation
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers)
.flatMap(new TransactionGenerator(products))
.withBroadcastSet(stores, ”stores”)
.map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
• Use RJ Nowling’s Java generator classes
• Write transactions to JSON

ETL with the DataSet API
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts)
.distinct
.zipWithUniqueId
val customerAndProductPairs = transactions
.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,
p)))
.join(productsWithIndex).where(_._2).equalTo(_._2)
.map(pair => (pair._1._1, pair._2._1))
.distinct
customerAndProductPairs.writeAsCsv(output)
• Read the dirty JSON
• Output (customer, product) pairs for the
recommender

ETL with the Table API
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId)
.select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId)
.select('storeId.max as 'max)
.join(storeTransactionCount)
.where(”count = max”)
.select('storeId, 'storeName, 'storeId.count as 'count)
.toDataSet[StoreCount]
• Read the dirty JSON
• SQL style queries

A little Recommeder theory
Item
factors
User side
information
User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R is potentially huge, approximate it with P∗Q
• Prediction is TopK(user’s row ∗ Q)

Matrix factorization with FlinkML
val input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.get
p.writeAsText(pOut)
q.writeAsText(qOut)
• Read the (customer, product) pairs
• Write P and Q to file

Recommendation with the DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999)
.map(new GetUserVector())
.broadcast()
.map(new PartialTopK())
.keyBy(0)
.flatMap(new GlobalTopK())
.print();
• Get the user’s row for a userID
• Compute the distributed TopK of the
user’s row ∗ Q

Coming up in Flink
2016-06-23 #CSHUG 33

Next steps
• SQL: ongoing work in collaboration with Apache Calcite
• Dynamic scaling: adapt resources to stream volume, historical
stream processing
• Queryable state: ability to query the state inside the stream
processor
• Mesos support
• More sources and sinks (e.g., Kinesis, Cassandra)
2016-06-23 #CSHUG 34

Big thanks to
• Ataccama
• Cloudera
• data Artisans
• MTA SZTAKI
2016-06-23 #CSHUG 35

Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org

Check out Flink
@ApacheFlink
Márton Balassi
mbalassi@apache.org / @MartonBalassi
2016-06-23 #CSHUG 37

Meet the squirrel @ #CSHUG

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Meet the squirrel @ #CSHUG

Similar to Meet the squirrel @ #CSHUG (20)

Recently uploaded

Recently uploaded (20)

Meet the squirrel @ #CSHUG

Editor's Notes