Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

A Deep Dive into Flink SQL
Jark Wu
Software Engineer at Alibaba
Apache Flink Committer & PMC member

Flink SQL Architecture
Content
CONTENT
How Flink SQL Works?
Flink SQL Optimizations
Summary and Futures

This is joint efforts with entire
Apache Flink community!

Architecture before Flink 1.9
Does this lookunified?

A Step Closer
Different APIs for
streaming and batch
Different translation
path
Different codes for
streaming and batch
What we want

Future Architecture
Flink Task Runtime
Planner
Table API & SQL
Stream Transformation
Stream Operator
stream & batch

Architecture since Flink 1.9+
Flink Task Runtime
Blink Planner
Table API & SQL
Stream Transformation
Stream Operator
Flink Planner
DataSet
Driver
stream & batchbatch stream

How Flink SQL Works?
SQL
Table API
Logical Plan Physical Plan Transformations JobGraph
Configurable optimizer phases
Catalog
Hive
Metastore
Code Generation
Optimizer
SubQuery
Decorrelation
Filter/Project
PushDown
Join
Reorder
…
Code Optimizations State-of-art Operators Resource Optimizations
Generated operators
JVM intrinsic
Declarative expressions
Operate on binary data
Cache efficient sorter
Compact binary hash map
Hybrid hash join
Full managed memory
IO Manager
Off-Heap memory
Flink Cluster
Submit Job

An Example
SELECT
t1.id, 1 +2 + t1.value AS v
FROMt1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id < 1000
Scan (t1) Scan (t2)
Join
Filter
Project
t1.id = t2.id
t2.id < 1000
t1.id,
1+2+t1.value
Logical PlanSQL Query

Expression Reduce
Scan (t1) Scan (t2)
Join
Filter
Project
t1.id = t2.id
t2.id < 1000
t1.id,
1+2+t1.value
Logical Plan
Literal(1) Literal(2)
Plus Field(t1.value)
Plus
Expression Tree
Literal(3) Field(t1.value)
Plus
1+2+t1.value 3+t1.value
Evaluate 1+2 for every row
Reduce constant
expressions

Filter Push Down
Scan (t1) Scan (t2)
Join
Filter
Project
t1.id = t2.id
t2.id < 1000
t1.id,
3+t1.value
Scan (t1) Scan (t2)
Filter
Join
Project
t1.id = t2.id
t1.id,
3+t1.value
t2.id < 1000
1 million
1 thousand
1 million1 billion

Projection Push Down
Scan (t1) Scan (t2)
Filter
Join
Project
t1.id = t2.id
t1.id,
3+t1.value
t2.id > 1000
Scan (t1) Scan (t2)
Filter
Join
Project
t1.id = t2.id
t1.id,
3+t1.value
t2.id > 1000
ProjectProject t2.id
t1.id,
t1.value

Physical Planning (Batch)
Optimized Logical Plan
Scan (t1) Scan (t2)
BoradcastHashJoin
Calc
t1.id,
3+t1.value
t2.id > 1000
CalcCalc t2.idt1.id,
t1.value
Physical Plan
Scan (t1) Scan (t2)
Filter
Join
Project
t1.id = t2.id
t1.id,
3+t1.value
t2.id > 1000
ProjectProject t2.id
t1.id,
t1.value
1 thousand
1 million

Translation & Code Generation
Scan (t1) Scan (t2)
BoradcastHashJoin
Calc
CalcCalc
Physical Plan
Source
BoradcastHashJoin
Calc
CalcCalc
Source
t2.id < 1000
t2.id
Transformation Tree
code generation

Physical Planning (Stream)
What is changelog and Why we need it?
Special things for streaming: Changelog Mechanism
aka Retraction Mechanism

Physical Planning (Stream): Changelog Mechanism
SELECT cnt, COUNT(cnt) as freq
FROM(
SELECT word,COUNT(*) as cnt
FROMwords
GROUP BY word)
GROUP BY cnt
word
Hello
World
Hello
Source
cnt freq
1 2
2 1
1 1
Expected Result

word
Hello word cnt
Hello 1
word_count
World 1
Hello 2
Hello, 1
World, 1
Hello, 2
SELECT
word,
COUNT(*) as cnt
FROM words
GROUP BY word
World
Hello
Source
cnt freq
1 2
2 1
SELECT
cnt,
COUNT(cnt) as freq
FROM word_count
GROUP BY cnt
1 2
Should be
“1”
Count Frequency
without changelog
Word Count

word
Hello word cnt
Hello 1
word_count
World 1
Hello 2
SELECT
word,
COUNT(*) as cnt
FROM logs
GROUP BY word
World
Hello
cnt freq
1 2
2 1
1 1
SELECT
cnt,
COUNT(cnt) as freq
FROM word_count
GROUP BY cnt
with changelog
由查询优化器判断是否需要Retraction，用户无感知。① Changelog makes the streaming query result correct
② Query optimizer determines whether update_before is needed
③ Users are not aware of it
Hello, 1insert
World, 1insert
Hello, 1update_before
Hello, 2update_after
Helloinsert
Worldinsert
Helloinsert
Source Word Count Count Frequency

Calc
Source
Aggregate
Aggregate
UpsertSink
[I]:Insert
[U]:Update
[D]: Delete
produce
[INSERT]
produce
[INSERT, UPDATE]
produce
[INSERT, UPDATE]
produce
[INSERT, UPDATE, DELETE]
[I]
[I,U,D]
[I,U]
[I,U]
Step1: determine what changes
will a node produce
words
word,
count(*) as cnt
cnt
cnt, count(*)
Physical Plan

Calc
Source
Aggregate
Aggregate
UpsertSink
produce
[UPDATE_BEFORE+UPDATE_AFTER]
require
only UPDATE_AFTER
[I]
[I,U,D]
[I,U]
[I,U]
require
UPDATE_BEFORE + UPDATE_AFTER
require
UPDATE_BEFORE + UPDATE_AFTER
require nothing
produce
[UPDATE_BEFORE+UPDATE_AFTER]
produce
[UPDATE_AFTER][UB]:Update_Before
[UA]:Update_After
Step2: determine how to produce updates
[UB+UA]
[UB+UA]
[UA]

Calc
Source
Aggregate
Aggregate
UpsertSink
[I]
[I,U,D]
[I,U]
[I,U]
[UB+UA]
[UB+UA]
[UA]
Simple COUNT implementation,
Generate UPDATE_BEFORE
COUNT with retract() implementation,
Not generate UPDATE_BEFORE
words
word,
count(*) as cnt
cnt
cnt, count(*)
Final Physical Plan

Flink SQL Optimizations
• Internal Data Structure (BinaryRow)
• Mini-Batch Processing
• Aggregation Skew Handling
• Plan Rewrite

Old Planner: Row
Object[]
Integer(2019)
String(“Flink”)
String(“Forward”)
Row
• Space inefficiency (object header)
• Boxing and unboxing
• Serialization and deserialization cost, especially when we want to access fields
randomly
• Row(2019, “Flink”, “Forward”)

• Deeply integrated with MemorySegment
• No need to deserialize / Compact layout / Random accessible
• Also have BinaryString, BinaryArray, BinaryMap
New Blink Planner: BinaryRow
2019 pointer pointer 5 Flink 7 Forward
Memory Segment
Fixed-length part Variable-length partNull info
Header (Row Kind)
Blink planner is +54.6% than old planner when object reuse is enabled:
https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance

• Each record would cost:
• One state reading and writing
• One deserialization and serialization
• One output
Mini-Batch Processing
Normal aggregation:
SELECT SUM(num) FROM T GROUP BY color

• Use heap memory to hold bundle
• In-memory aggregation before
accessing states and serde operations
• Also ease the downstream loads
Mini-Batch Processing
Mini-Batch aggregation:
table.exec.mini-batch.enabled = true
table.exec.mini-batch.allow-latency = “5000 ms”
table.exec.mini-batch.size = 1000

Aggregation Skew Handling
table.optimizer.agg-phase-strategy = TWO_PHASE

• It’s impractical to do a global streaming sort
• But it becomes possible if user only cares about the top n elements
• E.g. Calculate the top 3 shops for each category
Plan Rewrite (Top-N)
SELECT *
FROM (
SELECT *, // you can get like shopId or other information from this
ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS rowNum
FROM shop_sales)
WHERE rowNum <= 3

OverAggregate
Calc
…
…
Rank
…
…
Original Plan Optimized Plan
Plan Rewrite (Top-N)
rownum <= 3
ROW_NUMBER
partition key = category
sort key = sales
partition key = category
sort key = sales
top = 3

• Flink took a big step towards truly unified architecture
• Introduced how Flink SQL works step by step.
• Flink SQL does a lot of optimizations for users automatically
• Future (Flink 1.11+)
• Blink planner will be the default planner and ready for production
• New TableSource and TableSink interfaces (FLIP-95)
• Support to read changelogs (FLIP-105)
• Unified Batch and Streaming Filesystem connector (FLIP-115)
• Hive DDL & DML compatible (FLIP-123)
Summary & Futures

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Similar to Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu (20)

More from Flink Forward

More from Flink Forward (16)

Recently uploaded

Recently uploaded (20)

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu