1. Juggling
with
Bits
and
Bytes
How
Apache
Flink
operates
on
binary
data
Fabian
Hueske
:ueske@apache.org
@:ueske
1
2. Big
Data
frameworks
on
JVMs
• Many
(open
source)
Big
Data
frameworks
run
on
JVMs
– Hadoop,
Drill,
Spark,
Hive,
Pig,
and
...
– Flink
as
well
• Common
challenge:
How
to
organize
data
in-‐memory?
– In-‐memory
processing
(sorOng,
joining,
aggregaOng)
– In-‐memory
caching
of
intermediate
results
• Memory
management
of
a
system
influences
– Reliability
– Resource
efficiency,
performance
&
performance
predictability
– Ease
of
configuraOon
2
3. The
straight-‐forward
approach
Store
and
process
data
as
objects
on
the
heap
• Put
objects
in
an
array
and
sort
it
A
few
notable
drawbacks
• PredicOng
memory
consumpOon
is
hard
– If
you
fail,
an
OutOfMemoryError
will
kill
you!
• High
garbage
collecOon
overhead
– Easily
50%
of
Ome
spend
on
GC
• Objects
have
considerable
space
overhead
– At
least
8
bytes
for
each
(nested)
object!
(Depends
on
arch)
3
5. Flink
adopts
DBMS
technology
• Allocates
fixed
number
of
memory
segments
upfront
• Data
objects
are
serialized
into
memory
segments
• DBMS-‐style
algorithms
work
on
binary
representaOon
5
6. Why
is
that
good?
• Memory-‐safe
execuOon
– Used
and
available
memory
segments
are
easy
to
count
– No
parameter
tuning
for
reliable
operaOons!
• Efficient
out-‐of-‐core
algorithms
– Memory
segments
can
be
efficiently
wrifen
to
disk
• Reduced
GC
pressure
– Memory
segments
are
off-‐heap
or
never
deallocated
– Data
objects
are
short-‐lived
or
reused
• Space-‐efficient
data
representaOon
• Efficient
operaOons
on
binary
data
6
7. What
does
it
cost?
• Significant
implementaOon
investment
– Using
java.uOl.HashMap
vs.
– ImplemenOng
a
spillable
hash
table
backed
by
byte
arrays
and
custom
serializaOon
stack
• Other
systems
use
similar
techniques
– Apache
Drill,
Apache
AsterixDB
(incubaOng)
• Apache
Spark
evolves
into
a
similar
direcOon
7
9. Memory
segments
• Unit
of
memory
distribuOon
in
Flink
– Fixed
number
allocated
when
worker
starts
• Backed
by
a
regular
byte
array
(default
32KB)
• On-‐heap
or
off-‐heap
allocaOon
• R/W
access
through
Java’s
efficient
unsafe
methods
• MulOple
memory
segments
can
be
logically
concatenated
to
a
larger
chunk
of
memory
9
14. Custom
de/serializaOon
stack
• Many
alternaOves
for
Java
object
serializaOon
– Dynamic:
Kryo
– Schema-‐dependent:
Apache
Avro,
Apache
Thrip,
Protobufs
• But
Flink
has
its
own
serializaOon
stack
– OperaOng
on
serialized
data
requires
knowledge
of
layout
– Control
over
layout
can
improve
efficiency
of
operaOons
– Data
types
are
known
before
execuOon
14
15. Rich
&
extensible
type
system
• SerializaOon
framework
requires
knowledge
of
types
• Flink
analyzes
return
types
of
funcOons
– Java:
ReflecOon
based
type
analyzer
– Scala:
Compiler
informaOon
+
CodeGen
via
Macros
• Rich
type
system
– Atomics:
PrimiOves,
Writables,
Generic
types,
…
– Composites:
Tuples,
Pojos,
CaseClasses
– Extensible
by
custom
types
15
18. Data
processing
algorithms
• Flink’s
algorithms
are
based
on
RDBMS
technology
– External
Merge
Sort,
Hybrid
Hash
Join,
Sort
Merge
Join,
…
• Algorithms
receive
a
budget
of
memory
segments
– AutomaOc
decision
about
budget
size
– No
fine-‐tuning
of
operator
memory!
• Operate
in-‐memory
as
long
as
data
fits
into
budget
– And
gracefully
spill
to
disk
if
data
exceeds
memory
18
23. Sort
benchmark
• Task:
Sort
10
million
Tuple2<Integer,
String>
records
– String
length
12
chars
•
Tuple
has
16
Bytes
of
raw
data
• ~152
MB
raw
data
– Integers
uniformly,
Strings
long-‐tail
distributed
– Sort
on
Integer
field
and
on
String
field
• Generated
input
provided
as
mutable
object
iterator
• Use
JVM
with
900
MB
heap
size
– Minimum
size
to
reliable
run
the
benchmark
23
24. SorOng
methods
1. Objects-‐on-‐Heap:
– Put
cloned
data
objects
in
ArrayList
and
use
Java’s
CollecOon
sort.
– ArrayList
is
iniOalized
with
right
size.
2. Flink-‐serialized
(on-‐heap):
– Using
Flink’s
custom
serializers.
– Integer
with
full
binary
sorOng
key,
String
with
8
byte
prefix
key.
3. Kryo-‐serialized
(on-‐heap):
– Serialize
fields
with
Kryo.
– No
binary
sorOng
keys,
objects
are
deserialized
for
comparison.
• All
implementaOons
use
a
single
thread
• Average
execuOon
Ome
of
10
runs
reported
• GC
triggered
between
runs
(does
not
go
into
reported
Ome)
24
30. We’re
not
done
yet!
• SerializaOon
layouts
tailored
towards
operaOons
– More
efficient
operaOons
on
binary
data
• Table
API
provides
full
semanOcs
for
execuOon
– Use
code
generaOon
to
operate
fully
on
binary
data
• …
30
31. Summary
• AcOve
memory
management
avoids
OOMErrors
• Highly
efficient
data
serializaOon
stack
– Facilitates
operaOons
on
binary
data
– Makes
more
data
fit
into
memory
• DBMS-‐style
operators
operate
on
binary
data
– High
performance
in-‐memory
processing
– Graceful
destaging
to
disk
if
necessary
• Read
Flink’s
blog:
– hfp://flink.apache.org/news/2015/05/11/Juggling-‐with-‐Bits-‐and-‐Bytes.html
– hfp://flink.apache.org/news/2015/03/13/peeking-‐into-‐Apache-‐Flinks-‐Engine-‐Room.html
– hfp://flink.apache.org/news/2015/09/16/off-‐heap-‐memory.html
31