SlideShare a Scribd company logo
1 of 32
Juggling	
  with	
  Bits	
  and	
  Bytes	
  
How	
  Apache	
  Flink	
  operates	
  on	
  binary	
  data	
  
	
  
Fabian	
  Hueske	
  
:ueske@apache.org	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  @:ueske	
  
	
  
1	
  
Big	
  Data	
  frameworks	
  on	
  JVMs	
  
•  Many	
  (open	
  source)	
  Big	
  Data	
  frameworks	
  run	
  on	
  JVMs	
  
–  Hadoop,	
  Drill,	
  Spark,	
  Hive,	
  Pig,	
  and	
  ...	
  
–  Flink	
  as	
  well	
  
•  Common	
  challenge:	
  How	
  to	
  organize	
  data	
  in-­‐memory?	
  
–  In-­‐memory	
  processing	
  (sorOng,	
  joining,	
  aggregaOng)	
  
–  In-­‐memory	
  caching	
  of	
  intermediate	
  results	
  
•  Memory	
  management	
  of	
  a	
  system	
  influences	
  
–  Reliability	
  
–  Resource	
  efficiency,	
  performance	
  &	
  performance	
  predictability	
  
–  Ease	
  of	
  configuraOon	
  
2	
  
The	
  straight-­‐forward	
  approach	
  
Store	
  and	
  process	
  data	
  as	
  objects	
  on	
  the	
  heap	
  
•  Put	
  objects	
  in	
  an	
  array	
  and	
  sort	
  it	
  
	
  
A	
  few	
  notable	
  drawbacks	
  
•  PredicOng	
  memory	
  consumpOon	
  is	
  hard	
  
–  If	
  you	
  fail,	
  an	
  OutOfMemoryError	
  will	
  kill	
  you!	
  
•  High	
  garbage	
  collecOon	
  overhead	
  
–  Easily	
  50%	
  of	
  Ome	
  spend	
  on	
  GC	
  
•  Objects	
  have	
  considerable	
  space	
  overhead	
  
–  At	
  least	
  8	
  bytes	
  for	
  each	
  (nested)	
  object!	
  (Depends	
  on	
  arch)	
  
3	
  
FLINK’S	
  APPROACH	
  
4	
  
Flink	
  adopts	
  DBMS	
  technology	
  
•  Allocates	
  fixed	
  number	
  of	
  memory	
  segments	
  upfront	
  
•  Data	
  objects	
  are	
  serialized	
  into	
  memory	
  segments	
  
•  DBMS-­‐style	
  algorithms	
  work	
  on	
  binary	
  representaOon	
  
5	
  
Why	
  is	
  that	
  good?	
  
•  Memory-­‐safe	
  execuOon	
  
–  Used	
  and	
  available	
  memory	
  segments	
  are	
  easy	
  to	
  count	
  
–  No	
  parameter	
  tuning	
  for	
  reliable	
  operaOons!	
  
•  Efficient	
  out-­‐of-­‐core	
  algorithms	
  
–  Memory	
  segments	
  can	
  be	
  efficiently	
  wrifen	
  to	
  disk	
  
•  Reduced	
  GC	
  pressure	
  
–  Memory	
  segments	
  are	
  off-­‐heap	
  or	
  never	
  deallocated	
  
–  Data	
  objects	
  are	
  short-­‐lived	
  or	
  reused	
  
•  Space-­‐efficient	
  data	
  representaOon	
  
•  Efficient	
  operaOons	
  on	
  binary	
  data	
  
6	
  
What	
  does	
  it	
  cost?	
  
•  Significant	
  implementaOon	
  investment	
  
–  Using	
  java.uOl.HashMap	
  
vs.	
  
–  ImplemenOng	
  a	
  spillable	
  hash	
  table	
  backed	
  by	
  byte	
  arrays	
  
and	
  custom	
  serializaOon	
  stack	
  
•  Other	
  systems	
  use	
  similar	
  techniques	
  
–  Apache	
  Drill,	
  Apache	
  AsterixDB	
  (incubaOng)	
  
•  Apache	
  Spark	
  evolves	
  into	
  a	
  similar	
  direcOon	
  
7	
  
MEMORY	
  ALLOCATION	
  
8	
  
Memory	
  segments	
  
•  Unit	
  of	
  memory	
  distribuOon	
  in	
  Flink	
  
–  Fixed	
  number	
  allocated	
  when	
  worker	
  starts	
  
•  Backed	
  by	
  a	
  regular	
  byte	
  array	
  (default	
  32KB)	
  
•  On-­‐heap	
  or	
  off-­‐heap	
  allocaOon	
  
•  R/W	
  access	
  through	
  Java’s	
  efficient	
  unsafe	
  methods	
  
•  MulOple	
  memory	
  segments	
  can	
  be	
  logically	
  
concatenated	
  to	
  a	
  larger	
  chunk	
  of	
  memory	
  
9	
  
On-­‐heap	
  memory	
  allocaOon	
  
10	
  
Off-­‐heap	
  memory	
  allocaOon	
  
11	
  
On-­‐heap	
  vs.	
  Off-­‐heap	
  
•  No	
  significant	
  performance	
  difference	
  in	
  	
  
micro-­‐benchmarks	
  
•  Garbage	
  CollecOon	
  
–  Smaller	
  heap	
  -­‐>	
  faster	
  GC	
  
•  Faster	
  start-­‐up	
  Ome	
  
–  A	
  mulO-­‐GB	
  JVM	
  heap	
  takes	
  Ome	
  to	
  allocate	
  
12	
  
DATA	
  SERIALIZATION	
  
13	
  
Custom	
  de/serializaOon	
  stack	
  
•  Many	
  alternaOves	
  for	
  Java	
  object	
  serializaOon	
  
–  Dynamic:	
  Kryo	
  
–  Schema-­‐dependent:	
  Apache	
  Avro,	
  Apache	
  Thrip,	
  Protobufs	
  
•  But	
  Flink	
  has	
  its	
  own	
  serializaOon	
  stack	
  
–  OperaOng	
  on	
  serialized	
  data	
  requires	
  knowledge	
  of	
  layout	
  
–  Control	
  over	
  layout	
  can	
  improve	
  efficiency	
  of	
  operaOons	
  
–  Data	
  types	
  are	
  known	
  before	
  execuOon	
  
14	
  
Rich	
  &	
  extensible	
  type	
  system	
  
•  SerializaOon	
  framework	
  requires	
  knowledge	
  of	
  types	
  
•  Flink	
  analyzes	
  return	
  types	
  of	
  funcOons	
  
–  Java:	
  ReflecOon	
  based	
  type	
  analyzer	
  
–  Scala:	
  Compiler	
  informaOon	
  +	
  CodeGen	
  via	
  Macros	
  
•  Rich	
  type	
  system	
  
–  Atomics:	
  PrimiOves,	
  Writables,	
  Generic	
  types,	
  …	
  
–  Composites:	
  Tuples,	
  Pojos,	
  CaseClasses	
  
–  Extensible	
  by	
  custom	
  types	
  
15	
  
Serializing	
  a	
  Tuple3<Integer,	
  Double,	
  Person>	
  
16	
  
OPERATING	
  ON	
  BINARY	
  DATA	
  
17	
  
Data	
  processing	
  algorithms	
  
•  Flink’s	
  algorithms	
  are	
  based	
  on	
  RDBMS	
  technology	
  
–  External	
  Merge	
  Sort,	
  Hybrid	
  Hash	
  Join,	
  Sort	
  Merge	
  Join,	
  …	
  
•  Algorithms	
  receive	
  a	
  budget	
  of	
  memory	
  segments	
  
–  AutomaOc	
  decision	
  about	
  budget	
  size	
  
–  No	
  fine-­‐tuning	
  of	
  operator	
  memory!	
  
•  Operate	
  in-­‐memory	
  as	
  long	
  as	
  data	
  fits	
  into	
  budget	
  
–  And	
  gracefully	
  spill	
  to	
  disk	
  if	
  data	
  exceeds	
  memory	
  
18	
  
In-­‐memory	
  sort	
  –	
  Fill	
  the	
  sort	
  buffer	
  
19	
  
In-­‐memory	
  sort	
  –	
  Sort	
  the	
  buffer	
  
20	
  
In-­‐memory	
  sort	
  –	
  Read	
  sorted	
  buffer	
  
21	
  
SHOW	
  ME	
  NUMBERS!	
  
22	
  
Sort	
  benchmark	
  
•  Task:	
  Sort	
  10	
  million	
  Tuple2<Integer,	
  String>	
  records	
  
–  String	
  length	
  12	
  chars	
  
•  	
  Tuple	
  has	
  16	
  Bytes	
  of	
  raw	
  data	
  
•  ~152	
  MB	
  raw	
  data	
  
–  Integers	
  uniformly,	
  Strings	
  long-­‐tail	
  distributed	
  
–  Sort	
  on	
  Integer	
  field	
  and	
  on	
  String	
  field	
  
•  Generated	
  input	
  provided	
  as	
  mutable	
  object	
  iterator	
  
•  Use	
  JVM	
  with	
  900	
  MB	
  heap	
  size	
  
–  Minimum	
  size	
  to	
  reliable	
  run	
  the	
  benchmark	
  
23	
  
SorOng	
  methods	
  
1.  Objects-­‐on-­‐Heap:	
  	
  
–  Put	
  cloned	
  data	
  objects	
  in	
  ArrayList	
  and	
  use	
  Java’s	
  CollecOon	
  sort.	
  	
  
–  ArrayList	
  is	
  iniOalized	
  with	
  right	
  size.	
  
2.  Flink-­‐serialized	
  (on-­‐heap):	
  	
  
–  Using	
  Flink’s	
  custom	
  serializers.	
  
–  Integer	
  with	
  full	
  binary	
  sorOng	
  key,	
  String	
  with	
  8	
  byte	
  prefix	
  key.	
  
3.  Kryo-­‐serialized	
  (on-­‐heap):	
  	
  
–  Serialize	
  fields	
  with	
  Kryo.	
  	
  
–  No	
  binary	
  sorOng	
  keys,	
  objects	
  are	
  deserialized	
  for	
  comparison.	
  
•  All	
  implementaOons	
  use	
  a	
  single	
  thread	
  
•  Average	
  execuOon	
  Ome	
  of	
  10	
  runs	
  reported	
  
•  GC	
  triggered	
  between	
  runs	
  (does	
  not	
  go	
  into	
  reported	
  Ome)	
  
24	
  
ExecuOon	
  Ome	
  
25	
  
Garbage	
  collecOon	
  and	
  heap	
  usage	
  
26	
  
Objects-­‐on-­‐heap	
  
Flink-­‐serialized	
  
Memory	
  usage	
  
27	
  
•  Breakdown:	
  Flink	
  serialized	
  -­‐	
  Sort	
  Integer	
  
–  4	
  bytes	
  Integer	
  
–  12	
  bytes	
  String	
  
–  4	
  bytes	
  String	
  length	
  
–  4	
  bytes	
  pointer	
  
–  4	
  bytes	
  Integer	
  sorOng	
  key	
  
–  28	
  bytes	
  *	
  10M	
  records	
  =	
  267	
  MB	
  
Object-­‐on-­‐heap	
   Flink-­‐serialized	
   Kryo-­‐serialized	
  
Sort	
  Integer	
   Approx.	
  700	
  MB	
   277	
  MB	
   266	
  MB	
  
Sort	
  String	
   Approx.	
  700	
  MB	
   315	
  MB	
   266	
  MB	
  
Going	
  out-­‐of-­‐core	
  
28	
  
•  Single	
  thread	
  HashJoin	
  with	
  4GB	
  memory	
  budget	
  
•  Build	
  side	
  varies,	
  Probe	
  side	
  64GB	
  
WHAT’S	
  NEXT?	
  
29	
  
We’re	
  not	
  done	
  yet!	
  
	
  
•  SerializaOon	
  layouts	
  tailored	
  towards	
  operaOons	
  
–  More	
  efficient	
  operaOons	
  on	
  binary	
  data	
  
•  Table	
  API	
  provides	
  full	
  semanOcs	
  for	
  execuOon	
  
–  Use	
  code	
  generaOon	
  to	
  operate	
  fully	
  on	
  binary	
  data	
  
•  …	
  
30	
  
Summary	
  
•  AcOve	
  memory	
  management	
  avoids	
  OOMErrors	
  
•  Highly	
  efficient	
  data	
  serializaOon	
  stack	
  
–  Facilitates	
  operaOons	
  on	
  binary	
  data	
  
–  Makes	
  more	
  data	
  fit	
  into	
  memory	
  
•  DBMS-­‐style	
  operators	
  operate	
  on	
  binary	
  data	
  	
  
–  High	
  performance	
  in-­‐memory	
  processing	
  	
  
–  Graceful	
  destaging	
  to	
  disk	
  if	
  necessary	
  
•  Read	
  Flink’s	
  blog:	
  	
  
–  hfp://flink.apache.org/news/2015/05/11/Juggling-­‐with-­‐Bits-­‐and-­‐Bytes.html	
  
–  hfp://flink.apache.org/news/2015/03/13/peeking-­‐into-­‐Apache-­‐Flinks-­‐Engine-­‐Room.html	
  
–  hfp://flink.apache.org/news/2015/09/16/off-­‐heap-­‐memory.html	
  
	
  
31	
  
32	
  
hfp://flink.apache.org 	
   	
  @ApacheFlink	
  
Apache	
  Flink	
  

More Related Content

What's hot

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Martin Junghanns
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingFabian Hueske
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataFabian Hueske
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkFlink Forward
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseBig Data Spain
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt DowleSri Ambati
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 

What's hot (20)

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 

Viewers also liked

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkFlink Forward
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityFabian Hueske
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward
 

Viewers also liked (20)

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 

Similar to Fabian Hueske – Juggling with Bits and Bytes

Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Burak TUNGUT
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesHazelcast
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageTakashi Hoshino
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Ryft
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Overview of the ehcache
Overview of the ehcacheOverview of the ehcache
Overview of the ehcacheHyeonSeok Choi
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage EngineMongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage EngineMongoDB
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 

Similar to Fabian Hueske – Juggling with Bits and Bytes (20)

Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Overview of the ehcache
Overview of the ehcacheOverview of the ehcache
Overview of the ehcache
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage EngineMongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Fabian Hueske – Juggling with Bits and Bytes

  • 1. Juggling  with  Bits  and  Bytes   How  Apache  Flink  operates  on  binary  data     Fabian  Hueske   :ueske@apache.org                    @:ueske     1  
  • 2. Big  Data  frameworks  on  JVMs   •  Many  (open  source)  Big  Data  frameworks  run  on  JVMs   –  Hadoop,  Drill,  Spark,  Hive,  Pig,  and  ...   –  Flink  as  well   •  Common  challenge:  How  to  organize  data  in-­‐memory?   –  In-­‐memory  processing  (sorOng,  joining,  aggregaOng)   –  In-­‐memory  caching  of  intermediate  results   •  Memory  management  of  a  system  influences   –  Reliability   –  Resource  efficiency,  performance  &  performance  predictability   –  Ease  of  configuraOon   2  
  • 3. The  straight-­‐forward  approach   Store  and  process  data  as  objects  on  the  heap   •  Put  objects  in  an  array  and  sort  it     A  few  notable  drawbacks   •  PredicOng  memory  consumpOon  is  hard   –  If  you  fail,  an  OutOfMemoryError  will  kill  you!   •  High  garbage  collecOon  overhead   –  Easily  50%  of  Ome  spend  on  GC   •  Objects  have  considerable  space  overhead   –  At  least  8  bytes  for  each  (nested)  object!  (Depends  on  arch)   3  
  • 5. Flink  adopts  DBMS  technology   •  Allocates  fixed  number  of  memory  segments  upfront   •  Data  objects  are  serialized  into  memory  segments   •  DBMS-­‐style  algorithms  work  on  binary  representaOon   5  
  • 6. Why  is  that  good?   •  Memory-­‐safe  execuOon   –  Used  and  available  memory  segments  are  easy  to  count   –  No  parameter  tuning  for  reliable  operaOons!   •  Efficient  out-­‐of-­‐core  algorithms   –  Memory  segments  can  be  efficiently  wrifen  to  disk   •  Reduced  GC  pressure   –  Memory  segments  are  off-­‐heap  or  never  deallocated   –  Data  objects  are  short-­‐lived  or  reused   •  Space-­‐efficient  data  representaOon   •  Efficient  operaOons  on  binary  data   6  
  • 7. What  does  it  cost?   •  Significant  implementaOon  investment   –  Using  java.uOl.HashMap   vs.   –  ImplemenOng  a  spillable  hash  table  backed  by  byte  arrays   and  custom  serializaOon  stack   •  Other  systems  use  similar  techniques   –  Apache  Drill,  Apache  AsterixDB  (incubaOng)   •  Apache  Spark  evolves  into  a  similar  direcOon   7  
  • 9. Memory  segments   •  Unit  of  memory  distribuOon  in  Flink   –  Fixed  number  allocated  when  worker  starts   •  Backed  by  a  regular  byte  array  (default  32KB)   •  On-­‐heap  or  off-­‐heap  allocaOon   •  R/W  access  through  Java’s  efficient  unsafe  methods   •  MulOple  memory  segments  can  be  logically   concatenated  to  a  larger  chunk  of  memory   9  
  • 12. On-­‐heap  vs.  Off-­‐heap   •  No  significant  performance  difference  in     micro-­‐benchmarks   •  Garbage  CollecOon   –  Smaller  heap  -­‐>  faster  GC   •  Faster  start-­‐up  Ome   –  A  mulO-­‐GB  JVM  heap  takes  Ome  to  allocate   12  
  • 14. Custom  de/serializaOon  stack   •  Many  alternaOves  for  Java  object  serializaOon   –  Dynamic:  Kryo   –  Schema-­‐dependent:  Apache  Avro,  Apache  Thrip,  Protobufs   •  But  Flink  has  its  own  serializaOon  stack   –  OperaOng  on  serialized  data  requires  knowledge  of  layout   –  Control  over  layout  can  improve  efficiency  of  operaOons   –  Data  types  are  known  before  execuOon   14  
  • 15. Rich  &  extensible  type  system   •  SerializaOon  framework  requires  knowledge  of  types   •  Flink  analyzes  return  types  of  funcOons   –  Java:  ReflecOon  based  type  analyzer   –  Scala:  Compiler  informaOon  +  CodeGen  via  Macros   •  Rich  type  system   –  Atomics:  PrimiOves,  Writables,  Generic  types,  …   –  Composites:  Tuples,  Pojos,  CaseClasses   –  Extensible  by  custom  types   15  
  • 16. Serializing  a  Tuple3<Integer,  Double,  Person>   16  
  • 17. OPERATING  ON  BINARY  DATA   17  
  • 18. Data  processing  algorithms   •  Flink’s  algorithms  are  based  on  RDBMS  technology   –  External  Merge  Sort,  Hybrid  Hash  Join,  Sort  Merge  Join,  …   •  Algorithms  receive  a  budget  of  memory  segments   –  AutomaOc  decision  about  budget  size   –  No  fine-­‐tuning  of  operator  memory!   •  Operate  in-­‐memory  as  long  as  data  fits  into  budget   –  And  gracefully  spill  to  disk  if  data  exceeds  memory   18  
  • 19. In-­‐memory  sort  –  Fill  the  sort  buffer   19  
  • 20. In-­‐memory  sort  –  Sort  the  buffer   20  
  • 21. In-­‐memory  sort  –  Read  sorted  buffer   21  
  • 23. Sort  benchmark   •  Task:  Sort  10  million  Tuple2<Integer,  String>  records   –  String  length  12  chars   •   Tuple  has  16  Bytes  of  raw  data   •  ~152  MB  raw  data   –  Integers  uniformly,  Strings  long-­‐tail  distributed   –  Sort  on  Integer  field  and  on  String  field   •  Generated  input  provided  as  mutable  object  iterator   •  Use  JVM  with  900  MB  heap  size   –  Minimum  size  to  reliable  run  the  benchmark   23  
  • 24. SorOng  methods   1.  Objects-­‐on-­‐Heap:     –  Put  cloned  data  objects  in  ArrayList  and  use  Java’s  CollecOon  sort.     –  ArrayList  is  iniOalized  with  right  size.   2.  Flink-­‐serialized  (on-­‐heap):     –  Using  Flink’s  custom  serializers.   –  Integer  with  full  binary  sorOng  key,  String  with  8  byte  prefix  key.   3.  Kryo-­‐serialized  (on-­‐heap):     –  Serialize  fields  with  Kryo.     –  No  binary  sorOng  keys,  objects  are  deserialized  for  comparison.   •  All  implementaOons  use  a  single  thread   •  Average  execuOon  Ome  of  10  runs  reported   •  GC  triggered  between  runs  (does  not  go  into  reported  Ome)   24  
  • 26. Garbage  collecOon  and  heap  usage   26   Objects-­‐on-­‐heap   Flink-­‐serialized  
  • 27. Memory  usage   27   •  Breakdown:  Flink  serialized  -­‐  Sort  Integer   –  4  bytes  Integer   –  12  bytes  String   –  4  bytes  String  length   –  4  bytes  pointer   –  4  bytes  Integer  sorOng  key   –  28  bytes  *  10M  records  =  267  MB   Object-­‐on-­‐heap   Flink-­‐serialized   Kryo-­‐serialized   Sort  Integer   Approx.  700  MB   277  MB   266  MB   Sort  String   Approx.  700  MB   315  MB   266  MB  
  • 28. Going  out-­‐of-­‐core   28   •  Single  thread  HashJoin  with  4GB  memory  budget   •  Build  side  varies,  Probe  side  64GB  
  • 30. We’re  not  done  yet!     •  SerializaOon  layouts  tailored  towards  operaOons   –  More  efficient  operaOons  on  binary  data   •  Table  API  provides  full  semanOcs  for  execuOon   –  Use  code  generaOon  to  operate  fully  on  binary  data   •  …   30  
  • 31. Summary   •  AcOve  memory  management  avoids  OOMErrors   •  Highly  efficient  data  serializaOon  stack   –  Facilitates  operaOons  on  binary  data   –  Makes  more  data  fit  into  memory   •  DBMS-­‐style  operators  operate  on  binary  data     –  High  performance  in-­‐memory  processing     –  Graceful  destaging  to  disk  if  necessary   •  Read  Flink’s  blog:     –  hfp://flink.apache.org/news/2015/05/11/Juggling-­‐with-­‐Bits-­‐and-­‐Bytes.html   –  hfp://flink.apache.org/news/2015/03/13/peeking-­‐into-­‐Apache-­‐Flinks-­‐Engine-­‐Room.html   –  hfp://flink.apache.org/news/2015/09/16/off-­‐heap-­‐memory.html     31  
  • 32. 32   hfp://flink.apache.org    @ApacheFlink   Apache  Flink