SlideShare a Scribd company logo
1 of 37
© 2017 Dremio Corporation @DremioHQ
Using Apache Arrow, Calcite and Parquet to build a
Relational Cache
Halloween 2017
@DataEngConf
Jacques Nadeau
© 2017 Dremio Corporation @DremioHQ
Who?
Jacques Nadeau
@intjesus
• CTO & Co-founder of Dremio
• Apache member
• VP Apache Arrow
• PMCs: Arrow, Calcite, Incubator, Heron (incubating)
© 2017 Dremio Corporation @DremioHQ
Agenda
• Tech Backgrounder
• Caching Techniques
• Relational Caching In Depth
• Definition and Matching
• Dealing with Updates
• Closing Words
© 2017 Dremio Corporation @DremioHQ
Tech Backgrounder
© 2017 Dremio Corporation @DremioHQ
What is Apache Arrow
• Columnar In-memory Data processing
library
• Designed to work with any programming
language
• Support for both relational and complex
data as-is
• Used by Pandas, Spark, Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Calcite
• SQL parser, Relational Algebra &
Optimizer
• Understands Materialized Views and
Lattices
• Used by many to add SQL functionality
including Apex, Drill, Hive, Flink, Kylin,
Phoenix, Samza, Storm, Cascading &
Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Parquet
• OSS implementation of Google Dremel
disk format for complex columnar data
• Support high-level of data-ware
columnar compression, vectorized
columnar readback
• Defacto standard for Analytical data on
disk in Big Data ecosystem
© 2017 Dremio Corporation @DremioHQ
Caching Techniques
© 2017 Dremio Corporation @DremioHQ
What does Caching Mean?
• Caching: Reduce the distance to data (DTD).
• Distance: How much time and resources it takes to
access data?
– How fast is the medium? How near is it?
– Is the data designed for efficient consumption?
– How similar is the data to what you need to answer a
question?
Perf & Proximity
Relevance
Consumability
Ways to reduce DTD
© 2017 Dremio Corporation @DremioHQ
Types of Caching
• In-Memory File Pinning
• Columnar Disk Caching
• In-Memory Block Caching
• Near-CPU Data Caching
• Cube Relational Caching
• Arbitrary Relational Caching
© 2017 Dremio Corporation @DremioHQ
In-Memory File Pinning
• Hold a File in Memory for frequent retrieval
• Pros
– Simple, standard and well-defined interface
– Improves the performance of the medium.
– If you’re performance is primarily bound by disk IO,
this might be a good option.
• Cons
– File structure not necessarily best in-memory
structure.
– Data manipulation almost always requires a copy of
data to also be held in memory (because the file
format is not directly consumable).
© 2017 Dremio Corporation @DremioHQ
Columnar Disk Caching
• Store the data in an optimized columnar
format.
• Pros
– Better compression reduces IO
– Good structure improves processing
– Benefits selective workloads (needed
subset of all columns)
• Cons
– Requires duplicating data
– Typically manual/semi-automated (e.g.
MapReduce/Spark to ETL persist/update)
© 2017 Dremio Corporation @DremioHQ
In-Memory Block Caching
• Maintain portions of on-disk data in
Memory (e.g. Linux page cache, HBase
block cache)
• Pros
– Very mature and usually had for free
• Cons
– Not easy to control/influence.
– Very disconnected from workloads.
© 2017 Dremio Corporation @DremioHQ
Near-CPU Data Caching (memory or disk)
• Hold the data directly in a representation that can
be processed without restructuring (e.g. Arrow
format)
• Pros
– Processing can be done without interpretation of
format
– Very efficient to consume
– Possible to consume data by multiple consumers
without duplicating memory
• Cons
– Larger than compressed formats
– Requires applications to agree on format
© 2017 Dremio Corporation @DremioHQ
Cube-Based Relational Caching
• Create several partially aggregated cuboids that can
satisfy a range of aggregation queries
• Pros
– Low-latency performance for common aggregate
query patterns
– Cube storage requirements can be small fraction of
original dataset size
• Cons
– Analysis latency is bi-modal: cube hit is great but a
miss is either unserved or served slowly
– Difficult or impossible to satisfy arbitrary queries
© 2017 Dremio Corporation @DremioHQ
Arbitrary Relational Caching
• Create arbitrary data fragments combined
with partitioning and sorting schemes to
speed any query
• Pros
– Base case is easy to understand
– Can improve the performance of any query
• Cons
– Complex to match to arbitrary queries
– Can be large depending on needs
© 2017 Dremio Corporation @DremioHQ
Types of Caching: The combination we found useful
• In-Memory File Pinning
– Too non-specific given memory scarcity
• ✔ Columnar Disk Caching
– Make sure everything is in Parquet (for any non-ephemeral data)
• ✔ In-Memory Block Caching
– Leverage existing page-cache, avoid additional memory cache layers
• ✔ Near-CPU Data Caching
– Used primarily for ephemeral/short-term persistence to avoid overhead
• ✔ Cube Relational Caching
– Useful for aggregation patterns
• ✔ Arbitrary Relational Caching
– Useful for unusual aggregation and non-aggregation needs
© 2017 Dremio Corporation @DremioHQ
Relational Caching In Depth
© 2017 Dremio Corporation @DremioHQ
Relational Algebra Refresher
• Relations: Source of data (a table)
• Operators: Define a set of transformations
– Join, Project, Scan, Filter, Aggregate, Window, etc
• Properties: Defining traits of data at a particular
relation
– Sorted by X, Hash distributed by Y, etc.
• Rules: Defining equality conditions between a
collection of operations
– Project > Filter can be changed to Filter > Project, A scan
doesn’t need to project columns that aren’t used later,
etc.
• Graph/Tree: A collection of operators that define a
particular dataset in a DAG
Project
Scan
Filter
Filter
Scan
Project
© 2017 Dremio Corporation @DremioHQ
Relational Caching: Basic Concept
• Store derived data that is
between what you want
and original dataset
• Shortens Distance to
Data (DTD)
• Reduces resource
requirements & latency
Original Data
What you
Want
What you
Want
What you
Want
Persisted Shared
Intermediate State
originalDTD
newDTDcostreduction
© 2017 Dremio Corporation @DremioHQ
You Probably Already Do This!
Data Alternatives (Manually Created)
• Sessionized
• Cleansed
• Partitioned by time or region
• Summarized for a particular
purpose
Users Choose Depending on Need
• Analysts trained on using different
tables depending on use case
• Custom datasets built for
reporting
• Summarization and/or extraction
for dashboards
© 2017 Dremio Corporation @DremioHQ
Benefit of Relational Caching over “Copy and Pick”
“Copy and Pick” Relational Caching
Physical
Optimizations
(transform, sort, partition,
aggregate)
Logical Model
Source Table
????
User picks best
optimization
Cache picks best optimization
Cache maintains
representations
Admin picks manage
maintenance
© 2017 Dremio Corporation @DremioHQ
Key Components of Relational Caching
• How to Express Transformations/States: SQL
• Hold and Match Relational algebra: Calcite
• Persist alternative datasets: Parquet
• A way to process: Arrow + Sabot
• And a lot of code to put it all together…
© 2017 Dremio Corporation @DremioHQ
Query Planner
Our Approach
Data Processing
System (Sabot)
End User Queries
UI to Define
Cached Patterns
Source Storage Interface (Arrow)
HDFS S3 Elastic
Relational Pattern
Matching System
Relational
Pattern
Database
Change
Detection
Database
Cache
Persistence
Parquet
Arrow
Refresh
System
© 2017 Dremio Corporation @DremioHQ
Definition and Matching
© 2017 Dremio Corporation @DremioHQ
Coming Back to Calcite
• Calcite is a Planner & Optimizer
• Comes with a prebuilt selection of
operators, rules, properties (called
traits) and ways to express relations
• Also has a basic Materialized View
facility (relevant!)
Perfect
Foundation
for Relational
Caching
© 2017 Dremio Corporation @DremioHQ
How We Built Caching: Reflections
• Reflection: A persisted alternative view of data in Parquet
format
– Raw Reflection: Persist all records of underlying dataset, controlling
partitioning and sortedness
– Aggregate Reflection: Persist a partially aggregated dataset based on a
selection of dimensions and measures, still controlling partitioning and
sortedness
• Reflections can be built on either source tables or arbitrarily
defined Virtual Datasets
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Aggregation Rollup
Given a user query, try to create an alternative version of the
query that matches the cached target.
P(a,c)
F(c’ < 10)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(a,b, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
F(c’ < 10)
S(r1)
A(a, sum(c) as c’)
Target
Materialization
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Join/Aggregation Transposition
Join(t1.id=t2.id)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(id, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
Target
MaterializationS(t2)
Join(r1.id=t2.id)
S(r1)
A(a, sum(c) as c’)
S(t2)
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Costing and Partitioning Benefits
F(a)
S(t1)
S(t1)
S(r1)
Part by a
User Query
Target
Materialization
S(t1)
S(r1)
Part by b
Target
Materialization
S(r1)
pruned on a
© 2017 Dremio Corporation @DremioHQ
Relational Matching, Other Examples
• Physical Property Matching
• Predicate Promotion
• Predicate Inference
• Join Decomposition
• Join Promotion
© 2017 Dremio Corporation @DremioHQ
Dealing with Updates
© 2017 Dremio Corporation @DremioHQ
Refresh Management
Importance of Cache Creation Ordering
• Not all updating
orderings are equal
• Want to order
updates based on
“Refresh Graph” and
dependencies
• Multiple orders
possible, cost against
each other to
minimize update cost
Freshness Management
• Underlying data
may change
• User Should define
refresh frequency
• Separately Define
Absolute TTL
Physical
dataset
1H refresh
3H expiration
Raw Reflection
Aggregate Reflection
© 2017 Dremio Corporation @DremioHQ
Multiple Update Modes (Depending on Mutation Pattern)
• Full: Always rebuild reflections from scratch (highly mutating)
• Incremental (files): Incrementally builds reflections based on new
files and folders (append-only)
• Incremental (rowstores): Incrementally builds reflections based on
monotonically increasing field (append-only)
• Partitioned Refresh: Maintains reflections based on source
partitions (e.g. Filesystem directories, Hive partitions). (partially
mutating)
© 2017 Dremio Corporation @DremioHQ
Closing Words
© 2017 Dremio Corporation @DremioHQ
What We’ve Seen Using these Techniques
• Frequent 10x-100x+ performance improvements in multiple
workloads
• Vast reduction in resources required to achieve performance
levels
• In many cases, a reduction in disk space
– Due to avoidance of excessive unused or rarely used physical copies
© 2017 Dremio Corporation @DremioHQ
Find out More and Get Involved
• Drop by my office hours (East Room Lounge - now)
• Drop by the Dremio table behind you
• Join us at @ApacheArrow meetup at @enigma_data Midtown
– Wes Mckinney, creator of Pandas and myself, tech deep dive
• Join the Dremio community (Relational Caching)
– github.com/dremio/dremio-oss (Apache Licensed)
– dremio.com
– community.dremio.com
• Find out more about the Building Blocks
– dev@[arrow|calcite|parquet].apache.org
– http://github.com/apache/[arrow|calcite|parquet-mr]
– http://[arrow|calcite|parquet].apache.org
• Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite,
@ApacheParquet

More Related Content

What's hot

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 

What's hot (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 

Viewers also liked

Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Oracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始めOracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始めSatoshi Nagayasu
 
はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)Hiroshi Hayakawa
 

Viewers also liked (12)

Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Oracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始めOracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始め
 
はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)
 

Similar to Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldCloudera, Inc.
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?Storage Switzerland
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015Doug O'Flaherty
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the Cloud12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the CloudBuurst
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...Amazon Web Services
 
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesWebinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesStorage Switzerland
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud SolutionsLower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud SolutionsPerficient, Inc.
 

Similar to Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache (20)

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the Cloud12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the Cloud
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
 
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesWebinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud SolutionsLower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
 

Recently uploaded

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 

Recently uploaded (20)

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

  • 1. © 2017 Dremio Corporation @DremioHQ Using Apache Arrow, Calcite and Parquet to build a Relational Cache Halloween 2017 @DataEngConf Jacques Nadeau
  • 2. © 2017 Dremio Corporation @DremioHQ Who? Jacques Nadeau @intjesus • CTO & Co-founder of Dremio • Apache member • VP Apache Arrow • PMCs: Arrow, Calcite, Incubator, Heron (incubating)
  • 3. © 2017 Dremio Corporation @DremioHQ Agenda • Tech Backgrounder • Caching Techniques • Relational Caching In Depth • Definition and Matching • Dealing with Updates • Closing Words
  • 4. © 2017 Dremio Corporation @DremioHQ Tech Backgrounder
  • 5. © 2017 Dremio Corporation @DremioHQ What is Apache Arrow • Columnar In-memory Data processing library • Designed to work with any programming language • Support for both relational and complex data as-is • Used by Pandas, Spark, Dremio
  • 6. © 2017 Dremio Corporation @DremioHQ What is Apache Calcite • SQL parser, Relational Algebra & Optimizer • Understands Materialized Views and Lattices • Used by many to add SQL functionality including Apex, Drill, Hive, Flink, Kylin, Phoenix, Samza, Storm, Cascading & Dremio
  • 7. © 2017 Dremio Corporation @DremioHQ What is Apache Parquet • OSS implementation of Google Dremel disk format for complex columnar data • Support high-level of data-ware columnar compression, vectorized columnar readback • Defacto standard for Analytical data on disk in Big Data ecosystem
  • 8. © 2017 Dremio Corporation @DremioHQ Caching Techniques
  • 9. © 2017 Dremio Corporation @DremioHQ What does Caching Mean? • Caching: Reduce the distance to data (DTD). • Distance: How much time and resources it takes to access data? – How fast is the medium? How near is it? – Is the data designed for efficient consumption? – How similar is the data to what you need to answer a question? Perf & Proximity Relevance Consumability Ways to reduce DTD
  • 10. © 2017 Dremio Corporation @DremioHQ Types of Caching • In-Memory File Pinning • Columnar Disk Caching • In-Memory Block Caching • Near-CPU Data Caching • Cube Relational Caching • Arbitrary Relational Caching
  • 11. © 2017 Dremio Corporation @DremioHQ In-Memory File Pinning • Hold a File in Memory for frequent retrieval • Pros – Simple, standard and well-defined interface – Improves the performance of the medium. – If you’re performance is primarily bound by disk IO, this might be a good option. • Cons – File structure not necessarily best in-memory structure. – Data manipulation almost always requires a copy of data to also be held in memory (because the file format is not directly consumable).
  • 12. © 2017 Dremio Corporation @DremioHQ Columnar Disk Caching • Store the data in an optimized columnar format. • Pros – Better compression reduces IO – Good structure improves processing – Benefits selective workloads (needed subset of all columns) • Cons – Requires duplicating data – Typically manual/semi-automated (e.g. MapReduce/Spark to ETL persist/update)
  • 13. © 2017 Dremio Corporation @DremioHQ In-Memory Block Caching • Maintain portions of on-disk data in Memory (e.g. Linux page cache, HBase block cache) • Pros – Very mature and usually had for free • Cons – Not easy to control/influence. – Very disconnected from workloads.
  • 14. © 2017 Dremio Corporation @DremioHQ Near-CPU Data Caching (memory or disk) • Hold the data directly in a representation that can be processed without restructuring (e.g. Arrow format) • Pros – Processing can be done without interpretation of format – Very efficient to consume – Possible to consume data by multiple consumers without duplicating memory • Cons – Larger than compressed formats – Requires applications to agree on format
  • 15. © 2017 Dremio Corporation @DremioHQ Cube-Based Relational Caching • Create several partially aggregated cuboids that can satisfy a range of aggregation queries • Pros – Low-latency performance for common aggregate query patterns – Cube storage requirements can be small fraction of original dataset size • Cons – Analysis latency is bi-modal: cube hit is great but a miss is either unserved or served slowly – Difficult or impossible to satisfy arbitrary queries
  • 16. © 2017 Dremio Corporation @DremioHQ Arbitrary Relational Caching • Create arbitrary data fragments combined with partitioning and sorting schemes to speed any query • Pros – Base case is easy to understand – Can improve the performance of any query • Cons – Complex to match to arbitrary queries – Can be large depending on needs
  • 17. © 2017 Dremio Corporation @DremioHQ Types of Caching: The combination we found useful • In-Memory File Pinning – Too non-specific given memory scarcity • ✔ Columnar Disk Caching – Make sure everything is in Parquet (for any non-ephemeral data) • ✔ In-Memory Block Caching – Leverage existing page-cache, avoid additional memory cache layers • ✔ Near-CPU Data Caching – Used primarily for ephemeral/short-term persistence to avoid overhead • ✔ Cube Relational Caching – Useful for aggregation patterns • ✔ Arbitrary Relational Caching – Useful for unusual aggregation and non-aggregation needs
  • 18. © 2017 Dremio Corporation @DremioHQ Relational Caching In Depth
  • 19. © 2017 Dremio Corporation @DremioHQ Relational Algebra Refresher • Relations: Source of data (a table) • Operators: Define a set of transformations – Join, Project, Scan, Filter, Aggregate, Window, etc • Properties: Defining traits of data at a particular relation – Sorted by X, Hash distributed by Y, etc. • Rules: Defining equality conditions between a collection of operations – Project > Filter can be changed to Filter > Project, A scan doesn’t need to project columns that aren’t used later, etc. • Graph/Tree: A collection of operators that define a particular dataset in a DAG Project Scan Filter Filter Scan Project
  • 20. © 2017 Dremio Corporation @DremioHQ Relational Caching: Basic Concept • Store derived data that is between what you want and original dataset • Shortens Distance to Data (DTD) • Reduces resource requirements & latency Original Data What you Want What you Want What you Want Persisted Shared Intermediate State originalDTD newDTDcostreduction
  • 21. © 2017 Dremio Corporation @DremioHQ You Probably Already Do This! Data Alternatives (Manually Created) • Sessionized • Cleansed • Partitioned by time or region • Summarized for a particular purpose Users Choose Depending on Need • Analysts trained on using different tables depending on use case • Custom datasets built for reporting • Summarization and/or extraction for dashboards
  • 22. © 2017 Dremio Corporation @DremioHQ Benefit of Relational Caching over “Copy and Pick” “Copy and Pick” Relational Caching Physical Optimizations (transform, sort, partition, aggregate) Logical Model Source Table ???? User picks best optimization Cache picks best optimization Cache maintains representations Admin picks manage maintenance
  • 23. © 2017 Dremio Corporation @DremioHQ Key Components of Relational Caching • How to Express Transformations/States: SQL • Hold and Match Relational algebra: Calcite • Persist alternative datasets: Parquet • A way to process: Arrow + Sabot • And a lot of code to put it all together…
  • 24. © 2017 Dremio Corporation @DremioHQ Query Planner Our Approach Data Processing System (Sabot) End User Queries UI to Define Cached Patterns Source Storage Interface (Arrow) HDFS S3 Elastic Relational Pattern Matching System Relational Pattern Database Change Detection Database Cache Persistence Parquet Arrow Refresh System
  • 25. © 2017 Dremio Corporation @DremioHQ Definition and Matching
  • 26. © 2017 Dremio Corporation @DremioHQ Coming Back to Calcite • Calcite is a Planner & Optimizer • Comes with a prebuilt selection of operators, rules, properties (called traits) and ways to express relations • Also has a basic Materialized View facility (relevant!) Perfect Foundation for Relational Caching
  • 27. © 2017 Dremio Corporation @DremioHQ How We Built Caching: Reflections • Reflection: A persisted alternative view of data in Parquet format – Raw Reflection: Persist all records of underlying dataset, controlling partitioning and sortedness – Aggregate Reflection: Persist a partially aggregated dataset based on a selection of dimensions and measures, still controlling partitioning and sortedness • Reflections can be built on either source tables or arbitrarily defined Virtual Datasets
  • 28. © 2017 Dremio Corporation @DremioHQ Cache Matching: Aggregation Rollup Given a user query, try to create an alternative version of the query that matches the cached target. P(a,c) F(c’ < 10) S(t1) S(t1) A(a, sum(c) as c’) A(a,b, sum(c)) S(r1) User Query Reflection Definition Alternative Plan F(c’ < 10) S(r1) A(a, sum(c) as c’) Target Materialization
  • 29. © 2017 Dremio Corporation @DremioHQ Cache Matching: Join/Aggregation Transposition Join(t1.id=t2.id) S(t1) S(t1) A(a, sum(c) as c’) A(id, sum(c)) S(r1) User Query Reflection Definition Alternative Plan Target MaterializationS(t2) Join(r1.id=t2.id) S(r1) A(a, sum(c) as c’) S(t2)
  • 30. © 2017 Dremio Corporation @DremioHQ Cache Matching: Costing and Partitioning Benefits F(a) S(t1) S(t1) S(r1) Part by a User Query Target Materialization S(t1) S(r1) Part by b Target Materialization S(r1) pruned on a
  • 31. © 2017 Dremio Corporation @DremioHQ Relational Matching, Other Examples • Physical Property Matching • Predicate Promotion • Predicate Inference • Join Decomposition • Join Promotion
  • 32. © 2017 Dremio Corporation @DremioHQ Dealing with Updates
  • 33. © 2017 Dremio Corporation @DremioHQ Refresh Management Importance of Cache Creation Ordering • Not all updating orderings are equal • Want to order updates based on “Refresh Graph” and dependencies • Multiple orders possible, cost against each other to minimize update cost Freshness Management • Underlying data may change • User Should define refresh frequency • Separately Define Absolute TTL Physical dataset 1H refresh 3H expiration Raw Reflection Aggregate Reflection
  • 34. © 2017 Dremio Corporation @DremioHQ Multiple Update Modes (Depending on Mutation Pattern) • Full: Always rebuild reflections from scratch (highly mutating) • Incremental (files): Incrementally builds reflections based on new files and folders (append-only) • Incremental (rowstores): Incrementally builds reflections based on monotonically increasing field (append-only) • Partitioned Refresh: Maintains reflections based on source partitions (e.g. Filesystem directories, Hive partitions). (partially mutating)
  • 35. © 2017 Dremio Corporation @DremioHQ Closing Words
  • 36. © 2017 Dremio Corporation @DremioHQ What We’ve Seen Using these Techniques • Frequent 10x-100x+ performance improvements in multiple workloads • Vast reduction in resources required to achieve performance levels • In many cases, a reduction in disk space – Due to avoidance of excessive unused or rarely used physical copies
  • 37. © 2017 Dremio Corporation @DremioHQ Find out More and Get Involved • Drop by my office hours (East Room Lounge - now) • Drop by the Dremio table behind you • Join us at @ApacheArrow meetup at @enigma_data Midtown – Wes Mckinney, creator of Pandas and myself, tech deep dive • Join the Dremio community (Relational Caching) – github.com/dremio/dremio-oss (Apache Licensed) – dremio.com – community.dremio.com • Find out more about the Building Blocks – dev@[arrow|calcite|parquet].apache.org – http://github.com/apache/[arrow|calcite|parquet-mr] – http://[arrow|calcite|parquet].apache.org • Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite, @ApacheParquet