SlideShare a Scribd company logo
1 of 44
Download to read offline
EUROPE
APACHECON
EUROPE Oct. 22nd
- 24th
201
Fast federated SQL with
Apache Calcite
Chris Baynes
@binaryexp
chris@contiamo.com
● CTO at Contiamo
● Apache Calcite committer
Content
● Intro to Apache Calcite
● Calcite toolbox
● Performance - single data sources
● Federated queries
Intro to Apache Calcite
Secret sauce behind lots of existing tech:
Intro to Apache Calcite
Important piece in the “deconstructed database” *
* https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
Intro to Apache Calcite
Pluggable framework:
● Standard SQL query parsing (lots of different dialects via Babel)
● Query rewriting
● Materializations (even auto-generation with lattices)
● Query optimisation (rule based and cost based optimizers)
● Support datasource connections either as jdbc drivers or customer adapters
● In memory execution model
● SQL Streams
● ...
Calcite toolbox
Calcite terminology & architecture
SqlNode: SQL tree constructed by parser
RelNode: The query plan (relational algebra AST)
Program: Planner/Optimizer
● Cost based (Volcano)
● Rule based (Hep)
Babel parser (SQL parser)
Parse almost any SQL
● Not dialect specific parsing
● Avoids making assumptions about
keywords
● Handles special syntax (e.g. cast using ::
in Postgres)
Useful when:
● Impersonating a database server
● Clients are more familiar with a dialect
SqlShuttle / RelShuttle (Rewriters)
Uses visitor pattern to transform one
SqlNode/RelNode tree into another
Useful for:
● View rewrites
● Row level security
RelBuilder (Plan builder)
● Programmatically build a RelNode
● Skip SQL parsing
Useful when:
● Query is generated by model / UI
● Avoid writing SQL by hand
Statistics
Provide metadata about physical tables.
● Number of rows (100 default)
● Column uniqueness
● Distribution of values
Useful when:
● Improving planner costs
Materializations
● Define materialization with SQL
● Automatically rewrite queries to use
materializations
1. Materialize pageviews by country & city (pviews_by_loc)
2. Make query:
select country, city, cnt
from pviews_by_loc
order by country
select country, city, count(*)
from pageviews
group by country, city
order by country
SqlDialect
● Convert to dialect specific SQL string
● Many SQL dialects supported (Oracle,
Mysql, Postgres, Hive, …)
● Easy to plug in your own dialect
public class CustomDialect extends SqlDialect {
@Override public boolean supportsNestedAggregations() { return false; }
@Override public boolean supportsWindowFunctions() { return false; }
@Override public SqlNode getCastSpec(RelDataType type) {
// translate type names ...
}
@Override public void unparseCall(SqlWriter writer, SqlCall call,
int leftPrec, int rightPrec) {
// make changes to the SqlCall (SqlNode) ...
}
Lattices
● Define dimensions on star
schemas
● Combinations of dimensions form
summary tables
● Materialized summary tables are
called tiles
● Trick is to figure out which tiles to
pick
https://www.slideshare.net/julianhyde/dont-optimize-my-queries-optimize-my-data-81360892
Conclusion - tools
● Many different tools
● Can be used to build complex applications
● Consider affect on performance
Performance - Single data source
Measure - benchmark
Use a standard SQL benchmark tool: TPC
TPC-H:
● 22 queries
● data size configurable: 1 GB, 10, 30, ....
● generated for lots of dialects
Measure overhead
Configuration & hardware not so important for us
Measure - trace
Query push down
select name, id from customers order by name limit 10
JdbcToEnumerableConverter
JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10])
JdbcProject(name, id)
JdbcTableScan(table=[customers])
EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10)
JdbcToEnumerableConverter
JdbcProject(name, id)
JdbcTableScan(table=[customers])
Sort and limit in memory Full query push down✘ ✔
Benchmark runs
4 benchmark runs:
0. Direct against database
1. Query engine - unoptimized (defaults)
2. Query engine - with statistics
3. Query engine - with multi-phase optimiser
Benchmark baseline (run 0)
Benchmark run 1 - unoptimised
total = 158.1s
Benchmark run 1 - unoptimised
Results
● Slow!
● Very little push-down
● Planning consumes a lot of time
Try
● Using table statistics
Benchmark run 2 - use table stats
Benchmark run 2 - use table stats
Results
● Some queries noticeably faster
● Most work still being done in memory
● Query planning is very slow
Next, try: Planning phases
Benchmark run 3 - planning phases
Split planning into 2 phases (create separate Calcite Programs):
1. Logical phase:
Use rule-based optimiser (Hep), apply all logical optimisation rules
2. Physical phase:
Use cost-based optimiser (Volcano), apply all physical rules and conversions
(jdbc, druid, …)
Benchmark run 3 - planning phases vs stats
Benchmark run 3 - planning phases vs baseline
Benchmark run 3 - planning phases
Results
● Planning and execution are much faster
● Full query push down
● Almost all query time being spent in data source
● Some queries are faster than baseline
Conclusion - single data source
The goal: full query push down
Fastest way to get there:
● Apply most rules in rule-based optimiser
● Avoid giving cost-based optimiser too many rules
● Rely on underlying data source optimiser
Federated queries
“Single query based on information from multiple data sources”
Obstacles to efficiency
● Joins of large tables
● Large driver fetchSize
● Aggregations
● Other in-memory computations
Can lead to:
● Slow queries
● OOM
Solution
● Scales out to many nodes
● Avoid OOM for large datasets
● Some calcite support
Bonus
● Eases access to data lakes
Use Spark as the execution engine
Calcite support for Spark
Currently some options available:
● Calcite spark adapter
○ Translates Calcite plan => Spark plan
○ Sends Spark plan to cluster
○ - Needs more work to make translation complete
● Calcite SparkSqlDialect
○ Use RelToSqlConverter to generate a query in Spark SQL
○ - Spark completely re-plans/optimises the query
○ - Spark push downs are not as full
● Send SQL queries directly to Spark
○ - Can’t use Calcite toolbox
○ - Spark push downs are not as full
A hybrid approach
1. Generate a Calcite plan
2. Identify pieces related to separate data
sources
3. Squash those pieces into “Spark tables”
4. Construct a dataframe for every “Spark
table”
5. Convert the remaining plan to SparkSQL
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Join two tables from different data
sources, X & Y.
select cust_id, sale_id
from X.sales join Y.customer
using (cust_id)
where X.customer.city = ‘Berlin’
limit 10
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t1: select cust_id, sale_id
from X.sales
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t2: select cust_id
from Y.customer
where city = ‘Berlin’
Hybrid approach - example
SparkLimit(fetch=10)
SparkHashJoin
SparkTable(table=t1, ds=[X], query=[...])
SparkTable(table=t2, ds=[Y], query=[...])
select *
from t1 join t2
using (cust_id)
limit 10
Use the Spark SqlDialect to transform into a Spark query
Conclusion - Hybrid approach
● Full scalability of spark
● Push down ability of Calcite
● No bottleneck through in-memory layer
● Can be persisted to external db / file using spark
● Approach would also work for other engines (Drill, Flink, Presto)
Thanks!
@binaryexp
chris@contiamo.com
calcite.apache.org

More Related Content

What's hot

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
The Volcano/Cascades Optimizer
The Volcano/Cascades OptimizerThe Volcano/Cascades Optimizer
The Volcano/Cascades Optimizer宇 傅
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?Brent Ozar
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 

What's hot (20)

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
The Volcano/Cascades Optimizer
The Volcano/Cascades OptimizerThe Volcano/Cascades Optimizer
The Volcano/Cascades Optimizer
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 

Similar to Fast federated SQL queries with Apache Calcite and Spark

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsMonal Daxini
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTSolidify
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 

Similar to Fast federated SQL queries with Apache Calcite and Spark (20)

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDT
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 

Recently uploaded

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 

Recently uploaded (20)

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 

Fast federated SQL queries with Apache Calcite and Spark

  • 1. EUROPE APACHECON EUROPE Oct. 22nd - 24th 201 Fast federated SQL with Apache Calcite Chris Baynes
  • 2. @binaryexp chris@contiamo.com ● CTO at Contiamo ● Apache Calcite committer
  • 3.
  • 4. Content ● Intro to Apache Calcite ● Calcite toolbox ● Performance - single data sources ● Federated queries
  • 5. Intro to Apache Calcite Secret sauce behind lots of existing tech:
  • 6. Intro to Apache Calcite Important piece in the “deconstructed database” * * https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
  • 7. Intro to Apache Calcite Pluggable framework: ● Standard SQL query parsing (lots of different dialects via Babel) ● Query rewriting ● Materializations (even auto-generation with lattices) ● Query optimisation (rule based and cost based optimizers) ● Support datasource connections either as jdbc drivers or customer adapters ● In memory execution model ● SQL Streams ● ...
  • 9. Calcite terminology & architecture SqlNode: SQL tree constructed by parser RelNode: The query plan (relational algebra AST) Program: Planner/Optimizer ● Cost based (Volcano) ● Rule based (Hep)
  • 10. Babel parser (SQL parser) Parse almost any SQL ● Not dialect specific parsing ● Avoids making assumptions about keywords ● Handles special syntax (e.g. cast using :: in Postgres) Useful when: ● Impersonating a database server ● Clients are more familiar with a dialect
  • 11. SqlShuttle / RelShuttle (Rewriters) Uses visitor pattern to transform one SqlNode/RelNode tree into another Useful for: ● View rewrites ● Row level security
  • 12. RelBuilder (Plan builder) ● Programmatically build a RelNode ● Skip SQL parsing Useful when: ● Query is generated by model / UI ● Avoid writing SQL by hand
  • 13. Statistics Provide metadata about physical tables. ● Number of rows (100 default) ● Column uniqueness ● Distribution of values Useful when: ● Improving planner costs
  • 14. Materializations ● Define materialization with SQL ● Automatically rewrite queries to use materializations 1. Materialize pageviews by country & city (pviews_by_loc) 2. Make query: select country, city, cnt from pviews_by_loc order by country select country, city, count(*) from pageviews group by country, city order by country
  • 15. SqlDialect ● Convert to dialect specific SQL string ● Many SQL dialects supported (Oracle, Mysql, Postgres, Hive, …) ● Easy to plug in your own dialect public class CustomDialect extends SqlDialect { @Override public boolean supportsNestedAggregations() { return false; } @Override public boolean supportsWindowFunctions() { return false; } @Override public SqlNode getCastSpec(RelDataType type) { // translate type names ... } @Override public void unparseCall(SqlWriter writer, SqlCall call, int leftPrec, int rightPrec) { // make changes to the SqlCall (SqlNode) ... }
  • 16. Lattices ● Define dimensions on star schemas ● Combinations of dimensions form summary tables ● Materialized summary tables are called tiles ● Trick is to figure out which tiles to pick
  • 18. Conclusion - tools ● Many different tools ● Can be used to build complex applications ● Consider affect on performance
  • 19. Performance - Single data source
  • 20. Measure - benchmark Use a standard SQL benchmark tool: TPC TPC-H: ● 22 queries ● data size configurable: 1 GB, 10, 30, .... ● generated for lots of dialects Measure overhead Configuration & hardware not so important for us
  • 22. Query push down select name, id from customers order by name limit 10 JdbcToEnumerableConverter JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10]) JdbcProject(name, id) JdbcTableScan(table=[customers]) EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10) JdbcToEnumerableConverter JdbcProject(name, id) JdbcTableScan(table=[customers]) Sort and limit in memory Full query push down✘ ✔
  • 23. Benchmark runs 4 benchmark runs: 0. Direct against database 1. Query engine - unoptimized (defaults) 2. Query engine - with statistics 3. Query engine - with multi-phase optimiser
  • 25. Benchmark run 1 - unoptimised total = 158.1s
  • 26. Benchmark run 1 - unoptimised Results ● Slow! ● Very little push-down ● Planning consumes a lot of time Try ● Using table statistics
  • 27. Benchmark run 2 - use table stats
  • 28. Benchmark run 2 - use table stats Results ● Some queries noticeably faster ● Most work still being done in memory ● Query planning is very slow Next, try: Planning phases
  • 29. Benchmark run 3 - planning phases Split planning into 2 phases (create separate Calcite Programs): 1. Logical phase: Use rule-based optimiser (Hep), apply all logical optimisation rules 2. Physical phase: Use cost-based optimiser (Volcano), apply all physical rules and conversions (jdbc, druid, …)
  • 30. Benchmark run 3 - planning phases vs stats
  • 31. Benchmark run 3 - planning phases vs baseline
  • 32. Benchmark run 3 - planning phases Results ● Planning and execution are much faster ● Full query push down ● Almost all query time being spent in data source ● Some queries are faster than baseline
  • 33. Conclusion - single data source The goal: full query push down Fastest way to get there: ● Apply most rules in rule-based optimiser ● Avoid giving cost-based optimiser too many rules ● Rely on underlying data source optimiser
  • 34. Federated queries “Single query based on information from multiple data sources”
  • 35. Obstacles to efficiency ● Joins of large tables ● Large driver fetchSize ● Aggregations ● Other in-memory computations Can lead to: ● Slow queries ● OOM
  • 36. Solution ● Scales out to many nodes ● Avoid OOM for large datasets ● Some calcite support Bonus ● Eases access to data lakes Use Spark as the execution engine
  • 37. Calcite support for Spark Currently some options available: ● Calcite spark adapter ○ Translates Calcite plan => Spark plan ○ Sends Spark plan to cluster ○ - Needs more work to make translation complete ● Calcite SparkSqlDialect ○ Use RelToSqlConverter to generate a query in Spark SQL ○ - Spark completely re-plans/optimises the query ○ - Spark push downs are not as full ● Send SQL queries directly to Spark ○ - Can’t use Calcite toolbox ○ - Spark push downs are not as full
  • 38. A hybrid approach 1. Generate a Calcite plan 2. Identify pieces related to separate data sources 3. Squash those pieces into “Spark tables” 4. Construct a dataframe for every “Spark table” 5. Convert the remaining plan to SparkSQL
  • 39. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Join two tables from different data sources, X & Y. select cust_id, sale_id from X.sales join Y.customer using (cust_id) where X.customer.city = ‘Berlin’ limit 10
  • 40. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Identify data source pieces t1: select cust_id, sale_id from X.sales
  • 41. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Identify data source pieces t2: select cust_id from Y.customer where city = ‘Berlin’
  • 42. Hybrid approach - example SparkLimit(fetch=10) SparkHashJoin SparkTable(table=t1, ds=[X], query=[...]) SparkTable(table=t2, ds=[Y], query=[...]) select * from t1 join t2 using (cust_id) limit 10 Use the Spark SqlDialect to transform into a Spark query
  • 43. Conclusion - Hybrid approach ● Full scalability of spark ● Push down ability of Calcite ● No bottleneck through in-memory layer ● Can be persisted to external db / file using spark ● Approach would also work for other engines (Drill, Flink, Presto)