This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.
4. Content
● Intro to Apache Calcite
● Calcite toolbox
● Performance - single data sources
● Federated queries
5. Intro to Apache Calcite
Secret sauce behind lots of existing tech:
6. Intro to Apache Calcite
Important piece in the “deconstructed database” *
* https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
7. Intro to Apache Calcite
Pluggable framework:
● Standard SQL query parsing (lots of different dialects via Babel)
● Query rewriting
● Materializations (even auto-generation with lattices)
● Query optimisation (rule based and cost based optimizers)
● Support datasource connections either as jdbc drivers or customer adapters
● In memory execution model
● SQL Streams
● ...
9. Calcite terminology & architecture
SqlNode: SQL tree constructed by parser
RelNode: The query plan (relational algebra AST)
Program: Planner/Optimizer
● Cost based (Volcano)
● Rule based (Hep)
10. Babel parser (SQL parser)
Parse almost any SQL
● Not dialect specific parsing
● Avoids making assumptions about
keywords
● Handles special syntax (e.g. cast using ::
in Postgres)
Useful when:
● Impersonating a database server
● Clients are more familiar with a dialect
11. SqlShuttle / RelShuttle (Rewriters)
Uses visitor pattern to transform one
SqlNode/RelNode tree into another
Useful for:
● View rewrites
● Row level security
12. RelBuilder (Plan builder)
● Programmatically build a RelNode
● Skip SQL parsing
Useful when:
● Query is generated by model / UI
● Avoid writing SQL by hand
13. Statistics
Provide metadata about physical tables.
● Number of rows (100 default)
● Column uniqueness
● Distribution of values
Useful when:
● Improving planner costs
14. Materializations
● Define materialization with SQL
● Automatically rewrite queries to use
materializations
1. Materialize pageviews by country & city (pviews_by_loc)
2. Make query:
select country, city, cnt
from pviews_by_loc
order by country
select country, city, count(*)
from pageviews
group by country, city
order by country
15. SqlDialect
● Convert to dialect specific SQL string
● Many SQL dialects supported (Oracle,
Mysql, Postgres, Hive, …)
● Easy to plug in your own dialect
public class CustomDialect extends SqlDialect {
@Override public boolean supportsNestedAggregations() { return false; }
@Override public boolean supportsWindowFunctions() { return false; }
@Override public SqlNode getCastSpec(RelDataType type) {
// translate type names ...
}
@Override public void unparseCall(SqlWriter writer, SqlCall call,
int leftPrec, int rightPrec) {
// make changes to the SqlCall (SqlNode) ...
}
16. Lattices
● Define dimensions on star
schemas
● Combinations of dimensions form
summary tables
● Materialized summary tables are
called tiles
● Trick is to figure out which tiles to
pick
20. Measure - benchmark
Use a standard SQL benchmark tool: TPC
TPC-H:
● 22 queries
● data size configurable: 1 GB, 10, 30, ....
● generated for lots of dialects
Measure overhead
Configuration & hardware not so important for us
22. Query push down
select name, id from customers order by name limit 10
JdbcToEnumerableConverter
JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10])
JdbcProject(name, id)
JdbcTableScan(table=[customers])
EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10)
JdbcToEnumerableConverter
JdbcProject(name, id)
JdbcTableScan(table=[customers])
Sort and limit in memory Full query push down✘ ✔
23. Benchmark runs
4 benchmark runs:
0. Direct against database
1. Query engine - unoptimized (defaults)
2. Query engine - with statistics
3. Query engine - with multi-phase optimiser
28. Benchmark run 2 - use table stats
Results
● Some queries noticeably faster
● Most work still being done in memory
● Query planning is very slow
Next, try: Planning phases
29. Benchmark run 3 - planning phases
Split planning into 2 phases (create separate Calcite Programs):
1. Logical phase:
Use rule-based optimiser (Hep), apply all logical optimisation rules
2. Physical phase:
Use cost-based optimiser (Volcano), apply all physical rules and conversions
(jdbc, druid, …)
32. Benchmark run 3 - planning phases
Results
● Planning and execution are much faster
● Full query push down
● Almost all query time being spent in data source
● Some queries are faster than baseline
33. Conclusion - single data source
The goal: full query push down
Fastest way to get there:
● Apply most rules in rule-based optimiser
● Avoid giving cost-based optimiser too many rules
● Rely on underlying data source optimiser
35. Obstacles to efficiency
● Joins of large tables
● Large driver fetchSize
● Aggregations
● Other in-memory computations
Can lead to:
● Slow queries
● OOM
36. Solution
● Scales out to many nodes
● Avoid OOM for large datasets
● Some calcite support
Bonus
● Eases access to data lakes
Use Spark as the execution engine
37. Calcite support for Spark
Currently some options available:
● Calcite spark adapter
○ Translates Calcite plan => Spark plan
○ Sends Spark plan to cluster
○ - Needs more work to make translation complete
● Calcite SparkSqlDialect
○ Use RelToSqlConverter to generate a query in Spark SQL
○ - Spark completely re-plans/optimises the query
○ - Spark push downs are not as full
● Send SQL queries directly to Spark
○ - Can’t use Calcite toolbox
○ - Spark push downs are not as full
38. A hybrid approach
1. Generate a Calcite plan
2. Identify pieces related to separate data
sources
3. Squash those pieces into “Spark tables”
4. Construct a dataframe for every “Spark
table”
5. Convert the remaining plan to SparkSQL
39. Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Join two tables from different data
sources, X & Y.
select cust_id, sale_id
from X.sales join Y.customer
using (cust_id)
where X.customer.city = ‘Berlin’
limit 10
40. Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t1: select cust_id, sale_id
from X.sales
41. Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t2: select cust_id
from Y.customer
where city = ‘Berlin’
42. Hybrid approach - example
SparkLimit(fetch=10)
SparkHashJoin
SparkTable(table=t1, ds=[X], query=[...])
SparkTable(table=t2, ds=[Y], query=[...])
select *
from t1 join t2
using (cust_id)
limit 10
Use the Spark SqlDialect to transform into a Spark query
43. Conclusion - Hybrid approach
● Full scalability of spark
● Push down ability of Calcite
● No bottleneck through in-memory layer
● Can be persisted to external db / file using spark
● Approach would also work for other engines (Drill, Flink, Presto)