Fast federated SQL queries with Apache Calcite and Spark

EUROPE
APACHECON
EUROPE Oct. 22nd
- 24th
201
Fast federated SQL with
Apache Calcite
Chris Baynes

@binaryexp
chris@contiamo.com
● CTO at Contiamo
● Apache Calcite committer

Content
● Intro to Apache Calcite
● Calcite toolbox
● Performance - single data sources
● Federated queries

Intro to Apache Calcite
Secret sauce behind lots of existing tech:

Important piece in the “deconstructed database” *
* https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database

Pluggable framework:
● Standard SQL query parsing (lots of diﬀerent dialects via Babel)
● Query rewriting
● Materializations (even auto-generation with lattices)
● Query optimisation (rule based and cost based optimizers)
● Support datasource connections either as jdbc drivers or customer adapters
● In memory execution model
● SQL Streams
● ...

Calcite terminology & architecture
SqlNode: SQL tree constructed by parser
RelNode: The query plan (relational algebra AST)
Program: Planner/Optimizer
● Cost based (Volcano)
● Rule based (Hep)

Babel parser (SQL parser)
Parse almost any SQL
● Not dialect speciﬁc parsing
● Avoids making assumptions about
keywords
● Handles special syntax (e.g. cast using ::
in Postgres)
Useful when:
● Impersonating a database server
● Clients are more familiar with a dialect

SqlShuttle / RelShuttle (Rewriters)
Uses visitor pattern to transform one
SqlNode/RelNode tree into another
Useful for:
● View rewrites
● Row level security

RelBuilder (Plan builder)
● Programmatically build a RelNode
● Skip SQL parsing
Useful when:
● Query is generated by model / UI
● Avoid writing SQL by hand

Statistics
Provide metadata about physical tables.
● Number of rows (100 default)
● Column uniqueness
● Distribution of values
Useful when:
● Improving planner costs

Materializations
● Deﬁne materialization with SQL
● Automatically rewrite queries to use
materializations
1. Materialize pageviews by country & city (pviews_by_loc)
2. Make query:
select country, city, cnt
from pviews_by_loc
order by country
select country, city, count(*)
from pageviews
group by country, city
order by country

SqlDialect
● Convert to dialect speciﬁc SQL string
● Many SQL dialects supported (Oracle,
Mysql, Postgres, Hive, …)
● Easy to plug in your own dialect
public class CustomDialect extends SqlDialect {
@Override public boolean supportsNestedAggregations() { return false; }
@Override public boolean supportsWindowFunctions() { return false; }
@Override public SqlNode getCastSpec(RelDataType type) {
// translate type names ...
}
@Override public void unparseCall(SqlWriter writer, SqlCall call,
int leftPrec, int rightPrec) {
// make changes to the SqlCall (SqlNode) ...
}

Lattices
● Deﬁne dimensions on star
schemas
● Combinations of dimensions form
summary tables
● Materialized summary tables are
called tiles
● Trick is to ﬁgure out which tiles to
pick

https://www.slideshare.net/julianhyde/dont-optimize-my-queries-optimize-my-data-81360892

Conclusion - tools
● Many diﬀerent tools
● Can be used to build complex applications
● Consider aﬀect on performance

Performance - Single data source

Measure - benchmark
Use a standard SQL benchmark tool: TPC
TPC-H:
● 22 queries
● data size conﬁgurable: 1 GB, 10, 30, ....
● generated for lots of dialects
Measure overhead
Conﬁguration & hardware not so important for us

Query push down
select name, id from customers order by name limit 10
JdbcToEnumerableConverter
JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10])
JdbcProject(name, id)
JdbcTableScan(table=[customers])
EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10)
JdbcToEnumerableConverter
JdbcProject(name, id)
JdbcTableScan(table=[customers])
Sort and limit in memory Full query push down✘ ✔

Benchmark runs
4 benchmark runs:
0. Direct against database
1. Query engine - unoptimized (defaults)
2. Query engine - with statistics
3. Query engine - with multi-phase optimiser

Benchmark run 1 - unoptimised
total = 158.1s

Benchmark run 1 - unoptimised
Results
● Slow!
● Very little push-down
● Planning consumes a lot of time
Try
● Using table statistics

Benchmark run 2 - use table stats

Benchmark run 2 - use table stats
Results
● Some queries noticeably faster
● Most work still being done in memory
● Query planning is very slow
Next, try: Planning phases

Benchmark run 3 - planning phases
Split planning into 2 phases (create separate Calcite Programs):
1. Logical phase:
Use rule-based optimiser (Hep), apply all logical optimisation rules
2. Physical phase:
Use cost-based optimiser (Volcano), apply all physical rules and conversions
(jdbc, druid, …)

Benchmark run 3 - planning phases vs stats

Benchmark run 3 - planning phases vs baseline

Benchmark run 3 - planning phases
Results
● Planning and execution are much faster
● Full query push down
● Almost all query time being spent in data source
● Some queries are faster than baseline

Conclusion - single data source
The goal: full query push down
Fastest way to get there:
● Apply most rules in rule-based optimiser
● Avoid giving cost-based optimiser too many rules
● Rely on underlying data source optimiser

Federated queries
“Single query based on information from multiple data sources”

Obstacles to eﬃciency
● Joins of large tables
● Large driver fetchSize
● Aggregations
● Other in-memory computations
Can lead to:
● Slow queries
● OOM

Solution
● Scales out to many nodes
● Avoid OOM for large datasets
● Some calcite support
Bonus
● Eases access to data lakes
Use Spark as the execution engine

Calcite support for Spark
Currently some options available:
● Calcite spark adapter
○ Translates Calcite plan => Spark plan
○ Sends Spark plan to cluster
○ - Needs more work to make translation complete
● Calcite SparkSqlDialect
○ Use RelToSqlConverter to generate a query in Spark SQL
○ - Spark completely re-plans/optimises the query
○ - Spark push downs are not as full
● Send SQL queries directly to Spark
○ - Can’t use Calcite toolbox
○ - Spark push downs are not as full

A hybrid approach
1. Generate a Calcite plan
2. Identify pieces related to separate data
sources
3. Squash those pieces into “Spark tables”
4. Construct a dataframe for every “Spark
table”
5. Convert the remaining plan to SparkSQL

Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Join two tables from diﬀerent data
sources, X & Y.
select cust_id, sale_id
from X.sales join Y.customer
using (cust_id)
where X.customer.city = ‘Berlin’
limit 10

EnumerableHashJoin
Identify data source pieces
t1: select cust_id, sale_id
from X.sales

EnumerableHashJoin
Identify data source pieces
t2: select cust_id
from Y.customer
where city = ‘Berlin’

SparkLimit(fetch=10)
SparkHashJoin
SparkTable(table=t1, ds=[X], query=[...])
SparkTable(table=t2, ds=[Y], query=[...])
select *
from t1 join t2
using (cust_id)
limit 10
Use the Spark SqlDialect to transform into a Spark query

Conclusion - Hybrid approach
● Full scalability of spark
● Push down ability of Calcite
● No bottleneck through in-memory layer
● Can be persisted to external db / ﬁle using spark
● Approach would also work for other engines (Drill, Flink, Presto)

Thanks!
@binaryexp
chris@contiamo.com
calcite.apache.org

Fast federated SQL queries with Apache Calcite and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fast federated SQL queries with Apache Calcite and Spark

Similar to Fast federated SQL queries with Apache Calcite and Spark (20)

Recently uploaded

Recently uploaded (20)

Fast federated SQL queries with Apache Calcite and Spark