SlideShare a Scribd company logo
1 of 109
Download to read offline
><
INTRODUCTIONTO
INTRODUCTION TO APACHE CALCITE
APACHE CALCITE
JORDAN HALTERMAN
1
WHAT IS APACHE
CALCITE?
next 2
><INTRODUCTION TO APACHE CALCITE 3
What is Apache Calcite?
• A framework for building SQL databases
• Developed over more than ten years
• Written in Java
• Previously known as Optiq
• Previously known as Farrago
• Became an Apache project in 2013
• Led by Julian Hyde at Hortonworks
><INTRODUCTION TO APACHE CALCITE 4
Projects using Calcite
• Apache Hive
• Apache Drill
• Apache Flink
• Apache Phoenix
• Apache Samza
• Apache Storm
• Apache everything…
><INTRODUCTION TO APACHE CALCITE 5
What is Apache Calcite?
• SQL parser
• SQL validation
• Query optimizer
• SQL generator
• Data federator
><INTRODUCTION TO APACHE CALCITE
Parse
Queries are parsed using
a JavaCC generated
parser
Validate
Queries are validated
against known database
metadata
Optimize
Logical plans are optimized
and converted into physical
expressions
Execute
P h y s i c a l p l a n s a r e
converted into application-
specific executions
01 02 03 04
Stages of query execution
6
COMPONENTS next 7
><INTRODUCTION TO APACHE CALCITE 8
Components of Calcite
• Catalog - Defines metadata and namespaces
that can be accessed in SQL queries
• SQL parser - Parses valid SQL queries into an
abstract syntax tree (AST)
• SQL validator - Validates abstract syntax trees
against metadata provided by the catalog
• Query optimizer - Converts AST into logical
plans, optimizes logical plans, and converts
logical expressions into physical plans
• SQL generator - Converts physical plans to
SQL
CATALOG next 9
><INTRODUCTION TO APACHE CALCITE 10
Calcite Catalog
• Defines namespaces that can be accessed in Calcite
queries
• Schema
• A collection of schemas and tables
• Can be arbitrarily nested
• Table
• Represents a single data set
• Fields defined by a RelDataType
• RelDataType
• Represents fields in a data set
• Supports all SQL data types, including structs and
><INTRODUCTION TO APACHE CALCITE 11
Schema
• A collection of schemas and tables
• Schemas can be arbitrarily nested
><INTRODUCTION TO APACHE CALCITE 12
Schema
• A collection of schemas and tables
• Schemas can be arbitrarily nested
><INTRODUCTION TO APACHE CALCITE 13
Table
• Represents a single data set
• Fields are defined by a RelDataType
><INTRODUCTION TO APACHE CALCITE 14
Table
• Represents a single data set
• Fields are defined by a RelDataType
><INTRODUCTION TO APACHE CALCITE 15
RelDataType
• Represents the data type of an object
• Supports all SQL data types, including
structs and arrays
• Similar to Spark’s DataType
><INTRODUCTION TO APACHE CALCITE 16
RelDataType
><INTRODUCTION TO APACHE CALCITE 17
RelDataType
data type enum
><INTRODUCTION TO APACHE CALCITE 18
Statistic
• Provide table statistics used in optimization
><INTRODUCTION TO APACHE CALCITE 19
Statistic
• Provide table statistics used in optimization
><INTRODUCTION TO APACHE CALCITE 20
Usage of the Calcite catalog
><INTRODUCTION TO APACHE CALCITE 21
Usage of the Calcite catalog
schema
><INTRODUCTION TO APACHE CALCITE 22
Usage of the Calcite catalog
schema table
><INTRODUCTION TO APACHE CALCITE 23
Usage of the Calcite catalog
schema table
data type
><INTRODUCTION TO APACHE CALCITE 24
Usage of the Calcite catalog
schema table
data typedata type field
SQL PARSER next 25
><INTRODUCTION TO APACHE CALCITE 26
Calcite SQL parser
• LL(k) parser written in JavaCC
• Input queries are parsed into an abstract
syntax tree (AST)
• Tokens are represented in Calcite by
SqlNode
• SqlNode can also be converted back to a
SQL string via the unparse method
><INTRODUCTION TO APACHE CALCITE 27
JavaCC
• Java Compiler Compiler
• Created in 1996 at Sun Microsystems
• Generates Java code from a domain-
specific language
• ANTLR is the modern alternative used in
projects like Hive and Drill
• JavaCC has sparse documentation
><INTRODUCTION TO APACHE CALCITE 28
JavaCC
><INTRODUCTION TO APACHE CALCITE 29
JavaCC
><INTRODUCTION TO APACHE CALCITE 30
JavaCC
tokens
><INTRODUCTION TO APACHE CALCITE 31
JavaCC
tokens
or
><INTRODUCTION TO APACHE CALCITE 32
JavaCC
tokens
or
function call
><INTRODUCTION TO APACHE CALCITE 33
JavaCC
tokens
Java code
or
function call
><INTRODUCTION TO APACHE CALCITE 34
SqlNode
• SqlNode represents an element in an
abstract syntax tree
><INTRODUCTION TO APACHE CALCITE 35
SqlNode
• SqlNode represents an element in an
abstract syntax tree
select
><INTRODUCTION TO APACHE CALCITE 36
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect
><INTRODUCTION TO APACHE CALCITE 37
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator
><INTRODUCTION TO APACHE CALCITE 38
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator identifier
><INTRODUCTION TO APACHE CALCITE 39
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator identifier
data type
><INTRODUCTION TO APACHE CALCITE 40
SqlNode
• SqlNode represents an element in an
abstract syntax tree
identifiersselect operator identifier
data type
identifier
><INTRODUCTION TO APACHE CALCITE 41
SqlNode
• SqlNode’s unparse method converts a
SQL element back into a string
><INTRODUCTION TO APACHE CALCITE 42
SqlNode
• SqlNode’s unparse method converts a
SQL element back into a string
><INTRODUCTION TO APACHE CALCITE 43
SqlNode
><INTRODUCTION TO APACHE CALCITE 44
SqlNode
• SqlDialect indicates the capitalization
and quoting rules of specific databases
><INTRODUCTION TO APACHE CALCITE 45
SqlNode
• SqlDialect indicates the capitalization
and quoting rules of specific databases
QUERY OPTIMIZER next 46
><INTRODUCTION TO APACHE CALCITE 47
Query Plans
• Query plans represent the steps necessary
to execute a query
><INTRODUCTION TO APACHE CALCITE 48
Query Plans
• Query plans represent the steps necessary
to execute a query
><INTRODUCTION TO APACHE CALCITE 49
Query Plans
• Query plans represent the steps necessary
to execute a query
table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 50
Query Plans
• Query plans represent the steps necessary
to execute a query
inner join table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 51
Query Plans
• Query plans represent the steps necessary
to execute a query
filter
inner join table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 52
Query Plans
• Query plans represent the steps necessary
to execute a query
filter
inner join
project
table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 53
Query Plans
• Query plans represent the steps necessary
to execute a query
filter
inner join
project
table scan
table
scan
><INTRODUCTION TO APACHE CALCITE 54
Query Optimization
• Optimize logical plan
• Goal is typically to try to reduce the amount
of data that must be processed early in the
plan
• Convert logical plan into a physical plan
• Physical plan is engine specific and
represents the physical execution stages
><INTRODUCTION TO APACHE CALCITE 55
Query Optimization
• Prune unused fields
• Merge projections
• Convert subqueries to joins
• Reorder joins
• Push down projections
• Push down filters
><INTRODUCTION TO APACHE CALCITE 56
Query Optimization
><INTRODUCTION TO APACHE CALCITE 57
Query Optimization
><INTRODUCTION TO APACHE CALCITE 58
Query Optimization
><INTRODUCTION TO APACHE CALCITE 59
Query Optimization
push down
project
><INTRODUCTION TO APACHE CALCITE 60
Query Optimization
push down
project
push down
filter
><INTRODUCTION TO APACHE CALCITE 61
Query Optimization
><INTRODUCTION TO APACHE CALCITE 62
Key Concepts
Relational algebra
Row expressions
Traits
Conventions
Rules
Planners
Programs
><INTRODUCTION TO APACHE CALCITE 63
Key Concepts
Relational algebra
Row expressions
Traits
Conventions
Rules
Planners
Programs
RelNode
RexNode
RelTrait
Convention
RelOptRule
RelOptPlanner
Program
><INTRODUCTION TO APACHE CALCITE 64
Relational Algebra
• RelNode represents a relational expression
• Largely equivalent to Spark’s DataFrame
methods
• Logical algebra
• Physical algebra
><INTRODUCTION TO APACHE CALCITE 65
Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort
><INTRODUCTION TO APACHE CALCITE 66
Relational Algebra
TableScan
Project
Filter
Aggregate
Join
Union
Intersect
Sort
SparkTableScan
SparkProject
SparkFilter
SparkAggregate
SparkJoin
SparkUnion
SparkIntersect
SparkSort
><INTRODUCTION TO APACHE CALCITE 67
Row Expressions
• RexNode represents a row-level expression
• Largely equivalent to Spark’s Column
functions
• Projection fields
• Filter condition
• Join condition
• Sort fields
><INTRODUCTION TO APACHE CALCITE 68
Row Expressions
Input column ref
Literal
Struct field access
Function call
Window expression
><INTRODUCTION TO APACHE CALCITE 69
Row Expressions
Input column ref
Literal
Struct field access
Function call
Window expression
RexInputRef
RexLiteral
RexFieldAccess
RexCall
RexOver
><INTRODUCTION TO APACHE CALCITE 70
Row Expressions
><INTRODUCTION TO APACHE CALCITE 71
Row Expressions
input ref
><INTRODUCTION TO APACHE CALCITE 72
Row Expressions
input ref
function call
><INTRODUCTION TO APACHE CALCITE 73
Traits
• Defined by the RelTrait interface
• Represent a trait of a relational expression
that does not alter execution
• Traits are used to validate plan output
• Three primary trait types:
• Convention
• RelCollation
• RelDistribution
><INTRODUCTION TO APACHE CALCITE 74
Conventions
• Convention is a type of RelTrait
• A Convention is associated with a
RelNode interface
• SparkConvention, JdbcConvention,
EnumerableConvention, etc
• Conventions are used to represent a single
data source
• Inputs to a relational expression must be in
the same convention
><INTRODUCTION TO APACHE CALCITE 75
Conventions
><INTRODUCTION TO APACHE CALCITE 76
Conventions
Spark convention
><INTRODUCTION TO APACHE CALCITE 77
Conventions
Spark convention
JDBC
convention
><INTRODUCTION TO APACHE CALCITE 78
Conventions
Spark convention
JDBC
convention
converter
><INTRODUCTION TO APACHE CALCITE 79
Rules
• Rules are used to modify query plans
• Defined by the RelOptRule interface
• Two types of rules: converters and
transformers
• Converter rules implement Converter and
convert from one convention to another
• Rules are matched to elements of a query
plan using pattern matching
• onMatch is called for matched rules
• Converter rules applied via convert
><INTRODUCTION TO APACHE CALCITE 80
Converter Rule
><INTRODUCTION TO APACHE CALCITE 81
Converter Rule
expression type
><INTRODUCTION TO APACHE CALCITE 82
Converter Rule
expression type
input convention
><INTRODUCTION TO APACHE CALCITE 83
Converter Rule
expression type
input convention
converted convention
><INTRODUCTION TO APACHE CALCITE 84
Converter Rule
expression type
input convention
converted convention
converter function
><INTRODUCTION TO APACHE CALCITE 85
Pattern Matching
><INTRODUCTION TO APACHE CALCITE 86
Pattern Matching
><INTRODUCTION TO APACHE CALCITE 87
Pattern Matching
no match
:-(
><INTRODUCTION TO APACHE CALCITE 88
Pattern Matching
no match
:-(
><INTRODUCTION TO APACHE CALCITE 89
Pattern Matching
match!
no match
:-(
><INTRODUCTION TO APACHE CALCITE 90
Planners
• Planners implement the RelOptPlanner
interface
• Two types of planners:
• HepPlanner
• VolcanoPlanner
><INTRODUCTION TO APACHE CALCITE 91
Heuristic Optimization
• HepPlanner is a heuristic optimizer similar
to Spark’s optimizer
• Applies all matching rules until none can be
applied
• Heuristic optimization is faster than cost-
based optimization
• Risk of infinite recursion if rules make
opposing changes to the plan
><INTRODUCTION TO APACHE CALCITE 92
Cost-based Optimization
• VolcanoPlanner is a cost-based
optimizer
• Applies matching rules iteratively, selecting
the plan with the cheapest cost on each
iteration
• Costs are provided by relational expressions
• Not all possible plans can be computed
• Stops optimization when the cost does not
significantly improve through a determinable
number of iterations
><INTRODUCTION TO APACHE CALCITE 93
Cost-based Optimization
• Cost is provided by each RelNode
• Cost is represented by RelOptCost
• Cost typically includes row count, I/O, and
CPU cost
• Cost estimates are relative
• Statistics are used to improve accuracy of
cost estimations
• Calcite provides utilities for computing
various resource-related statistics for use in
cost estimations
><INTRODUCTION TO APACHE CALCITE 94
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 95
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 96
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 97
Cost-based Optimization
><INTRODUCTION TO APACHE CALCITE 98
Cost-based Optimization
PUTTING IT ALL
TOGETHER
next 99
><INTRODUCTION TO APACHE CALCITE 100
Putting it all together
><INTRODUCTION TO APACHE CALCITE 101
Putting it all together
><INTRODUCTION TO APACHE CALCITE 102
Putting it all together
><INTRODUCTION TO APACHE CALCITE 103
Putting it all together
><INTRODUCTION TO APACHE CALCITE 104
Putting it all together
><INTRODUCTION TO APACHE CALCITE 105
Putting it all together
><INTRODUCTION TO APACHE CALCITE 106
Putting it all together
><INTRODUCTION TO APACHE CALCITE 107
Putting it all together
><INTRODUCTION TO APACHE CALCITE 108
Putting it all together
><INTRODUCTION TO APACHE CALCITE 109
Putting it all together

More Related Content

What's hot

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 

What's hot (20)

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 

Similar to Introduction to Apache Calcite

Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Aman Sinha
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
PL/SQL Tips and Techniques Webinar Presentation
PL/SQL Tips and Techniques Webinar PresentationPL/SQL Tips and Techniques Webinar Presentation
PL/SQL Tips and Techniques Webinar PresentationEmbarcadero Technologies
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKai Wähner
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Dan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New FeaturesDan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New FeaturesEmbarcadero Technologies
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4boxu42
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sqlŁukasz Grala
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookData Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookCarolyn Duby
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!HostedbyConfluent
 
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Pydata london meetup - RiakTS, PySpark and Python by Stephen EtheridgePydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Pydata london meetup - RiakTS, PySpark and Python by Stephen EtheridgeEmmanuel Marchal
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
 

Similar to Introduction to Apache Calcite (20)

Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
PL/SQL Tips and Techniques Webinar Presentation
PL/SQL Tips and Techniques Webinar PresentationPL/SQL Tips and Techniques Webinar Presentation
PL/SQL Tips and Techniques Webinar Presentation
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
KSQL Intro
KSQL IntroKSQL Intro
KSQL Intro
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Dan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New FeaturesDan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New Features
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql3 CityNetConf - sql+c#=u-sql
3 CityNetConf - sql+c#=u-sql
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookData Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!
 
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Pydata london meetup - RiakTS, PySpark and Python by Stephen EtheridgePydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 

Recently uploaded

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 

Recently uploaded (20)

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 

Introduction to Apache Calcite

  • 1. >< INTRODUCTIONTO INTRODUCTION TO APACHE CALCITE APACHE CALCITE JORDAN HALTERMAN 1
  • 3. ><INTRODUCTION TO APACHE CALCITE 3 What is Apache Calcite? • A framework for building SQL databases • Developed over more than ten years • Written in Java • Previously known as Optiq • Previously known as Farrago • Became an Apache project in 2013 • Led by Julian Hyde at Hortonworks
  • 4. ><INTRODUCTION TO APACHE CALCITE 4 Projects using Calcite • Apache Hive • Apache Drill • Apache Flink • Apache Phoenix • Apache Samza • Apache Storm • Apache everything…
  • 5. ><INTRODUCTION TO APACHE CALCITE 5 What is Apache Calcite? • SQL parser • SQL validation • Query optimizer • SQL generator • Data federator
  • 6. ><INTRODUCTION TO APACHE CALCITE Parse Queries are parsed using a JavaCC generated parser Validate Queries are validated against known database metadata Optimize Logical plans are optimized and converted into physical expressions Execute P h y s i c a l p l a n s a r e converted into application- specific executions 01 02 03 04 Stages of query execution 6
  • 8. ><INTRODUCTION TO APACHE CALCITE 8 Components of Calcite • Catalog - Defines metadata and namespaces that can be accessed in SQL queries • SQL parser - Parses valid SQL queries into an abstract syntax tree (AST) • SQL validator - Validates abstract syntax trees against metadata provided by the catalog • Query optimizer - Converts AST into logical plans, optimizes logical plans, and converts logical expressions into physical plans • SQL generator - Converts physical plans to SQL
  • 10. ><INTRODUCTION TO APACHE CALCITE 10 Calcite Catalog • Defines namespaces that can be accessed in Calcite queries • Schema • A collection of schemas and tables • Can be arbitrarily nested • Table • Represents a single data set • Fields defined by a RelDataType • RelDataType • Represents fields in a data set • Supports all SQL data types, including structs and
  • 11. ><INTRODUCTION TO APACHE CALCITE 11 Schema • A collection of schemas and tables • Schemas can be arbitrarily nested
  • 12. ><INTRODUCTION TO APACHE CALCITE 12 Schema • A collection of schemas and tables • Schemas can be arbitrarily nested
  • 13. ><INTRODUCTION TO APACHE CALCITE 13 Table • Represents a single data set • Fields are defined by a RelDataType
  • 14. ><INTRODUCTION TO APACHE CALCITE 14 Table • Represents a single data set • Fields are defined by a RelDataType
  • 15. ><INTRODUCTION TO APACHE CALCITE 15 RelDataType • Represents the data type of an object • Supports all SQL data types, including structs and arrays • Similar to Spark’s DataType
  • 16. ><INTRODUCTION TO APACHE CALCITE 16 RelDataType
  • 17. ><INTRODUCTION TO APACHE CALCITE 17 RelDataType data type enum
  • 18. ><INTRODUCTION TO APACHE CALCITE 18 Statistic • Provide table statistics used in optimization
  • 19. ><INTRODUCTION TO APACHE CALCITE 19 Statistic • Provide table statistics used in optimization
  • 20. ><INTRODUCTION TO APACHE CALCITE 20 Usage of the Calcite catalog
  • 21. ><INTRODUCTION TO APACHE CALCITE 21 Usage of the Calcite catalog schema
  • 22. ><INTRODUCTION TO APACHE CALCITE 22 Usage of the Calcite catalog schema table
  • 23. ><INTRODUCTION TO APACHE CALCITE 23 Usage of the Calcite catalog schema table data type
  • 24. ><INTRODUCTION TO APACHE CALCITE 24 Usage of the Calcite catalog schema table data typedata type field
  • 26. ><INTRODUCTION TO APACHE CALCITE 26 Calcite SQL parser • LL(k) parser written in JavaCC • Input queries are parsed into an abstract syntax tree (AST) • Tokens are represented in Calcite by SqlNode • SqlNode can also be converted back to a SQL string via the unparse method
  • 27. ><INTRODUCTION TO APACHE CALCITE 27 JavaCC • Java Compiler Compiler • Created in 1996 at Sun Microsystems • Generates Java code from a domain- specific language • ANTLR is the modern alternative used in projects like Hive and Drill • JavaCC has sparse documentation
  • 28. ><INTRODUCTION TO APACHE CALCITE 28 JavaCC
  • 29. ><INTRODUCTION TO APACHE CALCITE 29 JavaCC
  • 30. ><INTRODUCTION TO APACHE CALCITE 30 JavaCC tokens
  • 31. ><INTRODUCTION TO APACHE CALCITE 31 JavaCC tokens or
  • 32. ><INTRODUCTION TO APACHE CALCITE 32 JavaCC tokens or function call
  • 33. ><INTRODUCTION TO APACHE CALCITE 33 JavaCC tokens Java code or function call
  • 34. ><INTRODUCTION TO APACHE CALCITE 34 SqlNode • SqlNode represents an element in an abstract syntax tree
  • 35. ><INTRODUCTION TO APACHE CALCITE 35 SqlNode • SqlNode represents an element in an abstract syntax tree select
  • 36. ><INTRODUCTION TO APACHE CALCITE 36 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect
  • 37. ><INTRODUCTION TO APACHE CALCITE 37 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator
  • 38. ><INTRODUCTION TO APACHE CALCITE 38 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator identifier
  • 39. ><INTRODUCTION TO APACHE CALCITE 39 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator identifier data type
  • 40. ><INTRODUCTION TO APACHE CALCITE 40 SqlNode • SqlNode represents an element in an abstract syntax tree identifiersselect operator identifier data type identifier
  • 41. ><INTRODUCTION TO APACHE CALCITE 41 SqlNode • SqlNode’s unparse method converts a SQL element back into a string
  • 42. ><INTRODUCTION TO APACHE CALCITE 42 SqlNode • SqlNode’s unparse method converts a SQL element back into a string
  • 43. ><INTRODUCTION TO APACHE CALCITE 43 SqlNode
  • 44. ><INTRODUCTION TO APACHE CALCITE 44 SqlNode • SqlDialect indicates the capitalization and quoting rules of specific databases
  • 45. ><INTRODUCTION TO APACHE CALCITE 45 SqlNode • SqlDialect indicates the capitalization and quoting rules of specific databases
  • 47. ><INTRODUCTION TO APACHE CALCITE 47 Query Plans • Query plans represent the steps necessary to execute a query
  • 48. ><INTRODUCTION TO APACHE CALCITE 48 Query Plans • Query plans represent the steps necessary to execute a query
  • 49. ><INTRODUCTION TO APACHE CALCITE 49 Query Plans • Query plans represent the steps necessary to execute a query table scan table scan
  • 50. ><INTRODUCTION TO APACHE CALCITE 50 Query Plans • Query plans represent the steps necessary to execute a query inner join table scan table scan
  • 51. ><INTRODUCTION TO APACHE CALCITE 51 Query Plans • Query plans represent the steps necessary to execute a query filter inner join table scan table scan
  • 52. ><INTRODUCTION TO APACHE CALCITE 52 Query Plans • Query plans represent the steps necessary to execute a query filter inner join project table scan table scan
  • 53. ><INTRODUCTION TO APACHE CALCITE 53 Query Plans • Query plans represent the steps necessary to execute a query filter inner join project table scan table scan
  • 54. ><INTRODUCTION TO APACHE CALCITE 54 Query Optimization • Optimize logical plan • Goal is typically to try to reduce the amount of data that must be processed early in the plan • Convert logical plan into a physical plan • Physical plan is engine specific and represents the physical execution stages
  • 55. ><INTRODUCTION TO APACHE CALCITE 55 Query Optimization • Prune unused fields • Merge projections • Convert subqueries to joins • Reorder joins • Push down projections • Push down filters
  • 56. ><INTRODUCTION TO APACHE CALCITE 56 Query Optimization
  • 57. ><INTRODUCTION TO APACHE CALCITE 57 Query Optimization
  • 58. ><INTRODUCTION TO APACHE CALCITE 58 Query Optimization
  • 59. ><INTRODUCTION TO APACHE CALCITE 59 Query Optimization push down project
  • 60. ><INTRODUCTION TO APACHE CALCITE 60 Query Optimization push down project push down filter
  • 61. ><INTRODUCTION TO APACHE CALCITE 61 Query Optimization
  • 62. ><INTRODUCTION TO APACHE CALCITE 62 Key Concepts Relational algebra Row expressions Traits Conventions Rules Planners Programs
  • 63. ><INTRODUCTION TO APACHE CALCITE 63 Key Concepts Relational algebra Row expressions Traits Conventions Rules Planners Programs RelNode RexNode RelTrait Convention RelOptRule RelOptPlanner Program
  • 64. ><INTRODUCTION TO APACHE CALCITE 64 Relational Algebra • RelNode represents a relational expression • Largely equivalent to Spark’s DataFrame methods • Logical algebra • Physical algebra
  • 65. ><INTRODUCTION TO APACHE CALCITE 65 Relational Algebra TableScan Project Filter Aggregate Join Union Intersect Sort
  • 66. ><INTRODUCTION TO APACHE CALCITE 66 Relational Algebra TableScan Project Filter Aggregate Join Union Intersect Sort SparkTableScan SparkProject SparkFilter SparkAggregate SparkJoin SparkUnion SparkIntersect SparkSort
  • 67. ><INTRODUCTION TO APACHE CALCITE 67 Row Expressions • RexNode represents a row-level expression • Largely equivalent to Spark’s Column functions • Projection fields • Filter condition • Join condition • Sort fields
  • 68. ><INTRODUCTION TO APACHE CALCITE 68 Row Expressions Input column ref Literal Struct field access Function call Window expression
  • 69. ><INTRODUCTION TO APACHE CALCITE 69 Row Expressions Input column ref Literal Struct field access Function call Window expression RexInputRef RexLiteral RexFieldAccess RexCall RexOver
  • 70. ><INTRODUCTION TO APACHE CALCITE 70 Row Expressions
  • 71. ><INTRODUCTION TO APACHE CALCITE 71 Row Expressions input ref
  • 72. ><INTRODUCTION TO APACHE CALCITE 72 Row Expressions input ref function call
  • 73. ><INTRODUCTION TO APACHE CALCITE 73 Traits • Defined by the RelTrait interface • Represent a trait of a relational expression that does not alter execution • Traits are used to validate plan output • Three primary trait types: • Convention • RelCollation • RelDistribution
  • 74. ><INTRODUCTION TO APACHE CALCITE 74 Conventions • Convention is a type of RelTrait • A Convention is associated with a RelNode interface • SparkConvention, JdbcConvention, EnumerableConvention, etc • Conventions are used to represent a single data source • Inputs to a relational expression must be in the same convention
  • 75. ><INTRODUCTION TO APACHE CALCITE 75 Conventions
  • 76. ><INTRODUCTION TO APACHE CALCITE 76 Conventions Spark convention
  • 77. ><INTRODUCTION TO APACHE CALCITE 77 Conventions Spark convention JDBC convention
  • 78. ><INTRODUCTION TO APACHE CALCITE 78 Conventions Spark convention JDBC convention converter
  • 79. ><INTRODUCTION TO APACHE CALCITE 79 Rules • Rules are used to modify query plans • Defined by the RelOptRule interface • Two types of rules: converters and transformers • Converter rules implement Converter and convert from one convention to another • Rules are matched to elements of a query plan using pattern matching • onMatch is called for matched rules • Converter rules applied via convert
  • 80. ><INTRODUCTION TO APACHE CALCITE 80 Converter Rule
  • 81. ><INTRODUCTION TO APACHE CALCITE 81 Converter Rule expression type
  • 82. ><INTRODUCTION TO APACHE CALCITE 82 Converter Rule expression type input convention
  • 83. ><INTRODUCTION TO APACHE CALCITE 83 Converter Rule expression type input convention converted convention
  • 84. ><INTRODUCTION TO APACHE CALCITE 84 Converter Rule expression type input convention converted convention converter function
  • 85. ><INTRODUCTION TO APACHE CALCITE 85 Pattern Matching
  • 86. ><INTRODUCTION TO APACHE CALCITE 86 Pattern Matching
  • 87. ><INTRODUCTION TO APACHE CALCITE 87 Pattern Matching no match :-(
  • 88. ><INTRODUCTION TO APACHE CALCITE 88 Pattern Matching no match :-(
  • 89. ><INTRODUCTION TO APACHE CALCITE 89 Pattern Matching match! no match :-(
  • 90. ><INTRODUCTION TO APACHE CALCITE 90 Planners • Planners implement the RelOptPlanner interface • Two types of planners: • HepPlanner • VolcanoPlanner
  • 91. ><INTRODUCTION TO APACHE CALCITE 91 Heuristic Optimization • HepPlanner is a heuristic optimizer similar to Spark’s optimizer • Applies all matching rules until none can be applied • Heuristic optimization is faster than cost- based optimization • Risk of infinite recursion if rules make opposing changes to the plan
  • 92. ><INTRODUCTION TO APACHE CALCITE 92 Cost-based Optimization • VolcanoPlanner is a cost-based optimizer • Applies matching rules iteratively, selecting the plan with the cheapest cost on each iteration • Costs are provided by relational expressions • Not all possible plans can be computed • Stops optimization when the cost does not significantly improve through a determinable number of iterations
  • 93. ><INTRODUCTION TO APACHE CALCITE 93 Cost-based Optimization • Cost is provided by each RelNode • Cost is represented by RelOptCost • Cost typically includes row count, I/O, and CPU cost • Cost estimates are relative • Statistics are used to improve accuracy of cost estimations • Calcite provides utilities for computing various resource-related statistics for use in cost estimations
  • 94. ><INTRODUCTION TO APACHE CALCITE 94 Cost-based Optimization
  • 95. ><INTRODUCTION TO APACHE CALCITE 95 Cost-based Optimization
  • 96. ><INTRODUCTION TO APACHE CALCITE 96 Cost-based Optimization
  • 97. ><INTRODUCTION TO APACHE CALCITE 97 Cost-based Optimization
  • 98. ><INTRODUCTION TO APACHE CALCITE 98 Cost-based Optimization
  • 100. ><INTRODUCTION TO APACHE CALCITE 100 Putting it all together
  • 101. ><INTRODUCTION TO APACHE CALCITE 101 Putting it all together
  • 102. ><INTRODUCTION TO APACHE CALCITE 102 Putting it all together
  • 103. ><INTRODUCTION TO APACHE CALCITE 103 Putting it all together
  • 104. ><INTRODUCTION TO APACHE CALCITE 104 Putting it all together
  • 105. ><INTRODUCTION TO APACHE CALCITE 105 Putting it all together
  • 106. ><INTRODUCTION TO APACHE CALCITE 106 Putting it all together
  • 107. ><INTRODUCTION TO APACHE CALCITE 107 Putting it all together
  • 108. ><INTRODUCTION TO APACHE CALCITE 108 Putting it all together
  • 109. ><INTRODUCTION TO APACHE CALCITE 109 Putting it all together