SlideShare a Scribd company logo
1 of 27
Apache Kylin 2.0
From classic OLAP to real-time data warehouse
李扬 | Li, Yang
Co-founder & CTO
All rights reserved ©Kyligence Inc.
http://kyligence.io
BI
Visualization
HDFS
Apache Kylin
Hive HBase
Interactive Reporting Dashboard
OLAP Engine
Hadoop
- 3,000 billion rows,
< 1 sec query latency
@toutiao.com, top 1 news feed app in China
- 60+ dimensions cube
@CPIC, top 3 insurance group in China
- JDBC / ODBC / RestAPI
- BI integration
What is Apache Kylin
Why Kylin is Fast
All rights reserved ©Kyligence Inc.
http://kyligence.io
select
l_returnflag,
o_orderstatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price
…
from
v_lineitem
inner join v_orders on l_orderkey = o_orderkey
where
l_shipdate <= '1998-09-16'
group by
l_returnflag,
o_orderstatus
order by
l_returnflag,
o_orderstatus;
A sample query: Report revenue by “returnflag” and
“orderstatus” over a time period.Sort
Aggr.
Filter
Tables
O(N)
Join
Why Kylin is Fast
All rights reserved ©Kyligence Inc.
http://kyligence.io
Sort
Cuboid
Filter
Sort
Aggr.
Filter
Tables
O(N)
Join
O(flag x status x days) = O(1)
Precalculate the Kylin Cube
Kylin is about Precalculation
All rights reserved ©Kyligence Inc.
http://kyligence.io
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item,
supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
- Based on the Cube Theory
- Model and Cube define the space of precalculation
- Build Engine carries out the precalculation
- Query Engine runs SQL on the precalculated result
O(1) regardless of Data Size
All rights reserved ©Kyligence Inc.
http://kyligence.io
Online Calculation
O(N)
O(1)
Apache Kylin
Data Size
Response
Time
All rights reserved ©Kyligence Inc.
http://kyligence.io
Snowflake Schema Support
Runs TPC-H benchmark.
Kylin 1.0 Star Schema Limitation
All rights reserved ©Kyligence Inc.
http://kyligence.io
Precalculate the join of 1 level of lookups
- Support star schema only
- Don’t allow same name columns from different tables
- Don’t allow table joining itself
- Difficult to support real world business cases
LINEORDER
DATES
PART
CUSTOMER
SUPPLIER
Join
LINEORDER
DATES PART
CUSTOMERSUPPLIER
Kylin 2.0 Snowflake Schema
All rights reserved ©Kyligence Inc.
http://kyligence.io
Precalculate unlimited levels of lookups
- Snowflake schema support (KYLIN-1875)
- Allow table be joined multiple times
- Big metadata change at Model level
- Many bug fixes regarding joins and sub-queries
- Support complex models of any kind,
support flexible queries on the models
ORDERS
CUSTOMER
SUPPLIER
PART
LINEITEM
PARTSUPP
NATION
REGION
Join
Join
Join
Join
Join
TPC-H on Kylin 2.0
All rights reserved ©Kyligence Inc.
http://kyligence.io
TPC-H is a benchmark for decision support system.
- Popular among commercial RDBMS & DW solutions
- Queries and data have broad industry-wide relevance
- Examine large volumes of data
- Execute queries with a high degree of complexity
- Give answers to critical business questions
Kylin 2.0 runs all the 22 TPC-H queries. (KYLIN-2467)
- Precalculation can answer very complex queries
- Goal is functionality at this stage
- Try it: https://github.com/Kyligence/kylin-tpch
ORDERS
CUSTOMER SUPPLIER PART
LINEITEM
PARTSUPPNATION
REGION
Complex Query 1
All rights reserved ©Kyligence Inc.
http://kyligence.io
TPC-H query 07
- 0.17 sec (Hive+Tez 35.23 sec)
- 2 sub-queries
select
supp_nation,
cust_nation,
l_year,
sum(volume) as revenue
from
(
select
n1.n_name as supp_nation,
n2.n_name as cust_nation,
l_shipyear as l_year,
l_saleprice as volume
from
v_lineitem
inner join supplier on s_suppkey = l_suppkey
inner join v_orders on l_orderkey = o_orderkey
inner join customer on o_custkey = c_custkey
inner join nation n1 on s_nationkey = n1.n_nationkey
inner join nation n2 on c_nationkey = n2.n_nationkey
where
(
(n1.n_name = 'KENYA' and n2.n_name = 'PERU')
or (n1.n_name = 'PERU' and n2.n_name = 'KENYA')
)
and l_shipdate between '1995-01-01' and '1996-12-31'
) as shipping
group by
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year
Sort
Aggr.
Filter
LINEITEM
Join
Proj.
Join
Join
Join
Join SUPPLIER
ORDER
CUSTOMER
NATION
NATION
All rights reserved ©Kyligence Inc.
http://kyligence.io
TPC-H query 11
- 3.42 sec (Hive+Tez 15.87 sec)
- 4 sub-queries, 1 online join
with q11_part_tmp_cached as (
select
ps_partkey,
sum(ps_partvalue) as part_value
from
v_partsupp
inner join supplier on ps_suppkey = s_suppkey
inner join nation on s_nationkey = n_nationkey
where
n_name = 'GERMANY'
group by ps_partkey
),
q11_sum_tmp_cached as (
select
sum(part_value) as total_value
from
q11_part_tmp_cached
)
select
ps_partkey,
part_value
from (
select
ps_partkey,
part_value,
total_value
from
q11_part_tmp_cached, q11_sum_tmp_cached
) a
where
part_value > total_value * 0.0001
order by
part_value desc;
Sort
Filter
Join
Proj.
Aggr.
Filter
Join
Proj.
Join SUPPLIER
NATION
Aggr.
Filter
Join
Proj.
Join SUPPLIER
NATION
PARTSUPP
PARTSUPP
Proj.
Aggr.
Complex Query 2
All rights reserved ©Kyligence Inc.
http://kyligence.io
TPC-H query 12
- 7.66 sec (Hive+Tez 12.64 sec)
- 5 sub-queries, 2 online joins
with in_scope_data as(
select
l_shipmode,
o_orderpriority
from
v_lineitem inner join v_orders on l_orderkey = o_orderkey
where
l_shipmode in ('REG AIR', 'MAIL')
and l_receiptdelayed = 1
and l_shipdelayed = 0
and l_receiptdate >= '1995-01-01'
and l_receiptdate < '1996-01-01'
), all_l_shipmode as(
select
distinct l_shipmode
from
in_scope_data
), high_line as(
select
l_shipmode,
count(*) as high_line_count
from
in_scope_data
where
o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH'
group by l_shipmode
), low_line as(
select
l_shipmode,
count(*) as low_line_count
from
in_scope_data
where
o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH'
group by l_shipmode
)
select
al.l_shipmode, hl.high_line_count, ll.low_line_count
from
all_l_shipmode al
left join high_line hl on al.l_shipmode = hl.l_shipmode
left join low_line ll on al.l_shipmode = ll.l_shipmode
order by
al.l_shipmode
Complex Query 3
Sort
Filter
Join
Join
Aggr.
Filter
Join
Proj.
ORDERS
LINEITEMAggr.
Filter
Join
Proj.
ORDERS
LINEITEM
Aggr.
Filter
Join
Proj.
ORDERS
LINEITEM
More than MOLAP
All rights reserved ©Kyligence Inc.
http://kyligence.io
- Supports complex data models and sub-queries; Runs TPC-H
- Percentile / Window / Time functions
SQL Maturity
Speed
Kylin 2.0Kylin 1.0
DW on
Hadoop
MOLAP Analytics DW
All rights reserved ©Kyligence Inc.
http://kyligence.io
Spark Cubing
Halves the build time.
A Bit of History
All rights reserved ©Kyligence Inc.
http://kyligence.io
Kylin 1.5 attempted Spark cubing, but the feature was never released.
- It was a port of MR in-mem cubing algorithm.
- Exploit memory and build the whole cube in one round.
- No improvement observed.
- Spark did nothing differently than MR.
Spark Cubing in 2.0
All rights reserved ©Kyligence Inc.
http://kyligence.io
RDD-1
RDD-2
RDD-3
RDD-4
RDD-5
Kylin 2.0 did a complete rework based on Layered Cubing algorithm.
- Each layer of cuboids as a RDD.
- Parent RDD is cached for next round.
- RDD exports to sequence file,
the same output format as MR.
- Translate “map” to “flatMap”;
and “reduce” to “reduceByKey”;
most code get reused.
DAG Calculating the 3rd Layer
All rights reserved ©Kyligence Inc.
http://kyligence.io
Spark Cubing vs. MR Layered Cubing
All rights reserved ©Kyligence Inc.
http://kyligence.io
Halves the build time. The advantage decreases as data size increases.
- 4-node cluster
- Spark 1.6.3 on YARN
- 24 vcores, 30 GB memory
- 3 data sets of increasing size:
.15 GB / 2.5 GB / 8 GB
Spark Cubing vs. MR In-mem Cubing
All rights reserved ©Kyligence Inc.
http://kyligence.io
Almost the same fast. And more adaptable to general data set.
- In-mem cubing expects sharded data,
works poorly on random data sets.
- Spark cubing is more adaptable to
different kinds of data distribution.
All rights reserved ©Kyligence Inc.
http://kyligence.io
Near Real-time Streaming
Build latency down to a few minutes.
New in Kylin 1.6
All rights reserved ©Kyligence Inc.
http://kyligence.io
In-mem Cubing
Kylin
BI Tools, Web App…
ANSI SQL
New
Demo of Twitter Analysis
All rights reserved ©Kyligence Inc.
http://kyligence.io
http://hub.kyligence.io
Incremental build triggers every 2 minutes, build finishes in 3 minutes.
- 8-node cluster on AWS, 3 Kafka brokers
- Twitter sample feed, 10+ K messages per second
- Cube has 9 dimensions and 3 measures
- 2 jobs running at the same time
Summary
All rights reserved ©Kyligence Inc.
http://kyligence.io
Apache Kylin 2.0
- Kylin 2.0 Beta download available.
- Snowflake schema support
- Runs TPC-H benchmark
- Time / Window / Percentile functions
- Spark cubing
- Near real-time streaming
What is next
- Hadoop 3.0 support (Erasure Coding)
- Spark cubing enhancement
- Connect more source (JDBC, SparkSQL)
- Alternative storage (Kudu?)
- Real-time support, lambda architecture
Thanks
See you on our next talk!

More Related Content

What's hot

When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Lucas Jellema
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin IntroductionLuke Han
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streamsconfluent
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareKai Wähner
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processingconfluent
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 

What's hot (20)

When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
OpenStack Framework Introduction
OpenStack Framework IntroductionOpenStack Framework Introduction
OpenStack Framework Introduction
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processing
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 

Viewers also liked

Viewers also liked (20)

Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Dbm630 lecture07
Dbm630 lecture07Dbm630 lecture07
Dbm630 lecture07
 
Dbm630 lecture04
Dbm630 lecture04Dbm630 lecture04
Dbm630 lecture04
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Dbm630 lecture08
Dbm630 lecture08Dbm630 lecture08
Dbm630 lecture08
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Dbm630 lecture05
Dbm630 lecture05Dbm630 lecture05
Dbm630 lecture05
 
Strata lightening-talk
Strata lightening-talkStrata lightening-talk
Strata lightening-talk
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring SolutionHelio, a Continues Real-Time Fraud Detection and Monitoring Solution
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
 
Memory Leaks on Android
Memory Leaks on AndroidMemory Leaks on Android
Memory Leaks on Android
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Code lifecycle in the jvm - TopConf Linz
Code lifecycle in the jvm - TopConf LinzCode lifecycle in the jvm - TopConf Linz
Code lifecycle in the jvm - TopConf Linz
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Just-in-time compiler (March, 2017)
Just-in-time compiler (March, 2017)Just-in-time compiler (March, 2017)
Just-in-time compiler (March, 2017)
 
系統程式 -- 第 9 章
系統程式 -- 第 9 章系統程式 -- 第 9 章
系統程式 -- 第 9 章
 

Similar to Apache kylin 2.0: from classic olap to real-time data warehouse

Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesYugabyte
 
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...Geir Høydalsvik
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Apache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large datasetApache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large datasetssuser931288
 
Apache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large datasetApache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large datasetChun'en Ni
 
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesDeep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.
 
Large customers want postgresql too !!
Large customers want postgresql too !!Large customers want postgresql too !!
Large customers want postgresql too !!rosensteel
 
Running Kubernetes with Amazon EKS - AWS Online Tech Talks
Running Kubernetes with Amazon EKS - AWS Online Tech TalksRunning Kubernetes with Amazon EKS - AWS Online Tech Talks
Running Kubernetes with Amazon EKS - AWS Online Tech TalksAmazon Web Services
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0ScyllaDB
 
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsSave Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsHostedbyConfluent
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsAerospike, Inc.
 
Driving Down Costs of z Systems™ Storage
Driving Down Costs of z Systems™ StorageDriving Down Costs of z Systems™ Storage
Driving Down Costs of z Systems™ StorageCA Technologies
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 
How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0Norvald Ryeng
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to DeploymentAerospike, Inc.
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Big Data Spain
 
Elastic Stack @ Swisscom Application Cloud
Elastic Stack @ Swisscom Application CloudElastic Stack @ Swisscom Application Cloud
Elastic Stack @ Swisscom Application CloudLucas Bremgartner
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
 

Similar to Apache kylin 2.0: from classic olap to real-time data warehouse (20)

Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
 
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...
The State of the Dolphin, MySQL Keynote at Percona Live Europe 2019, Amsterda...
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Apache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large datasetApache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large dataset
 
Apache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large datasetApache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large dataset
 
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesDeep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
Deep Learning and Gene Computing Acceleration with Alluxio in Kubernetes
 
Large customers want postgresql too !!
Large customers want postgresql too !!Large customers want postgresql too !!
Large customers want postgresql too !!
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Running Kubernetes with Amazon EKS - AWS Online Tech Talks
Running Kubernetes with Amazon EKS - AWS Online Tech TalksRunning Kubernetes with Amazon EKS - AWS Online Tech Talks
Running Kubernetes with Amazon EKS - AWS Online Tech Talks
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsSave Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud Costs
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
 
Driving Down Costs of z Systems™ Storage
Driving Down Costs of z Systems™ StorageDriving Down Costs of z Systems™ Storage
Driving Down Costs of z Systems™ Storage
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 
How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0How to Take Advantage of Optimizer Improvements in MySQL 8.0
How to Take Advantage of Optimizer Improvements in MySQL 8.0
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to Deployment
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Elastic Stack @ Swisscom Application Cloud
Elastic Stack @ Swisscom Application CloudElastic Stack @ Swisscom Application Cloud
Elastic Stack @ Swisscom Application Cloud
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Apache kylin 2.0: from classic olap to real-time data warehouse

  • 1. Apache Kylin 2.0 From classic OLAP to real-time data warehouse 李扬 | Li, Yang Co-founder & CTO
  • 2. All rights reserved ©Kyligence Inc. http://kyligence.io BI Visualization HDFS Apache Kylin Hive HBase Interactive Reporting Dashboard OLAP Engine Hadoop - 3,000 billion rows, < 1 sec query latency @toutiao.com, top 1 news feed app in China - 60+ dimensions cube @CPIC, top 3 insurance group in China - JDBC / ODBC / RestAPI - BI integration What is Apache Kylin
  • 3. Why Kylin is Fast All rights reserved ©Kyligence Inc. http://kyligence.io select l_returnflag, o_orderstatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price … from v_lineitem inner join v_orders on l_orderkey = o_orderkey where l_shipdate <= '1998-09-16' group by l_returnflag, o_orderstatus order by l_returnflag, o_orderstatus; A sample query: Report revenue by “returnflag” and “orderstatus” over a time period.Sort Aggr. Filter Tables O(N) Join
  • 4. Why Kylin is Fast All rights reserved ©Kyligence Inc. http://kyligence.io Sort Cuboid Filter Sort Aggr. Filter Tables O(N) Join O(flag x status x days) = O(1) Precalculate the Kylin Cube
  • 5. Kylin is about Precalculation All rights reserved ©Kyligence Inc. http://kyligence.io time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid - Based on the Cube Theory - Model and Cube define the space of precalculation - Build Engine carries out the precalculation - Query Engine runs SQL on the precalculated result
  • 6. O(1) regardless of Data Size All rights reserved ©Kyligence Inc. http://kyligence.io Online Calculation O(N) O(1) Apache Kylin Data Size Response Time
  • 7. All rights reserved ©Kyligence Inc. http://kyligence.io Snowflake Schema Support Runs TPC-H benchmark.
  • 8. Kylin 1.0 Star Schema Limitation All rights reserved ©Kyligence Inc. http://kyligence.io Precalculate the join of 1 level of lookups - Support star schema only - Don’t allow same name columns from different tables - Don’t allow table joining itself - Difficult to support real world business cases LINEORDER DATES PART CUSTOMER SUPPLIER Join LINEORDER DATES PART CUSTOMERSUPPLIER
  • 9. Kylin 2.0 Snowflake Schema All rights reserved ©Kyligence Inc. http://kyligence.io Precalculate unlimited levels of lookups - Snowflake schema support (KYLIN-1875) - Allow table be joined multiple times - Big metadata change at Model level - Many bug fixes regarding joins and sub-queries - Support complex models of any kind, support flexible queries on the models ORDERS CUSTOMER SUPPLIER PART LINEITEM PARTSUPP NATION REGION Join Join Join Join Join
  • 10. TPC-H on Kylin 2.0 All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H is a benchmark for decision support system. - Popular among commercial RDBMS & DW solutions - Queries and data have broad industry-wide relevance - Examine large volumes of data - Execute queries with a high degree of complexity - Give answers to critical business questions Kylin 2.0 runs all the 22 TPC-H queries. (KYLIN-2467) - Precalculation can answer very complex queries - Goal is functionality at this stage - Try it: https://github.com/Kyligence/kylin-tpch ORDERS CUSTOMER SUPPLIER PART LINEITEM PARTSUPPNATION REGION
  • 11. Complex Query 1 All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H query 07 - 0.17 sec (Hive+Tez 35.23 sec) - 2 sub-queries select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, l_shipyear as l_year, l_saleprice as volume from v_lineitem inner join supplier on s_suppkey = l_suppkey inner join v_orders on l_orderkey = o_orderkey inner join customer on o_custkey = c_custkey inner join nation n1 on s_nationkey = n1.n_nationkey inner join nation n2 on c_nationkey = n2.n_nationkey where ( (n1.n_name = 'KENYA' and n2.n_name = 'PERU') or (n1.n_name = 'PERU' and n2.n_name = 'KENYA') ) and l_shipdate between '1995-01-01' and '1996-12-31' ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year Sort Aggr. Filter LINEITEM Join Proj. Join Join Join Join SUPPLIER ORDER CUSTOMER NATION NATION
  • 12. All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H query 11 - 3.42 sec (Hive+Tez 15.87 sec) - 4 sub-queries, 1 online join with q11_part_tmp_cached as ( select ps_partkey, sum(ps_partvalue) as part_value from v_partsupp inner join supplier on ps_suppkey = s_suppkey inner join nation on s_nationkey = n_nationkey where n_name = 'GERMANY' group by ps_partkey ), q11_sum_tmp_cached as ( select sum(part_value) as total_value from q11_part_tmp_cached ) select ps_partkey, part_value from ( select ps_partkey, part_value, total_value from q11_part_tmp_cached, q11_sum_tmp_cached ) a where part_value > total_value * 0.0001 order by part_value desc; Sort Filter Join Proj. Aggr. Filter Join Proj. Join SUPPLIER NATION Aggr. Filter Join Proj. Join SUPPLIER NATION PARTSUPP PARTSUPP Proj. Aggr. Complex Query 2
  • 13. All rights reserved ©Kyligence Inc. http://kyligence.io TPC-H query 12 - 7.66 sec (Hive+Tez 12.64 sec) - 5 sub-queries, 2 online joins with in_scope_data as( select l_shipmode, o_orderpriority from v_lineitem inner join v_orders on l_orderkey = o_orderkey where l_shipmode in ('REG AIR', 'MAIL') and l_receiptdelayed = 1 and l_shipdelayed = 0 and l_receiptdate >= '1995-01-01' and l_receiptdate < '1996-01-01' ), all_l_shipmode as( select distinct l_shipmode from in_scope_data ), high_line as( select l_shipmode, count(*) as high_line_count from in_scope_data where o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' group by l_shipmode ), low_line as( select l_shipmode, count(*) as low_line_count from in_scope_data where o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' group by l_shipmode ) select al.l_shipmode, hl.high_line_count, ll.low_line_count from all_l_shipmode al left join high_line hl on al.l_shipmode = hl.l_shipmode left join low_line ll on al.l_shipmode = ll.l_shipmode order by al.l_shipmode Complex Query 3 Sort Filter Join Join Aggr. Filter Join Proj. ORDERS LINEITEMAggr. Filter Join Proj. ORDERS LINEITEM Aggr. Filter Join Proj. ORDERS LINEITEM
  • 14. More than MOLAP All rights reserved ©Kyligence Inc. http://kyligence.io - Supports complex data models and sub-queries; Runs TPC-H - Percentile / Window / Time functions SQL Maturity Speed Kylin 2.0Kylin 1.0 DW on Hadoop MOLAP Analytics DW
  • 15. All rights reserved ©Kyligence Inc. http://kyligence.io Spark Cubing Halves the build time.
  • 16. A Bit of History All rights reserved ©Kyligence Inc. http://kyligence.io Kylin 1.5 attempted Spark cubing, but the feature was never released. - It was a port of MR in-mem cubing algorithm. - Exploit memory and build the whole cube in one round. - No improvement observed. - Spark did nothing differently than MR.
  • 17. Spark Cubing in 2.0 All rights reserved ©Kyligence Inc. http://kyligence.io RDD-1 RDD-2 RDD-3 RDD-4 RDD-5 Kylin 2.0 did a complete rework based on Layered Cubing algorithm. - Each layer of cuboids as a RDD. - Parent RDD is cached for next round. - RDD exports to sequence file, the same output format as MR. - Translate “map” to “flatMap”; and “reduce” to “reduceByKey”; most code get reused.
  • 18. DAG Calculating the 3rd Layer All rights reserved ©Kyligence Inc. http://kyligence.io
  • 19. Spark Cubing vs. MR Layered Cubing All rights reserved ©Kyligence Inc. http://kyligence.io Halves the build time. The advantage decreases as data size increases. - 4-node cluster - Spark 1.6.3 on YARN - 24 vcores, 30 GB memory - 3 data sets of increasing size: .15 GB / 2.5 GB / 8 GB
  • 20. Spark Cubing vs. MR In-mem Cubing All rights reserved ©Kyligence Inc. http://kyligence.io Almost the same fast. And more adaptable to general data set. - In-mem cubing expects sharded data, works poorly on random data sets. - Spark cubing is more adaptable to different kinds of data distribution.
  • 21. All rights reserved ©Kyligence Inc. http://kyligence.io Near Real-time Streaming Build latency down to a few minutes.
  • 22. New in Kylin 1.6 All rights reserved ©Kyligence Inc. http://kyligence.io In-mem Cubing Kylin BI Tools, Web App… ANSI SQL New
  • 23. Demo of Twitter Analysis All rights reserved ©Kyligence Inc. http://kyligence.io http://hub.kyligence.io Incremental build triggers every 2 minutes, build finishes in 3 minutes. - 8-node cluster on AWS, 3 Kafka brokers - Twitter sample feed, 10+ K messages per second - Cube has 9 dimensions and 3 measures - 2 jobs running at the same time
  • 24.
  • 25.
  • 26. Summary All rights reserved ©Kyligence Inc. http://kyligence.io Apache Kylin 2.0 - Kylin 2.0 Beta download available. - Snowflake schema support - Runs TPC-H benchmark - Time / Window / Percentile functions - Spark cubing - Near real-time streaming What is next - Hadoop 3.0 support (Erasure Coding) - Spark cubing enhancement - Connect more source (JDBC, SparkSQL) - Alternative storage (Kudu?) - Real-time support, lambda architecture
  • 27. Thanks See you on our next talk!

Editor's Notes

  1. Thank you all for coming. Thanks for your time. My name is Yang, or 李扬 in Chinese. Co-founder and CTO of Kyligence. Also a PMC member of Apache Kylin. The topic today is about Apache Kylin 2.0. It is in beta at the moment and is planned to release in April. I will first brief what is Aapche Kylin and then talk about the important and new features it has, including snowflake schema support, spark cubing, and streaming.
  2. So what is Apache Kylin. It is an OLAP engine on Hadoop. Perhaps the most popular one at the moment. If you google “OLAP on Hadoop”, Kylin is the first result as I tried this morning. It sits on top of Hadoop infrastructure and exposes your relational data to upper application via the standard SQL interface. Kylin can handle very big data set and is very fast in terms of query latency. For example, the biggest Kylin instance we know in production is at toutiao.com, the top 1 news feed app in China. It has a table of 3,000 billion rows, and with Kylin, the average query response time is below 1 second. Kylin can handle very complex data models, and we will talk more about snowflake support soon. The widest cube we know is at CPIC (中国太平洋保险), the top 3 insurance group in China. It contains more than 60 dimensions. And Kylin provides standard JDBC / ODBC / RestAPI interfaces. Can integrate with existing BI tools very well, like Tableau and PowerBI, you name it.
  3. Why is Kylin so fast. We can show it with an example. Image a retail scenario, I want to report revenue by “returnflag” and “orderstatus” within a date range, to see the total amount of successful transactions, canceled transaction, and returned transactions etc. A typical SQL will look like this. And to execute it, we will compile it into a relational expression like the diagram on the left. It is so called the execution plan. On the execution plan, we can see, executing the query involves scanning all the rows in table, join them together, go through the date range filter, sum revenue by “returnflag” and “orderstatus”, and finally produced sorted result. It is easy to see that the time complexity of such execution is at least O(N), where N represents the total number of rows in the tables. Because at least each table row is visited once. And we assume the joining tables are perfectly partitioned and indexed, such that the expensive join operator can also finish in linear time complexity, which is actually not very possible in real cases. Anyway, O(N) is the best you could have doing ad-hoc SQL processing.
  4. How can Kylin go beyond O(N)? That is by precalculation. So if I know the query pattern in advance, I could precalculate the Aggregate, Join, and Table Scan operators, to create a cuboid. If cuboid sounds unfamiliar, you can think it as a materialized summary table. The summary table is transaction amount grouped by “returnflag”, “orderstatus”, and “date”. And because there are fixed number of return flags, order status, and let’s say the date range is limited too, for 3 years, there are about 1000 days. That means the number of rows in the summary table is at most “flag x status x days”, which is a constant in the big O notion. That means, if execute the same SQL on the precalculated cuboid, the maximum rows to process is a constant. And that is why Kylin can be faster.
  5. So Kylin is all about precalculation. The core idea is based on the classic cube theory and is developed from there into the SQL on big data domain. Kylin provides Model and Cube to help you define the space of precalculation. It has Build Engine that execute the precalculation in a distribute system using MR or Spark. And a Query Engine allows to run SQL on top of the precalculated result. The key here is modeling. If you have good understanding of your data and business analysis requirement, you can capture the necessary precalculation with the right cube.
  6. Once you got the right precalculation, then most of your queries (if not all) will be able to transform into the cube query like we have just seen. And the execution time complexity can be reduced to O(1) and achieve very fast query speed regardless of the original data size. That is the brief introduction of Apache Kylin.
  7. Next we will talk about the Snowflake Schema support, which is the most important new feature in Kylin 2.0.
  8. Kylin 1.0 has a limitation that it only supports star schema data model. That means it only allows precalculation of joining 1 level of lookup tables. Like the diagram shows. It is difficult to support real world business cases, which are often more complicated than star schema.
  9. Kylin 2.0 introduced big changes to model metadata, and can support snowflake data model out-of-box. Allows precalculate unlimited levels of lookup tables. Like the diagram shows. Also there were many bug fixes and improvements regarding joins and sub-queries. As a result, Kylin 2.0 is able to support very complex data models and queries.
  10. To demonstrate the greatly improved modeling and SQL capability, we made effort to run TPC-H queries on Kylin 2.0. For those who are not familiar with TPC-H, following is quote from TPC-H website: It is a popular decision support system for commercial RDBMS & DW solutions. It includes both queries and data that have broad industry-wide relevance. The queries are designed to examine large volumes of data, to have a high degree of complexity, and to give answers to critical business questions. Kylin 2.0 can run all the 22 TPC-H queries. We have put up a page with all the steps and resources that is needed to reproduce the work. So anyone who want to verify can give it a try in your own environment. This shows that with precalculated cube, Kylin is also very flexible and can answer very complex queries. Another note is, the goal here is not comparing performance with other TPC-H result. On one hand, according to TPC-H spec, precalculation is not permitted across tables, so in that sense, Kylin does not qualify a valid system to compare with other TPC-H results. On the other hand, we haven’t done performance tuning for the TPC-H queries yet. Just had enough time to let all the queries pass. The room for performance improvement is still very big.
  11. To quickly show you some interesting TPC-H queries. On the left side is the SQL. Its font size is very small and is not intended to be read. I highlighted the sub-queries in different colors, so we have a feeling of the complexity. On the right side is the relational expression of the query in the tree form. In the tree structure, I marked precalculated nodes with white background and red text, so they look lightweight. The other solid nodes are submit to online calculation and will be the cause if a query runs slow. This is TPC-H query 07, as you can see, almost all of the relational operators are precalculated. The remaining Sort / Proj / Filter work on very few records, thus the query is very fast. Takes Kylin only 0.17 second to run, and on the same hardware and same data set, Hive+Tez run 35.23 second for this query. That shows the difference between precalculation and online calculation.
  12. This is TPC-H query 11. It has 4 sub-queries in terms of SQL and is more complex than the previous query. Its relational expression tree is more complex too, include two branches, each load from a precalculated cuboid. And finally the result of the branches are joined again, which is a very heavy online computation. With the percentage of online calculation increased, the query takes longer for Kylin, runs 3.42 second. While a full online calculation is still much slower, takes 15.87 second to run.
  13. This is an even more complex query, TPC-H query 12. It contains 5 sub-queries, the SQL font size is even smaller than the previous pages, to fit the SQL in screen. And the relational form is more ugly too. Reading 3 cuboids in 3 branches, and online join them together. There are 2 heavy online join operators. It is expected that the more online calculation, the slower the query gets. It takes 7.66 second to run in Kylin and 12.64 second to run in Hive+Tez. As more heavy operators stay on online, the advantage of precalculation decreases.
  14. So sum it up, Kylin 2.0 is much more than multidimensional analysis. With snowflake schema enabled, it can support very complex relational models and answer very complex queries. It can run TPC-H benchmark, and has enhanced/added many other SQL features like Percentile functions, Window functions, and Time functions. Comparing to Kylin 1.0, Kylin 2.0 is a big step forward in terms of SQL maturity. Given the right model design, it can answer all your queries of any kind of complexity. Comparing to other data warehouse on Hadoop, Kylin 2.0 may still a little behind in terms of SQL, e.g. it does have UDF yet, but the gap is very small now. And in terms of query latency, because of precalculation, Kylin would have big advantage.
  15. Next we will talk about Spark Cubing, which cuts the build time by half.
  16. Since 1.5, we have been trying to do cubing on Spark. However at that time, the attempt was not successful. The first attempt was a port of MR in-mem cubing algorithm on to Spark. It is the easiest to do at that time, due to ease of implementation. The algorithm uses as much memory as it could and builds the whole in one round, thus Spark’s memory cache is not an advantage since the MR job is doing the same. The result is no obvious improvement compare to MapReduce.
  17. So the 2.0 did a complete rework based on the layered cubing algorithm. For layered cubing, the cube is calculated by layers. Each layer’s output is the next layer’s input. We use RDD to abstract the layer data, then the parent RDD can be cached and speed up the processing of the next layer. RDD can be exported to sequence file using the same format as MapReduce. This keeps the compatibility in the cube data format. Implementation wise, the original mapper can translate into “flatMap”, original reducer can translate into “reduceByKey”. Most of the code gets reused.
  18. This is a sample of a Spark job, calculating the 3rd layer of a cube. The first 2 stages are skipped, because they have been calculated and cached by previous layer’s job. The later two stages do the additional processing to calculate the 3rd of cube.
  19. Compare Spark cubing performance with MR layered cubing. The red bar is Spark and the blue bar is MR. The test is done in a 4-node cluster. Spark 1.6.3 on YARN, with 24 vcores and 30 GB memory allocated to Spark. We tested 3 data sets with increasing size: 0.15 GB, 2.5 GB, and 8 GB. The first two data sets are very small comparing to 30 GB memory. So spark could cache everything in memory, and the result is Spark build time is about 50% of MR layered cubing. The 8 GB data set is much bigger, and as cubing cause data expansion, it won’t be able to all fit in memory, thus the advantage of Spark decreases little, as the diagram shows.
  20. Compare Spark cubing to MR In-mem cubing. Still the red bar is Spark, the blue bar is MR in-mem cubing. They are almost the same fast. However, remember in-mem cubing is very picky on data distribution. It is only effective when the data is sharded or nearly sharded. And if you force in-mem cubing to run on random data, it performs even slower than MR layered cubing. In the diagram, the data sets are all sharded for in-mem cubing. On the other hand, Spark cubing shows good performance consistently, regardless of the level of data distribution. So in general, Spark cubing is still a better choice than in-mem cubing.
  21. Last thing is about near real-time streaming.
  22. This is actually a 1.6 feature. However I feel that it is not marketed well enough, so is it again. Start from 1.6, Kylin can connect Kafka as a source just like Hive. Using in-mem cubing algorithm, we can trigger micro incremental build very frequently, e.g. every 2 minutes. The result is many small cube segments and they can be queried to give very real-time result.
  23. To show this really truly works, we have put up a demo site to analyze twitter messages in real-time. It runs on a 8-node AWS cluster, 3 Kafka brokers. The input is Twitter sample feed, which has 10+ K messages per second. The cube is average complex, 9 dimensions and 3 measures. For such setup, the incremental build is triggered every 2 minutes and finishes in 3 minutes. That’s why you will see 2 jobs running at the same time like the screenshot shows. That is perfectly OK as long as your cluster continue to complete jobs at the fixed rate. As a result, the system has about 5 minutes delay in terms of real-time-ness.
  24. The demo shows twitter message trends by language and by devices. The left diagram is message trend by language of a whole day. We can see English message volume goes up in the US day time, and meanwhile Asia message volume goes down because it’s Asia night.
  25. There’s also the tag cloud, show the most recent hot topics. And below it the trend of the hottest tags. All the charts real-time to the latest 5 minutes.
  26. To summarize. Apache Kylin 2.0 is about to reach it 2.0 release. It has rich features like snowflake schema support, runs TPC-H benchmark, and many enhanced SQL functions. It has spark cubing that halves the build time, and has streaming capability. 2.0 is still in beta at the moment, there is a beta package which you can download and try. We are very eager to hear feedbacks from community, good or bad. After fixing any critical issues, the plan is to release in April. As to Kylin’s roadmap, there is a lot we want to do. Hadoop 3.0 with erasure coding could save cube storage greatly, something we will definitely catch up. Spark cubing has many room of improvement too, e.g. at the moment it does not source from Kafka yet. Connecting more sources is another frequently asked requirement. We could source from JDBC, or maybe SparkSQL. Alternative storage, like Kudu? Lastly but not least, a lambda architecture to support true real-time. That is all. Thanks for your time again.