With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
2. About Me
Dong Li is a Founding Member and Head of Product and Innovation
at Kyligence, an Apache Kylin Core Developer (Committer) and member of
the Project Management Committee (PMC) where he focuses on big data
technology development.
Previously, he was a Senior Engineer in eBay’s Global Analytics
Infrastructure Department, a Software Development Engineer for
Microsoft Cloud Computing and Enterprise Products.
4. Customer Background
• A fast-growing SaaS company in US
• 1800 customers in 40+ countries
• 1/3 Fortune 500 use
• 8 Billion transactions per year
• Dashboards for end users
5. Landscape & Challenges
• Source Data in AWS RDS
• Materialized views used for dashboards
• Slow queries cost 5+ seconds
• 4+ hours to refresh materialized views every day
• Bottleneck at ~10 concurrent users
• Couldn’t provide flexible dashboards
• Number of views keeps increasing
OLTP
(RDS)
OLAP
(RDS)
Materialized
View
Dashboard
Export ETL SQL
6. Expectation for the future data platform
• Flexible dashboards for end users
• High performance (< 2s), high concurrency (> 100 users)
• Easy to scale
• Low data preparation latency (< 1 hour)
• Flexible for new requirements
• Enterprise-grade security: data recovery, row/column level access etc.
• Totally on AWS
• Low TCO
• Open Platform for Machine learning, Internal Analytics etc.
9. Apache Kylin: Managing Your Most Valuable Data
• OLAP Data Modelling
• Speed Up Analytics Using Pre-Calculation
• ANSI SQL Interface
• High Concurrency and High Performance
• Batch & Streaming Together
Presentation
Visualization
Big Data
Platform
Data
Source
Data Mart
Hive Impala Spark SQL Kafka
MapReduce …Spark
10. Apache Kylin Community & Adoptions
1000+ Global Adoptions
Leading Open Source OLAP
Github Stars
JIRA Issues
11. Star Schema Benchmark
Star schema benchmark:
http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
0
2
4
6
8
10
12
1.1 1.2 1.3 2.1 2.2 2.3 3.1 3.2 3.3 3.4 4.1 4.2 4.3
Latency(s)
SSB Queries
条SQL响
Kylin SQL on Hadoop
SQL Latency
Lower is better
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50
Latency(s)
Data Scale
不同数据量性能 化
Kylin SQL on Hadoop
Data Volume Scale
Lower is better
13. select
l_returnflag,
o_orderstatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price
from
v_lineitem
inner join v_orders on l_orderkey = o_orderkey
where
l_shipdate <= '1998-09-16'
group by
l_returnflag,
o_orderstatus
order by
l_returnflag,
o_orderstatus;
Sort
Aggr
Filter
TablesO(N)
Join
Parse SQL to
an execution
plan
How Does Kylin Accelerate Queries?
• Kylin uses Apache Calcite as the SQL parser and optimizer
14. How Does Kylin Accelerate Queries?
• Kylin optimizes and adapts the plan to an OLAP cube.
• With less processing, Kylin can return the result instantly.
Aggr
Filter
Tables
Join
Sort
Sort
Cube
Filter
Pick the best
matched cube
Rewrite toThese steps have
already been completed
in the cube build.
O(1)
15. Apache Kylin
BI Tools Apps Machine Learning
SQL
Runtime Workload
Offload Workload
Scan & filter
Extract
Load
Architecture
The architecture of Apache Kylin v4.0.0-alpha
16. Use Case: Online Shopping Reporting
The most visited website in Japan
https://techblog.yahoo.co.jp/oss/apache-kylin/
§ Our reporting system used Impala as a backend
database previously.
- It took a long time (about 60 sec) to show Web
UI.
§ In order to lower the latency, we moved to Apache
Kylin.
- Average latency < 1sec for most cases
Thanks to low latency with Kylin, we become possible to focus on
adding functions for users.
§ We provide a reporting system that show
statistics for store owners.
- e. g. impressions, clicks and sales.
17. Apache Kylin 4.0 Roadmap: Cloud Native
Data analytics
Apache Kylin
Container Service (K8S, Docker)
Interactive Reporting Dashboard
OLAP / Data mart
Resource
Orchestration
Data Lake Source file, Streams, Parquet on Object Storage (S3, ADSL)
Metadata
Security
• Less Dependency, More
Lightweight
• Automated Scaling
• Less Computing and
Storage Cost
• Automated DevOps
18. Data is next Oil
The world’s most valuable resource is no
longer oil, but data. —“The Economist”
China’s Datasphere is expected to grow 30% on average over the next 7
years and will be the largest Datasphere of all regions by 2025 --IDC
175 Zettabytes By 2025 -- IDC
21. What is missing here?
? ? ?
Reporting Dashboard Ad-Hoc Data-as-a-Services Machine Learning
EDW Datasets Data Lake Datasets Cloud Datasets
Data Lake
Application
SQL / MDX
Data Analysts Marketing User Operation Analysts
22. Unified Semantic Layer
Govern
Data Platform
Reporting Dashboard Ad-Hoc Data-as-a-Services Machine Learning
Managed Datasets
Managed KPIs
CUSTOMER
CUSTOMER NUMBER
CUSTOMER NAME
CUSTOMER CITY
CUSTOMER POST
CUSTOMER ST
CUSTOMER ADDR
CUSTOMER PHONE
CUSTOMER FAX
ORDER
ORDER NUMBER
ORDER DATE
STATUS
ORDER ITEM BACKORDERED
QUANTITY
ITEM
ITEM NUMBER
QUANTITY
DESCRIPTION
ORDER ITEM SHIPPED
QUANTITY
SHIP DATE
Finance KPI ERP KPI
Accounting KPI ……
Marketing KPI
Sales KPI
EDW Datasets Data Lake Datasets Cloud Datasets
Data Lake
Application
SQL / MDX
One-stop Governed Platform
• Data as a service
• Single source of truth
• Managed golden data
Intelligent Data Platform
• Machine Learning recommendation
from SQL history
• Optimizaed for PB data at scale
• High performance and High
Concurrency
Analysts Delighted Platform
• Supports most favorite BI tools
• Support standard SQL/MDX
• Reduce engineering efforts
Data Analysts Marketing User Operation Analysts
Intelligent Cubing
managed data at scale