SlideShare a Scribd company logo
1 of 28
Download to read offline
Accelerate big data analytics
with Apache Kylin
Shaofeng Shi, 史少锋
Apache Kylin committer & PMC
• Shaofeng Shi, 史少锋
shaofengshi@apache.org,
shaofeng.shi@kyligence.io
• Apache Kylin committer & PMC, joined Kylin
project since 2014 in eBay;
• Now chief architect at Kyligence Inc.
About me
• Apache Kylin background
• Why OLAP cube is needed for big data
• How Kylin build/persist/query cube on Hadoop
• Performance benchmark
• Use cases
Agenda
• Huge amount of data be collected today.
• Transactions, server/app logs, mobile/IoT, etc;
• Hadoop is the de facto standard for big data.
• Hadoop was not designed for low-latency SQL
query.
• MapReduce, Hive, Pig, Spark… are for batch processing;
• Query performance decreases dramatically as
data volume increases. User experience is bad.
• Challenge: how to keep high performance data
analytics as data grows?
Background
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8
Query performance
Data volume
• Massive Parallel Processing (MPP) and SQL on Hadoop technologies tries to
solve the problem.
• Amazon Redshift, Greenplum, Presto, Impala, Spark SQL, etc;
• MPP accelerate queries in the following ways:
• Distribute data into multiple nodes, processing in parallel;
• Optimize I/O with column-oriented storage, compression encoding, etc;
• In-memory processing;
MPP and SQL on Hadoop
• Performance
• Handling massive data costs much resource; When resource is insufficient, performance is
downgraded. Latency from tens of seconds to tens of minutes.
• Concurrency
• CPU/memory/network intensive, hard to serve multiple concurrent users;
• Scalability
• Data need be redistributed when scale out, taking hours;
• Master node becomes a bottleneck as cluster size increase. Cluster size is limited to 1-2
hundreds;
MPP’s limitation
A real OLAP solution for big data
High
performance High
scalability
Standard
complaint
High
concurrency
+
Ease of use
What is Apache Kylin
Data
analytics
Hive / Kafka /
RDBMS
Apache Kylin
MR/Spark HBase
Interactive Reporting Dashboard
OLAP /
Data mart
Hadoop
- Build OLAP cube on Hadoop
- Support TB to PB data
- Sub-second query latency
- ANSI-SQL
- JDBC / ODBC / REST API
- BI integration
- Web GUI
- LDAP/SSO
Apache Kylin: Extreme OLAP Engine for Big Data
https://kylin.apache.org
OLAP cube
• OLAP cube is a data structure optimized for very
quick data analysis.
• OLAP cube has been adopted by traditional Data
Warehouses for decades.
• For analyst, the cube concept is easier to be
accepted than others.
• Why not use cube for big data OLAP?
Picture is from
https://www.guru99.com/online-analytical-
processing.html#4
Benefits from cube for big data
OLAP Cube
Classification,
aggregation, and
sorting
ü Performance: querying cube can be
1000x faster than querying raw data;
ü Ease of use: cube is topic oriented,
analysts can self-serve;
ü Cost efficiency: build once, use many times.
Much computing resources can be saved.
ü Feasibility: with Hadoop, building cube for a
large dataset becomes easy.
Source data
Apache Kylin is built with mainstream technologies
Apache Kylin
Data Analyst, BI Tools, Web App…
4. Send SQL
query
Online data flow
Offline data flow
6. Scan & filter cube
1. Extract
data to
Hadoop
2. Build the cube
3. Load
5. Parse, optimize &
rewrite query
Kylin can build all cuboids your want
• One cube has 2^N cuboids; (N = dimension number)
• Cuboid = Materialized view;
• All cuboids are consistent (updated atomically);
• User can define partial cube with rules.
• Kylin automatically generate the cubing jobs, and then execute/track them.
How Kylin build cube on Hadoop
• Overall cubing steps:
Ø Extract source data from Hive/Kafka/RDBMS to HDFS;
Ø Build dimension dictionaries for encoding;
Ø Build cuboids with MapReduce or Spark;
Ø Convert cube to HBase format;
Ø Load into HBase.
How Kylin build cube on Hadoop (cont.)
By-layer cubing
• Cube is composed by:
• Cuboids
• Dimensions
• Metrics (measures)
• Cuboid ID + Dimension values à row key
• Metrics à HBase column value
How Kylin persist cube into HBase
Logical
table for
cuboid
00011111
How Kylin query cube with SQL
select
l_returnflag,
o_orderstatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price
from
v_lineitem
inner join v_orders on l_orderkey = o_orderkey
where
l_shipdate <= '1998-09-16'
group by
l_returnflag,
o_orderstatus
order by
l_returnflag,
o_orderstatus;
Sort
Aggr.
Filter
TablesO(N)
Join
Parse SQL to an
execution plan
• Kylin uses Apache Calcite as the SQL parser and optimizer;
How Kylin query cube with SQL (cont.)
Sort
Aggr.
Filter
Tables
Join
Sort
Cube
Filter
Pick the best matched
cuboid
O(1)
Rewrite toThese steps have
already been
completed in the
cube build.
• Kylin optimize and adapt the plan to OLAP cube.
• With less processing, return result instantly.
• Translate cube query into HBase table scan;
• Group by, Filter -> Identify the best cuboid
• Filters -> HBase scan ranges (row key start/end), fuzzy row filter
• Aggregations -> Measure columns
• Scan HBase table and translate HBase result into cube result;
• HBase Coprocessor is used for condition pushdown and region side aggregation.
• HBase Result (row key + col value) decodes to cube result (dimensions + measures)
• Let Calcite do the final round processing (filtering, sorting, grouping, window,
etc).
How Kylin query cube with SQL (cont.)
Performance benchmark
Tested on SSB dataset, 200X to 1700X faster
than Apache Hive
As data size increase, the response time
keeps stable
KAP was the name of Kyligence release.
Ø Build a partial cube by manual (with rules) or automatically (cube planner);
Ø Support UHC dimension (e.g., cell number, cardinality > 100 million);
Ø Support precisely count distinct on UHC (with bitmap), and roll up at any level;
Ø Support fully and incrementally data load, also allow refreshing history data;
Ø Support Kafka and RDBMS (MySQL, Redshift etc) as data source;
Ø Multi-nodes deployment for HA;
Ø Read-write separation deployment (dedicated HBase) for high performance;
Ø The Real-time OLAP (v3.0) is at alpha now;
Advanced features
ü Dashboard, reporting and business intelligence on big data (support Tableau,
Cognos, MSTR, Qlik, Superset, Zeppelin…)
ü Multi-dimensional data exploration
ü Traditional data warehouse offload/migration to Hadoop
ü User behavior analysis (PV, UV, retention, funnel, etc)
ü Transparent query acceleration
Scenarios for Apache Kylin
• The biggest O2O service provider in China
• Catering takeaway, rating, hotel, travel, bike (mobike), etc;
• More than 300 millions active users and 3 million merchants;
• Apache Kylin was selected as the central OLAP platform
since 2015, serving thousands of business analysts from all
business lines.
• Till Aug, 2018, total data row in Kylin is 8.9 trillion,cube
storage is 971 TB.
• 3.8 million queries per day; 50% queries < 200ms; 90%
queries < 1.2s
Use case: Meituan & Dianping
• Yahoo! Japan’s online shop data analysis.
• Use Kylin to replace Impala, fulfill the low
latency requirement to business analysts;
• Query latency reduced from 1 minute to <1s;
• With Kylin’s cross-region deployment, cube
instead of raw data is pushed to the DC that
nearby the analysts, saved the bandwidth;
• Read more:
https://techblog.yahoo.co.jp/oss/apache-kylin/
Use case: Yahoo! Japan
• About Telecoming
• Mobile payments (Direct Carrier Billing) and digital
marketing company in Spain.
• To meet the demand of analytics over the large
volumes of data, deployed Hadoop and Kylin to
support interactive analytical queries.
• “Thanks to this new architecture, Telecoming
has improved the quality and productivity of its
decision-making processes, which translates
into better results for their business.”
• Read more: http://www.stratebi.com/-/big-data-
marketing-telecoming
Use case: Telecoming
• About OLX Group
• A global online marketplace, headquartered in Amsterdam,
owned by Naspers, operating in 45 countries;
• Built DW with Amazon Redshift in the past; Redshift
couldn’t scale, moving to data lake on S3;
• Need a solution to enable analyst users doing
interactive analysis with Tableau.
• Tried several solutions; finally, Kylin + EMR well
solved the challenge!
• “Kylin is the game changer with its extreme fast
performance and seamless integration with Tableau.”
Use case: OLX Group
Picture from https://www.olx.co.id/
• Apache Kylin home: https://kylin.apache.org/
• Apache Kylin source code: https://github.com/apache/kylin
• Join Kylin mailing list: dev@kylin.apache.org and user@kylin.apache.org
• Kylin meetup: https://www.meetup.com/Apache-Kylin-Meetup/
• Need enterprise solution? contact Kyligence: https://kyligence.io/ or
info@kyligence.io
Useful links
See a demo?
Thank you!
谢谢!

More Related Content

What's hot

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

What's hot (20)

How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
 

Similar to Accelerating Big Data Analytics with Apache Kylin

HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
Michael Stack
 
Apache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large datasetApache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large dataset
Chun'en Ni
 

Similar to Accelerating Big Data Analytics with Apache Kylin (20)

Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
Building Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSIBuilding Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSI
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Apache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data SpainApache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data Spain
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
ESGYN Overview
ESGYN OverviewESGYN Overview
ESGYN Overview
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015
 
Apache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large datasetApache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large dataset
 
Apache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large datasetApache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large dataset
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 

More from Tyler Wishnoff

More from Tyler Wishnoff (15)

Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
Hassle-Free Data Lake Governance: Automating Your Analytics with a Semantic L...
 
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
 
Providing Interactive Analytics on Excel with Billions of Rows
Providing Interactive Analytics on Excel with Billions of RowsProviding Interactive Analytics on Excel with Billions of Rows
Providing Interactive Analytics on Excel with Billions of Rows
 
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Apache kylin 101 - Get Sub-Second Analytics on Massive DatasetsApache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
 
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
AI-Powered Analytics: What It Is and How It’s Powering the Next Generation of...
 
Analysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
Analysis of the Pressure Placed on Medical Systems during the COVID-19 PandemicAnalysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
Analysis of the Pressure Placed on Medical Systems during the COVID-19 Pandemic
 
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
 
Augmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataAugmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big Data
 
Simplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the Cloud
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
 
Apache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence PresentationApache Kylin Data Summit 2019: Kyligence Presentation
Apache Kylin Data Summit 2019: Kyligence Presentation
 
Augmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data Analytics
 
Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
 
How Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
How Analytics Teams Using SSAS Can Embrace Big Data and the CloudHow Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
How Analytics Teams Using SSAS Can Embrace Big Data and the Cloud
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Accelerating Big Data Analytics with Apache Kylin

  • 1. Accelerate big data analytics with Apache Kylin Shaofeng Shi, 史少锋 Apache Kylin committer & PMC
  • 2. • Shaofeng Shi, 史少锋 shaofengshi@apache.org, shaofeng.shi@kyligence.io • Apache Kylin committer & PMC, joined Kylin project since 2014 in eBay; • Now chief architect at Kyligence Inc. About me
  • 3. • Apache Kylin background • Why OLAP cube is needed for big data • How Kylin build/persist/query cube on Hadoop • Performance benchmark • Use cases Agenda
  • 4. • Huge amount of data be collected today. • Transactions, server/app logs, mobile/IoT, etc; • Hadoop is the de facto standard for big data. • Hadoop was not designed for low-latency SQL query. • MapReduce, Hive, Pig, Spark… are for batch processing; • Query performance decreases dramatically as data volume increases. User experience is bad. • Challenge: how to keep high performance data analytics as data grows? Background 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 Query performance Data volume
  • 5. • Massive Parallel Processing (MPP) and SQL on Hadoop technologies tries to solve the problem. • Amazon Redshift, Greenplum, Presto, Impala, Spark SQL, etc; • MPP accelerate queries in the following ways: • Distribute data into multiple nodes, processing in parallel; • Optimize I/O with column-oriented storage, compression encoding, etc; • In-memory processing; MPP and SQL on Hadoop
  • 6. • Performance • Handling massive data costs much resource; When resource is insufficient, performance is downgraded. Latency from tens of seconds to tens of minutes. • Concurrency • CPU/memory/network intensive, hard to serve multiple concurrent users; • Scalability • Data need be redistributed when scale out, taking hours; • Master node becomes a bottleneck as cluster size increase. Cluster size is limited to 1-2 hundreds; MPP’s limitation
  • 7. A real OLAP solution for big data High performance High scalability Standard complaint High concurrency + Ease of use
  • 8. What is Apache Kylin Data analytics Hive / Kafka / RDBMS Apache Kylin MR/Spark HBase Interactive Reporting Dashboard OLAP / Data mart Hadoop - Build OLAP cube on Hadoop - Support TB to PB data - Sub-second query latency - ANSI-SQL - JDBC / ODBC / REST API - BI integration - Web GUI - LDAP/SSO Apache Kylin: Extreme OLAP Engine for Big Data https://kylin.apache.org
  • 9. OLAP cube • OLAP cube is a data structure optimized for very quick data analysis. • OLAP cube has been adopted by traditional Data Warehouses for decades. • For analyst, the cube concept is easier to be accepted than others. • Why not use cube for big data OLAP? Picture is from https://www.guru99.com/online-analytical- processing.html#4
  • 10. Benefits from cube for big data OLAP Cube Classification, aggregation, and sorting ü Performance: querying cube can be 1000x faster than querying raw data; ü Ease of use: cube is topic oriented, analysts can self-serve; ü Cost efficiency: build once, use many times. Much computing resources can be saved. ü Feasibility: with Hadoop, building cube for a large dataset becomes easy. Source data
  • 11. Apache Kylin is built with mainstream technologies Apache Kylin Data Analyst, BI Tools, Web App… 4. Send SQL query Online data flow Offline data flow 6. Scan & filter cube 1. Extract data to Hadoop 2. Build the cube 3. Load 5. Parse, optimize & rewrite query
  • 12. Kylin can build all cuboids your want • One cube has 2^N cuboids; (N = dimension number) • Cuboid = Materialized view; • All cuboids are consistent (updated atomically); • User can define partial cube with rules.
  • 13. • Kylin automatically generate the cubing jobs, and then execute/track them. How Kylin build cube on Hadoop
  • 14. • Overall cubing steps: Ø Extract source data from Hive/Kafka/RDBMS to HDFS; Ø Build dimension dictionaries for encoding; Ø Build cuboids with MapReduce or Spark; Ø Convert cube to HBase format; Ø Load into HBase. How Kylin build cube on Hadoop (cont.) By-layer cubing
  • 15. • Cube is composed by: • Cuboids • Dimensions • Metrics (measures) • Cuboid ID + Dimension values à row key • Metrics à HBase column value How Kylin persist cube into HBase Logical table for cuboid 00011111
  • 16. How Kylin query cube with SQL select l_returnflag, o_orderstatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price from v_lineitem inner join v_orders on l_orderkey = o_orderkey where l_shipdate <= '1998-09-16' group by l_returnflag, o_orderstatus order by l_returnflag, o_orderstatus; Sort Aggr. Filter TablesO(N) Join Parse SQL to an execution plan • Kylin uses Apache Calcite as the SQL parser and optimizer;
  • 17. How Kylin query cube with SQL (cont.) Sort Aggr. Filter Tables Join Sort Cube Filter Pick the best matched cuboid O(1) Rewrite toThese steps have already been completed in the cube build. • Kylin optimize and adapt the plan to OLAP cube. • With less processing, return result instantly.
  • 18. • Translate cube query into HBase table scan; • Group by, Filter -> Identify the best cuboid • Filters -> HBase scan ranges (row key start/end), fuzzy row filter • Aggregations -> Measure columns • Scan HBase table and translate HBase result into cube result; • HBase Coprocessor is used for condition pushdown and region side aggregation. • HBase Result (row key + col value) decodes to cube result (dimensions + measures) • Let Calcite do the final round processing (filtering, sorting, grouping, window, etc). How Kylin query cube with SQL (cont.)
  • 19. Performance benchmark Tested on SSB dataset, 200X to 1700X faster than Apache Hive As data size increase, the response time keeps stable KAP was the name of Kyligence release.
  • 20. Ø Build a partial cube by manual (with rules) or automatically (cube planner); Ø Support UHC dimension (e.g., cell number, cardinality > 100 million); Ø Support precisely count distinct on UHC (with bitmap), and roll up at any level; Ø Support fully and incrementally data load, also allow refreshing history data; Ø Support Kafka and RDBMS (MySQL, Redshift etc) as data source; Ø Multi-nodes deployment for HA; Ø Read-write separation deployment (dedicated HBase) for high performance; Ø The Real-time OLAP (v3.0) is at alpha now; Advanced features
  • 21. ü Dashboard, reporting and business intelligence on big data (support Tableau, Cognos, MSTR, Qlik, Superset, Zeppelin…) ü Multi-dimensional data exploration ü Traditional data warehouse offload/migration to Hadoop ü User behavior analysis (PV, UV, retention, funnel, etc) ü Transparent query acceleration Scenarios for Apache Kylin
  • 22. • The biggest O2O service provider in China • Catering takeaway, rating, hotel, travel, bike (mobike), etc; • More than 300 millions active users and 3 million merchants; • Apache Kylin was selected as the central OLAP platform since 2015, serving thousands of business analysts from all business lines. • Till Aug, 2018, total data row in Kylin is 8.9 trillion,cube storage is 971 TB. • 3.8 million queries per day; 50% queries < 200ms; 90% queries < 1.2s Use case: Meituan & Dianping
  • 23. • Yahoo! Japan’s online shop data analysis. • Use Kylin to replace Impala, fulfill the low latency requirement to business analysts; • Query latency reduced from 1 minute to <1s; • With Kylin’s cross-region deployment, cube instead of raw data is pushed to the DC that nearby the analysts, saved the bandwidth; • Read more: https://techblog.yahoo.co.jp/oss/apache-kylin/ Use case: Yahoo! Japan
  • 24. • About Telecoming • Mobile payments (Direct Carrier Billing) and digital marketing company in Spain. • To meet the demand of analytics over the large volumes of data, deployed Hadoop and Kylin to support interactive analytical queries. • “Thanks to this new architecture, Telecoming has improved the quality and productivity of its decision-making processes, which translates into better results for their business.” • Read more: http://www.stratebi.com/-/big-data- marketing-telecoming Use case: Telecoming
  • 25. • About OLX Group • A global online marketplace, headquartered in Amsterdam, owned by Naspers, operating in 45 countries; • Built DW with Amazon Redshift in the past; Redshift couldn’t scale, moving to data lake on S3; • Need a solution to enable analyst users doing interactive analysis with Tableau. • Tried several solutions; finally, Kylin + EMR well solved the challenge! • “Kylin is the game changer with its extreme fast performance and seamless integration with Tableau.” Use case: OLX Group Picture from https://www.olx.co.id/
  • 26. • Apache Kylin home: https://kylin.apache.org/ • Apache Kylin source code: https://github.com/apache/kylin • Join Kylin mailing list: dev@kylin.apache.org and user@kylin.apache.org • Kylin meetup: https://www.meetup.com/Apache-Kylin-Meetup/ • Need enterprise solution? contact Kyligence: https://kyligence.io/ or info@kyligence.io Useful links