SlideShare a Scribd company logo
1 of 16
Design Cube in Kylin
dev@kylin.incubator.apache.org
Before You Start
• Kylin is a MOLAP engine on Hadoop.
• Understand Kylin helps cube design a lot.
– http://www.slideshare.net/YangLi43/apache-kylin-deep-dive-2014-dec
• This deck summarizes best practices and
patterns on how to design an efficient cube.
– For detailed steps to create a cube, check out
https://github.com/KylinOLAP/Kylin/wiki/Kylin-Cube-Creation-Tutorial
Overview
• Identify Star Schema
• Design Cube
– Dimensions
– Measures
– Incremental Build
– Advanced Options
• Build and Verify
Identify Star Schema
• Kylin creates cube from a star schema of Hive
tables.
• One fact table that has ever growing records, like
transactions.
• A few dimension tables that are relatively static,
like users and products.
• Hive tables must be synced into Kylin first.
Know Cardinalities of Columns
• Cardinalities have significant impact on cube size and query
latency.
– High Cardinality: > 1,000
– Ultra High Cardinality: > 1,000,000
• Avoid UHC as much as possible.
– If it’s used as indicator, then put the indicator in cube.
– Try categorize values or derive features from the UHC rather
than putting the original value in cube.
• To know column cardinalities
– select count(distinct A) from T
– or google for fancy tools
Cube Concepts
Cube = all combination of dimensions
Cuboid = one combination of dimensions
Curse of dimensionality: N dimension cube has 2N cuboid
Design Dimensions
• 15 dimensions or less is most ideal.
– More than that causes slowness in cube build and
longer query latency.
– Does user really need a report of 15+ dimensions?
– You can define multiple cubes on one star schema to
fulfill different analysis scenarios.
• Control the total number of dimensions.
– Mandatory dimension
– Hierarchy dimension
– Derived dimension
Mandatory Dimension
• Dimension that presents in every query.
– like Date
• Mandatory dimension cuts cuboid combinations by half.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A is Mandatory
A B C
A B -
A - C
A - -
Hierarchy Dimension
• Dimensions that form a “contains” relationship where
parent level is required for child level to make sense.
– like Year -> Month -> Day; or Country -> City
• Hierarchy dimension reduces combination from 2N to N+1.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A->B->C is Hierarchy
A B C
A B -
A - -
- - -
Derived Dimension
• Dimensions on lookup table that can be derived by PK.
– like User ID derives [Name, Age, Gender]
• Derived dimension reduces combination from 2N to 2 at the
cost of extra runtime aggregation.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A, B, C are Derived by ID
ID
-
The Order of Dimensions
• Finally, define dimensions in following order.
– Mandatory dimension
– Dimensions that heavily involved in filters
– High cardinality dimensions
– Low cardinality dimensions
• Filter first, helps to cut down query scan ranges.
• High cardinality first, helps to calculate cube
efficiently.
Define Measures
• Kylin currently support
– Sum
– Count
– Max
– Min
– Average
– Distinct Count (based on HyperLogLog)
• Distinct Count is a very heavy data type.
– Error rate<1.22% takes 64KB per cell.
– Convince user to use the wildest tolerable error rate.
– Distinct Count is slower to build and query comparing to other
measures.
Incremental Build
• Kylin supports incremental build along a time dimension if enabled.
• Setting a start time, cube segments can be built daily (or any period)
processing only the incremental data.
• A segment can be refreshed relatively cheaply to reflect changes in
hive table.
• With the increasing number of segments, query would slow down a
bit.
• Merge segments to control the total number < 10 for best
performance.
Advanced Options
• Leave advanced options as is if you are not sure what they mean.
• Aggregation groups give finest control on which cuboids to build.
– Partial cube -- Only combinations within the same group are built.
– For cube with 30 dimensions, if divide the dimensions into 3 groups, the cuboid number will
reduce from 1 Billion to 3 Thousands.
• 230 => 210 + 210 + 210
– It’s tradeoff between online aggregation and offline pre-aggregation.
• Query is efficient when involved dimensions all come from a single aggregation
group, or otherwise runtime aggregation will slow down queries.
– Capture query patterns with your aggregation group.
– Keep less than 10 dimensions in one group, or the cube will be huge.
– A dimension can appear in multiple groups.
– Create a second cube with different aggregation group is also an option.
• Rowkeys, they are generated in order of dimensions. No need to change.
Build and Verify
• Once the cube is created, build it, and ready to verify.
• Check the expansion rate of your cube.
– Under 10 times is ideal.
• Notes on the SQLs
– Write queries against the original hive tables, cubes are
transparent at the query time.
– Sanity check: select count(*) from fact
– Make sure the join relationships (inner or left) matches the cube
definition exactly.
– Kylin works best with a group by clause.
– Date constant is like date ‘1970-01-01’
Q & A
Thanks!

More Related Content

What's hot

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 

What's hot (20)

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 

Similar to Design cube in Apache Kylin

Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecYang Li
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuSpark Summit
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming hongbin ma
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)qhzhou
 
Data Warehouse Implementation
Data Warehouse ImplementationData Warehouse Implementation
Data Warehouse Implementationomayva
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopDataWorks Summit
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepPaweł Mitruś
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureNir Rubinstein
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Deep learning optimization at alibaba by zhenliang zhang from Alibaba
Deep learning optimization at alibaba by  zhenliang zhang from AlibabaDeep learning optimization at alibaba by  zhenliang zhang from Alibaba
Deep learning optimization at alibaba by zhenliang zhang from AlibabaBill Liu
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoicebazaarvoice_engineering
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 

Similar to Design cube in Apache Kylin (20)

Datacube
DatacubeDatacube
Datacube
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Cloud DWH deep dive
Cloud DWH deep diveCloud DWH deep dive
Cloud DWH deep dive
 
Data Warehouse Implementation
Data Warehouse ImplementationData Warehouse Implementation
Data Warehouse Implementation
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and future
 
datacub
datacubdatacub
datacub
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Deep learning optimization at alibaba by zhenliang zhang from Alibaba
Deep learning optimization at alibaba by  zhenliang zhang from AlibabaDeep learning optimization at alibaba by  zhenliang zhang from Alibaba
Deep learning optimization at alibaba by zhenliang zhang from Alibaba
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoice
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Design cube in Apache Kylin

  • 1. Design Cube in Kylin dev@kylin.incubator.apache.org
  • 2. Before You Start • Kylin is a MOLAP engine on Hadoop. • Understand Kylin helps cube design a lot. – http://www.slideshare.net/YangLi43/apache-kylin-deep-dive-2014-dec • This deck summarizes best practices and patterns on how to design an efficient cube. – For detailed steps to create a cube, check out https://github.com/KylinOLAP/Kylin/wiki/Kylin-Cube-Creation-Tutorial
  • 3. Overview • Identify Star Schema • Design Cube – Dimensions – Measures – Incremental Build – Advanced Options • Build and Verify
  • 4. Identify Star Schema • Kylin creates cube from a star schema of Hive tables. • One fact table that has ever growing records, like transactions. • A few dimension tables that are relatively static, like users and products. • Hive tables must be synced into Kylin first.
  • 5. Know Cardinalities of Columns • Cardinalities have significant impact on cube size and query latency. – High Cardinality: > 1,000 – Ultra High Cardinality: > 1,000,000 • Avoid UHC as much as possible. – If it’s used as indicator, then put the indicator in cube. – Try categorize values or derive features from the UHC rather than putting the original value in cube. • To know column cardinalities – select count(distinct A) from T – or google for fancy tools
  • 6. Cube Concepts Cube = all combination of dimensions Cuboid = one combination of dimensions Curse of dimensionality: N dimension cube has 2N cuboid
  • 7. Design Dimensions • 15 dimensions or less is most ideal. – More than that causes slowness in cube build and longer query latency. – Does user really need a report of 15+ dimensions? – You can define multiple cubes on one star schema to fulfill different analysis scenarios. • Control the total number of dimensions. – Mandatory dimension – Hierarchy dimension – Derived dimension
  • 8. Mandatory Dimension • Dimension that presents in every query. – like Date • Mandatory dimension cuts cuboid combinations by half. Normal Dimensions A B C A B - - B C A - C A - - - B - - - C - - - A is Mandatory A B C A B - A - C A - -
  • 9. Hierarchy Dimension • Dimensions that form a “contains” relationship where parent level is required for child level to make sense. – like Year -> Month -> Day; or Country -> City • Hierarchy dimension reduces combination from 2N to N+1. Normal Dimensions A B C A B - - B C A - C A - - - B - - - C - - - A->B->C is Hierarchy A B C A B - A - - - - -
  • 10. Derived Dimension • Dimensions on lookup table that can be derived by PK. – like User ID derives [Name, Age, Gender] • Derived dimension reduces combination from 2N to 2 at the cost of extra runtime aggregation. Normal Dimensions A B C A B - - B C A - C A - - - B - - - C - - - A, B, C are Derived by ID ID -
  • 11. The Order of Dimensions • Finally, define dimensions in following order. – Mandatory dimension – Dimensions that heavily involved in filters – High cardinality dimensions – Low cardinality dimensions • Filter first, helps to cut down query scan ranges. • High cardinality first, helps to calculate cube efficiently.
  • 12. Define Measures • Kylin currently support – Sum – Count – Max – Min – Average – Distinct Count (based on HyperLogLog) • Distinct Count is a very heavy data type. – Error rate<1.22% takes 64KB per cell. – Convince user to use the wildest tolerable error rate. – Distinct Count is slower to build and query comparing to other measures.
  • 13. Incremental Build • Kylin supports incremental build along a time dimension if enabled. • Setting a start time, cube segments can be built daily (or any period) processing only the incremental data. • A segment can be refreshed relatively cheaply to reflect changes in hive table. • With the increasing number of segments, query would slow down a bit. • Merge segments to control the total number < 10 for best performance.
  • 14. Advanced Options • Leave advanced options as is if you are not sure what they mean. • Aggregation groups give finest control on which cuboids to build. – Partial cube -- Only combinations within the same group are built. – For cube with 30 dimensions, if divide the dimensions into 3 groups, the cuboid number will reduce from 1 Billion to 3 Thousands. • 230 => 210 + 210 + 210 – It’s tradeoff between online aggregation and offline pre-aggregation. • Query is efficient when involved dimensions all come from a single aggregation group, or otherwise runtime aggregation will slow down queries. – Capture query patterns with your aggregation group. – Keep less than 10 dimensions in one group, or the cube will be huge. – A dimension can appear in multiple groups. – Create a second cube with different aggregation group is also an option. • Rowkeys, they are generated in order of dimensions. No need to change.
  • 15. Build and Verify • Once the cube is created, build it, and ready to verify. • Check the expansion rate of your cube. – Under 10 times is ideal. • Notes on the SQLs – Write queries against the original hive tables, cubes are transparent at the query time. – Sanity check: select count(*) from fact – Make sure the join relationships (inner or left) matches the cube definition exactly. – Kylin works best with a group by clause. – Date constant is like date ‘1970-01-01’