SlideShare a Scribd company logo
1 of 38
INTRODUCTION TO OLAP
 OLAP (online analytical processing) is
computer processing that enables a user
to easily and selectively extract and
view data from different points of view.
 OLAP allows users to analyze database
information from multiple database systems
at one time.
 OLAP data is stored in multidimensional
databases.
Analysis
Query/
Reporting
Data
Mining
Monitoring & Administration
Metadata
Repository
External
Sources
Operational
databases
Extract
Transform
Load
Refresh
DATA
WAREHOUSE
Serve
OLAP servers
DATAWAREHOUSING ARCHITECHURE
 A multidimensional cube can combine data from
disparate data sources and store the information
in a fashion that is logical for business users.
THE OLAP CUBE
 An OLAP Cube is a data structure that allows fast
analysis of data.
 The arrangement of data into cubes overcomes a
limitation of relational databases.
 The OLAP cube consists of numeric facts called
measures which are categorized by dimensions.
OLAP CUBE
TWOTYPES OF
DATABASE ACTIVITY
 OLTP
◦ (Online-Transaction Processing)
 OLAP
◦ (Online-Analytical Processing)
OLTP vs. OLAP
 On-LineTransaction Processing (OLTP):
– technology used to perform updates on
operational or transactional systems (e.g., point
of sale systems)
 On-Line Analytical Processing (OLAP):
– technology used to perform complex analysis of
the data in a data warehouse
OLAP is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast,
consistent, interactive access to a wide variety of possible views of
information that has been transformed from raw data to reflect
the dimensionality of the enterprise as understood by the user.
[source: OLAP Council: www.olapcouncil.org]
OLTP vs. OLAP
TYPES OF OLAP
 Relational OLAP(ROLAP):
 Relational and Specialized Relational DBMS to store and manage warehouse data
 OLAP middleware to support missing pieces
 Optimize for each DBMS backend
 Aggregation Navigation Logic
 Additional tools and services
 Example: Microstrategy, MetaCube (Informix)
 Extended RDBMS with multidimensional data mapping to standard relational operation.
 Multidimensional OLAP(MOLAP):
 Array-based storage structures
 Direct access to array data structures
 Implemented operation in multidimensional data
 Example: Essbase (Arbor)
 Hybrid Online Analytical Processing (HOLAP):
A hybrid approach to the solution where the aggregated totals are stored in a
multidimensional database while the detail data is stored in the relational database. This is the
balance between the data efficiency of the ROLAP model and the performance of the
MOLAP model.
ROLAP v/s MOLAP
Characteristics ROLAP MOLAP
SCHEMA User star Schema
•Additional dimensions
can be added
dynamically.
User Data cubes
•Addition dimensions
require recreation of
data cube.
Database Size Medium to large Small to medium
Architecture Client/Server Client/Server
Access Support ad-hoc
requests
Limited to pre-defined
dimensions
Characteristics ROLAP MOLAP
Resources HIGH VERY HIGH
Flexibility HIGH LOW
Scalability HIGH LOW
Speed •Good with small data
sets.
•Average for medium to
large data set.
•Faster for small to
medium data sets.
•Average for large data
sets.
 One main benefit of OLAP is consistency of information
and calculations.
 "What if" scenarios are some of the most popular uses of
OLAP software and are made eminently more possible by
multidimensional processing.
 It allows a manager to pull down data from an OLAP
database in broad or specific terms.
 OLAP creates a single platform for all the information
and business needs, planning, budgeting, forecasting,
reporting and analysis.
BENEFITS OF OLAP
/Contd…
 Marketing and sales analysis
 Consumer goods industries
 Financial services industry (insurance, banks etc)
 Database Marketing
Apache Kylin – What ?
● Open source
● Distributed Analytics Engine
● Provides SQL interface
● Multi-dimensional analysis (OLAP) on Hadoop
● Faster and more user-responsive than relational online
analytical processing (ROLAP)
The Fundamental Idea
● The idea of Kylin is not brand new.
● Technologies include methods to store pre-calculated results
to serve analysis queries, generate each level’s cuboids with
all possible combinations of dimensions, and calculate all
metrics at different levels.
From Relational to key-value
● Prevents large table scan and a long delay to get the answer.
● It makes sense to calculate and store those values for further
usage.
● This process generates all of the dimension combinations and
measured values.
Github Page
How it Works ?
● Read data from Hive (which is stored on HDFS)
● Run MapReduce jobs to pre-calculate
● Store cube data in HBase
● Leverage Zookeeper for job coordination
Apache Foundation Blog December 2015
● Apache Kylin is the best OLAP engine on Big Data so far.
● While other OLAP engines struggle with the data volume,
Kylin enables query responses in the milliseconds.
● Starting to leverage Kylin for near real time data streaming
storage and analytics engine.
Advantages
● Kylin has good intergration with BI tools, such as Tableau or
Excel.
● Kylin support molap cube, it has very good performance for
complex query on billion level data set
Limitations
● Real Time Support hasn’t yet been built.
● Kylin only supports the star schema. You are limited to a
single fact table for each cube.
Key Features
●Open Source.
●Distributed architecture.
●Real-time ingestion.
●Column-oriented for speed.
●Fast filtering.
●Operational simplicity.
●Support to OLAP Queries.
Druid Architecture
Types of Nodes:
Historical Nodes
➢Backbone of Druid cluster.
➢Download segments and serve queries over them.
Broker Nodes
➢Clients query to broker node to get data from Druid .
➢Scattering Queries.
➢Gathering and merging results.(know location of the segments)
Coordinator Nodes
➢Manage segments on historical nodes .
➢Load new segments , drop old segments and move segments to load
balance.
Ingestion method
● Streaming (real-time):
– If your dataset originates in a streaming system like Kafka .
– Kafka lets you process streams of records as they occur.
– The Kafka cluster stores streams of records in categories called topics.
– Each record consists of a key, a value, and a timestamp
● File based (Batch):
– Load data from HDFS, local files ,etc in batches.
Segments
● Druid stores its index in segment files ,partitioned by time
(Timestamp)
● Data Structure of segment file
– Columnar: the data for each column is laid out in separate
data structures.
●A segment consists of the timestamp column, dimension columns, and metric
columns .
●The timestamp and metric columns are simple and each of these is an array of
integer or floating point values .Values in metric columns are pulled out to perform
aggregate.
●Dimensions columns are different because they support filter and group-by
operations and requires:
➢ Dictionary that encodes column values
{
"Justin Bieber": 0,
"Ke$ha": 1
}
➢Column data
[0,
0,
1,
1]
●Bitmaps - one for each unique value of the column
●value="Justin Bieber": [1,1,0,0]
●value="Ke$ha": [0,0,1,1]
Druid vs Apache Kylin
DRUID APACHE KYLIN
Query Speed Very Fast Fast
Type of Analysis RealTime Analysis Focuses on OLAP cases,
RealTime Analysis under
development
SQL Support Absent Present
FaultTolerance All Nodes Need to Setup
BITools Integration Under Development Present (Tableau or Excel)
Integration with Kafka Present Absent
Complex Queries Bad for big data sets Good Performance
StorageType Bit-map Index OLAP Cube
Underlying technology Own computation and storage
cluster
Hadoop for cube build ,
HBase for storage
Miscellaneous Points to Consider…
 Druid has limitation on table join.
 Apache Kylin supports Star Schema.
 Modern corporations are increasingly looking for near real time
analytics and insights to make actionable decisions.
 Druid is trying to support integration with BI tools using Apache
Hive at Horton works.
(https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/)
 Previous version of Druid was under GPL v2 license.The latest
version of Druid is under Apache license v2,Apache Kylin is
under Apache License v2.
 Druid has 181 contributors for their GitHub project whereas
Apache Kylin has 60 contributors.
References - OLAP & OLTP
● http://en.wikipedia.org/wiki/Online_analytical_processing
● http://www.dmreview.com/issues/19971101/964-l.html
● http://en.wikipedia.org/wiki/Extract,_transform,_load
● http://www.olapreport.com/Applications.html
References-Apache Kylin
 https://mail-archives.apache.org/mod_mbox/kylin-
dev/201503.mbox/%3CCAKmQrOY0fjZLUU0MGo5aajZ2uLb3T0qJknHQd+Wv1oxd5PKixQ@mai
l.gmail.com%3E
 https://dzone.com/articles/apache-kylin-for-olap-on-hadoop
 http://kylin.apache.org/docs16/
 https://github.com/apache/kylin
 https://resources.zaloni.com/blog/apache-kylin-for-olap-on-hadoop
 https://en.wikipedia.org/wiki/Apache_Kylin
 http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/
References-Druid
 http://druid.io/docs/latest/design/
 http://druid.io/docs/latest/tutorials/ingestion.html
 http://druid.io/docs/latest/design/segments.html
 https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
References-Druid vs Apache Kylin
 https://www.slideshare.net/freepsw/olap-for-big-data-druid-vs-apache-kylin-vs-apache-lens
 http://markmail.org/message/mf6gfzdwfqwtbtv6#query:+page:1+mid:sp7ek7x5pawjlxb6+state:results
 https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/
 https://github.com/druid-io/druid
 https://github.com/apache/kylin
THANK YOU!!

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 

What's hot (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive
HiveHive
Hive
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Pipelines and Packages: Introduction to Azure Data Factory (24HOP)
Pipelines and Packages: Introduction to Azure Data Factory (24HOP)Pipelines and Packages: Introduction to Azure Data Factory (24HOP)
Pipelines and Packages: Introduction to Azure Data Factory (24HOP)
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 

Viewers also liked

Viewers also liked (6)

Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor Introduction
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
카일린 Kylin, OLAP on hadoop
카일린 Kylin, OLAP on hadoop카일린 Kylin, OLAP on hadoop
카일린 Kylin, OLAP on hadoop
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 

Similar to Kylin and Druid Presentation

Similar to Kylin and Druid Presentation (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
3 OLAP.pptx
3 OLAP.pptx3 OLAP.pptx
3 OLAP.pptx
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
86921864 olap-case-study-vj
86921864 olap-case-study-vj86921864 olap-case-study-vj
86921864 olap-case-study-vj
 
Unlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQLUnlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQL
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
SAP HANA_class1.pptx
SAP HANA_class1.pptxSAP HANA_class1.pptx
SAP HANA_class1.pptx
 
OBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.pptOBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.ppt
 
OLAP OnLine Analytical Processing
OLAP OnLine Analytical ProcessingOLAP OnLine Analytical Processing
OLAP OnLine Analytical Processing
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Kushal Data Warehousing PPT
Kushal Data Warehousing PPTKushal Data Warehousing PPT
Kushal Data Warehousing PPT
 
OLAP & Data Warehouse
OLAP & Data WarehouseOLAP & Data Warehouse
OLAP & Data Warehouse
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Kylin and Druid Presentation

  • 1.
  • 2. INTRODUCTION TO OLAP  OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view.  OLAP allows users to analyze database information from multiple database systems at one time.  OLAP data is stored in multidimensional databases.
  • 4.  A multidimensional cube can combine data from disparate data sources and store the information in a fashion that is logical for business users.
  • 5. THE OLAP CUBE  An OLAP Cube is a data structure that allows fast analysis of data.  The arrangement of data into cubes overcomes a limitation of relational databases.  The OLAP cube consists of numeric facts called measures which are categorized by dimensions.
  • 7. TWOTYPES OF DATABASE ACTIVITY  OLTP ◦ (Online-Transaction Processing)  OLAP ◦ (Online-Analytical Processing)
  • 8. OLTP vs. OLAP  On-LineTransaction Processing (OLTP): – technology used to perform updates on operational or transactional systems (e.g., point of sale systems)  On-Line Analytical Processing (OLAP): – technology used to perform complex analysis of the data in a data warehouse OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the dimensionality of the enterprise as understood by the user. [source: OLAP Council: www.olapcouncil.org]
  • 10. TYPES OF OLAP  Relational OLAP(ROLAP):  Relational and Specialized Relational DBMS to store and manage warehouse data  OLAP middleware to support missing pieces  Optimize for each DBMS backend  Aggregation Navigation Logic  Additional tools and services  Example: Microstrategy, MetaCube (Informix)  Extended RDBMS with multidimensional data mapping to standard relational operation.  Multidimensional OLAP(MOLAP):  Array-based storage structures  Direct access to array data structures  Implemented operation in multidimensional data  Example: Essbase (Arbor)  Hybrid Online Analytical Processing (HOLAP): A hybrid approach to the solution where the aggregated totals are stored in a multidimensional database while the detail data is stored in the relational database. This is the balance between the data efficiency of the ROLAP model and the performance of the MOLAP model.
  • 11. ROLAP v/s MOLAP Characteristics ROLAP MOLAP SCHEMA User star Schema •Additional dimensions can be added dynamically. User Data cubes •Addition dimensions require recreation of data cube. Database Size Medium to large Small to medium Architecture Client/Server Client/Server Access Support ad-hoc requests Limited to pre-defined dimensions
  • 12. Characteristics ROLAP MOLAP Resources HIGH VERY HIGH Flexibility HIGH LOW Scalability HIGH LOW Speed •Good with small data sets. •Average for medium to large data set. •Faster for small to medium data sets. •Average for large data sets.
  • 13.  One main benefit of OLAP is consistency of information and calculations.  "What if" scenarios are some of the most popular uses of OLAP software and are made eminently more possible by multidimensional processing.  It allows a manager to pull down data from an OLAP database in broad or specific terms.  OLAP creates a single platform for all the information and business needs, planning, budgeting, forecasting, reporting and analysis. BENEFITS OF OLAP
  • 14. /Contd…  Marketing and sales analysis  Consumer goods industries  Financial services industry (insurance, banks etc)  Database Marketing
  • 15. Apache Kylin – What ? ● Open source ● Distributed Analytics Engine ● Provides SQL interface ● Multi-dimensional analysis (OLAP) on Hadoop ● Faster and more user-responsive than relational online analytical processing (ROLAP)
  • 16. The Fundamental Idea ● The idea of Kylin is not brand new. ● Technologies include methods to store pre-calculated results to serve analysis queries, generate each level’s cuboids with all possible combinations of dimensions, and calculate all metrics at different levels.
  • 17.
  • 18. From Relational to key-value ● Prevents large table scan and a long delay to get the answer. ● It makes sense to calculate and store those values for further usage. ● This process generates all of the dimension combinations and measured values.
  • 19.
  • 20.
  • 22. How it Works ? ● Read data from Hive (which is stored on HDFS) ● Run MapReduce jobs to pre-calculate ● Store cube data in HBase ● Leverage Zookeeper for job coordination
  • 23. Apache Foundation Blog December 2015 ● Apache Kylin is the best OLAP engine on Big Data so far. ● While other OLAP engines struggle with the data volume, Kylin enables query responses in the milliseconds. ● Starting to leverage Kylin for near real time data streaming storage and analytics engine.
  • 24. Advantages ● Kylin has good intergration with BI tools, such as Tableau or Excel. ● Kylin support molap cube, it has very good performance for complex query on billion level data set
  • 25. Limitations ● Real Time Support hasn’t yet been built. ● Kylin only supports the star schema. You are limited to a single fact table for each cube.
  • 26.
  • 27. Key Features ●Open Source. ●Distributed architecture. ●Real-time ingestion. ●Column-oriented for speed. ●Fast filtering. ●Operational simplicity. ●Support to OLAP Queries.
  • 28. Druid Architecture Types of Nodes: Historical Nodes ➢Backbone of Druid cluster. ➢Download segments and serve queries over them. Broker Nodes ➢Clients query to broker node to get data from Druid . ➢Scattering Queries. ➢Gathering and merging results.(know location of the segments) Coordinator Nodes ➢Manage segments on historical nodes . ➢Load new segments , drop old segments and move segments to load balance.
  • 29. Ingestion method ● Streaming (real-time): – If your dataset originates in a streaming system like Kafka . – Kafka lets you process streams of records as they occur. – The Kafka cluster stores streams of records in categories called topics. – Each record consists of a key, a value, and a timestamp ● File based (Batch): – Load data from HDFS, local files ,etc in batches.
  • 30. Segments ● Druid stores its index in segment files ,partitioned by time (Timestamp) ● Data Structure of segment file – Columnar: the data for each column is laid out in separate data structures.
  • 31. ●A segment consists of the timestamp column, dimension columns, and metric columns . ●The timestamp and metric columns are simple and each of these is an array of integer or floating point values .Values in metric columns are pulled out to perform aggregate. ●Dimensions columns are different because they support filter and group-by operations and requires: ➢ Dictionary that encodes column values { "Justin Bieber": 0, "Ke$ha": 1 } ➢Column data [0, 0, 1, 1] ●Bitmaps - one for each unique value of the column ●value="Justin Bieber": [1,1,0,0] ●value="Ke$ha": [0,0,1,1]
  • 32. Druid vs Apache Kylin DRUID APACHE KYLIN Query Speed Very Fast Fast Type of Analysis RealTime Analysis Focuses on OLAP cases, RealTime Analysis under development SQL Support Absent Present FaultTolerance All Nodes Need to Setup BITools Integration Under Development Present (Tableau or Excel) Integration with Kafka Present Absent Complex Queries Bad for big data sets Good Performance StorageType Bit-map Index OLAP Cube Underlying technology Own computation and storage cluster Hadoop for cube build , HBase for storage
  • 33. Miscellaneous Points to Consider…  Druid has limitation on table join.  Apache Kylin supports Star Schema.  Modern corporations are increasingly looking for near real time analytics and insights to make actionable decisions.  Druid is trying to support integration with BI tools using Apache Hive at Horton works. (https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/)  Previous version of Druid was under GPL v2 license.The latest version of Druid is under Apache license v2,Apache Kylin is under Apache License v2.  Druid has 181 contributors for their GitHub project whereas Apache Kylin has 60 contributors.
  • 34. References - OLAP & OLTP ● http://en.wikipedia.org/wiki/Online_analytical_processing ● http://www.dmreview.com/issues/19971101/964-l.html ● http://en.wikipedia.org/wiki/Extract,_transform,_load ● http://www.olapreport.com/Applications.html
  • 35. References-Apache Kylin  https://mail-archives.apache.org/mod_mbox/kylin- dev/201503.mbox/%3CCAKmQrOY0fjZLUU0MGo5aajZ2uLb3T0qJknHQd+Wv1oxd5PKixQ@mai l.gmail.com%3E  https://dzone.com/articles/apache-kylin-for-olap-on-hadoop  http://kylin.apache.org/docs16/  https://github.com/apache/kylin  https://resources.zaloni.com/blog/apache-kylin-for-olap-on-hadoop  https://en.wikipedia.org/wiki/Apache_Kylin  http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/
  • 36. References-Druid  http://druid.io/docs/latest/design/  http://druid.io/docs/latest/tutorials/ingestion.html  http://druid.io/docs/latest/design/segments.html  https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
  • 37. References-Druid vs Apache Kylin  https://www.slideshare.net/freepsw/olap-for-big-data-druid-vs-apache-kylin-vs-apache-lens  http://markmail.org/message/mf6gfzdwfqwtbtv6#query:+page:1+mid:sp7ek7x5pawjlxb6+state:results  https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/  https://github.com/druid-io/druid  https://github.com/apache/kylin