Kylin and Druid Presentation

INTRODUCTION TO OLAP
 OLAP (online analytical processing) is
computer processing that enables a user
to easily and selectively extract and
view data from different points of view.
 OLAP allows users to analyze database
information from multiple database systems
at one time.
 OLAP data is stored in multidimensional
databases.

Analysis
Query/
Reporting
Data
Mining
Monitoring & Administration
Metadata
Repository
External
Sources
Operational
databases
Extract
Transform
Load
Refresh
DATA
WAREHOUSE
Serve
OLAP servers
DATAWAREHOUSING ARCHITECHURE

 A multidimensional cube can combine data from
disparate data sources and store the information
in a fashion that is logical for business users.

THE OLAP CUBE
 An OLAP Cube is a data structure that allows fast
analysis of data.
 The arrangement of data into cubes overcomes a
limitation of relational databases.
 The OLAP cube consists of numeric facts called
measures which are categorized by dimensions.

TWOTYPES OF
DATABASE ACTIVITY
 OLTP
◦ (Online-Transaction Processing)
 OLAP
◦ (Online-Analytical Processing)

OLTP vs. OLAP
 On-LineTransaction Processing (OLTP):
– technology used to perform updates on
operational or transactional systems (e.g., point
of sale systems)
 On-Line Analytical Processing (OLAP):
– technology used to perform complex analysis of
the data in a data warehouse
OLAP is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast,
consistent, interactive access to a wide variety of possible views of
information that has been transformed from raw data to reflect
the dimensionality of the enterprise as understood by the user.
[source: OLAP Council: www.olapcouncil.org]

TYPES OF OLAP
 Relational OLAP(ROLAP):
 Relational and Specialized Relational DBMS to store and manage warehouse data
 OLAP middleware to support missing pieces
 Optimize for each DBMS backend
 Aggregation Navigation Logic
 Additional tools and services
 Example: Microstrategy, MetaCube (Informix)
 Extended RDBMS with multidimensional data mapping to standard relational operation.
 Multidimensional OLAP(MOLAP):
 Array-based storage structures
 Direct access to array data structures
 Implemented operation in multidimensional data
 Example: Essbase (Arbor)
 Hybrid Online Analytical Processing (HOLAP):
A hybrid approach to the solution where the aggregated totals are stored in a
multidimensional database while the detail data is stored in the relational database. This is the
balance between the data efficiency of the ROLAP model and the performance of the
MOLAP model.

ROLAP v/s MOLAP
Characteristics ROLAP MOLAP
SCHEMA User star Schema
•Additional dimensions
can be added
dynamically.
User Data cubes
•Addition dimensions
require recreation of
data cube.
Database Size Medium to large Small to medium
Architecture Client/Server Client/Server
Access Support ad-hoc
requests
Limited to pre-defined
dimensions

Characteristics ROLAP MOLAP
Resources HIGH VERY HIGH
Flexibility HIGH LOW
Scalability HIGH LOW
Speed •Good with small data
sets.
•Average for medium to
large data set.
•Faster for small to
medium data sets.
•Average for large data
sets.

 One main benefit of OLAP is consistency of information
and calculations.
 "What if" scenarios are some of the most popular uses of
OLAP software and are made eminently more possible by
multidimensional processing.
 It allows a manager to pull down data from an OLAP
database in broad or specific terms.
 OLAP creates a single platform for all the information
and business needs, planning, budgeting, forecasting,
reporting and analysis.
BENEFITS OF OLAP

/Contd…
 Marketing and sales analysis
 Consumer goods industries
 Financial services industry (insurance, banks etc)
 Database Marketing

Apache Kylin – What ?
● Open source
● Distributed Analytics Engine
● Provides SQL interface
● Multi-dimensional analysis (OLAP) on Hadoop
● Faster and more user-responsive than relational online
analytical processing (ROLAP)

The Fundamental Idea
● The idea of Kylin is not brand new.
● Technologies include methods to store pre-calculated results
to serve analysis queries, generate each level’s cuboids with
all possible combinations of dimensions, and calculate all
metrics at different levels.

From Relational to key-value
● Prevents large table scan and a long delay to get the answer.
● It makes sense to calculate and store those values for further
usage.
● This process generates all of the dimension combinations and
measured values.

How it Works ?
● Read data from Hive (which is stored on HDFS)
● Run MapReduce jobs to pre-calculate
● Store cube data in HBase
● Leverage Zookeeper for job coordination

Apache Foundation Blog December 2015
● Apache Kylin is the best OLAP engine on Big Data so far.
● While other OLAP engines struggle with the data volume,
Kylin enables query responses in the milliseconds.
● Starting to leverage Kylin for near real time data streaming
storage and analytics engine.

Advantages
● Kylin has good intergration with BI tools, such as Tableau or
Excel.
● Kylin support molap cube, it has very good performance for
complex query on billion level data set

Limitations
● Real Time Support hasn’t yet been built.
● Kylin only supports the star schema. You are limited to a
single fact table for each cube.

Key Features
●Open Source.
●Distributed architecture.
●Real-time ingestion.
●Column-oriented for speed.
●Fast filtering.
●Operational simplicity.
●Support to OLAP Queries.

Druid Architecture
Types of Nodes:
Historical Nodes
➢Backbone of Druid cluster.
➢Download segments and serve queries over them.
Broker Nodes
➢Clients query to broker node to get data from Druid .
➢Scattering Queries.
➢Gathering and merging results.(know location of the segments)
Coordinator Nodes
➢Manage segments on historical nodes .
➢Load new segments , drop old segments and move segments to load
balance.

Ingestion method
● Streaming (real-time):
– If your dataset originates in a streaming system like Kafka .
– Kafka lets you process streams of records as they occur.
– The Kafka cluster stores streams of records in categories called topics.
– Each record consists of a key, a value, and a timestamp
● File based (Batch):
– Load data from HDFS, local files ,etc in batches.

Segments
● Druid stores its index in segment files ,partitioned by time
(Timestamp)
● Data Structure of segment file
– Columnar: the data for each column is laid out in separate
data structures.

●A segment consists of the timestamp column, dimension columns, and metric
columns .
●The timestamp and metric columns are simple and each of these is an array of
integer or floating point values .Values in metric columns are pulled out to perform
aggregate.
●Dimensions columns are different because they support filter and group-by
operations and requires:
➢ Dictionary that encodes column values
{
"Justin Bieber": 0,
"Ke$ha": 1
}
➢Column data
[0,
0,
1,
1]
●Bitmaps - one for each unique value of the column
●value="Justin Bieber": [1,1,0,0]
●value="Ke$ha": [0,0,1,1]

Druid vs Apache Kylin
DRUID APACHE KYLIN
Query Speed Very Fast Fast
Type of Analysis RealTime Analysis Focuses on OLAP cases,
RealTime Analysis under
development
SQL Support Absent Present
FaultTolerance All Nodes Need to Setup
BITools Integration Under Development Present (Tableau or Excel)
Integration with Kafka Present Absent
Complex Queries Bad for big data sets Good Performance
StorageType Bit-map Index OLAP Cube
Underlying technology Own computation and storage
cluster
Hadoop for cube build ,
HBase for storage

Miscellaneous Points to Consider…
 Druid has limitation on table join.
 Apache Kylin supports Star Schema.
 Modern corporations are increasingly looking for near real time
analytics and insights to make actionable decisions.
 Druid is trying to support integration with BI tools using Apache
Hive at Horton works.
(https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/)
 Previous version of Druid was under GPL v2 license.The latest
version of Druid is under Apache license v2,Apache Kylin is
under Apache License v2.
 Druid has 181 contributors for their GitHub project whereas
Apache Kylin has 60 contributors.

References - OLAP & OLTP
● http://en.wikipedia.org/wiki/Online_analytical_processing
● http://www.dmreview.com/issues/19971101/964-l.html
● http://en.wikipedia.org/wiki/Extract,_transform,_load
● http://www.olapreport.com/Applications.html

References-Apache Kylin
 https://mail-archives.apache.org/mod_mbox/kylin-
dev/201503.mbox/%3CCAKmQrOY0fjZLUU0MGo5aajZ2uLb3T0qJknHQd+Wv1oxd5PKixQ@mai
l.gmail.com%3E
 https://dzone.com/articles/apache-kylin-for-olap-on-hadoop
 http://kylin.apache.org/docs16/
 https://github.com/apache/kylin
 https://resources.zaloni.com/blog/apache-kylin-for-olap-on-hadoop
 https://en.wikipedia.org/wiki/Apache_Kylin
 http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/

References-Druid
 http://druid.io/docs/latest/design/
 http://druid.io/docs/latest/tutorials/ingestion.html
 http://druid.io/docs/latest/design/segments.html
 https://en.wikipedia.org/wiki/Druid_(open-source_data_store)

References-Druid vs Apache Kylin
 https://www.slideshare.net/freepsw/olap-for-big-data-druid-vs-apache-kylin-vs-apache-lens
 http://markmail.org/message/mf6gfzdwfqwtbtv6#query:+page:1+mid:sp7ek7x5pawjlxb6+state:results
 https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/
 https://github.com/druid-io/druid
 https://github.com/apache/kylin

Kylin and Druid Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Kylin and Druid Presentation

Similar to Kylin and Druid Presentation (20)

Recently uploaded

Recently uploaded (20)

Kylin and Druid Presentation