Druid meetup 4th_sql_on_druid

2017.06.08
SQL on Druid
4th Druid@Seoul Meetup
You sun Jeong (jerryjung@sk.com)

Index
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A
2

What is Druid?
http://druid.io/
Druidis an open-source data storedesigned for sub-
second queries on real-time and historical data. It is primarily used for
business intelligence(OLAP) queries on event data. Druid provides low
latency (real-time) data ingestion, flexible data exploration, and
fast data aggregation. Existing Druid deployments have scaled to trillions of events and
petabytes of data. Druid is most commonly used to power user-facing analytic
applications.
3

Druid Features
5http://www.popit.kr/ultra-fast_olap_druid/
https://hortonworks.com/blog/apache-hive-druid-part-1-3/

Pre-Aggregation & Roll-up
6
minute

Druid Architecture
8
https://en.wikipedia.org/wiki/Druid_(open-source_data_store)

Agenda
9
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A

Druid vs Spark
10
http://
www.popit.kr/
druid-spark-
performance/

Druid vs Spark(Cached) vs Spark(HDFS)
11
http://www.popit.kr/druid-spark-
performance/
Interactive Analysis Capability

Druid vs Spark
12
http://www.popit.kr/druid-spark-
performance/

Agenda
13
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A

Things that bother me in the druid
14
Ingestion
Query
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
"granularity": "day",
"dimensions": ["country", "device"],
pyDruid  
RDruid
Join?

Hive Integration - Benchmark
16
http://www.popit.kr/ultra-fast_olap_druid2/

Hive Integration - Druid Storage Handler
17
Hive 2.2.0 or higher

Hive Integration - Types of Analytics
18

Hive Integration - Create Datasource
19
CREATE EXTERNAL TABLE druid_table_1 
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' 
TBLPROPERTIES ("druid.datasource" = "wikiticker");
CREATE TABLE druid_table_1 
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' 
TBLPROPERTIES ("druid.datasource" = "wikiticker",
"druid.segment.granularity" = "DAY") 
AS 
SELECT __time, page, user, c_added, c_removed 
FROM src;
Inference of Druid column types (timestamp, dimensions, metrics)
depends on Hive column type

Hive Integration - Querying Druid
20
Automatic rewriting when query is expressed over Druid table
Powered by Apache Calcite
Main challenge: identify patterns in logical plan corresponding to different
kinds of Druid queries (Timeseries, TopN, GroupBy, Select)
Translate (sub)plan of operators into valid Druid JSON query
Druid query is encapsulated within Hive TableScan operator
Hive TableScan uses Druid input format
Submits query to Druid and generates records out of the query results
It might not be possible to push all computation to Druid
Our contract is that the query should always be executed
https://www.slideshare.net/HadoopSummit/interactive-analytics-at-scale-in-apache-hive-using-
druid

Hive Integration - Querying Druid
21
SELECT `user`, sum(`c_added`) AS s 
FROM druid_table_1 
WHERE EXTRACT(year FROM `__time`) 
BETWEEN 2010 AND 2011 
GROUP BY `user` 
ORDER BY s DESC 
LIMIT 10; {
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}

Hive Integration - Join
22
SELECT a.channel, b.col1
FROM
(
SELECT `channel`, max(delta) as m, sum(added)
FROM druid_table_1
GROUP BY `channel`, `floor_year`(`__time`)
ORDER BY m DESC
LIMIT 1000
) a
JOIN
(
SELECT col1, col2
FROM hive_table_1
) b
ON a.channel = b.col2;
Query that runs across Druid and Hive

Spark Integration
23
CREATE TABLE if not exists orderLineItemPartSupplier
USING org.sparklinedata.druid
OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
timeDimensionColumn "l_shipdate",
druidDatasource "tpch",
druidHost "localhost",
zkQualifyDiscoveryNames "true",
columnMapping '{ "l_quantity" : "sum_l_quantity", "ps_availqty" : "sum_ps_availqty", "cn_name" : "c_nation",
"cr_name" : "c_region", "sn_name" : "s_nation", "sr_name" : "s_region" } ',
numProcessingThreadsPerHistorical '1',
starSchema ' { "factTable" : "orderLineItemPartSupplier", "relations" : [] } ')
;
JAVA_TOOL_OPTIONS=-Duser.timezone=UTC sh start-sparklinedatathriftserver.sh ~/server/spark-druid-olap/
scripts/spl-accel-assembly-0.5.0-SNAPSHOT.jar
--driver-memory 19g --master yarn --deploy-mode client --conf spark.scheduler.mode=FAIR --properties-
file sparkline.properties
https://github.com/SparklineData/spark-druid-olap

Druid - Built-in SQL
24
Druid 0.10.0 or higher
// Connect to /druid/v2/sql/avatica/ on your broker.
String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica/";
// Set any connection context parameters you need here (see "Connection context"
below).
// Or leave empty for default behavior.
Properties connectionProperties = new Properties();
try (Connection connection = DriverManager.getConnection(url, connectionProperties))
{
try (ResultSet resultSet = connection.createStatement().executeQuery("SELECT
COUNT(*) AS cnt FROM data_source")) {
while (resultSet.next()) {
// Do something
}
}
}
Druid includes a native SQL layer with an Apache Calcite-based parser and planner.

Druid - Built-in SQL
25
{
"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"
}
SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE
TABLE_SCHEMA = 'druid' AND TABLE_NAME = 'foo'
SELECT x, COUNT(*)
FROM data_source_1
WHERE x IN (SELECT x FROM data_source_2 WHERE y = 'baz')
GROUP BY x
http://druid.io/docs/latest/querying/sql.html

Agenda
26
1. What is Druid?
2. Benchmark
3. SQL on Druid
4. Q&A

Druid meetup 4th_sql_on_druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Druid meetup 4th_sql_on_druid

Similar to Druid meetup 4th_sql_on_druid (20)

More from Yousun Jeong

More from Yousun Jeong (10)

Recently uploaded

Recently uploaded (20)

Druid meetup 4th_sql_on_druid