3. What is Druid?
http://druid.io/
Druidis an open-source data storedesigned for sub-
second queries on real-time and historical data. It is primarily used for
business intelligence(OLAP) queries on event data. Druid provides low
latency (real-time) data ingestion, flexible data exploration, and
fast data aggregation. Existing Druid deployments have scaled to trillions of events and
petabytes of data. Druid is most commonly used to power user-facing analytic
applications.
3
18. Hive Integration - Types of Analytics
18
https://hortonworks.com/blog/apache-hive-druid-part-1-3/
http://www.popit.kr/ultra-fast_olap_druid2/
19. Hive Integration - Create Datasource
19
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker",
"druid.segment.granularity" = "DAY")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Inference of Druid column types (timestamp, dimensions, metrics)
depends on Hive column type
20. Hive Integration - Querying Druid
20
Automatic rewriting when query is expressed over Druid table
Powered by Apache Calcite
Main challenge: identify patterns in logical plan corresponding to different
kinds of Druid queries (Timeseries, TopN, GroupBy, Select)
Translate (sub)plan of operators into valid Druid JSON query
Druid query is encapsulated within Hive TableScan operator
Hive TableScan uses Druid input format
Submits query to Druid and generates records out of the query results
It might not be possible to push all computation to Druid
Our contract is that the query should always be executed
https://www.slideshare.net/HadoopSummit/interactive-analytics-at-scale-in-apache-hive-using-
druid
21. Hive Integration - Querying Druid
21
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10; {
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
22. Hive Integration - Join
22
SELECT a.channel, b.col1
FROM
(
SELECT `channel`, max(delta) as m, sum(added)
FROM druid_table_1
GROUP BY `channel`, `floor_year`(`__time`)
ORDER BY m DESC
LIMIT 1000
) a
JOIN
(
SELECT col1, col2
FROM hive_table_1
) b
ON a.channel = b.col2;
Query that runs across Druid and Hive
24. Druid - Built-in SQL
24
Druid 0.10.0 or higher
// Connect to /druid/v2/sql/avatica/ on your broker.
String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica/";
// Set any connection context parameters you need here (see "Connection context"
below).
// Or leave empty for default behavior.
Properties connectionProperties = new Properties();
try (Connection connection = DriverManager.getConnection(url, connectionProperties))
{
try (ResultSet resultSet = connection.createStatement().executeQuery("SELECT
COUNT(*) AS cnt FROM data_source")) {
while (resultSet.next()) {
// Do something
}
}
}
Druid includes a native SQL layer with an Apache Calcite-based parser and planner.
25. Druid - Built-in SQL
25
{
"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"
}
SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE
TABLE_SCHEMA = 'druid' AND TABLE_NAME = 'foo'
SELECT x, COUNT(*)
FROM data_source_1
WHERE x IN (SELECT x FROM data_source_2 WHERE y = 'baz')
GROUP BY x
http://druid.io/docs/latest/querying/sql.html