SlideShare a Scribd company logo
1 of 34
Interactive Analytics at Scale
in Apache Hive using Druid
Jesús Camacho Rodríguez
DataWorks Summit Europe
April 5, 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivation
 BI/OLAP applications that require interactive
visualization of complex data streams
– Real time bidding events
– User activity streams
– Voice call logs
– Network traffic flows
– Firewall events
– Application performance metrics
 Querying event data at large scale poses multiple challenges
Interactive analytics on event data
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid overview
 Development starts in 2011, open-sourced in late 2012
 Initial use case: interactive ad-analytics
 +150 contributors
 Main features
– Column-oriented distributed data store
– Batch and real-time ingestion
– Scalable to petabytes of data
– Sub-second response for arbitrary time-based
slice-and-dice
• Data partitioned by time dimension
• Automatic data summarization
• Approximate algorithms (hyperLogLog, theta)
Most Events per Day
30 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid architecture
Dashboards, BI tools
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Persistent storage
 Data in Druid is stored in segment files
 Partitioned by time, supports fast time-based slice-and-dice
 Ideally, segment files are each smaller than 1GB
 If files are large, smaller time partitions are needed
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Segment data structures
 Within a segment
– Timestamp column
– Dimension columns
– Metric columns
– Indexes to facilitate fast lookup and aggregation
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying
 HTTP REST API
 Queries and results expressed in JSON
 Multiple query types
– Time boundary
– Segment metadata
– Timeseries
– TopN
– GroupBy
– Select
{
"queryType": "groupBy",
"dataSource": "product_sales_index",
"granularity": "all",
"dimension": "product_id",
"aggregations": [ { "type": "doubleSum", "name": "s", "fieldName": "sales" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
Important to use adequate type  Impact on query performance
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid + Apache Hive
 Integration brings benefits both to Druid and Apache Hive
– Indexing complex query results in Druid using Hive
– Introducing a SQL interface on top of Druid
– Being able to execute complex operations on Druid data
– Efficient execution of OLAP queries in Hive
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 User needs to provide Druid data sources information to Hive
 Two different options depending on requirements
– Register Druid data sources in Hive
• Data is already stored in Druid
– Create Druid data sources from Hive
• Data is stored in Hive
• User may want to pre-process the data before storing it in Druid
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query
Registering Druid data sources
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive table name
Hive storage handler classname
Druid data source name
Druid segment granularity
Creating Druid data sources
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 Use Create Table As Select (CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY”)
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type
Creating Druid data sources
Timestamp Dimensions Metrics
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid data sources in Hive
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Creating Druid data sources
Select
File Sink
Original CTAS
physical plan
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
Table Scan
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Druid data sources in Hive
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Creating Druid data sources
Select
File Sink
Rewritten CTAS
physical plan CTAS query results
Table Scan
Reduce
Truncate timestamp to day granularity
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Segment 2011-01-01
Segment 2011-01-02
Druid data sources in Hive
Creating Druid data sources
Select
File Sink
Rewritten CTAS
physical plan
Table Scan
Reduce
CTAS query results
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
Querying Druid data sources
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Querying Druid data sources
 Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries
(Timeseries, TopN, GroupBy, Select)
 Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
 Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
 It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Possible to express filters
on time dimension using
SQL standard functions
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Initially:
– Scan is executed in Druid (select query)
– Rest of the query is executed in Hive
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Rewriting
rule
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Rewriting
rule
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
groupBy
Rewriting
rule
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
groupBy
Rewriting
rule
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
{
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
Physical plan transformation
Apache Hive
Druid query
groupBy
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Select
File SinkFile Sink
Table Scan
Query physical plan
Druid JSON query
Table Scan uses
Druid Input Format
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid input format
 Submits query to Druid and generates records out of the query results
 Current version
– Timeseries, TopN, and GroupBy queries are not partitioned
– Select queries: realtime and historical nodes are contacted directly
Node
Table Scan
Record reader
…
Timeseries, TopN, GroupBy
Node
Table Scan
Record reader
…
Table Scan
Record reader
… Node
Table Scan
Record reader
…
Table Scan
Record reader
…
Select
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
Querying Druid data sources
Demonstration
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demonstration
 Implementation in Apache Hive 2.3 - Apache Hive 3.0
– Release in Q2 2017
– Relies on Druid 0.9.2 and Apache Calcite 1.12.0
 Current status (master)
– Registering, creating, overwritting and deleting Druid data sources
– Querying Druid from Hive
• Bypass broker for Druid Select queries
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demonstration
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Interactive Analytics at Scale in Hive using Druid
Introduction
Registering and creating Druid data sources
Querying Druid data sources
Demonstration
Road ahead
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Road ahead
 Tighten integration between Druid and Apache Hive/Apache Calcite
– Recognize more functions  Push more computation to Druid
– Support complex column types
– Close the gap between semantics of different systems
• Time zone handling, null values
 Broader perspective
– Materialized views support in Apache Hive
• Data stored in Apache Hive
• Create materialized view in Druid
– Denormalized star schema for a certain time period
• Automatic input query rewriting over the materialized view (Apache Calcite)
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgments
 Apache Hive, Apache Calcite and Druid communities
– Slim Bouguerra, Julian Hyde, Nishant Bangarwa, Ashutosh Chauhan, Gunther Hagleitner, Carter
Shanklin, and many others
Thank You
@ApacheHive | @ApacheCalcite | @druidio
http://cwiki.apache.org/confluence/display/Hive/Druid+Integration
http://calcite.apache.org/docs/druid_adapter.html

More Related Content

What's hot

Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
FIWARE: Managing Context Information at large scale
FIWARE: Managing Context Information at large scaleFIWARE: Managing Context Information at large scale
FIWARE: Managing Context Information at large scaleFermin Galan
 
Hit Refresh with Oracle GoldenGate Microservices
Hit Refresh with Oracle GoldenGate MicroservicesHit Refresh with Oracle GoldenGate Microservices
Hit Refresh with Oracle GoldenGate MicroservicesBobby Curtis
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWAREFIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWAREFIWARE
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
GDPR Community Showcase for Apache Ranger and Apache Atlas
GDPR Community Showcase for Apache Ranger and Apache AtlasGDPR Community Showcase for Apache Ranger and Apache Atlas
GDPR Community Showcase for Apache Ranger and Apache AtlasDataWorks Summit
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Elastic Search
Elastic SearchElastic Search
Elastic SearchNavule Rao
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 

What's hot (20)

Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Data Modeling with NGSI, NGSI-LD
Data Modeling with NGSI, NGSI-LDData Modeling with NGSI, NGSI-LD
Data Modeling with NGSI, NGSI-LD
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 
FIWARE: Managing Context Information at large scale
FIWARE: Managing Context Information at large scaleFIWARE: Managing Context Information at large scale
FIWARE: Managing Context Information at large scale
 
Hit Refresh with Oracle GoldenGate Microservices
Hit Refresh with Oracle GoldenGate MicroservicesHit Refresh with Oracle GoldenGate Microservices
Hit Refresh with Oracle GoldenGate Microservices
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWAREFIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
GDPR Community Showcase for Apache Ranger and Apache Atlas
GDPR Community Showcase for Apache Ranger and Apache AtlasGDPR Community Showcase for Apache Ranger and Apache Atlas
GDPR Community Showcase for Apache Ranger and Apache Atlas
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
OpenShift on OpenStack with Kuryr
OpenShift on OpenStack with KuryrOpenShift on OpenStack with Kuryr
OpenShift on OpenStack with Kuryr
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 

Similar to Interactive Analytics at Scale in Apache Hive Using Druid

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidRaúl Marín
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Future of Data Meetup
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Understanding apache-druid
Understanding apache-druidUnderstanding apache-druid
Understanding apache-druidSuman Banerjee
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...InfluxData
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druidKashif Khan
 

Similar to Interactive Analytics at Scale in Apache Hive Using Druid (20)

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Time-series data analysis and persistence with Druid
Time-series data analysis and persistence with DruidTime-series data analysis and persistence with Druid
Time-series data analysis and persistence with Druid
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Understanding apache-druid
Understanding apache-druidUnderstanding apache-druid
Understanding apache-druid
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Scalable olap with druid
Scalable olap with druidScalable olap with druid
Scalable olap with druid
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Interactive Analytics at Scale in Apache Hive Using Druid

  • 1. Interactive Analytics at Scale in Apache Hive using Druid Jesús Camacho Rodríguez DataWorks Summit Europe April 5, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Motivation  BI/OLAP applications that require interactive visualization of complex data streams – Real time bidding events – User activity streams – Voice call logs – Network traffic flows – Firewall events – Application performance metrics  Querying event data at large scale poses multiple challenges Interactive analytics on event data
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid overview  Development starts in 2011, open-sourced in late 2012  Initial use case: interactive ad-analytics  +150 contributors  Main features – Column-oriented distributed data store – Batch and real-time ingestion – Scalable to petabytes of data – Sub-second response for arbitrary time-based slice-and-dice • Data partitioned by time dimension • Automatic data summarization • Approximate algorithms (hyperLogLog, theta) Most Events per Day 30 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid architecture Dashboards, BI tools
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Persistent storage  Data in Druid is stored in segment files  Partitioned by time, supports fast time-based slice-and-dice  Ideally, segment files are each smaller than 1GB  If files are large, smaller time partitions are needed Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5: Friday
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Segment data structures  Within a segment – Timestamp column – Dimension columns – Metric columns – Indexes to facilitate fast lookup and aggregation
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying  HTTP REST API  Queries and results expressed in JSON  Multiple query types – Time boundary – Segment metadata – Timeseries – TopN – GroupBy – Select { "queryType": "groupBy", "dataSource": "product_sales_index", "granularity": "all", "dimension": "product_id", "aggregations": [ { "type": "doubleSum", "name": "s", "fieldName": "sales" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Important to use adequate type  Impact on query performance
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid + Apache Hive  Integration brings benefits both to Druid and Apache Hive – Indexing complex query results in Druid using Hive – Introducing a SQL interface on top of Druid – Being able to execute complex operations on Druid data – Efficient execution of OLAP queries in Hive
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  User needs to provide Druid data sources information to Hive  Two different options depending on requirements – Register Druid data sources in Hive • Data is already stored in Druid – Create Druid data sources from Hive • Data is stored in Hive • User may want to pre-process the data before storing it in Druid
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query Registering Druid data sources
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY") AS SELECT __time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid data source name Druid segment granularity Creating Druid data sources
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler’ TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY”) AS SELECT __time, page, user, c_added, c_removed FROM src; ⇢ Inference of Druid column types (timestamp, dimensions, metrics) depends on Hive column type Creating Druid data sources Timestamp Dimensions Metrics
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid data sources in Hive  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Creating Druid data sources Select File Sink Original CTAS physical plan __time page user c_added c_removed 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T18:00:00Z Miley Ashu 2232 34 CTAS query results Table Scan
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved __time page user c_added c_removed __time_granularity 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Druid data sources in Hive  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Creating Druid data sources Select File Sink Rewritten CTAS physical plan CTAS query results Table Scan Reduce Truncate timestamp to day granularity
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Segment 2011-01-01 Segment 2011-01-02 Druid data sources in Hive Creating Druid data sources Select File Sink Rewritten CTAS physical plan Table Scan Reduce CTAS query results
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Querying Druid data sources  Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select)  Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator  Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results  It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Possible to express filters on time dimension using SQL standard functions
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Initially: – Scan is executed in Druid (select query) – Rest of the query is executed in Hive Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved { "queryType": "groupBy", "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Physical plan transformation Apache Hive Druid query groupBy Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Select File SinkFile Sink Table Scan Query physical plan Druid JSON query Table Scan uses Druid Input Format
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid input format  Submits query to Druid and generates records out of the query results  Current version – Timeseries, TopN, and GroupBy queries are not partitioned – Select queries: realtime and historical nodes are contacted directly Node Table Scan Record reader … Timeseries, TopN, GroupBy Node Table Scan Record reader … Table Scan Record reader … Node Table Scan Record reader … Table Scan Record reader … Select
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources Demonstration
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demonstration  Implementation in Apache Hive 2.3 - Apache Hive 3.0 – Release in Q2 2017 – Relies on Druid 0.9.2 and Apache Calcite 1.12.0  Current status (master) – Registering, creating, overwritting and deleting Druid data sources – Querying Druid from Hive • Bypass broker for Druid Select queries
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demonstration
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Interactive Analytics at Scale in Hive using Druid Introduction Registering and creating Druid data sources Querying Druid data sources Demonstration Road ahead
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Road ahead  Tighten integration between Druid and Apache Hive/Apache Calcite – Recognize more functions  Push more computation to Druid – Support complex column types – Close the gap between semantics of different systems • Time zone handling, null values  Broader perspective – Materialized views support in Apache Hive • Data stored in Apache Hive • Create materialized view in Druid – Denormalized star schema for a certain time period • Automatic input query rewriting over the materialized view (Apache Calcite)
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgments  Apache Hive, Apache Calcite and Druid communities – Slim Bouguerra, Julian Hyde, Nishant Bangarwa, Ashutosh Chauhan, Gunther Hagleitner, Carter Shanklin, and many others
  • 34. Thank You @ApacheHive | @ApacheCalcite | @druidio http://cwiki.apache.org/confluence/display/Hive/Druid+Integration http://calcite.apache.org/docs/druid_adapter.html

Editor's Notes

  1. - Add more info about materialized views?