SlideShare a Scribd company logo
1 of 36
© Cloudera, Inc. All rights reserved.
DRUID AND HIVE TOGETHER
USE CASES AND BEST PRACTICES
Nishant Bangarwa
© Cloudera, Inc. All rights reserved. 2
AGENDA
Motivation
Introduction to Druid
Hive and Druid
Performance Numbers
Demo
© Cloudera, Inc. All rights reserved. 3
Database popularity trend in last 24 months
© Cloudera, Inc. All rights reserved. 4
Challenges with specialized DBs
• Each specialized DB has different dialects and API
• Diverse security and audit mechanisms
• Different governance models
• Data from different sources needs to be combined at client side
• Need a solution to provide performance without added complexity
© Cloudera, Inc. All rights reserved. 5
Query Federation with Apache Hive
Extensible Storage Handler
• Input Format
• Output Format
• SerDe
• Rules for pushing computations
• Filters, Aggregates, Sort, Limit etc..
• Transform from SQL to special dialects
© Cloudera, Inc. All rights reserved. 6
Introduction to Apache Druid
High performance analytics data store for timeseries data
© Cloudera, Inc. All rights reserved. 7
Companies Using Druid
http://druid.io/druid-powered
© Cloudera, Inc. All rights reserved. 8
When to use Druid ?
• Event Data/ Timeseries data
• Realtime – Need to analyze events as they happen.
• Delays can lead to business loss e.g. Fraud Detection
• High Data Ingestion rate
• Scalable horizontally
• Queries generally involve aggregations and filtering on time
• Results for last quarter
• Aggregate comparisons over time, this week compared to last week etc.
• Result set is much smaller than the actual dataset being queried
© Cloudera, Inc. All rights reserved. 9
Common Use Cases
• User activity and behavior analysis
• clickstreams, viewstreams and activity streams
• measuring user engagement, tracking A/B test data for product releases, and
understanding usage patterns
• Application performance management
• operational data generated by applications
• identify bottlenecks and troubleshoot issues in Realtime
• IoT and device metrics
• Ingest machine generated data in real-time
• optimize hardware resources, identify issues, anomaly detection.
• Digital marketing
• understand advertising campaign performance, click through rates, conversion rates
© Cloudera, Inc. All rights reserved. 10
When NOT to use Druid ?
• updating existing records using a primary key
• updates need to be done via Rebuilding Segments (Re-Ingestion)
• Queries involve dumping entire dataset
• joining one big fact table to another big fact table
• query latency is not very important for business use case
• offline reporting system
© Cloudera, Inc. All rights reserved. 11
Key Druid Features
• Column-oriented Storage
• Sub-Second query times
• Arbitrary slicing and dicing of data
• Native Search Indexes
• Horizontally Scalable
• Streaming and Batch Ingestion
• Automatic Data Summarization
• Time based partition
• Flexible Schemas
• Rolling Upgrades
© Cloudera, Inc. All rights reserved. 12
Druid Concepts
Time Based Partitioning
1. Time partitioned Segment Files
2. Segments are versioned to support batch overrides
3. By Segment Query Results are Cached
Segment 5_1:
version1
Friday
Time
Segment 1:
version1
Monday
Segment 2:
version1
Tuesday
Segment 3:
version2
Wednesday
Segment 4:
version1
Thursday
Segment 5_2:
version1
Friday
© Cloudera, Inc. All rights reserved. 13
Druid Architecture
Realtime
Nodes
Historical
Nodes
Batch
Data Historical
Nodes
Broker
Nodes
Realtime
Index
Tasks
Streaming
Data
Historical
Nodes
Handoff
© Cloudera, Inc. All rights reserved. 15
Apache Hive and Apache Druid
• Large Scale Queries
• Joins, Subqueries
• Windowing Functions
• Transformations
• Complex Aggregations
• Advanced Sorting
• UDFs
• Queries to power visualizations
• Needles-in-a-haystack
• Dimensional Aggregates
• TopN queries
• Timeseries Queries
• Min/Max Values
• Streaming Ingestion
© Cloudera, Inc. All rights reserved. 16
Integration Benefits
1. Streaming Ingestion
2. Single SQL dialect and API
3. Central security controls and audit trail
4. Unified governance
5. Ability to combine data from multiple sources
6. Data independence
© Cloudera, Inc. All rights reserved. 17
Druid data sources in Hive
Registering Existing Druid data sources
Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive table name
Hive storage handler classname
Druid data source name
⇢ Broker node endpoint specified as a Hive configuration parameter
⇢ Automatic Druid data schema discovery: segment metadata query
© Cloudera, Inc. All rights reserved. 18
Druid data sources in Hive
Creating Druid data sources
Use Create Table As Select (CTAS) statement
CREATE EXTERNAL TABLE druid_table
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.segment.granularity" = "DAY”) AS
SELECT time, page, user, c_added, c_removed FROM src;
Hive table name
Hive storage handler classname
Druid segmentgranularity
⇢ Inference of Druid column types (timestamp,dimensions,metrics)dependson Hivecolumntype
© Cloudera, Inc. All rights reserved. 19
Druid data sources in Hive
File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Optional Data Summarization
Original CTAS
physical plan
__time page user c_added c_removed
2011-01-01T01:05:00Z Justin Boxer 1800 25
2011-01-02T19:00:00Z Justin Reach 2912 42
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17
2011-01-02T13:00:00Z Ke$ha Helz 3194 170
2011-01-02T18:00:00Z Miley Ashu 2232 34
CTAS query results
File Sink
Select
Table Scan
© Cloudera, Inc. All rights reserved. 20
Druid data sources in Hive
File Sink operator uses Druid output format
– Creates segment files and register them in Druid
– Data needs to be partitioned by time granularity
• Granularity specified as configuration parameter
Optional Data Summarization
Rewritten CTAS
physical plan
CTAS query results
File Sink
Select
Table Scan
__time page user c_added c_removed __time_granularity
2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z
2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z
2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z
2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z
2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z
Reduce
© Cloudera, Inc. All rights reserved. 21
Druid data sources in Hive
Creating Streaming Druid data sources
Use Create Table As Select (CTAS) statement
CREATE EXTERNAL TABLE druid_streaming
(`__time` timestamp,`dimension1` string`metric1` int, `metric2 double, Etc.. )
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ( "druid.segment.granularity" = "DAY”,
"kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "topic1");
Hive table name
Hive storage handler classname
Druid segmentgranularity
Kafka related properties
© Cloudera, Inc. All rights reserved. 22
Druid data sources in Hive
Managing Streaming Ingestion from Hive
Use Alter Table statement
ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'START’);
ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'STOP’);
ALTER TABLE druid_streaming t SET TBLPROPERTIES('druid.kafka.ingestion' = 'RESET');
Hive table name
Kafka related properties
⇢Reset will reset the offsets maintained by druid for ingestion
© Cloudera, Inc. All rights reserved. 23
Querying Druid datasources
• Automatic rewriting when query is expressed over Druid table
– Powered by Apache Calcite
– Main challenge: identify patterns in logical plan corresponding to different
kinds of Druid queries (Timeseries, TopN, GroupBy, Select)
• Translate (sub)plan of operators into valid Druid JSON query
– Druid query is encapsulated within Hive TableScan operator
• Hive TableScan uses Druid input format
– Submits query to Druid and generates records out of the query results
• It might not be possible to push all computation to Druid
– Our contract is that the query should always be executed
© Cloudera, Inc. All rights reserved. 24
Querying Druid datasources
Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Top 10 users that have added more characters from
beginning of 2010 until the end of 2011
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
© Cloudera, Inc. All rights reserved. 25
Querying Druid datasources
Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Initial Plan:
– Scan is executed in Druid (select query)
– Rest of the query is executed in Hive
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Apache Hive
Druid query
© Cloudera, Inc. All rights reserved. 26
Querying Druid datasources
Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
• Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive
Druid query
© Cloudera, Inc. All rights reserved. 27
Querying Druid datasources
Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Rewriting Rule
Apache Hive
Druid query
select
© Cloudera, Inc. All rights reserved. 28
Querying Druid datasources
Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Rewriting Rule
Apache Hive
Druid query
select
© Cloudera, Inc. All rights reserved. 29
Querying Druid datasources
Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Rewriting Rule
Apache Hive
Druid query
groupBy
© Cloudera, Inc. All rights reserved. 30
Querying Druid datasources
Apache Hive - SQL query
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM ` time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
• Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Rewriting Rule Apache Hive
Druid query
groupBy
© Cloudera, Inc. All rights reserved. 31
Querying Druid datasources
Filter
Project
Druid Scan
Sink
Sort Limit
Aggregate
Query Logical Plan
Apache Hive
Druid query
groupBy
{
"queryType": "groupBy", DruidJSON
query
"dataSource":
"users_index",
"granularity": "all",
"dimension":
"user",
"aggregations":[ { "type": "longSum","name":"s","fieldName":"c_added"} ],
"limitSpec":{
"limit":10,
"columns":[ {"dimension":"s","direction": "descending"} ]
},
"intervals":[ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000"]
}
File Sink Druid Scan
Query Physical Plan
© Cloudera, Inc. All rights reserved. 32
Druid input format
• Submits query to Druid and generates records out of the query results
• Current version
– Timeseries, TopN, and GroupBy queries are not partitioned directly sent to druid broker
– Scan queries: realtime and historical nodes are contacted directly
Timeseries, TopN, GroupBy Select
Node
Table Scan
Record reader
Table Scan
Record reader
Table Scan
Record reader
Node Node
Table Scan
Record reader
… … …
…
Table Scan
Record reader
…
© Cloudera, Inc. All rights reserved. 33
Performance and Scalability: Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Snap Inc)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
© Cloudera, Inc. All rights reserved. 34
Performance Numbers
• Query Latency
• average - 500ms
• 90%ile < 1sec
• 95%ile < 5sec
• 99%ile < 10 sec
• Query Volume
• 1000s queries per minute
• Benchmarking code
• https://github.com/druid-
io/druid-benchmark
© Cloudera, Inc. All rights reserved. 35
Performance Numbers
SSB Benchmark 1TB Scale
© Cloudera, Inc. All rights reserved. 36
Useful Resources
• Druid website – http://druid.io
• Druid User Group - dev@druid.incubator.apache.org
• Druid Dev Group – users@druid.incubator.apache.org
• Hive Druid Integration -
https://cwiki.apache.org/confluence/display/Hive/Druid+Integration
• Blogs - https://hortonworks.com/blog/apache-hive-druid-part-1-3/
• Query Federation with Apache Hive - https://hortonworks.com/blog/query-
federation-with-hive/
© Cloudera, Inc. All rights reserved.
THANK YOU

More Related Content

What's hot

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring MicroservicesWeaveworks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication confluent
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processingconfluent
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 

What's hot (20)

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring Microservices
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processing
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 

Similar to Druid and Hive Together : Use Cases and Best Practices

Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...DataStax Academy
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionTimothy Spann
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
How DBAs can garner the power of the Oracle Public Cloud?
How DBAs can garner the  power of the Oracle Public  Cloud?How DBAs can garner the  power of the Oracle Public  Cloud?
How DBAs can garner the power of the Oracle Public Cloud?Gustavo Rene Antunez
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...Neo4j
 

Similar to Druid and Hive Together : Use Cases and Best Practices (20)

Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
 
The new big data
The new big dataThe new big data
The new big data
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC Solution
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
How DBAs can garner the power of the Oracle Public Cloud?
How DBAs can garner the  power of the Oracle Public  Cloud?How DBAs can garner the  power of the Oracle Public  Cloud?
How DBAs can garner the power of the Oracle Public Cloud?
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Druid and Hive Together : Use Cases and Best Practices

  • 1. © Cloudera, Inc. All rights reserved. DRUID AND HIVE TOGETHER USE CASES AND BEST PRACTICES Nishant Bangarwa
  • 2. © Cloudera, Inc. All rights reserved. 2 AGENDA Motivation Introduction to Druid Hive and Druid Performance Numbers Demo
  • 3. © Cloudera, Inc. All rights reserved. 3 Database popularity trend in last 24 months
  • 4. © Cloudera, Inc. All rights reserved. 4 Challenges with specialized DBs • Each specialized DB has different dialects and API • Diverse security and audit mechanisms • Different governance models • Data from different sources needs to be combined at client side • Need a solution to provide performance without added complexity
  • 5. © Cloudera, Inc. All rights reserved. 5 Query Federation with Apache Hive Extensible Storage Handler • Input Format • Output Format • SerDe • Rules for pushing computations • Filters, Aggregates, Sort, Limit etc.. • Transform from SQL to special dialects
  • 6. © Cloudera, Inc. All rights reserved. 6 Introduction to Apache Druid High performance analytics data store for timeseries data
  • 7. © Cloudera, Inc. All rights reserved. 7 Companies Using Druid http://druid.io/druid-powered
  • 8. © Cloudera, Inc. All rights reserved. 8 When to use Druid ? • Event Data/ Timeseries data • Realtime – Need to analyze events as they happen. • Delays can lead to business loss e.g. Fraud Detection • High Data Ingestion rate • Scalable horizontally • Queries generally involve aggregations and filtering on time • Results for last quarter • Aggregate comparisons over time, this week compared to last week etc. • Result set is much smaller than the actual dataset being queried
  • 9. © Cloudera, Inc. All rights reserved. 9 Common Use Cases • User activity and behavior analysis • clickstreams, viewstreams and activity streams • measuring user engagement, tracking A/B test data for product releases, and understanding usage patterns • Application performance management • operational data generated by applications • identify bottlenecks and troubleshoot issues in Realtime • IoT and device metrics • Ingest machine generated data in real-time • optimize hardware resources, identify issues, anomaly detection. • Digital marketing • understand advertising campaign performance, click through rates, conversion rates
  • 10. © Cloudera, Inc. All rights reserved. 10 When NOT to use Druid ? • updating existing records using a primary key • updates need to be done via Rebuilding Segments (Re-Ingestion) • Queries involve dumping entire dataset • joining one big fact table to another big fact table • query latency is not very important for business use case • offline reporting system
  • 11. © Cloudera, Inc. All rights reserved. 11 Key Druid Features • Column-oriented Storage • Sub-Second query times • Arbitrary slicing and dicing of data • Native Search Indexes • Horizontally Scalable • Streaming and Batch Ingestion • Automatic Data Summarization • Time based partition • Flexible Schemas • Rolling Upgrades
  • 12. © Cloudera, Inc. All rights reserved. 12 Druid Concepts Time Based Partitioning 1. Time partitioned Segment Files 2. Segments are versioned to support batch overrides 3. By Segment Query Results are Cached Segment 5_1: version1 Friday Time Segment 1: version1 Monday Segment 2: version1 Tuesday Segment 3: version2 Wednesday Segment 4: version1 Thursday Segment 5_2: version1 Friday
  • 13. © Cloudera, Inc. All rights reserved. 13 Druid Architecture Realtime Nodes Historical Nodes Batch Data Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  • 14. © Cloudera, Inc. All rights reserved. 15 Apache Hive and Apache Druid • Large Scale Queries • Joins, Subqueries • Windowing Functions • Transformations • Complex Aggregations • Advanced Sorting • UDFs • Queries to power visualizations • Needles-in-a-haystack • Dimensional Aggregates • TopN queries • Timeseries Queries • Min/Max Values • Streaming Ingestion
  • 15. © Cloudera, Inc. All rights reserved. 16 Integration Benefits 1. Streaming Ingestion 2. Single SQL dialect and API 3. Central security controls and audit trail 4. Unified governance 5. Ability to combine data from multiple sources 6. Data independence
  • 16. © Cloudera, Inc. All rights reserved. 17 Druid data sources in Hive Registering Existing Druid data sources Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query
  • 17. © Cloudera, Inc. All rights reserved. 18 Druid data sources in Hive Creating Druid data sources Use Create Table As Select (CTAS) statement CREATE EXTERNAL TABLE druid_table STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.segment.granularity" = "DAY”) AS SELECT time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid segmentgranularity ⇢ Inference of Druid column types (timestamp,dimensions,metrics)dependson Hivecolumntype
  • 18. © Cloudera, Inc. All rights reserved. 19 Druid data sources in Hive File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Optional Data Summarization Original CTAS physical plan __time page user c_added c_removed 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T18:00:00Z Miley Ashu 2232 34 CTAS query results File Sink Select Table Scan
  • 19. © Cloudera, Inc. All rights reserved. 20 Druid data sources in Hive File Sink operator uses Druid output format – Creates segment files and register them in Druid – Data needs to be partitioned by time granularity • Granularity specified as configuration parameter Optional Data Summarization Rewritten CTAS physical plan CTAS query results File Sink Select Table Scan __time page user c_added c_removed __time_granularity 2011-01-01T01:05:00Z Justin Boxer 1800 25 2011-01-01T00:00:00Z 2011-01-02T19:00:00Z Justin Reach 2912 42 2011-01-02T00:00:00Z 2011-01-01T11:00:00Z Ke$ha Xeno 1953 17 2011-01-01T00:00:00Z 2011-01-02T13:00:00Z Ke$ha Helz 3194 170 2011-01-02T00:00:00Z 2011-01-02T18:00:00Z Miley Ashu 2232 34 2011-01-02T00:00:00Z Reduce
  • 20. © Cloudera, Inc. All rights reserved. 21 Druid data sources in Hive Creating Streaming Druid data sources Use Create Table As Select (CTAS) statement CREATE EXTERNAL TABLE druid_streaming (`__time` timestamp,`dimension1` string`metric1` int, `metric2 double, Etc.. ) STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ( "druid.segment.granularity" = "DAY”, "kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "topic1"); Hive table name Hive storage handler classname Druid segmentgranularity Kafka related properties
  • 21. © Cloudera, Inc. All rights reserved. 22 Druid data sources in Hive Managing Streaming Ingestion from Hive Use Alter Table statement ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'START’); ALTER TABLE druid_streaming SET TBLPROPERTIES('druid.kafka.ingestion' = 'STOP’); ALTER TABLE druid_streaming t SET TBLPROPERTIES('druid.kafka.ingestion' = 'RESET'); Hive table name Kafka related properties ⇢Reset will reset the offsets maintained by druid for ingestion
  • 22. © Cloudera, Inc. All rights reserved. 23 Querying Druid datasources • Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select) • Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator • Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results • It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  • 23. © Cloudera, Inc. All rights reserved. 24 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan
  • 24. © Cloudera, Inc. All rights reserved. 25 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Initial Plan: – Scan is executed in Druid (select query) – Rest of the query is executed in Hive Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Apache Hive Druid query
  • 25. © Cloudera, Inc. All rights reserved. 26 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive Druid query
  • 26. © Cloudera, Inc. All rights reserved. 27 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query select
  • 27. © Cloudera, Inc. All rights reserved. 28 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query select
  • 28. © Cloudera, Inc. All rights reserved. 29 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query groupBy
  • 29. © Cloudera, Inc. All rights reserved. 30 Querying Druid datasources Apache Hive - SQL query SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM ` time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; • Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Rewriting Rule Apache Hive Druid query groupBy
  • 30. © Cloudera, Inc. All rights reserved. 31 Querying Druid datasources Filter Project Druid Scan Sink Sort Limit Aggregate Query Logical Plan Apache Hive Druid query groupBy { "queryType": "groupBy", DruidJSON query "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations":[ { "type": "longSum","name":"s","fieldName":"c_added"} ], "limitSpec":{ "limit":10, "columns":[ {"dimension":"s","direction": "descending"} ] }, "intervals":[ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000"] } File Sink Druid Scan Query Physical Plan
  • 31. © Cloudera, Inc. All rights reserved. 32 Druid input format • Submits query to Druid and generates records out of the query results • Current version – Timeseries, TopN, and GroupBy queries are not partitioned directly sent to druid broker – Scan queries: realtime and historical nodes are contacted directly Timeseries, TopN, GroupBy Select Node Table Scan Record reader Table Scan Record reader Table Scan Record reader Node Node Table Scan Record reader … … … … Table Scan Record reader …
  • 32. © Cloudera, Inc. All rights reserved. 33 Performance and Scalability: Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Snap Inc) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 33. © Cloudera, Inc. All rights reserved. 34 Performance Numbers • Query Latency • average - 500ms • 90%ile < 1sec • 95%ile < 5sec • 99%ile < 10 sec • Query Volume • 1000s queries per minute • Benchmarking code • https://github.com/druid- io/druid-benchmark
  • 34. © Cloudera, Inc. All rights reserved. 35 Performance Numbers SSB Benchmark 1TB Scale
  • 35. © Cloudera, Inc. All rights reserved. 36 Useful Resources • Druid website – http://druid.io • Druid User Group - dev@druid.incubator.apache.org • Druid Dev Group – users@druid.incubator.apache.org • Hive Druid Integration - https://cwiki.apache.org/confluence/display/Hive/Druid+Integration • Blogs - https://hortonworks.com/blog/apache-hive-druid-part-1-3/ • Query Federation with Apache Hive - https://hortonworks.com/blog/query- federation-with-hive/
  • 36. © Cloudera, Inc. All rights reserved. THANK YOU