SlideShare a Scribd company logo
1 of 50
Download to read offline
Introduction to Apache Druid
Benjamin Hopp
ben@imply.io
San Francisco Airport Marriott Waterfront
Real-Time Analytics at Scale
https://www.druidsummit.org/
Register Using Code “Webinar50” and Receive 50% Off!
The Problem
Data Exploration
Data Ingestion Data Availability
Key features
● Column oriented
● High concurrency
● Scalable to 1000s of servers, millions of messages/sec
● Continuous, real-time ingest
● Query through SQL
● Target query latency sub-second to a few seconds
4
Open core
Imply’s open engine, Druid, is becoming a standard part of modern data infrastructure.
Druid
● Next generation analytics engine
● Widely adopted
Workflow transformation
● Subsecond speed unlocks new workflows
● Self-service explanations of data patterns
● Make data fun again
5Confidential. Do not redistribute.
Where Druid fits in
6
Data lakes
Message buses
Raw data Storage Analyze Application
What is Druid ?
7
Search
platform
OLAP
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
Timeseries
database
● Optimized for
time-based datasets
● Time-based functions
Druid combines ideas to power a new type of analytics application.
Druid Architecture
Master Server
Coordinator
● Controls where segments get loaded
Overlord
● Manages streaming supervisor tasks
Query Server
Broker
● Serves queries
Router
● Provides API gateway
● Consolidated user interface
Data Server
Historical
● Stores data segments
● Handles most of computation workloads
MiddleManager
● Controls tasks (peons)
Peons
● Ingests and indexes streaming data
● Serves data that is in-flight
Druid’s logical data model
Timestamp Dimensions Metrics
Querying Druid
● JSON via API
● SQL via API
● SQL via Unified Console
● Pivot
● Other JDBC BI tools
Imply Pivot
14
Streaming Ingestion
Method Kafka Kinesis Tranquility
Supervisor
type
kafka kinesis N/A
How it works
Druid reads
directly from
Apache Kafka.
Druid reads directly
from Amazon
Kinesis.
Tranquility, a library that ships separately
from Druid, is used to push data into
Druid.
Can ingest
late data?
Yes Yes
No (late data is dropped based on the
windowPeriod config)
Exactly-once
guarantees?
Yes Yes No
Batch Ingestion
Method Native batch (simple) Native batch (parallel) Hadoop-based
Parallel? No. Each task is single-threaded.
Yes, if firehose is splittable and
maxNumConcurrentSubTasks> 1 in
tuningConfig. See firehose
documentation for details.
Yes, always.
Can append or
overwrite?
Yes, both. Yes, both. Overwrite only.
File formats
Text file formats (CSV, TSV,
JSON).
Text file formats (CSV, TSV, JSON).
Any Hadoop
InputFormat.
Rollup modes
Perfect if forceGuaranteedRollup=
true in the tuningConfig.
Perfect if forceGuaranteedRollup=
true in the tuningConfig.
Always perfect.
Partitioning
options
Hash-based partitioning is
supported when
forceGuaranteedRollup= true in
the tuningConfig.
Hash-based partitioning (when
forceGuaranteedRollup= true).
Hash-based or
range-based
partitioning via
partitionsSpec.
Data structures in Apache
Druid
How data is structured
● Druid stores data in immutable segments
● Column-oriented compressed format
● Dictionary-encoded at column level
● Bitmap Index Compression : concise & roaring
○ Roaring -typically recommended, faster for boolean operations such
as filters
● Rollup (partial aggregation)
Choose column types carefully
String column
indexed
fast aggregation
fast grouping
Numeric column
indexed
fast aggregation
fast grouping
Segments Arranged by Time Chunks
Druid’s logical data model
Timestamp Dimensions Metrics
Druid Segments
2011-01-01T00:01:35Z Justin Bieber SF 10 5
2011-01-01T00:03:45Z Justin Bieber LA 25 37
2011-01-01T00:05:62Z Justin Bieber SF 15 19
2011-01-01T01:06:33Z Ke$ha LA 30 45
2011-01-01T01:08:51Z Ke$ha LA 16 8
2011-01-01T01:09:17Z Miley Cyrus DC 75 10
2011-01-01T02:23:30Z Miley Cyrus DC 22 12
2011-01-01T02:49:33Z Miley Cyrus DC 90 41
Segment 2011-01-01T00/2011-01-01T01
Segment 2011-01-01T01/2011-01-01T02
Segment 2011-01-01T02/2011-01-01T03
timestamp page city added deleted
page (STRING)
Anatomy of Druid Segment
Physical storage format
removed
(LONG)__time (LONG)
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
DATA
DICT
INDEX
0
0
0
1
1
2
2
2
Justin = 0
Ke$ha = 1
Miley = 2
[0,1,2](11100000)
[3,4] (00011000)
[5,6,7](0000111)
25
42
17
170
112
67
53
94
DATA2
1
2
1
1
0
0
0
IND
[0,2] (10100000)
[1,3,4](01011000)
[5,6,7](00000111)
DICT
DC = 0
LA=1
SF = 2
INDEX
city (STRING)
added
(LONG)
1800
2912
1953
3194
5690
1100
8423
9080
Dict encoded
(sorted)
Bitmap index
(stored compressed)
Filter Query Path
timestamp page
2011-01-01T00:01:35Z Justin Bieber
2011-01-01T00:03:45Z Justin Bieber
2011-01-01T00:05:62Z Justin Bieber
2011-01-01T00:06:33Z Ke$ha
2011-01-01T00:08:51Z Ke$ha
JB or KS
[ 1 1 1 1 1]
Justin Bieber
[1 1 1 0 0]
Ke$ha
[0 0 0 1 1]
Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction
Controlling Segment Size
● Segment Granularity - Increase if only 1 file per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000
Compaction
● Combines small segments into larger segments
● Useful for late-arriving data
● Task submitted to Overlord
{
"type" : "compact",
"dataSource" : "wikipedia",
"interval" : "2017-01-01/2018-01-01"
}
Rollup
● Pre-aggregation at ingestion
time
● Saves space, better
compression
● Query performance boost
Rollup
timestamp page city count sum_added sum_deleted
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61
2011-01-01T00:00:00Z Ke$ha LA 2 46 53
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88
timestamp page city added deleted
2011-01-01T00:01:35Z
Justin
Bieber
SF 10 5
2011-01-01T00:03:45Z
Justin
Bieber
SF 25 37
2011-01-01T00:05:62Z
Justin
Bieber
SF 15 19
2011-01-01T00:06:33Z Ke$ha LA 30 45
2011-01-01T00:08:51Z Ke$ha LA 16 8
2011-01-01T00:09:17Z
Miley
Cyrus
DC 75 10
2011-01-01T00:11:25Z
Miley
Cyrus
DC 11 25
2011-01-01T00:23:30Z
Miley
Cyrus
DC 22 12
2011-01-01T00:49:33Z
Miley
Cyrus
DC 90 41
Roll-up vs no roll-up
Do roll-up
● Working with space constraint.
● No need to retain high cardinality dimensions (like user id, precise
location information).
● Maximize price / performance.
Don’t roll-up
● Need the ability to retrieve individual events.
● May need to group or filter on any column.
Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often filter on
● Improves locality, compression,
storage size, query performance
Modeling data for fast search
Exact match or prefix filtering
○ Uses binary search
○ Only dictionary + index section of
dimension is needed
○ Example, store SSN backwards :
123-45-6789 if searching last-4 digits
frequently.
select count(*) from wikiticker where "comment" like 'A%'
select count(*) from wikiticker where "comment" like '%A%'
Approx Algorithms
● Data sketches are lossy data structures
● Tradeoff accuracy for reduced storage and improved
performance.
● Summarize data at ingestion time using sketches
● Improves roll-up, reduce memory footprint
Summarize with data sketches
timestamp page city count
sum_
added
sum_
deleted userid_sketch
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61 sketch_obj
2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88 sketch_obj
timestamp page userid city added deleted
2011-01-01T00:01:3
5Z
Justin
Bieber
user11 SF 10 5
2011-01-01T00:03:4
5Z
Justin
Bieber
user22 SF 25 37
2011-01-01T00:05:6
2Z
Justin
Bieber
user11 SF 15 19
2011-01-01T00:06:3
3Z
Ke$ha user33 LA 30 45
2011-01-01T00:08:5
1Z
Ke$ha user33 LA 16
8
2011-01-01T00:09:1
7Z
Miley
Cyrus
user11 DC 75 10
2011-01-01T00:11:2
5Z
Miley
Cyrus
user44 DC 11 25
2011-01-01T00:23:3
0Z
Miley
Cyrus
user44 DC 22 12
2011-01-01T00:49:3
3Z
Miley
Cyrus
user55 DC 90 41
When close enough is good enough
Approximate queries can provide up to 99% accuracy while greatly
improving performance
● Bloom Filters
○ Self Joins
● Theta Sketches
○ Union/Intersection/Difference
● HLL Sketches
○ Count Distinct
● Quantile Sketches
○ Median, percentiles
When close enough is good enough
● Hashes can be calculated at query time or ingestion
○ Pre-computed hashes can save up to 50% query time
● K value determines precision and performance
● Default values will count within 5% accuracy 99% of the time.
● HLL and Theta sketch can both provide COUNT DISTINCT
support, but HLL will do it faster and more accurately with a
smaller data footprint.
● Theta Sketches are more flexible, but require more storage.
Best Practices for Querying
Druid
Use Druid SQL
● Easier to learn/more familiar
● Will attempt to make intelligent query type choices (timeseries
vs topN vs groupBy)
● There are some limitations - such as multi-value dimensions,
not all aggregations are supported
SQL
Explain Plan
EXPLAIN PLAN FOR
SELECT channel, sum(added)
FROM wikipedia
WHERE commentLength >= 50
GROUP BY channel
ORDER BY sum(added) desc
LIMIT 3
Native JSON
Pick your query carefully
● TimeBoundary - Returns min/max timestamp for given interval.
● Timeseries - When you don’t want to group by dimension
● TopN - When you want to group by a single dimension
○ Approximate if > 1000 dimension values
● GroupBy - Least performant/most flexible
● Scan - For returning streaming raw data
○ Perfect ordering not preserved
● Select - For returning paginated raw data
● Search - Returns dimensions that match text search
Datasources
● Table
○ Most basic, queries from single Druid table
● Union
○ Equivalent of UNION ALL
○ Returns results of multiple queries
● Query
○ Equivalent of a Sub-Query
○ Results of inner query used as datasource for outer query
○ Increases load on broker
Filters
● Interval
○ Matches time range, can be used on __time or any column with
millisecond timestamp
● Selector
○ Matches a single dimension to a value
● Column Comparison
○ Compares two columns. Example ColA == ColB
● Search
○ Filter on partial string matches
● In
○ Matches to list of values
Filters (cont)
● Like Filter
○ Equivalent of SQL LIKE
○ Can perform better than Search Filter for prefix only searching
A note about Extraction Functions….
Most filters also support extraction functions. Performance of
functions varies greatly, and if possible using functions on ingest
instead of query time can improve performance
Context
● QueryID can be specified - prefixes are very useful for
debugging in Clarity
● Can control timeout, cache usage, and other fine tuning
parameters
● MinTopNThreshold
○ Default 1000 - specifies minimum number of records to return for
merging topN results. Can be increased to improve precision
● skipEmptyBuckets
○ Stops Zero-filling of timeseries queries
https://druid.apache.org/docs/latest/querying/query-context.html
Using Lookups
● Lookups are key/value pairs stored on every node.
○ Stored in memory
○ Alpha - stored disk in PalDB format
● Lookups loaded via Coordinator API
● Can be queries with either JSON or SQL queries
Virtual Columns and Expressions
● Used to manipulate columns at query time, including for
lookups
{
"type": "expression",
"name": “outputRowName”,
"expression": replace(inputColumn,”foo”,”bar”),
"outputType": “STRING”
}
Other Approximate SQL functions
● APPROX_COUNT_DISTINCT
○ Uses HyperUnique - can be a dimension or cardinality column
● APPROX_COUNT_DISTINCT_DS_HLL
○ Same as above, but using DS
○ More ability to tune precision
● APPROX_QUANTILE
○ Calculates Quantiles using ApproxHistogram algorithm
● APPROX_QUANTILE_DS
○ Calculates Quantiles using DS
● APPROX_QUANTILE_FIXED_BUCKETS
○ Calculates fixed bin histogram using DS
○ Faster calculation than APPROX_QUANTILEs
More Details: https://druid.apache.org/docs/latest/querying/sql.html
Stay in touch
50
@druidio
https://imply.io
https://druid.apache.org/
Ben Hopp
Benjamin.hopp@imply.io
LinkedIn: benhopp
@implydata

More Related Content

What's hot

Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrVadim Kirilchuk
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?Juhong Park
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인Jae Young Park
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
NLP techniques for log analysis
NLP techniques for log analysisNLP techniques for log analysis
NLP techniques for log analysisJacob Perkins
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsDataWorks Summit/Hadoop Summit
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 

What's hot (20)

Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and Solr
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Apache flink
Apache flinkApache flink
Apache flink
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
NLP techniques for log analysis
NLP techniques for log analysisNLP techniques for log analysis
NLP techniques for log analysis
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 

Similar to Real-Time Analytics at Scale with Apache Druid

Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Benchmarking Apache Druid
Benchmarking Apache DruidBenchmarking Apache Druid
Benchmarking Apache DruidImply
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
February 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBAmazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 

Similar to Real-Time Analytics at Scale with Apache Druid (20)

Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Benchmarking Apache Druid
Benchmarking Apache DruidBenchmarking Apache Druid
Benchmarking Apache Druid
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
February 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDB
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Real-Time Analytics at Scale with Apache Druid

  • 1. Introduction to Apache Druid Benjamin Hopp ben@imply.io
  • 2. San Francisco Airport Marriott Waterfront Real-Time Analytics at Scale https://www.druidsummit.org/ Register Using Code “Webinar50” and Receive 50% Off!
  • 3. The Problem Data Exploration Data Ingestion Data Availability
  • 4. Key features ● Column oriented ● High concurrency ● Scalable to 1000s of servers, millions of messages/sec ● Continuous, real-time ingest ● Query through SQL ● Target query latency sub-second to a few seconds 4
  • 5. Open core Imply’s open engine, Druid, is becoming a standard part of modern data infrastructure. Druid ● Next generation analytics engine ● Widely adopted Workflow transformation ● Subsecond speed unlocks new workflows ● Self-service explanations of data patterns ● Make data fun again 5Confidential. Do not redistribute.
  • 6. Where Druid fits in 6 Data lakes Message buses Raw data Storage Analyze Application
  • 7. What is Druid ? 7 Search platform OLAP ● Real-time ingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries Timeseries database ● Optimized for time-based datasets ● Time-based functions Druid combines ideas to power a new type of analytics application.
  • 9. Master Server Coordinator ● Controls where segments get loaded Overlord ● Manages streaming supervisor tasks
  • 10. Query Server Broker ● Serves queries Router ● Provides API gateway ● Consolidated user interface
  • 11. Data Server Historical ● Stores data segments ● Handles most of computation workloads MiddleManager ● Controls tasks (peons) Peons ● Ingests and indexes streaming data ● Serves data that is in-flight
  • 12. Druid’s logical data model Timestamp Dimensions Metrics
  • 13. Querying Druid ● JSON via API ● SQL via API ● SQL via Unified Console ● Pivot ● Other JDBC BI tools
  • 15. Streaming Ingestion Method Kafka Kinesis Tranquility Supervisor type kafka kinesis N/A How it works Druid reads directly from Apache Kafka. Druid reads directly from Amazon Kinesis. Tranquility, a library that ships separately from Druid, is used to push data into Druid. Can ingest late data? Yes Yes No (late data is dropped based on the windowPeriod config) Exactly-once guarantees? Yes Yes No
  • 16. Batch Ingestion Method Native batch (simple) Native batch (parallel) Hadoop-based Parallel? No. Each task is single-threaded. Yes, if firehose is splittable and maxNumConcurrentSubTasks> 1 in tuningConfig. See firehose documentation for details. Yes, always. Can append or overwrite? Yes, both. Yes, both. Overwrite only. File formats Text file formats (CSV, TSV, JSON). Text file formats (CSV, TSV, JSON). Any Hadoop InputFormat. Rollup modes Perfect if forceGuaranteedRollup= true in the tuningConfig. Perfect if forceGuaranteedRollup= true in the tuningConfig. Always perfect. Partitioning options Hash-based partitioning is supported when forceGuaranteedRollup= true in the tuningConfig. Hash-based partitioning (when forceGuaranteedRollup= true). Hash-based or range-based partitioning via partitionsSpec.
  • 17. Data structures in Apache Druid
  • 18. How data is structured ● Druid stores data in immutable segments ● Column-oriented compressed format ● Dictionary-encoded at column level ● Bitmap Index Compression : concise & roaring ○ Roaring -typically recommended, faster for boolean operations such as filters ● Rollup (partial aggregation)
  • 19. Choose column types carefully String column indexed fast aggregation fast grouping Numeric column indexed fast aggregation fast grouping
  • 20. Segments Arranged by Time Chunks
  • 21. Druid’s logical data model Timestamp Dimensions Metrics
  • 22. Druid Segments 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber LA 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T01:06:33Z Ke$ha LA 30 45 2011-01-01T01:08:51Z Ke$ha LA 16 8 2011-01-01T01:09:17Z Miley Cyrus DC 75 10 2011-01-01T02:23:30Z Miley Cyrus DC 22 12 2011-01-01T02:49:33Z Miley Cyrus DC 90 41 Segment 2011-01-01T00/2011-01-01T01 Segment 2011-01-01T01/2011-01-01T02 Segment 2011-01-01T02/2011-01-01T03 timestamp page city added deleted
  • 23. page (STRING) Anatomy of Druid Segment Physical storage format removed (LONG)__time (LONG) 1293840000000 1293840000000 1293840000000 1293840000000 1293840000000 1293840000000 1293840000000 1293840000000 DATA DICT INDEX 0 0 0 1 1 2 2 2 Justin = 0 Ke$ha = 1 Miley = 2 [0,1,2](11100000) [3,4] (00011000) [5,6,7](0000111) 25 42 17 170 112 67 53 94 DATA2 1 2 1 1 0 0 0 IND [0,2] (10100000) [1,3,4](01011000) [5,6,7](00000111) DICT DC = 0 LA=1 SF = 2 INDEX city (STRING) added (LONG) 1800 2912 1953 3194 5690 1100 8423 9080 Dict encoded (sorted) Bitmap index (stored compressed)
  • 24. Filter Query Path timestamp page 2011-01-01T00:01:35Z Justin Bieber 2011-01-01T00:03:45Z Justin Bieber 2011-01-01T00:05:62Z Justin Bieber 2011-01-01T00:06:33Z Ke$ha 2011-01-01T00:08:51Z Ke$ha JB or KS [ 1 1 1 1 1] Justin Bieber [1 1 1 0 0] Ke$ha [0 0 0 1 1]
  • 25. Optimize segment size Ideally 300 - 700 mb (~ 5 million rows) To control segment size ● Alter segment granularity ● Specify partition spec ● Use Automatic Compaction
  • 26. Controlling Segment Size ● Segment Granularity - Increase if only 1 file per segment and < 200MB "segmentGranularity": "HOUR" ● Max Rows Per Segment - Increase if a single segment is < 200MB "maxRowsPerSegment": 5000000
  • 27. Compaction ● Combines small segments into larger segments ● Useful for late-arriving data ● Task submitted to Overlord { "type" : "compact", "dataSource" : "wikipedia", "interval" : "2017-01-01/2018-01-01" }
  • 28. Rollup ● Pre-aggregation at ingestion time ● Saves space, better compression ● Query performance boost
  • 29. Rollup timestamp page city count sum_added sum_deleted 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 timestamp page city added deleted 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber SF 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T00:06:33Z Ke$ha LA 30 45 2011-01-01T00:08:51Z Ke$ha LA 16 8 2011-01-01T00:09:17Z Miley Cyrus DC 75 10 2011-01-01T00:11:25Z Miley Cyrus DC 11 25 2011-01-01T00:23:30Z Miley Cyrus DC 22 12 2011-01-01T00:49:33Z Miley Cyrus DC 90 41
  • 30. Roll-up vs no roll-up Do roll-up ● Working with space constraint. ● No need to retain high cardinality dimensions (like user id, precise location information). ● Maximize price / performance. Don’t roll-up ● Need the ability to retrieve individual events. ● May need to group or filter on any column.
  • 31. Partitioning beyond time ● Druid always partitions by time ● Decide which dimension to partition on… next ● Partition by some dimension you often filter on ● Improves locality, compression, storage size, query performance
  • 32. Modeling data for fast search Exact match or prefix filtering ○ Uses binary search ○ Only dictionary + index section of dimension is needed ○ Example, store SSN backwards : 123-45-6789 if searching last-4 digits frequently. select count(*) from wikiticker where "comment" like 'A%' select count(*) from wikiticker where "comment" like '%A%'
  • 33. Approx Algorithms ● Data sketches are lossy data structures ● Tradeoff accuracy for reduced storage and improved performance. ● Summarize data at ingestion time using sketches ● Improves roll-up, reduce memory footprint
  • 34. Summarize with data sketches timestamp page city count sum_ added sum_ deleted userid_sketch 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 sketch_obj 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 sketch_obj timestamp page userid city added deleted 2011-01-01T00:01:3 5Z Justin Bieber user11 SF 10 5 2011-01-01T00:03:4 5Z Justin Bieber user22 SF 25 37 2011-01-01T00:05:6 2Z Justin Bieber user11 SF 15 19 2011-01-01T00:06:3 3Z Ke$ha user33 LA 30 45 2011-01-01T00:08:5 1Z Ke$ha user33 LA 16 8 2011-01-01T00:09:1 7Z Miley Cyrus user11 DC 75 10 2011-01-01T00:11:2 5Z Miley Cyrus user44 DC 11 25 2011-01-01T00:23:3 0Z Miley Cyrus user44 DC 22 12 2011-01-01T00:49:3 3Z Miley Cyrus user55 DC 90 41
  • 35. When close enough is good enough Approximate queries can provide up to 99% accuracy while greatly improving performance ● Bloom Filters ○ Self Joins ● Theta Sketches ○ Union/Intersection/Difference ● HLL Sketches ○ Count Distinct ● Quantile Sketches ○ Median, percentiles
  • 36. When close enough is good enough ● Hashes can be calculated at query time or ingestion ○ Pre-computed hashes can save up to 50% query time ● K value determines precision and performance ● Default values will count within 5% accuracy 99% of the time. ● HLL and Theta sketch can both provide COUNT DISTINCT support, but HLL will do it faster and more accurately with a smaller data footprint. ● Theta Sketches are more flexible, but require more storage.
  • 37. Best Practices for Querying Druid
  • 38. Use Druid SQL ● Easier to learn/more familiar ● Will attempt to make intelligent query type choices (timeseries vs topN vs groupBy) ● There are some limitations - such as multi-value dimensions, not all aggregations are supported
  • 39. SQL
  • 40. Explain Plan EXPLAIN PLAN FOR SELECT channel, sum(added) FROM wikipedia WHERE commentLength >= 50 GROUP BY channel ORDER BY sum(added) desc LIMIT 3
  • 42. Pick your query carefully ● TimeBoundary - Returns min/max timestamp for given interval. ● Timeseries - When you don’t want to group by dimension ● TopN - When you want to group by a single dimension ○ Approximate if > 1000 dimension values ● GroupBy - Least performant/most flexible ● Scan - For returning streaming raw data ○ Perfect ordering not preserved ● Select - For returning paginated raw data ● Search - Returns dimensions that match text search
  • 43. Datasources ● Table ○ Most basic, queries from single Druid table ● Union ○ Equivalent of UNION ALL ○ Returns results of multiple queries ● Query ○ Equivalent of a Sub-Query ○ Results of inner query used as datasource for outer query ○ Increases load on broker
  • 44. Filters ● Interval ○ Matches time range, can be used on __time or any column with millisecond timestamp ● Selector ○ Matches a single dimension to a value ● Column Comparison ○ Compares two columns. Example ColA == ColB ● Search ○ Filter on partial string matches ● In ○ Matches to list of values
  • 45. Filters (cont) ● Like Filter ○ Equivalent of SQL LIKE ○ Can perform better than Search Filter for prefix only searching A note about Extraction Functions…. Most filters also support extraction functions. Performance of functions varies greatly, and if possible using functions on ingest instead of query time can improve performance
  • 46. Context ● QueryID can be specified - prefixes are very useful for debugging in Clarity ● Can control timeout, cache usage, and other fine tuning parameters ● MinTopNThreshold ○ Default 1000 - specifies minimum number of records to return for merging topN results. Can be increased to improve precision ● skipEmptyBuckets ○ Stops Zero-filling of timeseries queries https://druid.apache.org/docs/latest/querying/query-context.html
  • 47. Using Lookups ● Lookups are key/value pairs stored on every node. ○ Stored in memory ○ Alpha - stored disk in PalDB format ● Lookups loaded via Coordinator API ● Can be queries with either JSON or SQL queries
  • 48. Virtual Columns and Expressions ● Used to manipulate columns at query time, including for lookups { "type": "expression", "name": “outputRowName”, "expression": replace(inputColumn,”foo”,”bar”), "outputType": “STRING” }
  • 49. Other Approximate SQL functions ● APPROX_COUNT_DISTINCT ○ Uses HyperUnique - can be a dimension or cardinality column ● APPROX_COUNT_DISTINCT_DS_HLL ○ Same as above, but using DS ○ More ability to tune precision ● APPROX_QUANTILE ○ Calculates Quantiles using ApproxHistogram algorithm ● APPROX_QUANTILE_DS ○ Calculates Quantiles using DS ● APPROX_QUANTILE_FIXED_BUCKETS ○ Calculates fixed bin histogram using DS ○ Faster calculation than APPROX_QUANTILEs More Details: https://druid.apache.org/docs/latest/querying/sql.html
  • 50. Stay in touch 50 @druidio https://imply.io https://druid.apache.org/ Ben Hopp Benjamin.hopp@imply.io LinkedIn: benhopp @implydata