Real-Time Analytics at Scale with Apache Druid

Introduction to Apache Druid
Benjamin Hopp
ben@imply.io

San Francisco Airport Marriott Waterfront
Real-Time Analytics at Scale
https://www.druidsummit.org/
Register Using Code “Webinar50” and Receive 50% Off!

The Problem
Data Exploration
Data Ingestion Data Availability

Key features
● Column oriented
● High concurrency
● Scalable to 1000s of servers, millions of messages/sec
● Continuous, real-time ingest
● Query through SQL
● Target query latency sub-second to a few seconds
4

Open core
Imply’s open engine, Druid, is becoming a standard part of modern data infrastructure.
Druid
● Next generation analytics engine
● Widely adopted
Workflow transformation
● Subsecond speed unlocks new workflows
● Self-service explanations of data patterns
● Make data fun again
5Confidential. Do not redistribute.

Where Druid ﬁts in
6
Data lakes
Message buses
Raw data Storage Analyze Application

What is Druid ?
7
Search
platform
OLAP
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Eﬃcient storage
● Fast analytic queries
Timeseries
database
● Optimized for
time-based datasets
● Time-based functions
Druid combines ideas to power a new type of analytics application.

Master Server
Coordinator
● Controls where segments get loaded
Overlord
● Manages streaming supervisor tasks

Query Server
Broker
● Serves queries
Router
● Provides API gateway
● Consolidated user interface

Data Server
Historical
● Stores data segments
● Handles most of computation workloads
MiddleManager
● Controls tasks (peons)
Peons
● Ingests and indexes streaming data
● Serves data that is in-ﬂight

Druid’s logical data model
Timestamp Dimensions Metrics

Querying Druid
● JSON via API
● SQL via API
● SQL via Uniﬁed Console
● Pivot
● Other JDBC BI tools

Streaming Ingestion
Method Kafka Kinesis Tranquility
Supervisor
type
kafka kinesis N/A
How it works
Druid reads
directly from
Apache Kafka.
Druid reads directly
from Amazon
Kinesis.
Tranquility, a library that ships separately
from Druid, is used to push data into
Druid.
Can ingest
late data?
Yes Yes
No (late data is dropped based on the
windowPeriod config)
Exactly-once
guarantees?
Yes Yes No

Batch Ingestion
Method Native batch (simple) Native batch (parallel) Hadoop-based
Parallel? No. Each task is single-threaded.
Yes, if firehose is splittable and
maxNumConcurrentSubTasks> 1 in
tuningConfig. See firehose
documentation for details.
Yes, always.
Can append or
overwrite?
Yes, both. Yes, both. Overwrite only.
File formats
Text file formats (CSV, TSV,
JSON).
Text file formats (CSV, TSV, JSON).
Any Hadoop
InputFormat.
Rollup modes
Perfect if forceGuaranteedRollup=
true in the tuningConfig.
Perfect if forceGuaranteedRollup=
true in the tuningConfig.
Always perfect.
Partitioning
options
Hash-based partitioning is
supported when
forceGuaranteedRollup= true in
the tuningConfig.
Hash-based partitioning (when
forceGuaranteedRollup= true).
Hash-based or
range-based
partitioning via
partitionsSpec.

Data structures in Apache
Druid

How data is structured
● Druid stores data in immutable segments
● Column-oriented compressed format
● Dictionary-encoded at column level
● Bitmap Index Compression : concise & roaring
○ Roaring -typically recommended, faster for boolean operations such
as ﬁlters
● Rollup (partial aggregation)

Choose column types carefully
String column
indexed
fast aggregation
fast grouping
Numeric column
indexed
fast aggregation
fast grouping

Segments Arranged by Time Chunks

Druid Segments
2011-01-01T00:01:35Z Justin Bieber SF 10 5
2011-01-01T00:03:45Z Justin Bieber LA 25 37
2011-01-01T00:05:62Z Justin Bieber SF 15 19
2011-01-01T01:06:33Z Ke$ha LA 30 45
2011-01-01T01:08:51Z Ke$ha LA 16 8
2011-01-01T01:09:17Z Miley Cyrus DC 75 10
2011-01-01T02:23:30Z Miley Cyrus DC 22 12
2011-01-01T02:49:33Z Miley Cyrus DC 90 41
Segment 2011-01-01T00/2011-01-01T01
Segment 2011-01-01T01/2011-01-01T02
Segment 2011-01-01T02/2011-01-01T03
timestamp page city added deleted

page (STRING)
Anatomy of Druid Segment
Physical storage format
removed
(LONG)__time (LONG)
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
DATA
DICT
INDEX
0
0
0
1
1
2
2
2
Justin = 0
Ke$ha = 1
Miley = 2
[0,1,2](11100000)
[3,4] (00011000)
[5,6,7](0000111)
25
42
17
170
112
67
53
94
DATA2
1
2
1
1
0
0
0
IND
[0,2] (10100000)
[1,3,4](01011000)
[5,6,7](00000111)
DICT
DC = 0
LA=1
SF = 2
INDEX
city (STRING)
added
(LONG)
1800
2912
1953
3194
5690
1100
8423
9080
Dict encoded
(sorted)
Bitmap index
(stored compressed)

Filter Query Path
timestamp page
2011-01-01T00:01:35Z Justin Bieber
2011-01-01T00:06:33Z Ke$ha
2011-01-01T00:08:51Z Ke$ha
JB or KS
[ 1 1 1 1 1]
Justin Bieber
[1 1 1 0 0]
Ke$ha
[0 0 0 1 1]

Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction

Controlling Segment Size
● Segment Granularity - Increase if only 1 ﬁle per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000

Compaction
● Combines small segments into larger segments
● Useful for late-arriving data
● Task submitted to Overlord
{
"type" : "compact",
"dataSource" : "wikipedia",
"interval" : "2017-01-01/2018-01-01"
}

Rollup
● Pre-aggregation at ingestion
time
● Saves space, better
compression
● Query performance boost

Rollup
timestamp page city count sum_added sum_deleted
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61
2011-01-01T00:00:00Z Ke$ha LA 2 46 53
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88
timestamp page city added deleted
2011-01-01T00:01:35Z
Justin
Bieber
SF 10 5
2011-01-01T00:03:45Z
Justin
Bieber
SF 25 37
2011-01-01T00:05:62Z
Justin
Bieber
SF 15 19
2011-01-01T00:06:33Z Ke$ha LA 30 45
2011-01-01T00:08:51Z Ke$ha LA 16 8
2011-01-01T00:09:17Z
Miley
Cyrus
DC 75 10
2011-01-01T00:11:25Z
Miley
Cyrus
DC 11 25
2011-01-01T00:23:30Z
Miley
Cyrus
DC 22 12
2011-01-01T00:49:33Z
Miley
Cyrus
DC 90 41

Roll-up vs no roll-up
Do roll-up
● Working with space constraint.
● No need to retain high cardinality dimensions (like user id, precise
location information).
● Maximize price / performance.
Don’t roll-up
● Need the ability to retrieve individual events.
● May need to group or ﬁlter on any column.

Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often ﬁlter on
● Improves locality, compression,
storage size, query performance

Modeling data for fast search
Exact match or preﬁx ﬁltering
○ Uses binary search
○ Only dictionary + index section of
dimension is needed
○ Example, store SSN backwards :
123-45-6789 if searching last-4 digits
frequently.
select count(*) from wikiticker where "comment" like 'A%'
select count(*) from wikiticker where "comment" like '%A%'

Approx Algorithms
● Data sketches are lossy data structures
● Tradeoﬀ accuracy for reduced storage and improved
performance.
● Summarize data at ingestion time using sketches
● Improves roll-up, reduce memory footprint

Summarize with data sketches
timestamp page city count
sum_
added
sum_
deleted userid_sketch
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61 sketch_obj
2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88 sketch_obj
timestamp page userid city added deleted
2011-01-01T00:01:3
5Z
Justin
Bieber
user11 SF 10 5
2011-01-01T00:03:4
5Z
Justin
Bieber
user22 SF 25 37
2011-01-01T00:05:6
2Z
Justin
Bieber
user11 SF 15 19
2011-01-01T00:06:3
3Z
Ke$ha user33 LA 30 45
2011-01-01T00:08:5
1Z
Ke$ha user33 LA 16
8
2011-01-01T00:09:1
7Z
Miley
Cyrus
user11 DC 75 10
2011-01-01T00:11:2
5Z
Miley
Cyrus
user44 DC 11 25
2011-01-01T00:23:3
0Z
Miley
Cyrus
user44 DC 22 12
2011-01-01T00:49:3
3Z
Miley
Cyrus
user55 DC 90 41

When close enough is good enough
Approximate queries can provide up to 99% accuracy while greatly
improving performance
● Bloom Filters
○ Self Joins
● Theta Sketches
○ Union/Intersection/Diﬀerence
● HLL Sketches
○ Count Distinct
● Quantile Sketches
○ Median, percentiles

When close enough is good enough
● Hashes can be calculated at query time or ingestion
○ Pre-computed hashes can save up to 50% query time
● K value determines precision and performance
● Default values will count within 5% accuracy 99% of the time.
● HLL and Theta sketch can both provide COUNT DISTINCT
support, but HLL will do it faster and more accurately with a
smaller data footprint.
● Theta Sketches are more ﬂexible, but require more storage.

Best Practices for Querying
Druid

Use Druid SQL
● Easier to learn/more familiar
● Will attempt to make intelligent query type choices (timeseries
vs topN vs groupBy)
● There are some limitations - such as multi-value dimensions,
not all aggregations are supported

Explain Plan
EXPLAIN PLAN FOR
SELECT channel, sum(added)
FROM wikipedia
WHERE commentLength >= 50
GROUP BY channel
ORDER BY sum(added) desc
LIMIT 3

Pick your query carefully
● TimeBoundary - Returns min/max timestamp for given interval.
● Timeseries - When you don’t want to group by dimension
● TopN - When you want to group by a single dimension
○ Approximate if > 1000 dimension values
● GroupBy - Least performant/most ﬂexible
● Scan - For returning streaming raw data
○ Perfect ordering not preserved
● Select - For returning paginated raw data
● Search - Returns dimensions that match text search

Datasources
● Table
○ Most basic, queries from single Druid table
● Union
○ Equivalent of UNION ALL
○ Returns results of multiple queries
● Query
○ Equivalent of a Sub-Query
○ Results of inner query used as datasource for outer query
○ Increases load on broker

Filters
● Interval
○ Matches time range, can be used on __time or any column with
millisecond timestamp
● Selector
○ Matches a single dimension to a value
● Column Comparison
○ Compares two columns. Example ColA == ColB
● Search
○ Filter on partial string matches
● In
○ Matches to list of values

Filters (cont)
● Like Filter
○ Equivalent of SQL LIKE
○ Can perform better than Search Filter for preﬁx only searching
A note about Extraction Functions….
Most ﬁlters also support extraction functions. Performance of
functions varies greatly, and if possible using functions on ingest
instead of query time can improve performance

Context
● QueryID can be specified - prefixes are very useful for
debugging in Clarity
● Can control timeout, cache usage, and other fine tuning
parameters
● MinTopNThreshold
○ Default 1000 - specifies minimum number of records to return for
merging topN results. Can be increased to improve precision
● skipEmptyBuckets
○ Stops Zero-filling of timeseries queries
https://druid.apache.org/docs/latest/querying/query-context.html

Using Lookups
● Lookups are key/value pairs stored on every node.
○ Stored in memory
○ Alpha - stored disk in PalDB format
● Lookups loaded via Coordinator API
● Can be queries with either JSON or SQL queries

Virtual Columns and Expressions
● Used to manipulate columns at query time, including for
lookups
{
"type": "expression",
"name": “outputRowName”,
"expression": replace(inputColumn,”foo”,”bar”),
"outputType": “STRING”
}

Other Approximate SQL functions
● APPROX_COUNT_DISTINCT
○ Uses HyperUnique - can be a dimension or cardinality column
● APPROX_COUNT_DISTINCT_DS_HLL
○ Same as above, but using DS
○ More ability to tune precision
● APPROX_QUANTILE
○ Calculates Quantiles using ApproxHistogram algorithm
● APPROX_QUANTILE_DS
○ Calculates Quantiles using DS
● APPROX_QUANTILE_FIXED_BUCKETS
○ Calculates ﬁxed bin histogram using DS
○ Faster calculation than APPROX_QUANTILEs
More Details: https://druid.apache.org/docs/latest/querying/sql.html

Stay in touch
50
@druidio
https://imply.io
https://druid.apache.org/
Ben Hopp
Benjamin.hopp@imply.io
LinkedIn: benhopp
@implydata

Real-Time Analytics at Scale with Apache Druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real-Time Analytics at Scale with Apache Druid

Similar to Real-Time Analytics at Scale with Apache Druid (20)

Recently uploaded

Recently uploaded (20)

Real-Time Analytics at Scale with Apache Druid