A Day in the Life of a Druid Architect

A Day in the life of a Druid
Architect
Benjamin Hopp
Senior Solutions Architect @ Imply
ben@imply.io

San Francisco Airport Marriott Waterfront
Real-Time Analytics at Scale
https://www.druidsummit.org/

What do I do?
Productionalization
Implementation
Recommendation
Education

Ask a lot of Questions
● What is the use-case?
○ Is it a good ﬁt for druid?
● Who are the stakeholders?
○ End users - running queries
○ Data Engineers - ingesting data
○ Cluster Administrators - managing services
● How are they using the cluster?
● Where is the data coming from?
● What are the issues or concerns?
● Where does druid ﬁt in the technology stack?

When to use Druid
6
Search
platform
OLAP
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Eﬃcient storage
● Fast analytic queries
Timeseries
database
● Optimized for
time-based datasets
● Time-based functions

When NOT to use Druid
7
OLTP
Individual
record
update/delet
e
Big join
operations

Where Druid ﬁts in
8
Data lakes
Message buses
Raw data Storage Analyze Application

Pick your servers
Data NodesD
● Large-ish
● Scales with size of data and query
volume
● Lots of cores, lots of memory, fast NVMe
disk
Query NodesQ
● Medium-ish
● Scales with concurrency and # of Data
nodes
● Typically CPU bound
Master NodesM
● Small-ish Nodes
● Coordinator scales with # of segments
● Overlord scales with # of supervisors and
tasks

Conﬁgure for MAXIMUM PERFORMANCE
Data NodesD
● Enable Cache
● Heap/maxDirectMemory size
● druid.processing.buffer.sizeBytes
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Query NodesQ
● Disable Caching
● Heap/maxDirectMemory size
● druid.broker.http.numConnections
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Master NodesM ● Heap Size

Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction

Controlling Segment Size
● Number of Tasks - Keep to lowest number that supports max
ingestion rate.
● Segment Granularity - Increase if only 1 ﬁle per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000

Compaction
● Combines small segments into larger segments
● Useful for late-arriving data
● Task submitted to Overlord
{
"type" : "compact",
"dataSource" : "wikipedia",
"interval" : "2017-01-01/2018-01-01"
}

Rollup
● Pre-aggregation at ingestion
time
● Saves space, better
compression
● Query performance boost

Rollup
timestamp page city count sum_added sum_deleted
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61
2011-01-01T00:00:00Z Ke$ha LA 2 46 53
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88
timestamp page city added deleted
2011-01-01T00:01:35Z
Justin
Bieber
SF 10 5
2011-01-01T00:03:45Z
Justin
Bieber
SF 25 37
2011-01-01T00:05:62Z
Justin
Bieber
SF 15 19
2011-01-01T00:06:33Z Ke$ha LA 30 45
2011-01-01T00:08:51Z Ke$ha LA 16 8
2011-01-01T00:09:17Z
Miley
Cyrus
DC 75 10
2011-01-01T00:11:25Z
Miley
Cyrus
DC 11 25
2011-01-01T00:23:30Z
Miley
Cyrus
DC 22 12
2011-01-01T00:49:33Z
Miley
Cyrus
DC 90 41

Summarize with data sketches
timestamp page city count
sum_
added
sum_
deleted userid_sketch
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61 sketch_obj
2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88 sketch_obj
timestamp page userid city added deleted
2011-01-01T00:01:3
5Z
Justin
Bieber
user11 SF 10 5
2011-01-01T00:03:4
5Z
Justin
Bieber
user22 SF 25 37
2011-01-01T00:05:6
2Z
Justin
Bieber
user11 SF 15 19
2011-01-01T00:06:3
3Z
Ke$ha user33 LA 30 45
2011-01-01T00:08:5
1Z
Ke$ha user33 LA 16
8
2011-01-01T00:09:1
7Z
Miley
Cyrus
user11 DC 75 10
2011-01-01T00:11:2
5Z
Miley
Cyrus
user44 DC 11 25
2011-01-01T00:23:3
0Z
Miley
Cyrus
user44 DC 22 12
2011-01-01T00:49:3
3Z
Miley
Cyrus
user55 DC 90 41

Choose column types carefully
String column
indexed
fast aggregation
fast grouping
Numeric column
indexed
fast aggregation
fast grouping

Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often ﬁlter on
● Improves locality, compression,
storage size, query performance

Use Druid SQL
● Easier to learn/more familiar
● Will attempt to make intelligent query type choices (timeseries
vs topN vs groupBy)
● There are some limitations - such as multi-value dimensions,
not all aggregations are supported

Explain Plan
EXPLAIN PLAN FOR
SELECT channel, sum(added)
FROM wikipedia
WHERE commentLength >= 50
GROUP BY channel
ORDER BY sum(added) desc
LIMIT 3

Pick your query carefully
● TimeBoundary - Returns min/max timestamp for given interval.
● Timeseries - When you don’t want to group by dimension
● TopN - When you want to group by a single dimension
○ Approximate if > 1000 dimension values
● GroupBy - Least performant/most ﬂexible
● Scan - For returning streaming raw data
○ Perfect ordering not preserved
● Select - For returning paginated raw data
● Search - Returns dimensions that match text search

Using Lookups
● Use lookups when you have dimensions that change to avoid
re-indexing data
● Lookups are key/value pairs stored on every node.
● Loaded via ﬁle or JDBC connection to external database
● Lookups are loaded into the java heap size, so large lookups
need larger heaps

Stay in touch
29
@druidio
https://imply.io
https://druid.apache.org/
Ben Hopp
Benjamin.hopp@imply.io
LinkedIn: benhopp
@implydata

roadmap and community update
Ben Hopp
ben@imply.io

Druid 0.17.0
Our ﬁrst release as a top-level Apache project!
3

Druid 0.17.0 Highlights
● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
4

● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
5

Druid 0.16.0
Over 350 new features from 50 contributors!
Released September 2019.
7

● Native parallel batch shuffle
○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions
● Query vectorization phase one
○ Allows queries to be sped up by reducing the number of method calls
● Indexer process
○ An alternative to the MiddleManager + Peon task execution system which is easier to
configure and deploy
● Improved web console
○ Kafka & Kinesis support!
○ Point-and-click reindexing
8

Druid 0.17.0
Our ﬁrst release as a top-level Apache project!
Coming soon (really soon).
9

● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
10

● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
11

…and beyond!!
A selection of items planned for future 2020 Druid releases.
13

…and beyond!!
● SQL Joins
○ A multi-phase project to add full SQL Join support to Druid. Coming up ﬁrst -
sub-queries and lookups
● Windowed aggregations
○ For example, moving average and cumulative sum aggregations.
● Dynamic query prioritization & laning
○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking
light ones.
● Extended query vectorization support
○ Richer support for query vectorization against more query types
14

Download
Druid community site (new): https://druid.apache.org/
Imply distribution: https://imply.io/get-started
15

Contribute
16
https://github.com/apache/druid

Stay in touch
17
@druidio
Join the community!
http://druid.io/community
Free training hosted by Imply!
https://imply.io/druid-days
Follow the Druid project on Twitter!

A Day in the Life of a Druid Architect

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Day in the Life of a Druid Architect

Similar to A Day in the Life of a Druid Architect (20)

More from Itai Yaffe

More from Itai Yaffe (20)

Recently uploaded

Recently uploaded (20)

A Day in the Life of a Druid Architect