Our journey with druid - from initial research to full production scale

Our journey with Druid
From initial research to full production scale
Danny Ruchman + Itai Yaffe
Nielsen

Introduction
Danny Ruchman Itai Yaffe
● Software Engineer
and team manager
● Focused on big
data processing
solutions
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 3 years ago
● A Data company
● Machine learning models for insights
● Business decisions
● Targeting

Nielsen Marketing Cloud - questions we try to answer
● How many users of a certain profile can we reach
Campaign for fancy women sneakers -
● How many hits for a specific web page in a date range

The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time

● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.

Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the
corresponding index

What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch

Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error
K Value
Error
%ERROR

“Very fast highly scalable columnar data-store”
DRUID

Why is it cool?
● Store trillions of events, petabytes of data
● Sub-second analytic queries
● Highly scalable
● Cost effective

LongSumAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Simple Count
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
3
1
1
Roll-up -
Simple Count

Roll-up -
Count Distinct
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
1
1

Query performance benchmark
Concurrent Queries
Avg.ResponseTime
Druid .
Elasticsearch

Guidelines and pitfalls
● Setup is not easy

● Monitoring your system

● Monitoring your system - important metrics (incomplete list) :
○ Broker query time
○ Historical query time
○ Historical query wait time
○ Pending segments
○ Broker query TTFB
○ ...

● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...

● Query optimization
○ Combine multiple queries into single query
○ Use filters
○ Use groupBy v2 engine (default since 0.10.0)
○ Use timeseries rather than groupBy queries
(where applicable)

● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10
○ Concurrent ingestion tasks (one per EMR cluster and datasource)
■ Set worker select strategy to fillCapacityWithAffinity

● Batch Ingestion (WIP)
○ Action - pre-aggregating the data in Spark Streaming app
■ Aggregating the data by key
● groupBy().agg() for simple counts
● combineByKey() for distinct count (using the DataSketches packages)
Requires setting isInputThetaSketch=true on ingestion task
■ Increased micro-batch interval from 30 minutes to 1 hour
○ Result :
■ # of output records is ~2000X smaller and total size of output files is less
than 1%, compared to the previous version
■ 10X less nodes in the EMR cluster running the MapReduce ingestion job

● Community

Future work
● Improving accuracy for small set <-> big set intersections
● Improving query performance
○ groupBy V2
○ NVMe SSDs
○ Switching to timeseries query type where applicable
○ Apply less granular aggregation where applicable
(e.g 1 month rather than 1 day)
● Upgrading Druid to 0.11.0
● Exploring option of tiering of query processing nodes
○ Reporting vs interactive queries
○ Hot vs cold data
● Using SQL interface (experimental)
● Using Lookups (experimental)

DRUID ES
What have we learned?
● Druid is a columnar, time series data-store
● Can store trillions of events and serve analytic queries in sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies
● Can be used for :
○ Distinct count (via ThetaSketch)
○ Simple counts
● Setup is not easy
● Ingestion has little effect on query performance (deep storage usage)
● Provides very good visibility
● Improve query performance by carefully designing your data model and building
your queries

QUESTIONS?
Join us - https://www.comeet.co/jobs/nielsen/33.000
Big Data Architect
Java & Machine Learning Developer
Junior Big Data Developer
And more...

THANK YOU!https://www.linkedin.com/in/danny-ruchman-70211a27/
https://www.linkedin.com/in/itaiy/

Druid vs ES
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES

Our journey with druid - from initial research to full production scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Our journey with druid - from initial research to full production scale

Similar to Our journey with druid - from initial research to full production scale (20)

More from Itai Yaffe

More from Itai Yaffe (20)

Recently uploaded

Recently uploaded (20)

Our journey with druid - from initial research to full production scale

Editor's Notes