Using druid for interactive count distinct queries at scale

•Download as PPTX, PDF•

2 likes•1,103 views

At NMC (Nielsen Marketing Cloud) we need to present to our clients the number of unique users who meet a given criteria. The condition is typically a set-theoretic expression over a stream of events for a given time range. Historically, we have used ElasticSearch to answer these types of questions, however, we have encountered major scaling issues. In this presentation we will detail the journey of researching, benchmarking and productionizing a new technology, Druid, with DataSketches, to overcome the limitations we were facing

Data & Analytics

USING DRUID
FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big
Data and Machine
Learning problems
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)

The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time

● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.

Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index

What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch

Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error

“Very fast highly scalable columnar data-store”
DRUID

Roll-up
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
2
1

Guidelines and pitfalls
● Setup is not easy

Guidelines and pitfalls
● Monitoring your system

Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...

Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters

Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10

Summary
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES

THANK YOU!
https://www.linkedin.com/in/itaiy/
https://www.linkedin.com/in/yakirbuskilla/

What's hot

Delta: Building Merge on ReadDatabricks

Parquet overviewJulien Le Dem

MariaDB Performance Tuning Crash CourseSeveralnines

Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive

Dynamic filtering for presto join optimisationOri Reshef

Migrating from Oracle to PostgresEDB

Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko

Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems

Querying Druid in SQL with SupersetDataWorks Summit

Apache Flink and Apache Hudi.pdfdogma28

分布式存储的元数据设计LI Daobing

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

HTAP QueriesAtif Shaikh

ceph optimization on ssd ilsoo byun-shortNAVER D2

Create Your Own LanguageHamidreza Soleimani

Building large scale transactional data lake using apache hudiBill Liu

How to use Parquet as a basis for ETL and analyticsJulien Le Dem

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks

Desenvolvendo Aplicações baseadas em Big Data com PySparkVinícius Barros

What's hot (20)

Delta: Building Merge on Read

Parquet overview

MariaDB Performance Tuning Crash Course

Apache BookKeeper: A High Performance and Low Latency Storage Service

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

Dynamic filtering for presto join optimisation

Migrating from Oracle to Postgres

Thrift vs Protocol Buffers vs Avro - Biased Comparison

Apache Cassandra Lesson: Data Modelling and CQL3

Querying Druid in SQL with Superset

Apache Flink and Apache Hudi.pdf

分布式存储的元数据设计

Tame the small files problem and optimize data layout for streaming ingestion...

HTAP Queries

ceph optimization on ssd ilsoo byun-short

Create Your Own Language

Building large scale transactional data lake using apache hudi

How to use Parquet as a basis for ETL and analytics

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...

Desenvolvendo Aplicações baseadas em Big Data com PySpark

Similar to Using druid for interactive count distinct queries at scale

Using druid for interactive count distinct queries at scale @ nmcIdo Shilon

Our journey with druid - from initial research to full production scaleItai Yaffe

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

Druid - DevconTLV XYakir Buskilla

Introducing TiDB @ SF DevOps MeetupKevin Xu

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu

TiDB + Mobike by Kevin Xu (@kevinsxu)Kevin Xu

TiDB IntroductionMorgan Tocker

SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...ScyllaDB

Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion

Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari

Scale Relational Database with NewSQLPingCAP

Challenges of monitoring distributed systemsNenad Bozic

Big Data, Bigger AnalyticsItzhak Kameli

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari

MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDBMongoDB

When Apache Spark Meets TiDB with Xiaoyu MaDatabricks

Auditing data and answering the life long question, is it the end of the day ...Simona Meriam

Similar to Using druid for interactive count distinct queries at scale (20)

Using druid for interactive count distinct queries at scale @ nmc

Our journey with druid - from initial research to full production scale

Counting Unique Users in Real-Time: Here's a Challenge for You!

Druid - DevconTLV X

Introducing TiDB @ SF DevOps Meetup

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]

TiDB + Mobike by Kevin Xu (@kevinsxu)

TiDB Introduction

SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...

Security Monitoring for big Infrastructures without a Million Dollar budget

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...

Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...

Scale Relational Database with NewSQL

Challenges of monitoring distributed systems

Big Data, Bigger Analytics

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017

MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDB

When Apache Spark Meets TiDB with Xiaoyu Ma

Auditing data and answering the life long question, is it the end of the day ...

Recently uploaded

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

RadioAdProWritingCinderellabyButleri.pdfgstagge

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

ASML's Taxonomy Adventure by Daniel Cantervoginip

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Easter Eggs From Star Wars and in cars 1 and 217djon017

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Multiple time frame trading analysis -brianshannon.pdfchwongval

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Recently uploaded (20)

Advanced Machine Learning for Business Professionals

Top 5 Best Data Analytics Courses In Queens

GA4 Without Cookies [Measure Camp AMS]

RadioAdProWritingCinderellabyButleri.pdf

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

ASML's Taxonomy Adventure by Daniel Canter

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Defining Constituents, Data Vizzes and Telling a Data Story

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Easter Eggs From Star Wars and in cars 1 and 2

Call Girls In Dwarka 9654467111 Escorts Service

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Multiple time frame trading analysis -brianshannon.pdf

Call Girls in Saket 99530🔝 56974 Escort Service

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Using druid for interactive count distinct queries at scale

1. USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

2. Introduction Yakir Buskilla Itai Yaffe ● Software Architect ● Focusing on Big Data and Machine Learning problems ● Big Data Infrastructure Developer ● Dealing with Big Data challenges for the last 5 years

3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen 2 years ago ● A leader in the Ad Tech and Marketing Tech industry ● What do we do ? ○ Data as a Service (DaaS) ○ Software as a Service (SaaS)

4. NMC high-level architecture

5. The need ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time

6. The need

7. The need

8. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions Naive Bit VectorApprox.

9. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index

10. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)

11. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch

12. KMV intuition

13. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error

14. “Very fast highly scalable columnar data-store” DRUID

15. Roll-up ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Count Distinct 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2 2 1

16. Druid architecture

17. How do we use Druid

18. Guidelines and pitfalls ● Setup is not easy

19. Guidelines and pitfalls ● Monitoring your system

20. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...

21. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters

22. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10

23. Guidelines and pitfalls ● Community

24. Summary 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES

25. QUESTIONS?

26. THANK YOU! https://www.linkedin.com/in/itaiy/ https://www.linkedin.com/in/yakirbuskilla/

Editor's Notes

Intro of us + NMC
Daas = marketplace for device level data connecting buyers and sellers Saas - Nielsen Marketing cloud platform which help brands to connect with their customers by using our big data sets and our analytics tools
Our serving layer(Front End) aggregates data from various online + offline sources We aggregate around 10B events per day
Past… Mention “cardinality” and “real-time dashboard” Explain the need to union and intersect
-Bit vector - Elastic search /Redis is an example of such system
We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster This method was very expensive and was partially helpful Tuning for better performance also didn’t help too much
Preprocessing - Too many combinations - The formula length is not bounded (show some numbers) HyperLogLog -Implementation in ElasticSearch was too slow (done on query time) - Set operations increase the error dramatically
Unions and Intersections increase the error The problematic case is intersection of very small set with very big set
The larger the K the smaller the Error However larger K means more memory & storage needed
So we talked about statistical algorithms, which is nice, but we needed a practical solution… OOTB supports ThetaSketch algorithm
Timeseries database - first thing you need to know about Druid Column types : Timestamp Dimensions Metrics Together they comprise a Datasource There are different types of roll-ups (sum, count, etc.) Agg is done on ingestion time (outcome is much smaller in size) In query time, it’s closer to a key-value search
We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion) Lambda architecture
Explain the tuple and what is happening during the aggregation
Setup is not easy Separate config/servers/tuning Caused the deployment to take a few months Use the Druid recommendation for Production configuration
Monitoring Your System Druid has built in support for Graphite ( exports many metrics )
Data Modeling If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).It didn’t solve all use-cases, but it gives you an idea of how you can approach the problem Different datasources - e.g lower accuracy for faster queries VS higher accuracy with a bit slower queries
Combine multiple queries over the REST API There can be billions of rows, so filter the data as part of the query (as early as possible)
EMR tuning (spot instances (80% cost reduction), druid MR prod config) Use Parquet
Ingestion doesn’t affect query + sub-second response for even 100s or 1000s of concurrent queries Cost is for the entire solution (Druid cluster, EMR, etc.) With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution)

Using druid for interactive count distinct queries at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using druid for interactive count distinct queries at scale

Similar to Using druid for interactive count distinct queries at scale (20)

More from Itai Yaffe

More from Itai Yaffe (20)

Recently uploaded

Recently uploaded (20)

Using druid for interactive count distinct queries at scale

Editor's Notes