Using Druid for Interactive Count-Distinct Queries at Scale

•Download as PPTX, PDF•

2 likes•588 views

Ido Shilon

Using druid for interactive count distinct queries

Technology

Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big
Data and Machine
Learning problems
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)

The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time

● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.

Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index

What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch

Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error

“Very fast highly scalable columnar data-store”
DRUID

Roll-up
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
2
1

Guidelines and pitfalls
● Setup is not easy

Guidelines and pitfalls
● Monitoring your system

Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...

Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters

Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10

Summary
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES

What's hot

Quoc Le at AI Frontiers : Automated Machine LearningAI Frontiers

The Evolution of AutoMLNing Jiang

Retrieving Visually-Similar Products for Shopping Recommendations using Spark...Databricks

NLP Text Recommendation System Journey to Automated TrainingDatabricks

Prediction of taxi rides ETADaniel Marcous

Growing Data Scientists by Amparo Alonso BetanzosBig Data Spain

Machine Learning at Scale with MLflow and Apache SparkDatabricks

Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen

Automatic machine learning (AutoML) 101QuantUniversity

Building A Feature FactoryDatabricks

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati

Data Science in the Real World: Making a Difference Srinath Perera

Machine Learning Projects Using MATLAB Research HelpMatlab Simulation

Sparklyr: Big Data enabler for R usersICTeam S.p.A.

Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks

Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks

Graph Gurus Episode 9: How Visa Optimizes Network and IT Resources with a Nat...TigerGraph

Machine Learning In ProductionSamir Bessalah

Machine Learning Powered by Graphs - Alessandro NegroGraphAware

WSO2 Big Data Platform and ApplicationsSrinath Perera

What's hot (20)

Quoc Le at AI Frontiers : Automated Machine Learning

The Evolution of AutoML

Retrieving Visually-Similar Products for Shopping Recommendations using Spark...

NLP Text Recommendation System Journey to Automated Training

Prediction of taxi rides ETA

Growing Data Scientists by Amparo Alonso Betanzos

Machine Learning at Scale with MLflow and Apache Spark

Lambda Architecture 2.0 for Reactive AB Testing

Automatic machine learning (AutoML) 101

Building A Feature Factory

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...

Data Science in the Real World: Making a Difference

Machine Learning Projects Using MATLAB Research Help

Sparklyr: Big Data enabler for R users

Consolidating MLOps at One of Europe’s Biggest Airports

Applied Machine Learning for Ranking Products in an Ecommerce Setting

Graph Gurus Episode 9: How Visa Optimizes Network and IT Resources with a Nat...

Machine Learning In Production

Machine Learning Powered by Graphs - Alessandro Negro

WSO2 Big Data Platform and Applications

Viewers also liked

Blind spots in big data erez koren @ forterIdo Shilon

Deep learning at nmc devin jones Ido Shilon

Why ml and ai are the future of gaming david sachs @ tomoboxIdo Shilon

Accelerating scale from startups to enterprise by Peter bakasIdo Shilon

BDX 2016 - Kevin lyons & yakir buskilla @ eXelate Ido Shilon

Micro apps across 3 continents using React js Ido Shilon

BDX 2016 - Arnon rotem gal-oz @ appsflyerIdo Shilon

BDX 2016- Monal daxini @ NetflixIdo Shilon

Druid - DevconTLV XYakir Buskilla

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters

Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan

Druid at SF Big Analytics 2015-12-01gianmerlino

July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network

Interactive analytics at scale with druidJulien Lavigne du Cadet

Data Analytics with DruidYousun Jeong

PayPal Real Time AnalyticsAnil Madan

Programmatic Bidding Data Streams & DruidCharles Allen

Druid realtime indexingSeoeun Park

Real-time analytics with Druid at AppsflyerMichael Spector

Lambda Architectures in PracticeC4Media

Viewers also liked (20)

Blind spots in big data erez koren @ forter

Deep learning at nmc devin jones

Why ml and ai are the future of gaming david sachs @ tomobox

Accelerating scale from startups to enterprise by Peter bakas

BDX 2016 - Kevin lyons & yakir buskilla @ eXelate

Micro apps across 3 continents using React js

BDX 2016 - Arnon rotem gal-oz @ appsflyer

BDX 2016- Monal daxini @ Netflix

Druid - DevconTLV X

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015

Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop

Druid at SF Big Analytics 2015-12-01

July 2014 HUG : Pushing the limits of Realtime Analytics using Druid

Interactive analytics at scale with druid

Data Analytics with Druid

PayPal Real Time Analytics

Programmatic Bidding Data Streams & Druid

Druid realtime indexing

Real-time analytics with Druid at Appsflyer

Lambda Architectures in Practice

Similar to Using Druid for Interactive Count-Distinct Queries at Scale

Using druid for interactive count distinct queries at scaleItai Yaffe

Our journey with druid - from initial research to full production scaleItai Yaffe

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

Introducing TiDB @ SF DevOps MeetupKevin Xu

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu

TiDB IntroductionMorgan Tocker

TiDB + Mobike by Kevin Xu (@kevinsxu)Kevin Xu

Big Data, Bigger AnalyticsItzhak Kameli

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion

Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari

Scale Relational Database with NewSQLPingCAP

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante

Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner

Challenges of monitoring distributed systemsNenad Bozic

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion

SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...ScyllaDB

How KeyBank Used Elastic to Build an Enterprise Monitoring SolutionElasticsearch

When Apache Spark Meets TiDB with Xiaoyu MaDatabricks

MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDBMongoDB

Similar to Using Druid for Interactive Count-Distinct Queries at Scale (20)

Using druid for interactive count distinct queries at scale

Our journey with druid - from initial research to full production scale

Counting Unique Users in Real-Time: Here's a Challenge for You!

Introducing TiDB @ SF DevOps Meetup

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]

TiDB Introduction

TiDB + Mobike by Kevin Xu (@kevinsxu)

Big Data, Bigger Analytics

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...

Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017

Scale Relational Database with NewSQL

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...

Security Monitoring for big Infrastructures without a Million Dollar budget

Challenges of monitoring distributed systems

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018

SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...

How KeyBank Used Elastic to Build an Enterprise Monitoring Solution

When Apache Spark Meets TiDB with Xiaoyu Ma

MongoDB World 2019: Near Real-Time Analytical Data Hub with MongoDB

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

A Call to Action for Generative AI in 2024Results

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Histor y of HAM Radio presentation slidevu2urc

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

A Domino Admins Adventures (Engage 2024)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

A Call to Action for Generative AI in 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

GenCyber Cyber Security Day Presentation

Histor y of HAM Radio presentation slide

CNv6 Instructor Chapter 6 Quality of Service

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Handwritten Text Recognition for manuscripts and early printed texts

How to Troubleshoot Apps for the Modern Connected Worker

Breaking the Kubernetes Kill Chain: Host Path Mount

Scaling API-first – The story of a global engineering organization

Using Druid for Interactive Count-Distinct Queries at Scale

1. Yakir Buskilla + Itai Yaffe Nielsen USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

2. Introduction Yakir Buskilla Itai Yaffe ● Software Architect ● Focusing on Big Data and Machine Learning problems ● Big Data Infrastructure Developer ● Dealing with Big Data challenges for the last 5 years

3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen 2 years ago ● A leader in the Ad Tech and Marketing Tech industry ● What do we do ? ○ Data as a Service (DaaS) ○ Software as a Service (SaaS)

4. NMC high-level architecture

5. The need ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time

6. The need

7. The need

8. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions Naive Bit VectorApprox.

9. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index

10. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)

11. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch

12. KMV intuition

13. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error

14. “Very fast highly scalable columnar data-store” DRUID

15. Roll-up ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Count Distinct 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2 2 1

16. Druid architecture

17. How do we use Druid

18. Guidelines and pitfalls ● Setup is not easy

19. Guidelines and pitfalls ● Monitoring your system

20. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...

21. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters

22. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10

23. Guidelines and pitfalls ● Community

24. Summary 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES

25. THANK YOU!

Editor's Notes

Intro of us + NMC
Daas = marketplace for device level data connecting buyers and sellers Saas - Nielsen Marketing cloud platform which help brands to connect with their customers by using our big data sets and our analytics tools
Our serving layer(Front End) aggregates data from various online + offline sources We aggregate around 10B events per day
Past… Mention “cardinality” and “real-time dashboard” Explain the need to union and intersect
-Bit vector - Elastic search /Redis is an example of such system
We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster This method was very expensive and was partially helpful Tuning for better performance also didn’t help too much
Preprocessing - Too many combinations - The formula length is not bounded (show some numbers) HyperLogLog -Implementation in ElasticSearch was too slow (done on query time) - Set operations increase the error dramatically
Unions and Intersections increase the error The problematic case is intersection of very small set with very big set
The larger the K the smaller the Error However larger K means more memory & storage needed
So we talked about statistical algorithms, which is nice, but we needed a practical solution… OOTB supports ThetaSketch algorithm
Timeseries database - first thing you need to know about Druid Column types : Timestamp Dimensions Metrics Together they comprise a Datasource Agg is done on ingestion time (outcome is much smaller in size) In query time, it’s closer to a key-value search
We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion) Lambda architecture
Explain the tuple and what is happening during the aggregation
Setup is not easy Separate config/servers/tuning Caused the deployment to take a few months Use the Druid recommendation for Production configuration
Monitoring Your System Druid has built in support for Graphite ( exports many metrics )
Data Modeling If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).It didn’t solve all use-cases, but it gives you an idea of how you can approach the problem Different datasources - e.g lower accuracy for faster queries VS higher accuracy with a bit slower queries
Combine multiple queries over the REST API There can be billions of rows, so filter the data as part of the query
EMR tuning (spot instances (80% cost reduction), druid MR prod config) Use Parquet The picture here - maybe money??
Ingestion doesn’t affect query + sub-second response for even 100s or 1000s of concurrent queries Cost is for the entire solution (Druid cluster, EMR, etc.) With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution)

Using Druid for Interactive Count-Distinct Queries at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Using Druid for Interactive Count-Distinct Queries at Scale

Similar to Using Druid for Interactive Count-Distinct Queries at Scale (20)

Recently uploaded

Recently uploaded (20)

Using Druid for Interactive Count-Distinct Queries at Scale

Editor's Notes