Druid - DevconTLV X

•

1 like•132 views

The document discusses estimating set cardinalities using Druid, an open-source data store. It describes how Druid uses Theta Sketches to estimate cardinalities and support set operations. This allows it to efficiently calculate unique counts over attributes and time ranges. The document compares Druid's performance to Elasticsearch on a 10TB dataset, finding Druid can process the data 4x faster using less resources and costing 2.5x less per month.

Software

Nov 2016
DRUIDEstimating Set Cardinalities

DRUID
“ Very fast highly scalable columnar data-store ”

Roll-Up
Event Time Id Attribute Daily Unique Monthly Unique
2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 11111 1 1
2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 22222 1 1
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 11111 0 0
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 22222 1 0
2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 33333 1 0
Event Time Attribute Daily Unique Count Monthly Unique Count
2016-11-15 111111 1 1
2016-11-15 222222 2 1
2016-11-15 333333 1 0
SumAggregator

Count-Distinct problem
❏ Find the number of distinct elements in a data stream with repeated elements
❏ eXelate business question
❏ How many unique devices has eXelate encountered:
❏ for a given set-theoretic expression of attributes (segments, labels, regions, etc.)
❏ over a given date range

Count-Distinct Approaches
• Store everything
• Store only 1 bit per device
• 10B Devices - 1.25 GB/day
• 10B Devices * 80K attributes - 100 TB/day
• Approximate

ThetaSketch
• K Minimum Values (KMV)
• Estimate set cardinality
• Supports set-theoretic operations
X Y
• ThetaSketch mathematical framework - generalization of KMV
X Y

ThetaSketch Error
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
128 8.87% 17.75%
8,192 1.10% 2.21%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
131,072 0.28% 0.55%

Solution using Elasticsearch
• Document structure
{
“id”: “3a4c1f2d84a5c179435c1fea86e6ae02”,
“events”: [
{
“date”: “15-11-2016”,
“attributes” : [ “111111”, “222222” ]
},
{
“date”: “16-11-2016”,
“attributes” : [ “222222”, “333333” ]
}
]
}
• Exploit Elasticsearch reverse-index

• Indexing data
• 250 GB of daily data, 10 hours
• Affect query time
• Large index - 2.5 TB
• Querying
• low concurrency
• Spans on all the machines in the cluster
• Cost
• $100K monthly
Elasticsearch Issues

What We Tried
• Pre-processing
• Too many combinations
• HyperLogLog
• No good support for set-theoretic operations
• Calculated during query time

Druid Solution
(timestamp,device_id,attribute)
ThetaSketchAggregator

Benchmark
• Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge)
• Elasticsearch Cluster : 20 nodes (r3.8xlarge)

10TB
4 Hours
160GB
280ms-350ms
$40K/mo
DRUID ES
250GB
10 Hours
2.5TB
500ms-6000ms
$100K/mo
Druid vs. ES

We Are Hiring
❏Web application team leader
❏Frontend developer
❏Java developer & machine learning
❏Senior java developer
❏IT Production Engineer
❏Node.js Developer
http://exelate.com/about-us/careers

Similar to Druid - DevconTLV X

Using druid for interactive count distinct queries at scaleItai Yaffe

Using druid for interactive count distinct queries at scale @ nmcIdo Shilon

Our journey with druid - from initial research to full production scaleItai Yaffe

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion

[D15] 最強にスケーラブルなカラムナーDBよ、Hadoopとのタッグでビッグデータの地平を目指せ！by Daisuke HiramaInsight Technology, Inc.

MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB

Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong

Analyze and visualize non-relational data with DocumentDB + Power BISriram Hariharan

DOAG Security Day 2016 Enterprise Security ReloadedLoopback.ORG

Generic Framework for Knowledge Classification-1Venkata Vineel

Solr Power FTW: Powering NoSQL the World OverAlex Pinkin

Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner

Managing your Black Friday Logs NDC OsloDavid Pilato

Querying Data Pipeline with AWS AthenaYaroslav Tkachenko

Managing your black friday logs - Code EuropeDavid Pilato

MongoDB Chunks - Distribution, Splitting, and MergingJason Terpko

MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB

Building OpenDNS StatsGeorge Ang

ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevAltinity Ltd

app/server monitoringJaemok Jeong

Similar to Druid - DevconTLV X (20)

Using druid for interactive count distinct queries at scale

Using druid for interactive count distinct queries at scale @ nmc

Our journey with druid - from initial research to full production scale

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018

[D15] 最強にスケーラブルなカラムナーDBよ、Hadoopとのタッグでビッグデータの地平を目指せ！by Daisuke Hirama

MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study

Experiences in ELK with D3.js for Large Log Analysis and Visualization

Analyze and visualize non-relational data with DocumentDB + Power BI

DOAG Security Day 2016 Enterprise Security Reloaded

Generic Framework for Knowledge Classification-1

Solr Power FTW: Powering NoSQL the World Over

Security Monitoring for big Infrastructures without a Million Dollar budget

Managing your Black Friday Logs NDC Oslo

Querying Data Pipeline with AWS Athena

Managing your black friday logs - Code Europe

MongoDB Chunks - Distribution, Splitting, and Merging

MongoDB for Time Series Data: Setting the Stage for Sensor Management

Building OpenDNS Stats

ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev

app/server monitoring

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

5 Signs You Need a Fashion PLM Software.pdfWave PLM

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

TECUNIQUE: Success Stories: IT Service providermohitmore19

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Right Money Management App For Your Financial GoalsJhone kinadey

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Recently uploaded (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

HR Software Buyers Guide in 2024 - HRSoftware.com

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Optimizing AI for immediate response in Smart CCTV

Diamond Application Development Crafting Solutions with Precision

5 Signs You Need a Fashion PLM Software.pdf

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Microsoft AI Transformation Partner Playbook.pdf

A Secure and Reliable Document Management System is Essential.docx

TECUNIQUE: Success Stories: IT Service provider

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Right Money Management App For Your Financial Goals

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Druid - DevconTLV X

1. Nov 2016 DRUIDEstimating Set Cardinalities

2. ABOUT ME

3. DRUID “ Very fast highly scalable columnar data-store ”

4. Roll-Up Event Time Id Attribute Daily Unique Monthly Unique 2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 11111 1 1 2016-11-15 3a4c1f2d84a5c179435c1fea86e6ae02 22222 1 1 2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 11111 0 0 2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 22222 1 0 2016-11-15 5dd59f9bd068f802a7c6dd832bf60d02 33333 1 0 Event Time Attribute Daily Unique Count Monthly Unique Count 2016-11-15 111111 1 1 2016-11-15 222222 2 1 2016-11-15 333333 1 0 SumAggregator

5. Druid Architecture

6. Count-Distinct problem ❏ Find the number of distinct elements in a data stream with repeated elements ❏ eXelate business question ❏ How many unique devices has eXelate encountered: ❏ for a given set-theoretic expression of attributes (segments, labels, regions, etc.) ❏ over a given date range

7. Nielsen Marketing Cloud

8. Count-Distinct Approaches • Store everything • Store only 1 bit per device • 10B Devices - 1.25 GB/day • 10B Devices * 80K attributes - 100 TB/day • Approximate

9. ThetaSketch • K Minimum Values (KMV) • Estimate set cardinality • Supports set-theoretic operations X Y • ThetaSketch mathematical framework - generalization of KMV X Y

10. ThetaSketch Error Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 128 8.87% 17.75% 8,192 1.10% 2.21% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% 131,072 0.28% 0.55%

11. Solution using Elasticsearch • Document structure { “id”: “3a4c1f2d84a5c179435c1fea86e6ae02”, “events”: [ { “date”: “15-11-2016”, “attributes” : [ “111111”, “222222” ] }, { “date”: “16-11-2016”, “attributes” : [ “222222”, “333333” ] } ] } • Exploit Elasticsearch reverse-index

12. • Indexing data • 250 GB of daily data, 10 hours • Affect query time • Large index - 2.5 TB • Querying • low concurrency • Spans on all the machines in the cluster • Cost • $100K monthly Elasticsearch Issues

13. What We Tried • Pre-processing • Too many combinations • HyperLogLog • No good support for set-theoretic operations • Calculated during query time

14. Druid Solution (timestamp,device_id,attribute) ThetaSketchAggregator

15. Benchmark • Druid Cluster : 1x Broker (r3.8xlarge) , 8x Historical (r3.8xlarge) • Elasticsearch Cluster : 20 nodes (r3.8xlarge)

16. 10TB 4 Hours 160GB 280ms-350ms $40K/mo DRUID ES 250GB 10 Hours 2.5TB 500ms-6000ms $100K/mo Druid vs. ES

17.

18. We Are Hiring ❏Web application team leader ❏Frontend developer ❏Java developer & machine learning ❏Senior java developer ❏IT Production Engineer ❏Node.js Developer http://exelate.com/about-us/careers

Druid - DevconTLV X

Recommended

Recommended

More Related Content

Similar to Druid - DevconTLV X

Similar to Druid - DevconTLV X (20)

Recently uploaded

Recently uploaded (20)

Druid - DevconTLV X