The document discusses estimating set cardinalities using Druid, an open-source data store. It describes how Druid uses Theta Sketches to estimate cardinalities and support set operations. This allows it to efficiently calculate unique counts over attributes and time ranges. The document compares Druid's performance to Elasticsearch on a 10TB dataset, finding Druid can process the data 4x faster using less resources and costing 2.5x less per month.
6. Count-Distinct problem
❏ Find the number of distinct elements in a data stream with repeated elements
❏ eXelate business question
❏ How many unique devices has eXelate encountered:
❏ for a given set-theoretic expression of attributes (segments, labels, regions, etc.)
❏ over a given date range
8. Count-Distinct Approaches
• Store everything
• Store only 1 bit per device
• 10B Devices - 1.25 GB/day
• 10B Devices * 80K attributes - 100 TB/day
• Approximate
9. ThetaSketch
• K Minimum Values (KMV)
• Estimate set cardinality
• Supports set-theoretic operations
X Y
• ThetaSketch mathematical framework - generalization of KMV
X Y
12. • Indexing data
• 250 GB of daily data, 10 hours
• Affect query time
• Large index - 2.5 TB
• Querying
• low concurrency
• Spans on all the machines in the cluster
• Cost
• $100K monthly
Elasticsearch Issues
13. What We Tried
• Pre-processing
• Too many combinations
• HyperLogLog
• No good support for set-theoretic operations
• Calculated during query time