Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using druid for interactive count distinct queries at scale @ nmc

508 views

Published on

Using druid for interactive count distinct queries

Published in: Technology
  • Login to see the comments

Using druid for interactive count distinct queries at scale @ nmc

  1. 1. Yakir Buskilla + Itai Yaffe Nielsen USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE
  2. 2. Introduction Yakir Buskilla Itai Yaffe ● Software Architect ● Focusing on Big Data and Machine Learning problems ● Big Data Infrastructure Developer ● Dealing with Big Data challenges for the last 5 years
  3. 3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen 2 years ago ● A leader in the Ad Tech and Marketing Tech industry ● What do we do ? ○ Data as a Service (DaaS) ○ Software as a Service (SaaS)
  4. 4. NMC high-level architecture
  5. 5. The need ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time
  6. 6. The need
  7. 7. The need
  8. 8. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions Naive Bit VectorApprox.
  9. 9. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index
  10. 10. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)
  11. 11. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch
  12. 12. KMV intuition
  13. 13. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error
  14. 14. “Very fast highly scalable columnar data-store” DRUID
  15. 15. Roll-up ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Count Distinct 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2 2 1
  16. 16. Druid architecture
  17. 17. How do we use Druid
  18. 18. Guidelines and pitfalls ● Setup is not easy
  19. 19. Guidelines and pitfalls ● Monitoring your system
  20. 20. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...
  21. 21. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters
  22. 22. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10
  23. 23. Guidelines and pitfalls ● Community
  24. 24. Summary 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES
  25. 25. THANK YOU!

×