Counting Unique Users in Real-Time: Here's a Challenge for You!

Counting Unique Users in Real-Time:
Here’s a Challenge for You!
Yakir Buskilla & Itai Yaffe
Nielsen

Introduction
Yakir Buskilla Itai Yaffe
● VP R&D
● Focused on big data
processing and machine
learning solutions
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Business decisions
● Targeting

Nielsen Marketing Cloud - high-level architecture

Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campaign for young women who love tech
2. How many impressions a campaign received?

The need for Count Distinct
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time

● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K
attributes - 100 TB/day
● Approximate
Possible solutions for Count Distinct
Naive
Bit Vector
Approx.

Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the
corresponding index

Query performance - Elasticsearch

What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch

Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error
Error as function of K

“Very fast highly scalable columnar data-store”
DRUID

Why is it cool?
● Store trillions of events, petabytes of data
● Sub-second analytic queries
● Highly scalable
● Cost effective
● Decoupled architecture
○ E.g ingestion is separated from query

LongSumAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15
11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Simple Count
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
3
1
1
Roll-up - Simple Count

Roll-up - Count Distinct
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Atritbute Count Distinct*
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2*
1*
1*
* What is actually stored is a
ThetaSketch object.
The actual result is calculated
in real-time, which allows us
to do UNIONs and INTERSECTs

Guidelines and pitfalls
● Setup is not easy
● Deployment is use-case dependent, e.g:
○ Deep storage - S3
○ No. of datasources - <10 (all are ThetaSketch)
○ Data size on cluster - >30TB
○ Broker nodes - 3 X r4.8xlarge (32 cores, 244GB RAM each)
○ Historical nodes - 17 X i3.8xlarge (32 cores, 244GB RAM each, NVMe SSD)

● Monitoring your system

● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX
US
Porsche Intent XXXXXX
Porsche Intent
... ......
XXXXXX
...

● Query optimization
○ Combine multiple queries into single query
○ Use filters
○ Use timeseries rather than groupBy queries
(where applicable)
○ Use groupBy v2 engine (default since 0.10.0)

● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10

● Batch Ingestion
○ Action - pre-aggregating the data in Spark app
■ Aggregating data by key
● groupBy() - for simple counts
● combineByKey() - for distinct count (using the DataSketches packages)
■ Decreasing execution frequency
● E.g every 1 hour (rather than every 30 minutes)
○ Result:
■ # of output records is ~2000X smaller and total size of output files is less
than 1%, compared to the previous version
■ 10X less nodes in the EMR cluster running the MapReduce ingestion job
■ Another 80% cost reduction, $2.64M/year -> $0.47M/year!

● Community

Future work
● Research ways to improve accuracy for small set <-> large set intersections
● Further improve query performance
● Explore option of tiering of query processing nodes
○ Reporting vs interactive queries
○ Hot vs cold data
● Version upgrades

What have we learned?
● Answering Count Distinct queries in real-time is a challenge!
○ Approximation algorithms FTW!
● Druid provides a concrete implementation of the ThetaSketch mathematical
framework
○ A columnar, time series data-store
○ Can store trillions of events and serve analytic queries in sub-second
○ Highly-scalable, cost-effective and widely used among Big Data companies
○ Can be used for:
■ Distinct count (via ThetaSketch)
■ Simple counts
● Words of wisdom:
○ Setup is not easy, using online resources (documentation, community) can help
○ Ingestion has little effect on query performance (deep storage usage)
○ Provides very good visibility
○ Improve query performance by carefully designing your data model and building your queries

Want to know more?
● Women in Big Data
○ A world-wide program that aims:
■ To inspire, connect, grow, and champion success of women in Big Data.
■ To grow women representation in Big Data field > 25% by 2020
○ Visit the website (https://www.womeninbigdata.org/) and join the Women in Big Data Luncheon
today (12:30PM, http://tinyurl.com/y2mycox4)!
● Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
○ Tomorrow, 2:50 PM - 3:30 PM Room 127-128, http://tinyurl.com/y5vfmq5p
● NMC Tech Blog - https://medium.com/nmc-techblog

https://www.linkedin.com/in/yakirbuskilla/
https://www.linkedin.com/in/itaiy/
THANK YOU

Druid vs ES
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES

Counting Unique Users in Real-Time: Here's a Challenge for You!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Counting Unique Users in Real-Time: Here's a Challenge for You!

Similar to Counting Unique Users in Real-Time: Here's a Challenge for You! (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Counting Unique Users in Real-Time: Here's a Challenge for You!

Editor's Notes