SlideShare a Scribd company logo
1 of 33
Download to read offline
HyperLogLog
in Hive How to count
sheep efficiently?
Phillip Capper: Whitecliffs Sheep
@bzamecnik
Agenda
● the problem – count distinct elements
● exact counting
● fast approximate counting – using HLL in Hive
● comparing performance and accuracy
● appendix – a bit of theory of probabilistic counting
○ how it works?
The problem: count distinct elements
● eg. the number of unique visitors
● each visitor can make a lot of clicks
● typically grouped in various ways
● "set cardinality estimation" problem
Small data solutions
● sort the data O(N*log(N)) and skip duplicates O(N)
○ O(N) space
● put data into a hash or tree set and iterate
○ hash set: O(N^2) worst case build, O(N) iteration
○ tree set: O(N*log(N)) build, O(N) iteration
○ both O(N) space
● but: we have big data
Example:
~100M unique values in 5B rows each day
32 bytes per value -> 3 GB unique, 150 GB total
Problems with counting big data
● data is partitioned
○ across many machines
○ in time
● we can't sum cardinality of each partition
○ since the subsets are generally not disjoint
○ we would overestimate
count(part1) + count(part1) >= count(part1 ∪ part2)
● we need to merge estimators and then estimate
cardinality
count(estimator(part1) ∪ estimator(part2))
SELECT COUNT(DISTINCT user_id)
FROM events;
single reducer!
Exact counting in Hive
Exact counting in Hive – subquery
SELECT COUNT(*) FROM (
SELECT 1 FROM events
GROUP BY user_id
) unique_guids;
Or more concisely:
SELECT COUNT(*) FROM (
SELECT DISTINCT user_id
FROM events
) unique_guids;
many reducers
two phases
cannot combine
more aggregations
Exact counting in Hive
● hive.optimize.distinct.rewrite
○ allows to rewrite COUNT(DISTINCT) to subquery
○ since Hive 1.2.0
Probabilistic counting
● fast results, but approximate
● practical example of using HLL in Hive
● more theory in the appendix
● klout/brickhouse
○ single option
○ no JAR, some tests
○ based on HLL++ from stream-lib (quite fast)
● jdmaturen/hive-hll
○ no options (they are in API, but not implemented!)
○ no JAR, no tests
○ compatible with java-hll, pg-hll, js-hll
● t3rmin4t0r/hive-hll-udf
○ no options, no JAR, no tests
Implementations of HLL as Hive UDFs
● User-Defined Functions
● function registered from a class (loaded from JAR)
● JAR needs to be on HDFS (otherwise it fails)
● you can choose the UDF name at will
● work both in HiveServer2/Beeline and Hive CLI
ADD JAR hdfs:///path/to/the/library.jar;
CREATE TEMPORARY FUNCTION foo_func
AS 'com.example.foo.FooUDF';
● Usage:
SELECT foo_func(...) FROM ...;
UDFs in Hive
● to_hll(value)
○ aggregate values to HLL
○ UDAF (aggregation function)
○ + hash each value
○ optionally can be configured (eg. for precision)
● union_hlls(hll)
○ union multiple HLLs
○ UDAF
● hll_approx_count(hll)
○ estimate cardinality from a HLL
○ UDF
HLL can be stored as binary or string type.
General UDFs API for HLL
● Estimate of total unique visitors:
SELECT hll_approx_count(to_hll(user_id))
FROM events;
● Estimate of total events + unique visitors at once:
SELECT
count(*) AS total_events
hll_approx_count(to_hll(user_id))
AS unique_visitors
FROM events;
Example usage
Example usage
● Compute each daily estimator once:
CREATE TABLE daily_user_hll AS
SELECT date, to_hll(user_id) AS users_hll
FROM events
GROUP BY date;
● Then quickly aggregate and estimate:
SELECT hll_approx_count(union_hlls(users_hll))
AS user_count
FROM daily_user_hll
WHERE date BETWEEN '2015-01-01' AND '2015-01-31';
https://github.com/klout/brickhouse - Hive UDF
https://github.com/addthis/stream-lib - HLL++
$ git clone https://github.com/klout/brickhouse
disable maven-javadoc-plugin in pom.xml (since it fails)
$ mvn package
$ wget http://central.maven.
org/maven2/com/clearspring/analytics/stream/2.3.0/stream-
2.3.0.jar
$ scp target/brickhouse-0.7.1-SNAPSHOT.jar 
stream-2.3.0.jar cluster-host:
cluster-host$ hdfs dfs -copyFromLocal *.jar 
/user/me/hive-libs
Brickhouse – installation
Brickhouse – usage
ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar;
ADD JAR /user/zamecnik/lib/stream-2.3.0.jar;
CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll.
HyperLogLogUDAF';
CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf.
hll.UnionHyperLogLogUDAF';
CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse.
udf.hll.EstimateCardinalityUDF';
to_hll(value, [bit_precision])
● bit_precision: 4 to 16 (default 6)
Hive-hll usage
ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar;
CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll.
HashUDF';
CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll.
AddAggUDAF';
CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll.
UnionAggUDAF';
CREATE TEMPORARY FUNCTION hll_approx_count AS 'com.
kresilas.hll.CardinalityUDF';
We have to explicitly hash the value:
SELECT
hll_approx_count(to_hll(hll_hash(user_id)))
FROM events;
Options for creating HLL:
to_hll(x, [log2m, regwidth, expthresh, sparseon])
hardcoded to:
[log2m=11, regwidth=5, expthresh=-1, sparseon=true]
Hive-hll usage
Nice things
● HLLs are additive
○ can be computed once
○ various partitions can be merged and estimated for
cardinality later
● we can count multiple unique columns at once
○ no need to subquery
○ we can do wild grouping (by country, browser, …)
● HLLs take only little space
Rolling window
-- keep reasonable number of task for month of data
SET mapreduce.input.fileinputformat.split.maxsize=5368709120;
-- keep low number of output files (HLLs are quite small)
SET hive.merge.mapredfiles=true;
-- maximum precision
SET hivevar:hll_precision=16;
-- HLL for each day
CREATE TABLE guids_parquet_hll AS
SELECT
'${year}' AS year,
'${month}' AS month,
day,
to_hll(guid, ${hll_precision}) AS guid_hll
FROM parquet.dump_${year}_${month}
GROUP BY day;
-- for each day estimate number of guids 7-days back
CREATE TABLE zamecnik.guids_parquet_rolling_30_day_count
AS
SELECT
`date`,
hll_approx_count(guids_union) AS guid_count
FROM (
SELECT
concat(`year`, '-', `month`, '-', `day`) as `date`,
union_hlls(guid_hll) OVER w AS guids_union
FROM guids_parquet_hll
WINDOW w AS (
ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING
)
) rolling_guids;
Rolling window
● when JARs are not on HDFS the query fails (why?)
● computing on many days of raw clickstream fails in
Beeline (works in Hive CLI), parquet is ok
● HIVE-9073 WINDOW + custom UDAF → NPE
○ fixed in Hive 1.2.0
● DISTRO-631
Pitfalls
Approximation error
● Typically < 1-2 %
● Can be controlled by the parameters
● Example: 1 year of guids
Appendix – more interesting things
● trade-off: some approximation error for far better
performance and memory consumption
● sketch - streaming & probabilistic algorithm
● KMV - k minimal values
● linear counter
● loglog counter
Probabilistic counting
LogLog counter
● run length of initial zeros
● multiple estimators (registers)
● stochastic averaging
○ single hash function
○ multiple buckets
● hash → (register index, run length)
Linear counter
m = 20 # size of the register
register = bitarray(m) # register, m bits
def add(value):
h = mmh3.hash(value) % m # select bit index
register[h] = 1 # = max(1, register[h])
def cardinality():
u_n = register.count(0) # number of zeros
v_n = u_n / m # relative number of zeros
n_hat = -m * math.log(v_n) # estimate of the set cardinality
return n_hat
● structure like loglog counter
● harmonic mean to combine registers
● correction for small and large cardinalities
● values needs to be hashed well – murmur3
HyperLogLog (HLL)
HLL union
● just take max of each register value
● no loss – same result as HLL of union of streams
● parallelizable
● union preserves error bound, intersection/diff do not
Further reading
● very nice explanation of HLL
● Probabilistic Data Structures For Web Analytics And
Data Mining
● Sketch of the Day: HyperLogLog — Cornerstone of a
Big Data Infrastructure
● HyperLogLog in Pure SQL
● Use Subqueries to Count Distinct 50X Faster
● It is possible to combine HLL of different sizes
Papers
● HyperLogLog in Practice: Algorithmic Engineering of
a State of The Art Cardinality Estimation Algorithm
● https://github.com/addthis/stream-lib#cardinality
Other problems & structures
● set membership – bloom filter
● top-k elements – count-min-sketch, stream-summary

More Related Content

What's hot

On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLDatabricks
 
Low level java programming
Low level java programmingLow level java programming
Low level java programmingPeter Lawrey
 
[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터Donghyeon Kim
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...VMware Tanzu
 
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013Jaime Crespo
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5Peter Lawrey
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsHung-yu Lin
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetAlexey Lesovsky
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Data discovery & metadata management (amundsen installation)
Data discovery & metadata management (amundsen installation)Data discovery & metadata management (amundsen installation)
Data discovery & metadata management (amundsen installation)창언 정
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 

What's hot (20)

Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
 
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication Cheatsheet
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Data discovery & metadata management (amundsen installation)
Data discovery & metadata management (amundsen installation)Data discovery & metadata management (amundsen installation)
Data discovery & metadata management (amundsen installation)
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 

Viewers also liked

ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоPechaKucha Ukraine
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresshrinivasvasala
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Qrator Labs
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglognybon
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Roman Elizarov
 

Viewers also liked (9)

ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglog
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
 

Similar to HyperLogLog in Hive - How to count sheep efficiently?

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDMike Dusenberry
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISelIgalia
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2Dvir Volk
 
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DBMarch 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DBJosiah Carlson
 
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdfProgramming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdfssuser6254411
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17Daniel Eriksson
 
OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013Antony T Curtis
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
 
10-IDL.pptx
10-IDL.pptx10-IDL.pptx
10-IDL.pptxDhayaM1
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBArangoDB Database
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Alexey Rybak
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 

Similar to HyperLogLog in Hive - How to count sheep efficiently? (20)

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISel
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DBMarch 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
 
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdfProgramming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17
 
OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013
 
Java 8
Java 8Java 8
Java 8
 
Towards hasktorch 1.0
Towards hasktorch 1.0Towards hasktorch 1.0
Towards hasktorch 1.0
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
10-IDL.pptx
10-IDL.pptx10-IDL.pptx
10-IDL.pptx
 
Go. why it goes v2
Go. why it goes v2Go. why it goes v2
Go. why it goes v2
 
Hibernate 1x2
Hibernate 1x2Hibernate 1x2
Hibernate 1x2
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 

HyperLogLog in Hive - How to count sheep efficiently?

  • 1. HyperLogLog in Hive How to count sheep efficiently? Phillip Capper: Whitecliffs Sheep @bzamecnik
  • 2. Agenda ● the problem – count distinct elements ● exact counting ● fast approximate counting – using HLL in Hive ● comparing performance and accuracy ● appendix – a bit of theory of probabilistic counting ○ how it works?
  • 3. The problem: count distinct elements ● eg. the number of unique visitors ● each visitor can make a lot of clicks ● typically grouped in various ways ● "set cardinality estimation" problem
  • 4. Small data solutions ● sort the data O(N*log(N)) and skip duplicates O(N) ○ O(N) space ● put data into a hash or tree set and iterate ○ hash set: O(N^2) worst case build, O(N) iteration ○ tree set: O(N*log(N)) build, O(N) iteration ○ both O(N) space ● but: we have big data Example: ~100M unique values in 5B rows each day 32 bytes per value -> 3 GB unique, 150 GB total
  • 5. Problems with counting big data ● data is partitioned ○ across many machines ○ in time ● we can't sum cardinality of each partition ○ since the subsets are generally not disjoint ○ we would overestimate count(part1) + count(part1) >= count(part1 ∪ part2) ● we need to merge estimators and then estimate cardinality count(estimator(part1) ∪ estimator(part2))
  • 6. SELECT COUNT(DISTINCT user_id) FROM events; single reducer! Exact counting in Hive
  • 7. Exact counting in Hive – subquery SELECT COUNT(*) FROM ( SELECT 1 FROM events GROUP BY user_id ) unique_guids; Or more concisely: SELECT COUNT(*) FROM ( SELECT DISTINCT user_id FROM events ) unique_guids; many reducers two phases cannot combine more aggregations
  • 8. Exact counting in Hive ● hive.optimize.distinct.rewrite ○ allows to rewrite COUNT(DISTINCT) to subquery ○ since Hive 1.2.0
  • 9.
  • 10. Probabilistic counting ● fast results, but approximate ● practical example of using HLL in Hive ● more theory in the appendix
  • 11. ● klout/brickhouse ○ single option ○ no JAR, some tests ○ based on HLL++ from stream-lib (quite fast) ● jdmaturen/hive-hll ○ no options (they are in API, but not implemented!) ○ no JAR, no tests ○ compatible with java-hll, pg-hll, js-hll ● t3rmin4t0r/hive-hll-udf ○ no options, no JAR, no tests Implementations of HLL as Hive UDFs
  • 12. ● User-Defined Functions ● function registered from a class (loaded from JAR) ● JAR needs to be on HDFS (otherwise it fails) ● you can choose the UDF name at will ● work both in HiveServer2/Beeline and Hive CLI ADD JAR hdfs:///path/to/the/library.jar; CREATE TEMPORARY FUNCTION foo_func AS 'com.example.foo.FooUDF'; ● Usage: SELECT foo_func(...) FROM ...; UDFs in Hive
  • 13. ● to_hll(value) ○ aggregate values to HLL ○ UDAF (aggregation function) ○ + hash each value ○ optionally can be configured (eg. for precision) ● union_hlls(hll) ○ union multiple HLLs ○ UDAF ● hll_approx_count(hll) ○ estimate cardinality from a HLL ○ UDF HLL can be stored as binary or string type. General UDFs API for HLL
  • 14. ● Estimate of total unique visitors: SELECT hll_approx_count(to_hll(user_id)) FROM events; ● Estimate of total events + unique visitors at once: SELECT count(*) AS total_events hll_approx_count(to_hll(user_id)) AS unique_visitors FROM events; Example usage
  • 15. Example usage ● Compute each daily estimator once: CREATE TABLE daily_user_hll AS SELECT date, to_hll(user_id) AS users_hll FROM events GROUP BY date; ● Then quickly aggregate and estimate: SELECT hll_approx_count(union_hlls(users_hll)) AS user_count FROM daily_user_hll WHERE date BETWEEN '2015-01-01' AND '2015-01-31';
  • 16. https://github.com/klout/brickhouse - Hive UDF https://github.com/addthis/stream-lib - HLL++ $ git clone https://github.com/klout/brickhouse disable maven-javadoc-plugin in pom.xml (since it fails) $ mvn package $ wget http://central.maven. org/maven2/com/clearspring/analytics/stream/2.3.0/stream- 2.3.0.jar $ scp target/brickhouse-0.7.1-SNAPSHOT.jar stream-2.3.0.jar cluster-host: cluster-host$ hdfs dfs -copyFromLocal *.jar /user/me/hive-libs Brickhouse – installation
  • 17. Brickhouse – usage ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar; ADD JAR /user/zamecnik/lib/stream-2.3.0.jar; CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll. HyperLogLogUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf. hll.UnionHyperLogLogUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse. udf.hll.EstimateCardinalityUDF'; to_hll(value, [bit_precision]) ● bit_precision: 4 to 16 (default 6)
  • 18. Hive-hll usage ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar; CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll. HashUDF'; CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll. AddAggUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll. UnionAggUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'com. kresilas.hll.CardinalityUDF';
  • 19. We have to explicitly hash the value: SELECT hll_approx_count(to_hll(hll_hash(user_id))) FROM events; Options for creating HLL: to_hll(x, [log2m, regwidth, expthresh, sparseon]) hardcoded to: [log2m=11, regwidth=5, expthresh=-1, sparseon=true] Hive-hll usage
  • 20. Nice things ● HLLs are additive ○ can be computed once ○ various partitions can be merged and estimated for cardinality later ● we can count multiple unique columns at once ○ no need to subquery ○ we can do wild grouping (by country, browser, …) ● HLLs take only little space
  • 21. Rolling window -- keep reasonable number of task for month of data SET mapreduce.input.fileinputformat.split.maxsize=5368709120; -- keep low number of output files (HLLs are quite small) SET hive.merge.mapredfiles=true; -- maximum precision SET hivevar:hll_precision=16; -- HLL for each day CREATE TABLE guids_parquet_hll AS SELECT '${year}' AS year, '${month}' AS month, day, to_hll(guid, ${hll_precision}) AS guid_hll FROM parquet.dump_${year}_${month} GROUP BY day;
  • 22. -- for each day estimate number of guids 7-days back CREATE TABLE zamecnik.guids_parquet_rolling_30_day_count AS SELECT `date`, hll_approx_count(guids_union) AS guid_count FROM ( SELECT concat(`year`, '-', `month`, '-', `day`) as `date`, union_hlls(guid_hll) OVER w AS guids_union FROM guids_parquet_hll WINDOW w AS ( ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING ) ) rolling_guids; Rolling window
  • 23. ● when JARs are not on HDFS the query fails (why?) ● computing on many days of raw clickstream fails in Beeline (works in Hive CLI), parquet is ok ● HIVE-9073 WINDOW + custom UDAF → NPE ○ fixed in Hive 1.2.0 ● DISTRO-631 Pitfalls
  • 24. Approximation error ● Typically < 1-2 % ● Can be controlled by the parameters ● Example: 1 year of guids
  • 25. Appendix – more interesting things
  • 26. ● trade-off: some approximation error for far better performance and memory consumption ● sketch - streaming & probabilistic algorithm ● KMV - k minimal values ● linear counter ● loglog counter Probabilistic counting
  • 27. LogLog counter ● run length of initial zeros ● multiple estimators (registers) ● stochastic averaging ○ single hash function ○ multiple buckets ● hash → (register index, run length)
  • 28. Linear counter m = 20 # size of the register register = bitarray(m) # register, m bits def add(value): h = mmh3.hash(value) % m # select bit index register[h] = 1 # = max(1, register[h]) def cardinality(): u_n = register.count(0) # number of zeros v_n = u_n / m # relative number of zeros n_hat = -m * math.log(v_n) # estimate of the set cardinality return n_hat
  • 29. ● structure like loglog counter ● harmonic mean to combine registers ● correction for small and large cardinalities ● values needs to be hashed well – murmur3 HyperLogLog (HLL)
  • 30. HLL union ● just take max of each register value ● no loss – same result as HLL of union of streams ● parallelizable ● union preserves error bound, intersection/diff do not
  • 31. Further reading ● very nice explanation of HLL ● Probabilistic Data Structures For Web Analytics And Data Mining ● Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure ● HyperLogLog in Pure SQL ● Use Subqueries to Count Distinct 50X Faster ● It is possible to combine HLL of different sizes
  • 32. Papers ● HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm ● https://github.com/addthis/stream-lib#cardinality
  • 33. Other problems & structures ● set membership – bloom filter ● top-k elements – count-min-sketch, stream-summary