Probabilistic data structures

•Download as PPTX, PDF•

0 likes•598 views

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption.

Data & Analytics

Probabilistic
Data Structures
By,
Shrinivas Vasala

Outline
• Basic Idea
• Hyperloglog
• Bloom filters
• Count-min sketches
• References & further reading
2

Basic Idea
• Problem: you have a lot of data to count,
track or otherwise analyze
• Often an approximate answer is sufficient,
especially if you can place bounds on how
wrong the approximation is likely to be
• IO is the most expensive
3

Hyperloglog
• Originally described by Flajolet and colleagues in 2007
• Can estimate cardinalities well beyond 109 with a relative accuracy
(standard error) of 2% while only using 1.5kb of memory
• Hashing turns anything into uniform distribution
• If the maximum number of leading zeros observed is ’n’, an estimate
for the number of distinct elements in the set is 2n
• Tuning Precision : set is split into multiple subsets
• Harmonic mean + low/high sampling adjustments  Results
• Increasing the number of bits of your hash increases the highest
possible number you can accurately approximate
• Commonly used hashing algorithm: Murmur
• Implementation in Redis
4

Bloom filters
• Conceived by Burton Howard Bloom in 1970
• Used to test whether an element is a member of a set
• Query returns either "possibly in set" or "definitely not in set“
• An empty Bloom filter is a bit array of ‘m’ bits, all set to 0
• ‘k’ hash functions map the set element to one of the m array
positions with a uniform random distribution
• The bits at all these positions are set to 1
• To query, hash with same hash functions and check if all positions
are set, if not its surely not in the set else maybe present in set
• For an optimal value of k with 1% error each element requires only
about 9.6 bits — regardless of the size of the elements
5
1. Choose a ballpark value for no. of elements in the set ‘n’
2. Choose a value for m
3. Calculate the optimal value of k [ k = (m/n) ln 2 ]
4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ]
If p is unacceptable, return to step 2 and change m;
otherwise we're done.

Count-min sketches
• Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003
• How many of each item there is in an collection
• Sketch is a compact summary of a large amount of data, is a 2D array
of w columns and d rows
• Each box is a counter
• Each row is indexed by a
corresponding hash function
• Estimated frequency for
‘Something’ is min(a,b,c,d)
• ‘w’ limits the magnitude of the error [ error <= 2 * n/w ]
• ‘d’ controls the probability that the estimation is greater than the error
[probability limit exceeded = 1 – (1/2) ** d]
• Works best on Skewed data

References & further reading
Other Topics
• MinHash is a technique for quickly estimating how similar two sets are
• Quotient filters are AMQs (approximate membership query) and, provide many of the same
benefits as Bloom filters
• Skip list is a data structure that allows fast search within an ordered sequence of elements
Hyperloglog
1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html
2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms
3. http://en.wikipedia.org/wiki/HyperLogLog
Bloom filters
1. http://billmill.org/bloomfilter-tutorial/
2. http://en.wikipedia.org/wiki/Bloom_filter
Count-min sketches
1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon-
2012
2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
3. https://sites.google.com/site/countminsketch/home
7

What's hot

Tutorial 9 (bloom filters)Kira

Bloom filterwang ping

New zealand bloom filterxlight

Real Time Big Data ManagementAlbert Bifet

Bloom filterHamid Feizabadi

Vasia Kalavri – Training: Gelly School Flink Forward

Real-Time Big Data Stream AnalyticsAlbert Bifet

Streaming AlgorithmsJoe Kelley

Bloom filtersDevesh Maru

Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonChristopher Conlan

Bloom filterfeng lee

Algorithms 101 for Data ScientistsChristopher Conlan

STRIP: stream learning of influence probabilities.Albert Bifet

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit

HashingRamzi Alqrainy

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

Concept of hashingRafi Dar

Introduction to Big Data ScienceAlbert Bifet

Hash tablesInternational Islamic University

Internet of Things Data ScienceAlbert Bifet

What's hot (20)

Tutorial 9 (bloom filters)

Bloom filter

New zealand bloom filter

Real Time Big Data Management

Bloom filter

Vasia Kalavri – Training: Gelly School

Real-Time Big Data Stream Analytics

Streaming Algorithms

Bloom filters

Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon

Bloom filter

Algorithms 101 for Data Scientists

STRIP: stream learning of influence probabilities.

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...

Hashing

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Concept of hashing

Introduction to Big Data Science

Hash tables

Internet of Things Data Science

Viewers also liked

Using Simplicity to Make Hard Big Data Problems Easynathanmarz

ReqLabs PechaKucha Евгений СафроненкоPechaKucha Ukraine

Big Data aggregation techniquesValentin Logvinskiy

HyperLogLog in Hive - How to count sheep efficiently?bzamecnik

Анализ количества посетителей на сайте [Считаем уникальные элементы]Qrator Labs

Hyper loglognybon

Deep dive into Coroutines on JVM @ KotlinConf 2017Roman Elizarov

Viewers also liked (7)

Using Simplicity to Make Hard Big Data Problems Easy

ReqLabs PechaKucha Евгений Сафроненко

Big Data aggregation techniques

HyperLogLog in Hive - How to count sheep efficiently?

Анализ количества посетителей на сайте [Считаем уникальные элементы]

Hyper loglog

Deep dive into Coroutines on JVM @ KotlinConf 2017

Similar to Probabilistic data structures

Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiProfessor Lili Saghafi

Amortized analysisDr Shashikant Athawale

Approximation Data Structures for Streaming ApplicationsDebasish Ghosh

GBM package in rmark_landry

machine learning - Clustering in RSudhakar Chavan

Selection K in K-means ClusteringJunghoon Kim

2013 open analytics_countingv3Open Analytics

CSA 3702 machine learning module 3Nandhini S

Random finite set filters for superpositon type sensorsDaniel Hauschildt

Counting (Using Computer)roshmat

UNIT_V_Cluster Analysis.pptxsandeepsandy494692

Machine Learning Algorithm - Decision Trees Kush Kulshrestha

Performance Analysis of Different Clustering AlgorithmIOSR Journals

F017132529IOSR Journals

Decision TreesStudent

Probabilistic data structuresYoav chernobroda

Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya

K-means and GMMSanghyuk Chun

DS9 - Clustering.pptxJK970901

Practical deep learning for computer visionEran Shlomo

Similar to Probabilistic data structures (20)

Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi

Amortized analysis

Approximation Data Structures for Streaming Applications

GBM package in r

machine learning - Clustering in R

Selection K in K-means Clustering

2013 open analytics_countingv3

CSA 3702 machine learning module 3

Random finite set filters for superpositon type sensors

Counting (Using Computer)

UNIT_V_Cluster Analysis.pptx

Machine Learning Algorithm - Decision Trees

Performance Analysis of Different Clustering Algorithm

F017132529

Decision Trees

Probabilistic data structures

Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...

K-means and GMM

DS9 - Clustering.pptx

Practical deep learning for computer vision

Recently uploaded

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Multiple time frame trading analysis -brianshannon.pdfchwongval

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

How we prevented account sharing with MFAAndrei Kaleshka

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Recently uploaded (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Multiple time frame trading analysis -brianshannon.pdf

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

How we prevented account sharing with MFA

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Call Girls In Dwarka 9654467111 Escorts Service

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

GA4 Without Cookies [Measure Camp AMS]

Generative AI for Social Good at Open Data Science East 2024

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Top 5 Best Data Analytics Courses In Queens

Semantic Shed - Squashing and Squeezing.pptx

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

DBA Basics: Getting Started with Performance Tuning.pdf

Defining Constituents, Data Vizzes and Telling a Data Story

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Probabilistic data structures

1. Probabilistic Data Structures By, Shrinivas Vasala

2. Outline • Basic Idea • Hyperloglog • Bloom filters • Count-min sketches • References & further reading 2

3. Basic Idea • Problem: you have a lot of data to count, track or otherwise analyze • Often an approximate answer is sufficient, especially if you can place bounds on how wrong the approximation is likely to be • IO is the most expensive 3

4. Hyperloglog • Originally described by Flajolet and colleagues in 2007 • Can estimate cardinalities well beyond 109 with a relative accuracy (standard error) of 2% while only using 1.5kb of memory • Hashing turns anything into uniform distribution • If the maximum number of leading zeros observed is ’n’, an estimate for the number of distinct elements in the set is 2n • Tuning Precision : set is split into multiple subsets • Harmonic mean + low/high sampling adjustments  Results • Increasing the number of bits of your hash increases the highest possible number you can accurately approximate • Commonly used hashing algorithm: Murmur • Implementation in Redis 4

5. Bloom filters • Conceived by Burton Howard Bloom in 1970 • Used to test whether an element is a member of a set • Query returns either "possibly in set" or "definitely not in set“ • An empty Bloom filter is a bit array of ‘m’ bits, all set to 0 • ‘k’ hash functions map the set element to one of the m array positions with a uniform random distribution • The bits at all these positions are set to 1 • To query, hash with same hash functions and check if all positions are set, if not its surely not in the set else maybe present in set • For an optimal value of k with 1% error each element requires only about 9.6 bits — regardless of the size of the elements 5 1. Choose a ballpark value for no. of elements in the set ‘n’ 2. Choose a value for m 3. Calculate the optimal value of k [ k = (m/n) ln 2 ] 4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ] If p is unacceptable, return to step 2 and change m; otherwise we're done.

6. Count-min sketches • Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003 • How many of each item there is in an collection • Sketch is a compact summary of a large amount of data, is a 2D array of w columns and d rows • Each box is a counter • Each row is indexed by a corresponding hash function • Estimated frequency for ‘Something’ is min(a,b,c,d) • ‘w’ limits the magnitude of the error [ error <= 2 * n/w ] • ‘d’ controls the probability that the estimation is greater than the error [probability limit exceeded = 1 – (1/2) ** d] • Works best on Skewed data

7. References & further reading Other Topics • MinHash is a technique for quickly estimating how similar two sets are • Quotient filters are AMQs (approximate membership query) and, provide many of the same benefits as Bloom filters • Skip list is a data structure that allows fast search within an ordered sequence of elements Hyperloglog 1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html 2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms 3. http://en.wikipedia.org/wiki/HyperLogLog Bloom filters 1. http://billmill.org/bloomfilter-tutorial/ 2. http://en.wikipedia.org/wiki/Bloom_filter Count-min sketches 1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon- 2012 2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch 3. https://sites.google.com/site/countminsketch/home 7

Probabilistic data structures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Probabilistic data structures

Similar to Probabilistic data structures (20)

Recently uploaded

Recently uploaded (20)

Probabilistic data structures