SlideShare a Scribd company logo
1 of 7
Probabilistic
Data Structures
By,
Shrinivas Vasala
Outline
• Basic Idea
• Hyperloglog
• Bloom filters
• Count-min sketches
• References & further reading
2
Basic Idea
• Problem: you have a lot of data to count,
track or otherwise analyze
• Often an approximate answer is sufficient,
especially if you can place bounds on how
wrong the approximation is likely to be
• IO is the most expensive
3
Hyperloglog
• Originally described by Flajolet and colleagues in 2007
• Can estimate cardinalities well beyond 109 with a relative accuracy
(standard error) of 2% while only using 1.5kb of memory
• Hashing turns anything into uniform distribution
• If the maximum number of leading zeros observed is ’n’, an estimate
for the number of distinct elements in the set is 2n
• Tuning Precision : set is split into multiple subsets
• Harmonic mean + low/high sampling adjustments  Results
• Increasing the number of bits of your hash increases the highest
possible number you can accurately approximate
• Commonly used hashing algorithm: Murmur
• Implementation in Redis
4
Bloom filters
• Conceived by Burton Howard Bloom in 1970
• Used to test whether an element is a member of a set
• Query returns either "possibly in set" or "definitely not in set“
• An empty Bloom filter is a bit array of ‘m’ bits, all set to 0
• ‘k’ hash functions map the set element to one of the m array
positions with a uniform random distribution
• The bits at all these positions are set to 1
• To query, hash with same hash functions and check if all positions
are set, if not its surely not in the set else maybe present in set
• For an optimal value of k with 1% error each element requires only
about 9.6 bits — regardless of the size of the elements
5
1. Choose a ballpark value for no. of elements in the set ‘n’
2. Choose a value for m
3. Calculate the optimal value of k [ k = (m/n) ln 2 ]
4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ]
If p is unacceptable, return to step 2 and change m;
otherwise we're done.
Count-min sketches
• Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003
• How many of each item there is in an collection
• Sketch is a compact summary of a large amount of data, is a 2D array
of w columns and d rows
• Each box is a counter
• Each row is indexed by a
corresponding hash function
• Estimated frequency for
‘Something’ is min(a,b,c,d)
• ‘w’ limits the magnitude of the error [ error <= 2 * n/w ]
• ‘d’ controls the probability that the estimation is greater than the error
[probability limit exceeded = 1 – (1/2) ** d]
• Works best on Skewed data
References & further reading
Other Topics
• MinHash is a technique for quickly estimating how similar two sets are
• Quotient filters are AMQs (approximate membership query) and, provide many of the same
benefits as Bloom filters
• Skip list is a data structure that allows fast search within an ordered sequence of elements
Hyperloglog
1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html
2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms
3. http://en.wikipedia.org/wiki/HyperLogLog
Bloom filters
1. http://billmill.org/bloomfilter-tutorial/
2. http://en.wikipedia.org/wiki/Bloom_filter
Count-min sketches
1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon-
2012
2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
3. https://sites.google.com/site/countminsketch/home
7

More Related Content

What's hot

Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Bloom filter
Bloom filterBloom filter
Bloom filterwang ping
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonChristopher Conlan
 
Bloom filter
Bloom filterBloom filter
Bloom filterfeng lee
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsChristopher Conlan
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashingRafi Dar
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 

What's hot (20)

Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Bloom filters
Bloom filtersBloom filters
Bloom filters
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Hashing
HashingHashing
Hashing
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Hash tables
Hash tablesHash tables
Hash tables
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 

Viewers also liked

Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоPechaKucha Ukraine
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Qrator Labs
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglognybon
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Roman Elizarov
 

Viewers also liked (7)

Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglog
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
 

Similar to Probabilistic data structures

Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiProfessor Lili Saghafi
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3Open Analytics
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Random finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsRandom finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsDaniel Hauschildt
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)roshmat
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Kush Kulshrestha
 
Performance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmPerformance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmIOSR Journals
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresYoav chernobroda
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 

Similar to Probabilistic data structures (20)

Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
 
Amortized analysis
Amortized analysisAmortized analysis
Amortized analysis
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
2013 open analytics_countingv3
2013 open analytics_countingv32013 open analytics_countingv3
2013 open analytics_countingv3
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Random finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensorsRandom finite set filters for superpositon type sensors
Random finite set filters for superpositon type sensors
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees Machine Learning Algorithm - Decision Trees
Machine Learning Algorithm - Decision Trees
 
Performance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmPerformance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering Algorithm
 
F017132529
F017132529F017132529
F017132529
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 

Recently uploaded

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Recently uploaded (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Probabilistic data structures

  • 2. Outline • Basic Idea • Hyperloglog • Bloom filters • Count-min sketches • References & further reading 2
  • 3. Basic Idea • Problem: you have a lot of data to count, track or otherwise analyze • Often an approximate answer is sufficient, especially if you can place bounds on how wrong the approximation is likely to be • IO is the most expensive 3
  • 4. Hyperloglog • Originally described by Flajolet and colleagues in 2007 • Can estimate cardinalities well beyond 109 with a relative accuracy (standard error) of 2% while only using 1.5kb of memory • Hashing turns anything into uniform distribution • If the maximum number of leading zeros observed is ’n’, an estimate for the number of distinct elements in the set is 2n • Tuning Precision : set is split into multiple subsets • Harmonic mean + low/high sampling adjustments  Results • Increasing the number of bits of your hash increases the highest possible number you can accurately approximate • Commonly used hashing algorithm: Murmur • Implementation in Redis 4
  • 5. Bloom filters • Conceived by Burton Howard Bloom in 1970 • Used to test whether an element is a member of a set • Query returns either "possibly in set" or "definitely not in set“ • An empty Bloom filter is a bit array of ‘m’ bits, all set to 0 • ‘k’ hash functions map the set element to one of the m array positions with a uniform random distribution • The bits at all these positions are set to 1 • To query, hash with same hash functions and check if all positions are set, if not its surely not in the set else maybe present in set • For an optimal value of k with 1% error each element requires only about 9.6 bits — regardless of the size of the elements 5 1. Choose a ballpark value for no. of elements in the set ‘n’ 2. Choose a value for m 3. Calculate the optimal value of k [ k = (m/n) ln 2 ] 4. Calculate the error rate ‘p’ [ p = (1-e-kn/m)k ] If p is unacceptable, return to step 2 and change m; otherwise we're done.
  • 6. Count-min sketches • Invented by Graham Cormode and S. Muthu Muthukrishnan in 2003 • How many of each item there is in an collection • Sketch is a compact summary of a large amount of data, is a 2D array of w columns and d rows • Each box is a counter • Each row is indexed by a corresponding hash function • Estimated frequency for ‘Something’ is min(a,b,c,d) • ‘w’ limits the magnitude of the error [ error <= 2 * n/w ] • ‘d’ controls the probability that the estimation is greater than the error [probability limit exceeded = 1 – (1/2) ** d] • Works best on Skewed data
  • 7. References & further reading Other Topics • MinHash is a technique for quickly estimating how similar two sets are • Quotient filters are AMQs (approximate membership query) and, provide many of the same benefits as Bloom filters • Skip list is a data structure that allows fast search within an ordered sequence of elements Hyperloglog 1. http://druid.io/blog/2012/05/04/fast-cheap-and-98-right-cardinality-estimation-for-big-data.html 2. http://www.slideshare.net/c.titus.brown/2013-py-con-awesome-big-data-algorithms 3. http://en.wikipedia.org/wiki/HyperLogLog Bloom filters 1. http://billmill.org/bloomfilter-tutorial/ 2. http://en.wikipedia.org/wiki/Bloom_filter Count-min sketches 1. http://www.slideshare.net/StampedeCon/a-survey-of-probabilistic-data-structures-stampedecon- 2012 2. http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch 3. https://sites.google.com/site/countminsketch/home 7