HyperLogLog Intuition Without the Hard Math

•

0 likes•193 views

HyperLogLog probabilistic data structures make some potent big data analytics magic. Making HLL work requires hard math. Understanding why it works does not.

Technology

$HyperLogLog Intuition (without the hard math) Sim Simeonov, Founder & CTO, Swoop @simeons / sim at swoop dot com$

The following presentation takes a poetic
license in order to provide intuition

Q:How do we quickly count the number of
distinct things in some collection?
A: Since “things” is fuzzy, hash them to
simplify the problem to…

Q:How do we quickly determine the
cardinality (size) of a set of n numbers?
A: Quickly means using fewer resources.
Assume we only have k buckets…

Distribute n items randomly in k buckets
E(distance) ≅
!
"
E(min) ≅
!
"
⇒ 𝑛 ≅
!
#(%&")
more buckets == greater precision

We can estimate n from k and
the position of the first bucket…
without keeping any of the n numbers

Q:How do we improve the precision
of our estimate?
A: Use a collection of buckets and use the
mean of the estimates created from each.

HLL sketch ≅ a distribution of mins
true mean

HyperLogLog sketches are reaggregatable
because min reaggregates with min

Making it work in the real world
• Data is not uniformly distributed…
• Hash it!
• How do we get many “samples” from one set of hashes?
• Partition them!
• Can we get a good estimate for the mean?
• Yes, with some fancy math & empirical corrections.
• Do we actually have to keep the minimums?
• No, just keep the number of 0s before the first 1 in binary form.
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

Improving patient outcomes
LEADING HEALTH DATA LEADING CONSUMER DATA
Lifestyle
Magazinesubscriptions
Catalogpurchases
Psychographics
Animal lover
Fisherman
Demographics
Propertyrecords
Internettransactions
• 280M unique US patients
• 7 years longitudinal data
• De-identified, HIPAA-safe
1st Party Data
Proprietary tech to
integrate data
NPI Data
Attributed to the
patient
Claims
ICD 9 or 10, CPT,
Rx and J codes
• 300M US Consumers
• 3,500+ consumer attributes
• De-identified, privacy-safe
Petabyte scale privacy-preserving ML/AI

• Experiment with the HLL functions in spark-alchemy.
• Keep big data in Spark only and interop with HLL sketches.
Do you want to make Spark great while improving millions of lives?
Let’s talk.
Calls to Action
sim at swoop dot com

Similar to HyperLogLog Intuition Without the Hard Math

Assessment In Spreadsheetsguest46de76

Visual Analytics in Omics: why, what, how?Jan Aerts

Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Ian Morgan

Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Bayes Nets meetup London

Machine learning in the life sciences with knimeGreg Landrum

Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen

Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and SharingAlex Pinto

Tokens, Complex Systems, and NatureTrent McConaghy

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive

Designing Test Collections That Provide Tight Confidence IntervalsTetsuya Sakai

Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain

Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney

Nlp and Neural Networks workshopQuantUniversity

Machine Learning with Hadoop Boston hug 2012MapR Technologies

Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...Quantopian

Multimodal Learning AnalyticsXavier Ochoa

Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney

Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...Core Security

BigData Visualization and Usecase@TDGA-Stelligence-11july2019-sharestelligence

Real World NLP, ML, and Big DataDevin Bost

Similar to HyperLogLog Intuition Without the Hard Math (20)

Assessment In Spreadsheets

Visual Analytics in Omics: why, what, how?

Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...

Machine learning in the life sciences with knime

Machine Learning Foundations for Professional Managers

Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and Sharing

Tokens, Complex Systems, and Nature

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014

Designing Test Collections That Provide Tight Confidence Intervals

Building graphs to discover information by David Martínez at Big Data Spain 2015

Agile Data Science: Building Hadoop Analytics Applications

Nlp and Neural Networks workshop

Machine Learning with Hadoop Boston hug 2012

Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...

Multimodal Learning Analytics

Agile Data Science: Hadoop Analytics Applications

Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...

BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share

Real World NLP, ML, and Big Data

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

A Domino Admins Adventures (Engage 2024)Gabriella Davis

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

A Call to Action for Generative AI in 2024Results

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Histor y of HAM Radio presentation slidevu2urc

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

A Domino Admins Adventures (Engage 2024)

My Hashitalk Indonesia April 2024 Presentation

Scaling API-first – The story of a global engineering organization

Google AI Hackathon: LLM based Evaluator for RAG

The Codex of Business Writing Software for Real-World Solutions 2.pptx

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

A Call to Action for Generative AI in 2024

How to Troubleshoot Apps for the Modern Connected Worker

Histor y of HAM Radio presentation slide

Finology Group – Insurtech Innovation Award 2024

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Data Cloud, More than a CDP by Matt Robison

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

HyperLogLog Intuition Without the Hard Math

1. HyperLogLog Intuition (without the hard math) Sim Simeonov, Founder & CTO, Swoop @simeons / sim at swoop dot com

2. The following presentation takes a poetic license in order to provide intuition

3. Q:How do we quickly count the number of distinct things in some collection? A: Since “things” is fuzzy, hash them to simplify the problem to…

4. Q:How do we quickly determine the cardinality (size) of a set of n numbers? A: Quickly means using fewer resources. Assume we only have k buckets…

5. Distribute n items randomly in k buckets E(distance) ≅ ! " E(min) ≅ ! " ⇒ 𝑛 ≅ ! #(%&") more buckets == greater precision

6. We can estimate n from k and the position of the first bucket… without keeping any of the n numbers

7. Q:How do we improve the precision of our estimate? A: Use a collection of buckets and use the mean of the estimates created from each.

8. HLL sketch ≅ a distribution of mins true mean

9. HyperLogLog sketches are reaggregatable because min reaggregates with min

10. Making it work in the real world • Data is not uniformly distributed… • Hash it! • How do we get many “samples” from one set of hashes? • Partition them! • Can we get a good estimate for the mean? • Yes, with some fancy math & empirical corrections. • Do we actually have to keep the minimums? • No, just keep the number of 0s before the first 1 in binary form. https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

11. http://bit.ly/spark-alchemy

12. http://bit.ly/spark-records

13. Improving patient outcomes LEADING HEALTH DATA LEADING CONSUMER DATA Lifestyle Magazinesubscriptions Catalogpurchases Psychographics Animal lover Fisherman Demographics Propertyrecords Internettransactions • 280M unique US patients • 7 years longitudinal data • De-identified, HIPAA-safe 1st Party Data Proprietary tech to integrate data NPI Data Attributed to the patient Claims ICD 9 or 10, CPT, Rx and J codes • 300M US Consumers • 3,500+ consumer attributes • De-identified, privacy-safe Petabyte scale privacy-preserving ML/AI

14. • Experiment with the HLL functions in spark-alchemy. • Keep big data in Spark only and interop with HLL sketches. Do you want to make Spark great while improving millions of lives? Let’s talk. Calls to Action sim at swoop dot com

HyperLogLog Intuition Without the Hard Math

Recommended

Recommended

More Related Content

Similar to HyperLogLog Intuition Without the Hard Math

Similar to HyperLogLog Intuition Without the Hard Math (20)

More from Simeon Simeonov

More from Simeon Simeonov (11)

Recently uploaded

Recently uploaded (20)

HyperLogLog Intuition Without the Hard Math