Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.
Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.
We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.
We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
2. Improving patient outcomes
LEADING HEALTH DATA LEADING CONSUMER DATA
Lifestyle
Magazinesubscriptions
Catalogpurchases
Psychographics
Animal lover
Fisherman
Demographics
Propertyrecords
Internettransactions
• 280M unique US patients
• 7 years longitudinal data
• De-identified, HIPAA-safe
1st Party Data
Proprietary tech to
integrate data
NPI Data
Attributed to the
patient
Claims
ICD 9 or 10, CPT,
Rx and J codes
• 300M US Consumers
• 3,500+ consumer attributes
• De-identified, privacy-safe
Petabyte scale privacy-preserving ML/AI
9. root
|-- date: date
|-- generic: string
|-- brand: string
|-- product: string
|-- patient_id: long
|-- doctor_id: long
Demo system: prescriptions in 2018
• Narrow sample
• 10.7 billion rows / 150Gb
• Small-ish Spark 2.4 cluster
• 80 cores, 600Gb RAM
• Delta Lake, fully cached
10. select * from prescriptions
Brand nameGeneric name National Drug Code (NDC)
11. select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescriptions
group by 1
order by 1
Count scripts, generics & brands by month
Time: 145 secs
Input: 10.7B rows / 10Gb
Shuffle: 39M rows / 1Gb
13. Preaggregate by generic & brand by month
create table prescription_counts_by_month
select
to_date(date_trunc("month", date)) as date,
generic,
brand,
count(*) as scripts
from prescriptions
group by 1, 2, 3
14. select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescription_counts_by_month
group by 1
order by 1
Count scripts, generics & brands by month v2
Time: 3 secs (50x faster)
Input: 2.6M rows / 100Mb
Shuffle: 2.6M rows / 100Mb
15. select *, raw_count / agg_count as row_reduction
from
(select count(*) as raw_count from prescriptions)
cross join
(select count(*) as agg_count from prescription_counts_by_month)
Only 50x faster because of job startup cost
16. high row reduction is only possible when
preaggregating low cardinality dimensions,
such as generic (7K) and brand (20K), but not
product (350K) or patient_id (300+M)
The curse of high cardinality (1 of 2)
17. small shuffles are only possible with
low cardinality count(distinct …)
The curse of high cardinality (2 of 2)
18. select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(distinct patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Adding a high-cardinality distinct count
Time: 370 secs :(
Input: 10.7B rows / 21Gb
Shuffle: 7.5B rows / 102Gb
20. select
to_date(date_trunc("month", date)) as date,
approx_count_distinct(generic) as generics,
approx_count_distinct(brand) as brands,
approx_count_distinct(patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Approximate counting, default 5% error
Time: 120 secs (3x faster)
Input: 10.7B rows / 21Gb
Shuffle: 6K rows / 7Mb
26. Preaggregate with HLL sketches
create table prescription_counts_by_month_hll
select
to_date(date_trunc("month", date)) as date,
generic,
brand,
count(*) as scripts,
hll_init_agg(patient_id) as patient_ids,
from prescriptions
group by 1, 2, 3
27. select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
hll_cardinality(hll_merge(patient_ids)) as patients,
count(*) as scripts
from prescription_counts_by_month_hll
group by 1
order by 1
Reaggregate and present with HLL sketches
Time: 7 secs (50x faster)
Input: 2.6M rows / 200Mb
Shuffle: 2.6M rows / 100Mb
32. Making it work in the real world
• Data is not uniformly distributed…
• Hash it!
• How do we get many “samples” from one set of hashes?
• Partition them!
• Can we get a good estimate for the mean?
• Yes, with some fancy math & empirical corrections.
• Do we actually have to keep the minimums?
• No, just keep the number of 0s before the first 1 in binary form.
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
36. • High-performance interactive analytics
• Preaggregate in Spark, push to Postgres / Citus, reaggregate there
• Better privacy
• HLL sketches contain no identifiable information
• Unions across columns
• No added error
• Intersections across columns
• Use inclusion/exclusion principle; increases estimate error
Other benefits of using HLL sketches
37. • Experiment with the HLL functions in spark-alchemy
• Can you keep big data in Spark only and interop with HLL sketches?
• We’d welcome a PR that adds BigQuery support to spark-alchemy
• Last but not least, do you want to build tools to make Spark great
while improving the lives of millions of patients?
Calls to Action
sim at swoop dot com