High-Performance Advanced Analytics with Spark-Alchemy

High-Performance Analytics
with spark-alchemy
Sim Simeonov, Founder & CTO, Swoop
@simeons / sim at swoop dot com

Improving patient outcomes
LEADING HEALTH DATA LEADING CONSUMER DATA
Lifestyle
Magazinesubscriptions
Catalogpurchases
Psychographics
Animal lover
Fisherman
Demographics
Propertyrecords
Internettransactions
• 280M unique US patients
• 7 years longitudinal data
• De-identified, HIPAA-safe
1st Party Data
Proprietary tech to
integrate data
NPI Data
Attributed to the
patient
Claims
ICD 9 or 10, CPT,
Rx and J codes
• 300M US Consumers
• 3,500+ consumer attributes
• De-identified, privacy-safe
Petabyte scale privacy-preserving ML/AI

process fewer rows of data
The key to high-performance analytics

the most important attribute of a
high-performance analytics system
is the reaggregatability of its data

count(distinct …)
is the bane of high-performance analytics
because it is not reaggregatable

select * from prescriptions
Brand nameGeneric name National Drug Code (NDC)

select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescriptions
group by 1
order by 1
Count scripts, generics & brands by month
Time: 145 secs
Input: 10.7B rows / 10Gb
Shuffle: 39M rows / 1Gb

decompose aggregate(…) into
reaggregate(preaggregate(…))
Divide & conquer
Do this onceDo this many times

Preaggregate by generic & brand by month
create table prescription_counts_by_month
select
generic,
brand,
count(*) as scripts
from prescriptions
group by 1, 2, 3

select
count(*) as scripts
from prescription_counts_by_month
group by 1
order by 1
Count scripts, generics & brands by month v2
Time: 3 secs (50x faster)
Input: 2.6M rows / 100Mb
Shuffle: 2.6M rows / 100Mb

select *, raw_count / agg_count as row_reduction
from
(select count(*) as raw_count from prescriptions)
cross join
(select count(*) as agg_count from prescription_counts_by_month)
Only 50x faster because of job startup cost

high row reduction is only possible when
preaggregating low cardinality dimensions,
such as generic (7K) and brand (20K), but not
product (350K) or patient_id (300+M)
The curse of high cardinality (1 of 2)

small shuffles are only possible with
low cardinality count(distinct …)
The curse of high cardinality (2 of 2)

select
count(distinct patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Adding a high-cardinality distinct count
Time: 370 secs :(
Shuffle: 7.5B rows / 102Gb

Maybe approximate counting can help?

select
approx_count_distinct(generic) as generics,
approx_count_distinct(brand) as brands,
approx_count_distinct(patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Approximate counting, default 5% error
Shuffle: 6K rows / 7Mb

approx_count_distinct()
still has to look at every row of data
3x faster is not good enough

How do we preaggregate
high cardinality data
to compute distinct counts?

1. Preaggregate
Create an HLL sketch from data for distinct counts
2. Reaggregate
Merge HLL sketches (into HLL sketches)
3. Present
Compute cardinality of HLL sketches
Divide & conquer using HyperLogLog

HLL in spark-alchemy
https://github.com/swoop-inc/spark-alchemy

Preaggregate with HLL sketches
create table prescription_counts_by_month_hll
select
generic,
brand,
count(*) as scripts,
hll_init_agg(patient_id) as patient_ids,
from prescriptions
group by 1, 2, 3

select
hll_cardinality(hll_merge(patient_ids)) as patients,
count(*) as scripts
from prescription_counts_by_month_hll
group by 1
order by 1
Reaggregate and present with HLL sketches
Input: 2.6M rows / 200Mb
Shuffle: 2.6M rows / 100Mb

the intuition behind HyperLogLog

Distribute n items randomly in k buckets
E(distance) ≅
*
+
E(min) ≅
*
+
⇒ 𝑛 ≅
*
/(12+)
more buckets == greater precision

HLL sketch ≅ a distribution of mins
true mean

HyperLogLog sketches are reaggregatable
because min reaggregates with min

Making it work in the real world
• Data is not uniformly distributed…
• Hash it!
• How do we get many “samples” from one set of hashes?
• Partition them!
• Can we get a good estimate for the mean?
• Yes, with some fancy math & empirical corrections.
• Do we actually have to keep the minimums?
• No, just keep the number of 0s before the first 1 in binary form.
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

my boss wants me to count precisely

Sketch sizes affect estimation errors

• ClearSpring HLL++ https://github.com/addthis/stream-lib
• No known interoperability
• Neustar (Aggregate Knowledge) HLL https://github.com/aggregateknowledge/java-hll
• Postgres & JavaScript interop
• BigQuery HLL++ https://github.com/google/zetasketch
• BigQuery interop (PRs welcome J)
spark-alchemy & HLL interoperability
hll_convert(hll_sketch, from, to)

• High-performance interactive analytics
• Preaggregate in Spark, push to Postgres / Citus, reaggregate there
• Better privacy
• HLL sketches contain no identifiable information
• Unions across columns
• No added error
• Intersections across columns
• Use inclusion/exclusion principle; increases estimate error
Other benefits of using HLL sketches

• Experiment with the HLL functions in spark-alchemy
• Can you keep big data in Spark only and interop with HLL sketches?
• We’d welcome a PR that adds BigQuery support to spark-alchemy
• Last but not least, do you want to build tools to make Spark great
while improving the lives of millions of patients?
Calls to Action
sim at swoop dot com

High-Performance Advanced Analytics with Spark-Alchemy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High-Performance Advanced Analytics with Spark-Alchemy

Similar to High-Performance Advanced Analytics with Spark-Alchemy (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

High-Performance Advanced Analytics with Spark-Alchemy