SlideShare a Scribd company logo
1 of 28
Download to read offline
Igor Sfiligoi, UC San Diego & SDSC, November 2020
SPEEDING MICROBIOME
RESEARCH BY THREE
ORDERS OF MAGNITUDE
Presented at
NVIDIA Virtual Theater
0
4
8
12
16
20
24
28
32
Original CPU Xeon Gold V100 A100 RTX8000
Wallclock time, in hours
8000
6000x
25x
2
THE CONTEXT
Microbiome research has expanded over the years from analyzing handfuls of
samples to hundreds of thousands.
What worked at small scale, does not work at the more recent scales!
UniFrac is one such tool and is used for comparing microbiome profiles to one
another. One important field is studying the impact of the gut population of
microbes, which influence diseases ranging from ulcers to heart disease to
autism to COVID-19.
My collogues at UCSD were hurting due to excessive runtimes of the tool, so
they asked me to explore if porting it to GPUs would be an advantageous option.
Since I am here here to talk, I think you will guess how that worked out.
3
THE SCIENCE
4
WE ARE WHAT WE EAT
Microbiome is critical for health:
โ€ข Produces compounds your body needs
which it cannot otherwise produce
โ€ข Disruptions in the microbiome are
associated with a range of diseases
โ€ข Many non-communicable diseases,
like Alzheimerโ€™s, various cancers,
Cardiovascular disease and much
more are associated with the
microbiome
On a more recent theme:
โ€ข Many high-risk populations for
COVID19 also have diseases known to
be associated with the microbiome
https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
5
KNIGHT LAB AT UCSD LEADING AMERICAN GUT PROJECT
Collecting specimens, DNA sequencing samples, and analyzing the results
Daniel McDonaldRob Knight
6
7
SAMPLES RELATIONSHIPS
A fundamental component to
microbiome analysis is understanding
how entire microbial communities
relate to each other. This requires
pairwise comparisons of all samples
in a dataset
DISTANCE MATRIX
8
UNIFRAC DISTANCE
โ€ข Incorporates information on the
evolutionary relatedness of community
members by incorporating the phylogeny of
the observed organisms in the computation.
โ€ข Other measures, such as Euclidean distance,
implicitly assume all organisms are equally
related.
Lozupone and Knight Applied and environmental microbiology 2005
Samples where the organisms are evolutionarily
very similar from an evolutionary perspective
will have a small UniFrac distance.
On the other hand, two samples composed of
very different organisms will have
a large UniFrac distance.
A distance metric
https://en.wikipedia.org/wiki/UniFrac
9
COMPUTING ON GPU
10
STARTING WITH STRIPED UNIFRAC
Recent (2018) algorithm
that is optimized for
both speed and parallelism (on CPUs)
Allowed the microbiome researchers to analyze
tens of thousands of samples from modern
studies.
But going into 100k range and beyond
becoming too expensive
Runtime scales approximately quadratically
State of the art as of early 2020
From: Striped UniFrac: enabling microbiome analysis at unprecedented scale
Projected runtimes using the early 2020 CPU-only code
11
COULD PORTING
TO GPU ALLOW US
TO DRASTICALLY
REDUCE THE RUNTIME?
Let me take a look!
Igor Sfiligoi
12
Where is most of the
time spent?
Turns out to be a tight double loop
โ€ข With many iterations
โ€ข All independent
Conceptually not too far away
from BLAS
โ€ข Should fly on GPUs!
Simple stack sampling method
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int k = 0;
k < n_samples; k++) {
unsigned int l =
(k + stripe + 1)%n_samples;
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
โ€ฆ
dm_stripe[k] += fabs(u1-v1)*length;
}
}
13
OpenACC makes it easy
to have a first port
Almost as easy as adding a decorator
โ€ข Too bad arrays of pointers
not well supported
โ€ข Thus required a bit of refactoring
Done in less than a week
โ€ข A couple days FTE
8x speedup CPU -> GPU (chip vs chip)
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples; k++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
unsigned int l =
(k + stripe + 1)%n_samples;
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
โ€ฆ
dm_stripe[k] += fabs(u1-v1)*length;
}
}
Intel Xeon E5-2680 v4
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
Runtime on 25k sample
14
But how it is used just
as important
The emb input buffers must be prepared
for each invocation
โ€ข Data movement latency
Main buffer all traversed every time
โ€ข No cache reuse
Large number of invocations
โ€ข GPU invocation overhead
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
for(unsigned int k = 0;
k < (tree.nparens / 2) โ€“ 1) ; k++) {
// must be run sequentially
// on CPU, logic and deep function nesting
// rewrites all of emp buffer
embed(emb, tree, k);
// on GPU
#pragma acc data copyin(emb)
run_loop(dm_stripe_buf, emb, tree.getlen(k));
}
return dm_stripe_buf;
Bad for both
CPU and GPU code-paths
15
Batching to the rescue
Batching many emb buffers
โ€ข Improves memory locality
for main buffer
โ€ข Reduces GPU invocation overhead
and allows for overlap with CPU
โ€ข At the expense of more memory use
Cache-awareness in loop becomes very
important
Additional 8x speedup on GPU (total 64x)
And a decent 4x speedup on CPU
#pragma acc parallel loop collapse(3) async 
present(emb,dm_stripes_buf,length)
for(sk) { // swap order and tile
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
โ€ฆ
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += fabs(u-v)*length[e];
}
โ€ฆ
dm_stripe[k] += my_stripe;
}
}
}
#ifdef _OPENACC
std::swap(emb,emb_alt);
#endif
Intel Xeon E5-2680 v4
(using all 14 cores)
193 minutes (3.2 hours)
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes
Runtime on 25k sample
16
WE WERE PRETTY HAPPY WITH SPEEDUP
Switching to fp32 added additional boost
80x
600x
Spring 2020
17
RETHINKING THE
ALGORITHM
18
SEVERAL FLAVORS OF UNIFRAC
There are several versions of UniFrac
Two of the popular ones are true FP compute (like previous slides)
One is binary in nature
Expected binary version to be significantly faster
But was not!
19
Binary operations
only in tight loop
Moreover, the same emb buffer
being read multiple time
โ€ข FP -> bool conversion every single time
Full FP logic still needed
#pragma acc parallel loop โ€ฆ
for(sk) {
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
โ€ฆ
double my_stripe = dm_stripe[k];
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
bool u = emb[offset+k]>0;
bool v = emb[offset+k+stripe+ 1]>0;
my_stripe += (u^v)*length[e];
}
โ€ฆ
dm_stripe[k] += my_stripe;
}
}
}
20
BINARY PRE-PROCESSING AND PACKING
Computing FP -> bool before invoking the loop saves a lot of compute
Packing 8 bools into a single UINT8 saves a lot of memory (size and access)
I can pre-compute all 256 combinations, too
โ€ข Just memory lookup and sums in loop now
NVIDIA Tesla V100
(using all 84 SMs)
2.5 minutes
Runtime on 25k sample
21
LOTS OF ZEROES EVERYWHERE!
Since I have only 256 combinations, I get curious and check the distribution
โ€ข >90% of the time it is a 0!
We were doing a huge amount of NOOP compute (add by zero)
โ€ข The emb buffer is basically a sparse matrix!
Using UINT64 and adding a simple if (!=zero)
gets me another 3x speedup
โ€ข Ran out of time for further optimizations
NVIDIA Tesla V100
(using all 84 SMs)
45 seconds
Runtime on 25k sample
Basically a sparse matrix problem
22
#pragma acc parallel loop โ€ฆ
for(sk) {
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
โ€ฆ
double my_stripe = dm_stripe[k];
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
uint64_t u = emb_packed[offset+k];
uint64_t v = emb_packed[offset+k+stripe+ 1];
uint64_t o1 = u1 | v1;
if (o1!=0) { // zeros are prevalent
my_stripe += psum[ (o1 & 0xff)] +
psum[0x100+((o1 >> 8) & 0xff)] +
โ€ฆ
psum[0x700+((o1 >> 56) )];
}
}
โ€ฆ
dm_stripe[k] += my_stripe;
}
}
}
#pragma acc parallel loop โ€ฆ
for (unsigned int emb_el=0;
emb_el<embs_els; emb_el++) {
for (unsigned int sub8=0;
sub8<8; sub8++) {
unsigned int emb8 = emb_el*8+sub8;
TFloat * psum = &(sums[emb8<<8]);
TFloat * pl = &(lengths[emb8*8]);
for (unsigned int b8_i=0;
b8_i<0x100; b8_i++) {
psum[b8_i] = (((b8_i >> 0) & 1) * pl[0]) +
(((b8_i >> 1) & 1) * pl[1]) +
โ€ฆ
(((b8_i >> 7) & 1) * pl[7]);
}
}
}
Sparse packed version
23
WORKS EVEN BETTER ON LARGER PROBLEMS
25k 50k 115k 300k
Original, Xeon Gold CPU 30k seconds 2.5k minutes 30k minutes 8k hours
Latest, Xeon Gold CPU 290 seconds 16.5 minutes 180 minutes 33 hours
Latest, V100 GPU 45 seconds 2.2 minutes 13 minutes 1.9 hours
Latest, A100 GPU 33 seconds 1.72 minutes 9.8 minutes 1.4 hours
Latest, RTX8000 GPU 29 seconds 1.58 minutes 9.4 minutes 1.3 hours
1000x 1500x 6000x3000x
24
CPU speed now often
the limiting factor
Original code was single threaded
โ€ข Relying on partitioning of problem
โ€ข GPUs prefer full problem in loop
Using OpenMP for CPU parallelization
โ€ข Together with OpenACC for GPUs
make for a great pair
GPU compute just so fast!
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
for(unsigned int k0 = 0;
k < (tree.nparens / 2) โ€“ 1) ; k+=chunk) {
#pragma omp parallel for
for (unsigned int i=0; i<chunk; i+=64) {
embed_packed(emb[i], tree, k0+i);
fill_leghts(lengths,tree, k0+i);
}
#pragma acc data update device(emb)
#pragma acc wait
#pragma acc data copyin(lengths)
run_loop(dm_stripe_buf, emb, length);
}
#pragma acc wait
return dm_stripe_buf;
25
IN SUMMARY
26
THE PORTING TO GPUS WAS A MAJOR SUCCESS
Most of the time spent in a tight loop
Easy to port to GPUs using OpenACC
But deep understanding of code critical for maximum speedup
Gained significantly more from better algorithm than better HW
Having a single code set between CPU and GPU
helped a lot improving both code paths
Optimizing one side usually let to discoveries for the other
GPUs still way faster than CPUs, so HW does matter
Way beyond our greatest hopes
6000x
27
ENABLING SCIENCE THAT WOULD OTHERWISE NOT BE POSSIBLE
300k sample now computed in about an hour on single node vs heroic HPC job
6000x
Supported by NSF grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and NIH grant DP1-AT010885.
28
ACKNOWLEDGMENTS
This work was partially funded by US National Science Foundation (NSF) grants
DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and by
US National Institutes of Health (NIH) grant DP1-AT010885.

More Related Content

More from Igor Sfiligoi

Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
ย 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
ย 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
ย 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
ย 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
ย 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
ย 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
ย 
Accelerating microbiome research withย OpenACC
Accelerating microbiome research withย OpenACCAccelerating microbiome research withย OpenACC
Accelerating microbiome research withย OpenACCIgor Sfiligoi
ย 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
ย 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
ย 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
ย 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
ย 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
ย 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
ย 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
ย 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Igor Sfiligoi
ย 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorIgor Sfiligoi
ย 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runIgor Sfiligoi
ย 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsIgor Sfiligoi
ย 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOIgor Sfiligoi
ย 

More from Igor Sfiligoi (20)

Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
ย 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
ย 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
ย 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
ย 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
ย 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
ย 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
ย 
Accelerating microbiome research withย OpenACC
Accelerating microbiome research withย OpenACCAccelerating microbiome research withย OpenACC
Accelerating microbiome research withย OpenACC
ย 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
ย 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
ย 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
ย 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
ย 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
ย 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
ย 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
ย 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
ย 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondor
ย 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
ย 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
ย 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
ย 

Recently uploaded

ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptxANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptxSwetaba Besh
ย 
Kolkata Call Girls Naktala ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ Top Class Call Girl Se...
Kolkata Call Girls Naktala  ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ  Top Class Call Girl Se...Kolkata Call Girls Naktala  ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ  Top Class Call Girl Se...
Kolkata Call Girls Naktala ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ Top Class Call Girl Se...Namrata Singh
ย 
Call 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room DeliveryCall 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room DeliveryJyoti singh
ย 
Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...
Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...
Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...Sheetaleventcompany
ย 
Shazia Iqbal 2024 - Bioorganic Chemistry.pdf
Shazia Iqbal 2024 - Bioorganic Chemistry.pdfShazia Iqbal 2024 - Bioorganic Chemistry.pdf
Shazia Iqbal 2024 - Bioorganic Chemistry.pdfTrustlife
ย 
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsAppMost Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsAppjimmihoslasi
ย 
Cardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their RegulationCardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their RegulationMedicoseAcademics
ย 
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...GENUINE ESCORT AGENCY
ย 
Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...
Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...
Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...Sheetaleventcompany
ย 
๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...
๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...
๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...Sheetaleventcompany
ย 
Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...Oleg Kshivets
ย 
Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...
Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...
Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...gragneelam30
ย 
โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...
โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...
โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...Sheetaleventcompany
ย 
๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...
๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...
๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...Sheetaleventcompany
ย 
Control of Local Blood Flow: acute and chronic
Control of Local Blood Flow: acute and chronicControl of Local Blood Flow: acute and chronic
Control of Local Blood Flow: acute and chronicMedicoseAcademics
ย 
Intramuscular & Intravenous Injection.pptx
Intramuscular & Intravenous Injection.pptxIntramuscular & Intravenous Injection.pptx
Intramuscular & Intravenous Injection.pptxsaranpratha12
ย 
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptxANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptxSwetaba Besh
ย 
Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...Sheetaleventcompany
ย 
Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...Sheetaleventcompany
ย 
Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...dishamehta3332
ย 

Recently uploaded (20)

ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptxANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
ย 
Kolkata Call Girls Naktala ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ Top Class Call Girl Se...
Kolkata Call Girls Naktala  ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ  Top Class Call Girl Se...Kolkata Call Girls Naktala  ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ  Top Class Call Girl Se...
Kolkata Call Girls Naktala ๐Ÿ’ฏCall Us ๐Ÿ” 8005736733 ๐Ÿ” ๐Ÿ’ƒ Top Class Call Girl Se...
ย 
Call 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room DeliveryCall 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls โ‚น4.5k Cash Payment With Room Delivery
ย 
Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...
Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...
Chandigarh Call Girls Service โค๏ธ๐Ÿ‘ 9809698092 ๐Ÿ‘„๐ŸซฆIndependent Escort Service Cha...
ย 
Shazia Iqbal 2024 - Bioorganic Chemistry.pdf
Shazia Iqbal 2024 - Bioorganic Chemistry.pdfShazia Iqbal 2024 - Bioorganic Chemistry.pdf
Shazia Iqbal 2024 - Bioorganic Chemistry.pdf
ย 
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsAppMost Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
Most Beautiful Call Girl in Chennai 7427069034 Contact on WhatsApp
ย 
Cardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their RegulationCardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their Regulation
ย 
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
Genuine Call Girls Hyderabad 9630942363 Book High Profile Call Girl in Hydera...
ย 
Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...
Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...
Nagpur Call Girl Service ๐Ÿ“ž9xx000xx09๐Ÿ“žJust Call Divya๐Ÿ“ฒ Call Girl In Nagpur No๐Ÿ’ฐ...
ย 
๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...
๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...
๐Ÿ‘‰Chandigarh Call Girl Service๐Ÿ“ฒNiamh 8868886958 ๐Ÿ“ฒBook 24hours Now๐Ÿ“ฒ๐Ÿ‘‰Sexy Call G...
ย 
Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: ะกlinical Implementation of Artificial Intelligence, Synergeti...
ย 
Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...
Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...
Call Girls Bangalore - 450+ Call Girl Cash Payment ๐Ÿ’ฏCall Us ๐Ÿ” 6378878445 ๐Ÿ” ๐Ÿ’ƒ ...
ย 
โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...
โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...
โค๏ธAmritsar Escorts Serviceโ˜Ž๏ธ9815674956โ˜Ž๏ธ Call Girl service in Amritsarโ˜Ž๏ธ Amri...
ย 
๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...
๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...
๐Ÿ’šChandigarh Call Girls Service ๐Ÿ’ฏPiya ๐Ÿ“ฒ๐Ÿ”8868886958๐Ÿ”Call Girls In Chandigarh No...
ย 
Control of Local Blood Flow: acute and chronic
Control of Local Blood Flow: acute and chronicControl of Local Blood Flow: acute and chronic
Control of Local Blood Flow: acute and chronic
ย 
Intramuscular & Intravenous Injection.pptx
Intramuscular & Intravenous Injection.pptxIntramuscular & Intravenous Injection.pptx
Intramuscular & Intravenous Injection.pptx
ย 
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptxANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF REPRODUCTIVE SYSTEM.pptx
ย 
Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...
Premium Call Girls Nagpur {9xx000xx09} โค๏ธVVIP POOJA Call Girls in Nagpur Maha...
ย 
Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...
Dehradun Call Girls Service {8854095900} โค๏ธVVIP ROCKY Call Girl in Dehradun U...
ย 
Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...
Whitefield { Call Girl in Bangalore โ‚น7.5k Pick Up & Drop With Cash Payment 63...
ย 

Speeding Microbiome Research by Three Orders of Magnitude

  • 1. Igor Sfiligoi, UC San Diego & SDSC, November 2020 SPEEDING MICROBIOME RESEARCH BY THREE ORDERS OF MAGNITUDE Presented at NVIDIA Virtual Theater 0 4 8 12 16 20 24 28 32 Original CPU Xeon Gold V100 A100 RTX8000 Wallclock time, in hours 8000 6000x 25x
  • 2. 2 THE CONTEXT Microbiome research has expanded over the years from analyzing handfuls of samples to hundreds of thousands. What worked at small scale, does not work at the more recent scales! UniFrac is one such tool and is used for comparing microbiome profiles to one another. One important field is studying the impact of the gut population of microbes, which influence diseases ranging from ulcers to heart disease to autism to COVID-19. My collogues at UCSD were hurting due to excessive runtimes of the tool, so they asked me to explore if porting it to GPUs would be an advantageous option. Since I am here here to talk, I think you will guess how that worked out.
  • 4. 4 WE ARE WHAT WE EAT Microbiome is critical for health: โ€ข Produces compounds your body needs which it cannot otherwise produce โ€ข Disruptions in the microbiome are associated with a range of diseases โ€ข Many non-communicable diseases, like Alzheimerโ€™s, various cancers, Cardiovascular disease and much more are associated with the microbiome On a more recent theme: โ€ข Many high-risk populations for COVID19 also have diseases known to be associated with the microbiome https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
  • 5. 5 KNIGHT LAB AT UCSD LEADING AMERICAN GUT PROJECT Collecting specimens, DNA sequencing samples, and analyzing the results Daniel McDonaldRob Knight
  • 6. 6
  • 7. 7 SAMPLES RELATIONSHIPS A fundamental component to microbiome analysis is understanding how entire microbial communities relate to each other. This requires pairwise comparisons of all samples in a dataset DISTANCE MATRIX
  • 8. 8 UNIFRAC DISTANCE โ€ข Incorporates information on the evolutionary relatedness of community members by incorporating the phylogeny of the observed organisms in the computation. โ€ข Other measures, such as Euclidean distance, implicitly assume all organisms are equally related. Lozupone and Knight Applied and environmental microbiology 2005 Samples where the organisms are evolutionarily very similar from an evolutionary perspective will have a small UniFrac distance. On the other hand, two samples composed of very different organisms will have a large UniFrac distance. A distance metric https://en.wikipedia.org/wiki/UniFrac
  • 10. 10 STARTING WITH STRIPED UNIFRAC Recent (2018) algorithm that is optimized for both speed and parallelism (on CPUs) Allowed the microbiome researchers to analyze tens of thousands of samples from modern studies. But going into 100k range and beyond becoming too expensive Runtime scales approximately quadratically State of the art as of early 2020 From: Striped UniFrac: enabling microbiome analysis at unprecedented scale Projected runtimes using the early 2020 CPU-only code
  • 11. 11 COULD PORTING TO GPU ALLOW US TO DRASTICALLY REDUCE THE RUNTIME? Let me take a look! Igor Sfiligoi
  • 12. 12 Where is most of the time spent? Turns out to be a tight double loop โ€ข With many iterations โ€ข All independent Conceptually not too far away from BLAS โ€ข Should fly on GPUs! Simple stack sampling method for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int k = 0; k < n_samples; k++) { unsigned int l = (k + stripe + 1)%n_samples; double u1 = emb[k]; double v1 = emb[k + stripe + 1]; โ€ฆ dm_stripe[k] += fabs(u1-v1)*length; } }
  • 13. 13 OpenACC makes it easy to have a first port Almost as easy as adding a decorator โ€ข Too bad arrays of pointers not well supported โ€ข Thus required a bit of refactoring Done in less than a week โ€ข A couple days FTE 8x speedup CPU -> GPU (chip vs chip) #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples; k++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; unsigned int l = (k + stripe + 1)%n_samples; double u1 = emb[k]; double v1 = emb[k + stripe + 1]; โ€ฆ dm_stripe[k] += fabs(u1-v1)*length; } } Intel Xeon E5-2680 v4 (using all 14 cores) 800 minutes (13 hours) NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) Runtime on 25k sample
  • 14. 14 But how it is used just as important The emb input buffers must be prepared for each invocation โ€ข Data movement latency Main buffer all traversed every time โ€ข No cache reuse Large number of invocations โ€ข GPU invocation overhead initialize(dm_stripe_buf); #pragma acc data copy(dm_stripe_buf) for(unsigned int k = 0; k < (tree.nparens / 2) โ€“ 1) ; k++) { // must be run sequentially // on CPU, logic and deep function nesting // rewrites all of emp buffer embed(emb, tree, k); // on GPU #pragma acc data copyin(emb) run_loop(dm_stripe_buf, emb, tree.getlen(k)); } return dm_stripe_buf; Bad for both CPU and GPU code-paths
  • 15. 15 Batching to the rescue Batching many emb buffers โ€ข Improves memory locality for main buffer โ€ข Reduces GPU invocation overhead and allows for overlap with CPU โ€ข At the expense of more memory use Cache-awareness in loop becomes very important Additional 8x speedup on GPU (total 64x) And a decent 4x speedup on CPU #pragma acc parallel loop collapse(3) async present(emb,dm_stripes_buf,length) for(sk) { // swap order and tile for(stripe) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; unsigned int l = (k + stripe + 1)%n_samples; โ€ฆ double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += fabs(u-v)*length[e]; } โ€ฆ dm_stripe[k] += my_stripe; } } } #ifdef _OPENACC std::swap(emb,emb_alt); #endif Intel Xeon E5-2680 v4 (using all 14 cores) 193 minutes (3.2 hours) NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Runtime on 25k sample
  • 16. 16 WE WERE PRETTY HAPPY WITH SPEEDUP Switching to fp32 added additional boost 80x 600x Spring 2020
  • 18. 18 SEVERAL FLAVORS OF UNIFRAC There are several versions of UniFrac Two of the popular ones are true FP compute (like previous slides) One is binary in nature Expected binary version to be significantly faster But was not!
  • 19. 19 Binary operations only in tight loop Moreover, the same emb buffer being read multiple time โ€ข FP -> bool conversion every single time Full FP logic still needed #pragma acc parallel loop โ€ฆ for(sk) { for(stripe) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; unsigned int l = (k + stripe + 1)%n_samples; โ€ฆ double my_stripe = dm_stripe[k]; for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; bool u = emb[offset+k]>0; bool v = emb[offset+k+stripe+ 1]>0; my_stripe += (u^v)*length[e]; } โ€ฆ dm_stripe[k] += my_stripe; } } }
  • 20. 20 BINARY PRE-PROCESSING AND PACKING Computing FP -> bool before invoking the loop saves a lot of compute Packing 8 bools into a single UINT8 saves a lot of memory (size and access) I can pre-compute all 256 combinations, too โ€ข Just memory lookup and sums in loop now NVIDIA Tesla V100 (using all 84 SMs) 2.5 minutes Runtime on 25k sample
  • 21. 21 LOTS OF ZEROES EVERYWHERE! Since I have only 256 combinations, I get curious and check the distribution โ€ข >90% of the time it is a 0! We were doing a huge amount of NOOP compute (add by zero) โ€ข The emb buffer is basically a sparse matrix! Using UINT64 and adding a simple if (!=zero) gets me another 3x speedup โ€ข Ran out of time for further optimizations NVIDIA Tesla V100 (using all 84 SMs) 45 seconds Runtime on 25k sample Basically a sparse matrix problem
  • 22. 22 #pragma acc parallel loop โ€ฆ for(sk) { for(stripe) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; unsigned int l = (k + stripe + 1)%n_samples; โ€ฆ double my_stripe = dm_stripe[k]; for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; uint64_t u = emb_packed[offset+k]; uint64_t v = emb_packed[offset+k+stripe+ 1]; uint64_t o1 = u1 | v1; if (o1!=0) { // zeros are prevalent my_stripe += psum[ (o1 & 0xff)] + psum[0x100+((o1 >> 8) & 0xff)] + โ€ฆ psum[0x700+((o1 >> 56) )]; } } โ€ฆ dm_stripe[k] += my_stripe; } } } #pragma acc parallel loop โ€ฆ for (unsigned int emb_el=0; emb_el<embs_els; emb_el++) { for (unsigned int sub8=0; sub8<8; sub8++) { unsigned int emb8 = emb_el*8+sub8; TFloat * psum = &(sums[emb8<<8]); TFloat * pl = &(lengths[emb8*8]); for (unsigned int b8_i=0; b8_i<0x100; b8_i++) { psum[b8_i] = (((b8_i >> 0) & 1) * pl[0]) + (((b8_i >> 1) & 1) * pl[1]) + โ€ฆ (((b8_i >> 7) & 1) * pl[7]); } } } Sparse packed version
  • 23. 23 WORKS EVEN BETTER ON LARGER PROBLEMS 25k 50k 115k 300k Original, Xeon Gold CPU 30k seconds 2.5k minutes 30k minutes 8k hours Latest, Xeon Gold CPU 290 seconds 16.5 minutes 180 minutes 33 hours Latest, V100 GPU 45 seconds 2.2 minutes 13 minutes 1.9 hours Latest, A100 GPU 33 seconds 1.72 minutes 9.8 minutes 1.4 hours Latest, RTX8000 GPU 29 seconds 1.58 minutes 9.4 minutes 1.3 hours 1000x 1500x 6000x3000x
  • 24. 24 CPU speed now often the limiting factor Original code was single threaded โ€ข Relying on partitioning of problem โ€ข GPUs prefer full problem in loop Using OpenMP for CPU parallelization โ€ข Together with OpenACC for GPUs make for a great pair GPU compute just so fast! initialize(dm_stripe_buf); #pragma acc data copy(dm_stripe_buf) for(unsigned int k0 = 0; k < (tree.nparens / 2) โ€“ 1) ; k+=chunk) { #pragma omp parallel for for (unsigned int i=0; i<chunk; i+=64) { embed_packed(emb[i], tree, k0+i); fill_leghts(lengths,tree, k0+i); } #pragma acc data update device(emb) #pragma acc wait #pragma acc data copyin(lengths) run_loop(dm_stripe_buf, emb, length); } #pragma acc wait return dm_stripe_buf;
  • 26. 26 THE PORTING TO GPUS WAS A MAJOR SUCCESS Most of the time spent in a tight loop Easy to port to GPUs using OpenACC But deep understanding of code critical for maximum speedup Gained significantly more from better algorithm than better HW Having a single code set between CPU and GPU helped a lot improving both code paths Optimizing one side usually let to discoveries for the other GPUs still way faster than CPUs, so HW does matter Way beyond our greatest hopes 6000x
  • 27. 27 ENABLING SCIENCE THAT WOULD OTHERWISE NOT BE POSSIBLE 300k sample now computed in about an hour on single node vs heroic HPC job 6000x Supported by NSF grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and NIH grant DP1-AT010885.
  • 28. 28 ACKNOWLEDGMENTS This work was partially funded by US National Science Foundation (NSF) grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.