Adapted slides from he presentation at SC20 NVIDIA Virtual Theater: https://gateway.on24.com/wcc/eh/1640195/lp/2819881/Speeding-Microbiome-Research-by-Three-Orders-of-Magnitude/
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. It's used for studying the impact of the gut population of microbes, which influence diseases ranging from ulcers to heart disease to autism to COVID-19. Computing UniFrac on modest sample sizes used to take a workday on a server-class, CPU-only node, while modern datasets would require a large compute cluster to be feasible. We thus decided to port the code to GPUs, using OpenACC to keep a single code base. The differences in parallelism between CPU and GPU compute made us rethink the compute logic of the code, and we incrementally improved the runtime by over three orders of magnitude. Computing the same modest sample size now takes only 30 seconds on an NVIDIA A100 Tensor Core GPU, four minutes on an NVIDIA GeForce GTX 1050 GPU-equipped laptop, and about five minutes on a server class CPU. The difference is even bigger for cutting-edge datasets, where we observe approximately a 6000X speedup on a A100 GPU and 250X speedup on the CPU-only code-base. To fully utilize the larger GPUs, however, we need a large CPU too, since some of the code is still CPU-only. Using OpenACC-supported sync data transfers allows us to have full overlap in compute between the two chips.
Whitefield { Call Girl in Bangalore โน7.5k Pick Up & Drop With Cash Payment 63...
ย
Speeding Microbiome Research by Three Orders of Magnitude
1. Igor Sfiligoi, UC San Diego & SDSC, November 2020
SPEEDING MICROBIOME
RESEARCH BY THREE
ORDERS OF MAGNITUDE
Presented at
NVIDIA Virtual Theater
0
4
8
12
16
20
24
28
32
Original CPU Xeon Gold V100 A100 RTX8000
Wallclock time, in hours
8000
6000x
25x
2. 2
THE CONTEXT
Microbiome research has expanded over the years from analyzing handfuls of
samples to hundreds of thousands.
What worked at small scale, does not work at the more recent scales!
UniFrac is one such tool and is used for comparing microbiome profiles to one
another. One important field is studying the impact of the gut population of
microbes, which influence diseases ranging from ulcers to heart disease to
autism to COVID-19.
My collogues at UCSD were hurting due to excessive runtimes of the tool, so
they asked me to explore if porting it to GPUs would be an advantageous option.
Since I am here here to talk, I think you will guess how that worked out.
4. 4
WE ARE WHAT WE EAT
Microbiome is critical for health:
โข Produces compounds your body needs
which it cannot otherwise produce
โข Disruptions in the microbiome are
associated with a range of diseases
โข Many non-communicable diseases,
like Alzheimerโs, various cancers,
Cardiovascular disease and much
more are associated with the
microbiome
On a more recent theme:
โข Many high-risk populations for
COVID19 also have diseases known to
be associated with the microbiome
https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
5. 5
KNIGHT LAB AT UCSD LEADING AMERICAN GUT PROJECT
Collecting specimens, DNA sequencing samples, and analyzing the results
Daniel McDonaldRob Knight
7. 7
SAMPLES RELATIONSHIPS
A fundamental component to
microbiome analysis is understanding
how entire microbial communities
relate to each other. This requires
pairwise comparisons of all samples
in a dataset
DISTANCE MATRIX
8. 8
UNIFRAC DISTANCE
โข Incorporates information on the
evolutionary relatedness of community
members by incorporating the phylogeny of
the observed organisms in the computation.
โข Other measures, such as Euclidean distance,
implicitly assume all organisms are equally
related.
Lozupone and Knight Applied and environmental microbiology 2005
Samples where the organisms are evolutionarily
very similar from an evolutionary perspective
will have a small UniFrac distance.
On the other hand, two samples composed of
very different organisms will have
a large UniFrac distance.
A distance metric
https://en.wikipedia.org/wiki/UniFrac
10. 10
STARTING WITH STRIPED UNIFRAC
Recent (2018) algorithm
that is optimized for
both speed and parallelism (on CPUs)
Allowed the microbiome researchers to analyze
tens of thousands of samples from modern
studies.
But going into 100k range and beyond
becoming too expensive
Runtime scales approximately quadratically
State of the art as of early 2020
From: Striped UniFrac: enabling microbiome analysis at unprecedented scale
Projected runtimes using the early 2020 CPU-only code
11. 11
COULD PORTING
TO GPU ALLOW US
TO DRASTICALLY
REDUCE THE RUNTIME?
Let me take a look!
Igor Sfiligoi
12. 12
Where is most of the
time spent?
Turns out to be a tight double loop
โข With many iterations
โข All independent
Conceptually not too far away
from BLAS
โข Should fly on GPUs!
Simple stack sampling method
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int k = 0;
k < n_samples; k++) {
unsigned int l =
(k + stripe + 1)%n_samples;
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
โฆ
dm_stripe[k] += fabs(u1-v1)*length;
}
}
13. 13
OpenACC makes it easy
to have a first port
Almost as easy as adding a decorator
โข Too bad arrays of pointers
not well supported
โข Thus required a bit of refactoring
Done in less than a week
โข A couple days FTE
8x speedup CPU -> GPU (chip vs chip)
#pragma acc parallel loop collapse(2)
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples; k++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
unsigned int l =
(k + stripe + 1)%n_samples;
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
โฆ
dm_stripe[k] += fabs(u1-v1)*length;
}
}
Intel Xeon E5-2680 v4
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
Runtime on 25k sample
14. 14
But how it is used just
as important
The emb input buffers must be prepared
for each invocation
โข Data movement latency
Main buffer all traversed every time
โข No cache reuse
Large number of invocations
โข GPU invocation overhead
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
for(unsigned int k = 0;
k < (tree.nparens / 2) โ 1) ; k++) {
// must be run sequentially
// on CPU, logic and deep function nesting
// rewrites all of emp buffer
embed(emb, tree, k);
// on GPU
#pragma acc data copyin(emb)
run_loop(dm_stripe_buf, emb, tree.getlen(k));
}
return dm_stripe_buf;
Bad for both
CPU and GPU code-paths
15. 15
Batching to the rescue
Batching many emb buffers
โข Improves memory locality
for main buffer
โข Reduces GPU invocation overhead
and allows for overlap with CPU
โข At the expense of more memory use
Cache-awareness in loop becomes very
important
Additional 8x speedup on GPU (total 64x)
And a decent 4x speedup on CPU
#pragma acc parallel loop collapse(3) async
present(emb,dm_stripes_buf,length)
for(sk) { // swap order and tile
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
โฆ
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += fabs(u-v)*length[e];
}
โฆ
dm_stripe[k] += my_stripe;
}
}
}
#ifdef _OPENACC
std::swap(emb,emb_alt);
#endif
Intel Xeon E5-2680 v4
(using all 14 cores)
193 minutes (3.2 hours)
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes
Runtime on 25k sample
16. 16
WE WERE PRETTY HAPPY WITH SPEEDUP
Switching to fp32 added additional boost
80x
600x
Spring 2020
18. 18
SEVERAL FLAVORS OF UNIFRAC
There are several versions of UniFrac
Two of the popular ones are true FP compute (like previous slides)
One is binary in nature
Expected binary version to be significantly faster
But was not!
19. 19
Binary operations
only in tight loop
Moreover, the same emb buffer
being read multiple time
โข FP -> bool conversion every single time
Full FP logic still needed
#pragma acc parallel loop โฆ
for(sk) {
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
โฆ
double my_stripe = dm_stripe[k];
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
bool u = emb[offset+k]>0;
bool v = emb[offset+k+stripe+ 1]>0;
my_stripe += (u^v)*length[e];
}
โฆ
dm_stripe[k] += my_stripe;
}
}
}
20. 20
BINARY PRE-PROCESSING AND PACKING
Computing FP -> bool before invoking the loop saves a lot of compute
Packing 8 bools into a single UINT8 saves a lot of memory (size and access)
I can pre-compute all 256 combinations, too
โข Just memory lookup and sums in loop now
NVIDIA Tesla V100
(using all 84 SMs)
2.5 minutes
Runtime on 25k sample
21. 21
LOTS OF ZEROES EVERYWHERE!
Since I have only 256 combinations, I get curious and check the distribution
โข >90% of the time it is a 0!
We were doing a huge amount of NOOP compute (add by zero)
โข The emb buffer is basically a sparse matrix!
Using UINT64 and adding a simple if (!=zero)
gets me another 3x speedup
โข Ran out of time for further optimizations
NVIDIA Tesla V100
(using all 84 SMs)
45 seconds
Runtime on 25k sample
Basically a sparse matrix problem
22. 22
#pragma acc parallel loop โฆ
for(sk) {
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
โฆ
double my_stripe = dm_stripe[k];
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
uint64_t u = emb_packed[offset+k];
uint64_t v = emb_packed[offset+k+stripe+ 1];
uint64_t o1 = u1 | v1;
if (o1!=0) { // zeros are prevalent
my_stripe += psum[ (o1 & 0xff)] +
psum[0x100+((o1 >> 8) & 0xff)] +
โฆ
psum[0x700+((o1 >> 56) )];
}
}
โฆ
dm_stripe[k] += my_stripe;
}
}
}
#pragma acc parallel loop โฆ
for (unsigned int emb_el=0;
emb_el<embs_els; emb_el++) {
for (unsigned int sub8=0;
sub8<8; sub8++) {
unsigned int emb8 = emb_el*8+sub8;
TFloat * psum = &(sums[emb8<<8]);
TFloat * pl = &(lengths[emb8*8]);
for (unsigned int b8_i=0;
b8_i<0x100; b8_i++) {
psum[b8_i] = (((b8_i >> 0) & 1) * pl[0]) +
(((b8_i >> 1) & 1) * pl[1]) +
โฆ
(((b8_i >> 7) & 1) * pl[7]);
}
}
}
Sparse packed version
24. 24
CPU speed now often
the limiting factor
Original code was single threaded
โข Relying on partitioning of problem
โข GPUs prefer full problem in loop
Using OpenMP for CPU parallelization
โข Together with OpenACC for GPUs
make for a great pair
GPU compute just so fast!
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
for(unsigned int k0 = 0;
k < (tree.nparens / 2) โ 1) ; k+=chunk) {
#pragma omp parallel for
for (unsigned int i=0; i<chunk; i+=64) {
embed_packed(emb[i], tree, k0+i);
fill_leghts(lengths,tree, k0+i);
}
#pragma acc data update device(emb)
#pragma acc wait
#pragma acc data copyin(lengths)
run_loop(dm_stripe_buf, emb, length);
}
#pragma acc wait
return dm_stripe_buf;
26. 26
THE PORTING TO GPUS WAS A MAJOR SUCCESS
Most of the time spent in a tight loop
Easy to port to GPUs using OpenACC
But deep understanding of code critical for maximum speedup
Gained significantly more from better algorithm than better HW
Having a single code set between CPU and GPU
helped a lot improving both code paths
Optimizing one side usually let to discoveries for the other
GPUs still way faster than CPUs, so HW does matter
Way beyond our greatest hopes
6000x
27. 27
ENABLING SCIENCE THAT WOULD OTHERWISE NOT BE POSSIBLE
300k sample now computed in about an hour on single node vs heroic HPC job
6000x
Supported by NSF grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and NIH grant DP1-AT010885.
28. 28
ACKNOWLEDGMENTS
This work was partially funded by US National Science Foundation (NSF) grants
DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and by
US National Institutes of Health (NIH) grant DP1-AT010885.