Speeding Microbiome Research by Three Orders of Magnitude

Igor Sfiligoi, UC San Diego & SDSC, November 2020
SPEEDING MICROBIOME
RESEARCH BY THREE
ORDERS OF MAGNITUDE
Presented at
NVIDIA Virtual Theater
0
4
8
12
16
20
24
28
32
Original CPU Xeon Gold V100 A100 RTX8000
Wallclock time, in hours
8000
6000x
25x

2
THE CONTEXT
Microbiome research has expanded over the years from analyzing handfuls of
samples to hundreds of thousands.
What worked at small scale, does not work at the more recent scales!
UniFrac is one such tool and is used for comparing microbiome profiles to one
another. One important field is studying the impact of the gut population of
microbes, which influence diseases ranging from ulcers to heart disease to
autism to COVID-19.
My collogues at UCSD were hurting due to excessive runtimes of the tool, so
they asked me to explore if porting it to GPUs would be an advantageous option.
Since I am here here to talk, I think you will guess how that worked out.

4
WE ARE WHAT WE EAT
Microbiome is critical for health:
• Produces compounds your body needs
which it cannot otherwise produce
• Disruptions in the microbiome are
associated with a range of diseases
• Many non-communicable diseases,
like Alzheimer’s, various cancers,
Cardiovascular disease and much
more are associated with the
microbiome
On a more recent theme:
• Many high-risk populations for
COVID19 also have diseases known to
be associated with the microbiome
https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg

5
KNIGHT LAB AT UCSD LEADING AMERICAN GUT PROJECT
Collecting specimens, DNA sequencing samples, and analyzing the results
Daniel McDonaldRob Knight

7
SAMPLES RELATIONSHIPS
A fundamental component to
microbiome analysis is understanding
how entire microbial communities
relate to each other. This requires
pairwise comparisons of all samples
in a dataset
DISTANCE MATRIX

8
UNIFRAC DISTANCE
• Incorporates information on the
evolutionary relatedness of community
members by incorporating the phylogeny of
the observed organisms in the computation.
• Other measures, such as Euclidean distance,
implicitly assume all organisms are equally
related.
Lozupone and Knight Applied and environmental microbiology 2005
Samples where the organisms are evolutionarily
very similar from an evolutionary perspective
will have a small UniFrac distance.
On the other hand, two samples composed of
very different organisms will have
a large UniFrac distance.
A distance metric
https://en.wikipedia.org/wiki/UniFrac

10
STARTING WITH STRIPED UNIFRAC
Recent (2018) algorithm
that is optimized for
both speed and parallelism (on CPUs)
Allowed the microbiome researchers to analyze
tens of thousands of samples from modern
studies.
But going into 100k range and beyond
becoming too expensive
Runtime scales approximately quadratically
State of the art as of early 2020
From: Striped UniFrac: enabling microbiome analysis at unprecedented scale
Projected runtimes using the early 2020 CPU-only code

11
COULD PORTING
TO GPU ALLOW US
TO DRASTICALLY
REDUCE THE RUNTIME?
Let me take a look!
Igor Sfiligoi

12
Where is most of the
time spent?
Turns out to be a tight double loop
• With many iterations
• All independent
Conceptually not too far away
from BLAS
• Should fly on GPUs!
Simple stack sampling method
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int k = 0;
k < n_samples; k++) {
unsigned int l =
(k + stripe + 1)%n_samples;
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
…
dm_stripe[k] += fabs(u1-v1)*length;
}
}

13
OpenACC makes it easy
to have a first port
Almost as easy as adding a decorator
• Too bad arrays of pointers
not well supported
• Thus required a bit of refactoring
Done in less than a week
• A couple days FTE
8x speedup CPU -> GPU (chip vs chip)
#pragma acc parallel loop collapse(2)
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
k < n_samples; k++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
unsigned int l =
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
…
dm_stripe[k] += fabs(u1-v1)*length;
}
}
Intel Xeon E5-2680 v4
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
Runtime on 25k sample

14
But how it is used just
as important
The emb input buffers must be prepared
for each invocation
• Data movement latency
Main buffer all traversed every time
• No cache reuse
Large number of invocations
• GPU invocation overhead
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
k < (tree.nparens / 2) – 1) ; k++) {
// must be run sequentially
// on CPU, logic and deep function nesting
// rewrites all of emp buffer
embed(emb, tree, k);
// on GPU
#pragma acc data copyin(emb)
run_loop(dm_stripe_buf, emb, tree.getlen(k));
}
return dm_stripe_buf;
Bad for both
CPU and GPU code-paths

15
Batching to the rescue
Batching many emb buffers
• Improves memory locality
for main buffer
• Reduces GPU invocation overhead
and allows for overlap with CPU
• At the expense of more memory use
Cache-awareness in loop becomes very
important
Additional 8x speedup on GPU (total 64x)
And a decent 4x speedup on CPU
#pragma acc parallel loop collapse(3) async
present(emb,dm_stripes_buf,length)
for(sk) { // swap order and tile
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += fabs(u-v)*length[e];
}
…
dm_stripe[k] += my_stripe;
}
}
}
#ifdef _OPENACC
std::swap(emb,emb_alt);
#endif
Intel Xeon E5-2680 v4
(using all 14 cores)
193 minutes (3.2 hours)
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes

16
WE WERE PRETTY HAPPY WITH SPEEDUP
Switching to fp32 added additional boost
80x
600x
Spring 2020

18
SEVERAL FLAVORS OF UNIFRAC
There are several versions of UniFrac
Two of the popular ones are true FP compute (like previous slides)
One is binary in nature
Expected binary version to be significantly faster
But was not!

19
Binary operations
only in tight loop
Moreover, the same emb buffer
being read multiple time
• FP -> bool conversion every single time
Full FP logic still needed
#pragma acc parallel loop …
for(sk) {
for(stripe) {
unsigned int l =
…
bool u = emb[offset+k]>0;
bool v = emb[offset+k+stripe+ 1]>0;
my_stripe += (u^v)*length[e];
}
…
}
}
}

20
BINARY PRE-PROCESSING AND PACKING
Computing FP -> bool before invoking the loop saves a lot of compute
Packing 8 bools into a single UINT8 saves a lot of memory (size and access)
I can pre-compute all 256 combinations, too
• Just memory lookup and sums in loop now
NVIDIA Tesla V100
(using all 84 SMs)
2.5 minutes

21
LOTS OF ZEROES EVERYWHERE!
Since I have only 256 combinations, I get curious and check the distribution
• >90% of the time it is a 0!
We were doing a huge amount of NOOP compute (add by zero)
• The emb buffer is basically a sparse matrix!
Using UINT64 and adding a simple if (!=zero)
gets me another 3x speedup
• Ran out of time for further optimizations
NVIDIA Tesla V100
(using all 84 SMs)
45 seconds
Basically a sparse matrix problem

22
for(sk) {
for(stripe) {
unsigned int l =
…
uint64_t u = emb_packed[offset+k];
uint64_t v = emb_packed[offset+k+stripe+ 1];
uint64_t o1 = u1 | v1;
if (o1!=0) { // zeros are prevalent
my_stripe += psum[ (o1 & 0xff)] +
psum[0x100+((o1 >> 8) & 0xff)] +
…
psum[0x700+((o1 >> 56) )];
}
}
…
}
}
}
for (unsigned int emb_el=0;
emb_el<embs_els; emb_el++) {
for (unsigned int sub8=0;
sub8<8; sub8++) {
unsigned int emb8 = emb_el*8+sub8;
TFloat * psum = &(sums[emb8<<8]);
TFloat * pl = &(lengths[emb8*8]);
for (unsigned int b8_i=0;
b8_i<0x100; b8_i++) {
psum[b8_i] = (((b8_i >> 0) & 1) * pl[0]) +
(((b8_i >> 1) & 1) * pl[1]) +
…
(((b8_i >> 7) & 1) * pl[7]);
}
}
}
Sparse packed version

23
WORKS EVEN BETTER ON LARGER PROBLEMS
25k 50k 115k 300k
Original, Xeon Gold CPU 30k seconds 2.5k minutes 30k minutes 8k hours
Latest, Xeon Gold CPU 290 seconds 16.5 minutes 180 minutes 33 hours
Latest, V100 GPU 45 seconds 2.2 minutes 13 minutes 1.9 hours
Latest, A100 GPU 33 seconds 1.72 minutes 9.8 minutes 1.4 hours
Latest, RTX8000 GPU 29 seconds 1.58 minutes 9.4 minutes 1.3 hours
1000x 1500x 6000x3000x

24
CPU speed now often
the limiting factor
Original code was single threaded
• Relying on partitioning of problem
• GPUs prefer full problem in loop
Using OpenMP for CPU parallelization
• Together with OpenACC for GPUs
make for a great pair
GPU compute just so fast!
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
for(unsigned int k0 = 0;
k < (tree.nparens / 2) – 1) ; k+=chunk) {
#pragma omp parallel for
for (unsigned int i=0; i<chunk; i+=64) {
embed_packed(emb[i], tree, k0+i);
fill_leghts(lengths,tree, k0+i);
}
#pragma acc data update device(emb)
#pragma acc wait
#pragma acc data copyin(lengths)
run_loop(dm_stripe_buf, emb, length);
}
#pragma acc wait
return dm_stripe_buf;

26
THE PORTING TO GPUS WAS A MAJOR SUCCESS
Most of the time spent in a tight loop
Easy to port to GPUs using OpenACC
But deep understanding of code critical for maximum speedup
Gained significantly more from better algorithm than better HW
Having a single code set between CPU and GPU
helped a lot improving both code paths
Optimizing one side usually let to discoveries for the other
GPUs still way faster than CPUs, so HW does matter
Way beyond our greatest hopes
6000x

27
ENABLING SCIENCE THAT WOULD OTHERWISE NOT BE POSSIBLE
300k sample now computed in about an hour on single node vs heroic HPC job
6000x
Supported by NSF grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and NIH grant DP1-AT010885.

28
ACKNOWLEDGMENTS
This work was partially funded by US National Science Foundation (NSF) grants
DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and by
US National Institutes of Health (NIH) grant DP1-AT010885.

Speeding Microbiome Research by Three Orders of Magnitude

Recommended

Recommended

More Related Content

More from Igor Sfiligoi

More from Igor Sfiligoi (20)

Recently uploaded

Recently uploaded (20)

Speeding Microbiome Research by Three Orders of Magnitude