SlideShare a Scribd company logo
1 of 1
Download to read offline
Porting and optimizing UniFrac for GPUs
Reducing microbiome analysis runtimes by orders of magnitude
Igor Sfiligoi Daniel McDonald Rob Knight
(isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu)
University of California San Diego - La Jolla, CA, USA
This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.
Abstract
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another
(“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many
independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the
number of CPU cores on a single node, indicating memory-bound compute.
Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this
poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting
using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be
wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too.
The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours.
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf,length)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples ; k++) {
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
Most time spent in small
area of the code, which is
implemented as a set of
tight loops.
OpenACC makes it trivial toport
to GPU compute.
Just decorate with a pragma.
But needed minor refactoring to have a unified buffer.
Memory writes much more
expensive than memory reads,
so cluster reads and minimize writes.
Also undo manual unrolls: were optimal for CPU, bad for GPU.
Reorder loops to maximize
cache reuse.
Intel Xeon E5-2680 v4 CPU
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
NVIDIA Tesla V100
(using all 84 SMs)
33 minutes
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes
Measuring runtime computing
UniFrac on EMP dataset.
UniFrac was originally designed
and always implemented using
fp64 math operations. Thehigher-
precision floating point math was
used to maximize reliability of
the results.
Can we use fp32?
On mobile and gaming GPUs
fp64 compute is 32x slower
than fp32 compute.
We can!
We compared the results of the compute using
fp32-enabled and fp64-only code, using the same
EMP input, and observed a near identical result.
(Mantel R2 0.99999; p<0.001, comparing
pairwise distances in the two matrices).
E5-2680 v4 CPU
Original New
GPU
V100
GPU
2080TI
GPU
1080TI
GPU
1080
GPU
Mobile 1050
fp64 800 193 12 59 77 99 213
fp32 - 190 9.5 19 31 36 64
Runtime computing UniFrac on EMP dataset (Single Chip, in minutes)
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips)
Conclusions
Our work now allows many microbiome analyses which were
previously relegated to large compute clusters to be performed with
much lower resource requirements. Even the largest datasets currently
envisaged could be processed in reasonable time with just a handful of
server-class GPUs, while smaller but still sizable datasets like the
EMP can be processed even on GPU-enabled workstations.
We also explored the use of lower-precision floating point math to
effectively exploit consumer-grade GPUs, which are typical in
desktop and laptop setups. We conclude that fp32 math yields
nearly identical results to fp64, and should be adequate for the vast
majority of studies, making compute on GPU-enabled personal
devices, even laptops, a sufficient resource for this otherwise rate-
limiting step for many researchers.
An ordination plot of unweighted UniFrac distances over 113,721 samples
Xeon E5-2680 v4 CPU
(using all 14 cores)
193minutes (~3 hours)
Properly aligning the memory
buffers and picking the right value
for the grouping parameters can
make a 5x difference in speed.

More Related Content

What's hot

Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorchMayur Bhangale
 
Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Seiya Tokui
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017Yu-Hsun (lymanblue) Lin
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Daniel Lemire
 
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...Kenta Oono
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009bosc
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowAI Frontiers
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionKenta Oono
 

What's hot (20)

Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
Faster Python, FOSDEM
Faster Python, FOSDEMFaster Python, FOSDEM
Faster Python, FOSDEM
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
Deep parking
Deep parkingDeep parking
Deep parking
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introduction
 

Similar to Porting and optimizing UniFrac for GPUs

Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareDaniel Blezek
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra Umbra Software
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
 

Similar to Porting and optimizing UniFrac for GPUs (20)

Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 

Recently uploaded

Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Miss joya
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000aliya bhat
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersnarwatsonia7
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...Miss joya
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 
Call Girls Thane Just Call 9910780858 Get High Class Call Girls Service
Call Girls Thane Just Call 9910780858 Get High Class Call Girls ServiceCall Girls Thane Just Call 9910780858 Get High Class Call Girls Service
Call Girls Thane Just Call 9910780858 Get High Class Call Girls Servicesonalikaur4
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformKweku Zurek
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliRewAs ALI
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...narwatsonia7
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiNehru place Escorts
 
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service JaipurHigh Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipurparulsinha
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 

Recently uploaded (20)

Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
 
Call Girls Thane Just Call 9910780858 Get High Class Call Girls Service
Call Girls Thane Just Call 9910780858 Get High Class Call Girls ServiceCall Girls Thane Just Call 9910780858 Get High Class Call Girls Service
Call Girls Thane Just Call 9910780858 Get High Class Call Girls Service
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy Platform
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas Ali
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
 
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service JaipurHigh Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
 

Porting and optimizing UniFrac for GPUs

  • 1. Porting and optimizing UniFrac for GPUs Reducing microbiome analysis runtimes by orders of magnitude Igor Sfiligoi Daniel McDonald Rob Knight (isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu) University of California San Diego - La Jolla, CA, USA This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885. Abstract UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the number of CPU cores on a single node, indicating memory-bound compute. Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too. The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours. for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf,length) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples ; k++) { … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } Most time spent in small area of the code, which is implemented as a set of tight loops. OpenACC makes it trivial toport to GPU compute. Just decorate with a pragma. But needed minor refactoring to have a unified buffer. Memory writes much more expensive than memory reads, so cluster reads and minimize writes. Also undo manual unrolls: were optimal for CPU, bad for GPU. Reorder loops to maximize cache reuse. Intel Xeon E5-2680 v4 CPU (using all 14 cores) 800 minutes (13 hours) NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) NVIDIA Tesla V100 (using all 84 SMs) 33 minutes NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Measuring runtime computing UniFrac on EMP dataset. UniFrac was originally designed and always implemented using fp64 math operations. Thehigher- precision floating point math was used to maximize reliability of the results. Can we use fp32? On mobile and gaming GPUs fp64 compute is 32x slower than fp32 compute. We can! We compared the results of the compute using fp32-enabled and fp64-only code, using the same EMP input, and observed a near identical result. (Mantel R2 0.99999; p<0.001, comparing pairwise distances in the two matrices). E5-2680 v4 CPU Original New GPU V100 GPU 2080TI GPU 1080TI GPU 1080 GPU Mobile 1050 fp64 800 193 12 59 77 99 213 fp32 - 190 9.5 19 31 36 64 Runtime computing UniFrac on EMP dataset (Single Chip, in minutes) Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips) Conclusions Our work now allows many microbiome analyses which were previously relegated to large compute clusters to be performed with much lower resource requirements. Even the largest datasets currently envisaged could be processed in reasonable time with just a handful of server-class GPUs, while smaller but still sizable datasets like the EMP can be processed even on GPU-enabled workstations. We also explored the use of lower-precision floating point math to effectively exploit consumer-grade GPUs, which are typical in desktop and laptop setups. We conclude that fp32 math yields nearly identical results to fp64, and should be adequate for the vast majority of studies, making compute on GPU-enabled personal devices, even laptops, a sufficient resource for this otherwise rate- limiting step for many researchers. An ordination plot of unweighted UniFrac distances over 113,721 samples Xeon E5-2680 v4 CPU (using all 14 cores) 193minutes (~3 hours) Properly aligning the memory buffers and picking the right value for the grouping parameters can make a 5x difference in speed.