Accelerating Particle Image Velocimetry using Hybrid Architectures

Accelerating Particle Image
Velocimetry using Hybrid
Architectures
Vivek Venugopal
Cameron Patterson
Kevin Shinpaugh

Cardio-Vascular Disease (CVD)
Cardio Vascular Disease (CVD)
Cancer
Other 5%
5%
Chronic Lung Respiratory Disease
6%
Accidents
37%
Alzheimerʼs Disease

24%

24%

Cause of Deaths in United States for 2005
(reference: American Heart Association)

Research@AEThER Lab

Intravascular
Pressure distribution Location of stent
within the heart Region of Interest

• Research focuses on cardiovascular ﬂuid dynamics
and stent hemodynamics useful in designing
intravascular stents for treating CVD.

Research@AEThER Lab
! !

! ! !
! !
PIV setup DPIV camera capturing the "#$%&'!()*+!,-.$
!
Laser illuminated fluid flow
4503!6'127+!89020!01!':2#&'!/'2%3!;'10&'!#:/2.66.2
>'//'
4503!&#$927+!?.2.!.@A%#/#2#0:!.:=!@0:2&06!

• The AEThER lab uses the Particle Image Velocimetry
4C0220-!6'127+!!(!.D#/!-020&#<'=!2&.>'&/'!10&!.
10@%/#:
4C0220-!&#$927+!G2.:=.&=!?8,H!@.-'&.!#/!&'3
(PIV) technique to track the motion of particles in an
!

illuminated flow field.

Motivation

Case 1

Stent Each case =
Case 2
configuration 1 1250 image pairs
x 5 MB = 6.25 GB
PIV
Stent
experimental
configuration 2
setup
Case 100

Stent
configuration 20

• Time for FlowIQ analysis of one image pair on 2GHz
Xeon processor = 16 minutes.
• 1250 image pairs x 100 cases x 20 stent
configurations = 2.6 years

FFT-based PIV algorithm
t (16 x16) or
(32 x 32) or
Motion
(64 x 64)
Vector
Image 1
FFT
t + dt
zone 1
Multiplication IFFT Reduction

Image 2 FFT
zone 2

• Peak detection using FFT blocks
• Reduction block: sub-pixel algorithm and ﬁltering to
determine the velocity value

!! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=!
Implementation platform
B@C! ()*+,"1DEA"./0"
K/)!DE$F$G!K)*3-!#LMN!
7"I!.5:',109C!+5-(2!0*!-9!
-22B09!.-(2!+-*)2!59!1/)!K)*3-!
#LMN!7"I=!$1!/-*!-!"#$!
%&'()**!<,33B/)0C/1!<5(:!<-.15(!
-92!0*!1-(C)1)2!-*!-!/0C/!
')(<5(:-9.)!.5:',109C!
HO"#J!*53,1059!<5(!"#$!
%&'()**!/5*1!;5(A*1-1059*=!
7"I!#5:',109C!+5-(2B3)4)3!
'(52,.1*!25!951!/-4)!20*'3-?!
.599).15(*!-92!-()!*').0<0.-33?!
2)*0C9)2!<5(!.5:',109C=!
"(5.)**5(!.35.A*@!:):5(?! F;G5:)"BHC-"()*+,"1DEA"./0"
• System G: 325 Apple Mac Pro nodes, 2 quad-core
.59<0C,(-1059!-92!.5:',109C!
<)-1,()*!;033!)4534)!20<<)()913?!
1/-9!C(-'/0.*!+5-(2!'(52,.1*=!
Intel Xeon processors with 8 GB RAM per node
K)*3-!#LMN!7"I!*').0<0.-1059*P!
running Linux CentOS
!! Q9)!7"I!HRSL!1/()-2!'(5.)**5(*J! !
!! R=T!78!2)20.-1)2!:):5(?!
• Tesla C1060 U01*!09!59)!<,33B3)9C1/@!2,-3!*351!;01/!59)!5')9!"#$!%&'()**!&RV!*351! at 1.3
!!
: 240 streaming cores clocked
GHz with 4 GB onboard memory

GPU platform
GPU

Multiprocessor 30 Host (CPU) Device (GPU)
Grid
Multiprocessor 2
Multiprocessor 1 Block Block Block
Shared Memory (0,0) (0,1) (0,2)
kernel 1
Block Block Block
Registers Registers Registers (1,0) (1,1) (1,2)
Instruction
Processor 1 Processor 2 Processor 8 Unit

Block (1,1)
Thread Thread Thread Thread
Constant (0,0) (0,1) (0,2) (0,3)
Cache
Thread Thread Thread Thread
(1,0) (1,1) (1,2) (1,3)
Texture Thread Thread Thread Thread
Cache Block
(2,0) Block
(2,1) (2,2) Block
(2,3)
(0,0) (0,1) (0,2)
kernel 2
Block Block Block
GPU memory (1,0) (1,1) (1,2)

CUDA hardware CUDA programming model
model for Tesla C1060

• Nvidia Tesla C1060: 30 streaming multi-processors
with 8 cores each (SIMD)
• Kernel-based execution

GPU platform
CPU CUDA enabled program on GPU

kernel 10 kernel 11 kernel 1n

PIV sequential code cudaThreadSynchronize()
in C consisting of 15
functions looped
3096 times

cudaThreadSynchronize()





index0 index1 indexn
n = loop size = 3096

FPGA platform
ML310 board 1 ML310 board 2

Aurora switches Aurora switches

FSL FSL

PE1 PE2 Aurora PE1 PE2

Aurora switches

Aurora switches
Aurora switches

Aurora switches
FSL FSL

PE3 PE4 PE3 PE4

Aurora switches Aurora switches

Network Of FPGAs with Integrated Aurora Switches (NOFIS)

PRO-PART design flow
CFG representation
of application
algorithm PRO-PART
+ Design flow

component
specification of
implementation platform
NOFIS platform
Input specifications

• Automatic synthesis of CFG captures the
communication
component

communication cores depending
specification
flow

on the Process Partitioning partitioning and
communication resource

• Library of cores depending on the specification

automated

communication link configure and generate
values for communication

• Determine max. flow control and
cores

configure the parameters mapping to hardware

CUDA implementation results
zone_create
real2complex
complex_mul
complex_conj
conj_symm
CUDA kernel functions

c2c_radix2_sp
find_corrmax
transpose
x_field
y_field
indx_reorder CUDA 2.1 with memcopy
meshgrid_x CUDA 2.2 with memcopy
meshgrid_y CUDA 2.2 with zerocopy
fft_indx
cc_indx
memcopy

1 10 100 1000 10000 100000 1000000
GPU Time in usecs
CUDA Profile graph comparison

Results
1.1
CPU
1.05
GPU
FPGA
0.825

• Linear speedup to the

Time in seconds
0.65
number of cores 0.55
• GPU 3x speedup

0.34
0.275

0
Execution Device

Conclusions

• Nvidia's Tesla C1060 GPU faster than both the
sequential CPU implementation and the NOFIS
FPGA implementation for the PIV application
• FPGAs with custom communication interfaces provide
better throughput
• PRO-PART design ﬂow expected to reduce design
development time (work in progress)
• Ideal Hybrid platform consisting of CPU and GPU for
computation and FPGA for data movement

Discussion
Questions

Research Justiﬁcation
Aurora
PE PE PE ML310 ML310
1 2 6 board 1 board 2

PE PE PE Aurora
7 8 12
ML310 ML310
board 3 board 4

PE PE PE
31 32 36 ML310 boards connected in
Clock
source mesh driven by same clock
value but different sources
Streaming architecture with
large number of PEs
requiring more than 1 board

Aurora switches

FSL
PE PE PE
1 2 3

Aurora switches
Aurora switches

FSL

PE PE PE
4 5 6

PE PE PE
7 8 9 Clock
source

Aurora switches

Inside a ML310

Communication Flow Graph

a1 a2 N1 N1

l1 l2
b1 b2 N2 N2

l3
l4
c1 N3

l5

d1 N4

l6

e1 N5

l7

PRO-PART design ﬂow
Inside ML310
N1 N1

l1 l2
PE PE
N2 N2

data_in1 I/O FSL I/O data_out1
ML310 board1 l3 l4
ML310 board2 FSL FSL

N3 data_in2 data_out2
I/O FSL I/O

l5
PE PE

N4

l6

N5

l7

Evaluating PRO-PART + NOFIS
Current implementation Proposed implementation
(work in progress)
• Time-consuming and
• Shorter design cycle
extensive simulations
• Potential increase in resource
• Hardware efficient due to
but meets performance goals
hand-coded RTL
CFG captures the
component
design capture communication
specification
flow

partitioning and
partitioning and
scheduling of processes communication resource
specification
manual
redesign loop

automated

select parameters for configure and generate
the customizable cores values for communication
cores

map the values on the
hardware mapping to hardware

NOFIS Hardware

ML310 Inﬁniband
adapter board

ML310

Accelerating Particle Image Velocimetry using Hybrid Architectures

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Accelerating Particle Image Velocimetry using Hybrid Architectures

Similar to Accelerating Particle Image Velocimetry using Hybrid Architectures (20)

More from Vivek Venugopalan

More from Vivek Venugopalan (6)

Recently uploaded

Recently uploaded (20)

Accelerating Particle Image Velocimetry using Hybrid Architectures