High Performance Computing (HPC) applications are mapped to a cluster of multi-core processors communicating using high speed interconnects. More computational power is harnessed with the addition of hardware accelerators such as Graphics Processing Unit (GPU) cards and Field Programmable Gate Arrays (FPGAs). Particle Image Velocimetry (PIV) is an embarrassingly parallel application that can benefit from acceleration using hybrid architectures. The PIV application is mapped to a Nvidia GPU system, resulting in 3x speedup over a dual quad-core Intel processor implementation. The design methodology used to implement the PIV application on a specialized FPGA platform under development is described in brief and the resulting performance benefit is analyzed.
Accelerating Particle Image Velocimetry using Hybrid Architectures
1. Accelerating Particle Image
Velocimetry using Hybrid
Architectures
Vivek Venugopal
Cameron Patterson
Kevin Shinpaugh
2. Cardio-Vascular Disease (CVD)
Cardio Vascular Disease (CVD)
Cancer
Other 5%
5%
Chronic Lung Respiratory Disease
6%
Accidents
37%
Alzheimerʼs Disease
24%
24%
Cause of Deaths in United States for 2005
(reference: American Heart Association)
3. Research@AEThER Lab
Intravascular
Pressure distribution Location of stent
within the heart Region of Interest
• Research focuses on cardiovascular fluid dynamics
and stent hemodynamics useful in designing
intravascular stents for treating CVD.
4. Research@AEThER Lab
! !
! ! !
! !
PIV setup DPIV camera capturing the "#$%&'!()*+!,-.$
!
Laser illuminated fluid flow
4503!6'127+!89020!01!':2#&'!/'2%3!;'10&'!#:/2.66.2
>'//'
4503!&#$927+!?.2.!.@A%#/#2#0:!.:=!@0:2&06!
• The AEThER lab uses the Particle Image Velocimetry
4C0220-!6'127+!!(!.D#/!-020&#<'=!2&.>'&/'!10&!.
10@%/#:
4C0220-!&#$927+!G2.:=.&=!?8,H!@.-'&.!#/!&'3
(PIV) technique to track the motion of particles in an
!
illuminated flow field.
5. Motivation
Case 1
Stent Each case =
Case 2
configuration 1 1250 image pairs
x 5 MB = 6.25 GB
PIV
Stent
experimental
configuration 2
setup
Case 100
Stent
configuration 20
• Time for FlowIQ analysis of one image pair on 2GHz
Xeon processor = 16 minutes.
• 1250 image pairs x 100 cases x 20 stent
configurations = 2.6 years
6. FFT-based PIV algorithm
t (16 x16) or
(32 x 32) or
Motion
(64 x 64)
Vector
Image 1
FFT
t + dt
zone 1
Multiplication IFFT Reduction
Image 2 FFT
zone 2
• Peak detection using FFT blocks
• Reduction block: sub-pixel algorithm and filtering to
determine the velocity value
7. !! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=!
Implementation platform
B@C! ()*+,"1DEA"./0"
K/)!DE$F$G!K)*3-!#LMN!
7"I!.5:',109C!+5-(2!0*!-9!
-22B09!.-(2!+-*)2!59!1/)!K)*3-!
#LMN!7"I=!$1!/-*!-!"#$!
%&'()**!<,33B/)0C/1!<5(:!<-.15(!
-92!0*!1-(C)1)2!-*!-!/0C/!
')(<5(:-9.)!.5:',109C!
HO"#J!*53,1059!<5(!"#$!
%&'()**!/5*1!;5(A*1-1059*=!
7"I!#5:',109C!+5-(2B3)4)3!
'(52,.1*!25!951!/-4)!20*'3-?!
.599).15(*!-92!-()!*').0<0.-33?!
2)*0C9)2!<5(!.5:',109C=!
"(5.)**5(!.35.A*@!:):5(?! F;G5:)"BHC-"()*+,"1DEA"./0"
• System G: 325 Apple Mac Pro nodes, 2 quad-core
.59<0C,(-1059!-92!.5:',109C!
<)-1,()*!;033!)4534)!20<<)()913?!
1/-9!C(-'/0.*!+5-(2!'(52,.1*=!
Intel Xeon processors with 8 GB RAM per node
K)*3-!#LMN!7"I!*').0<0.-1059*P!
running Linux CentOS
!! Q9)!7"I!HRSL!1/()-2!'(5.)**5(*J! !
!! R=T!78!2)20.-1)2!:):5(?!
• Tesla C1060 U01*!09!59)!<,33B3)9C1/@!2,-3!*351!;01/!59)!5')9!"#$!%&'()**!&RV!*351! at 1.3
!!
: 240 streaming cores clocked
GHz with 4 GB onboard memory
11. PRO-PART design flow
CFG representation
of application
algorithm PRO-PART
+ Design flow
component
specification of
implementation platform
NOFIS platform
Input specifications
• Automatic synthesis of CFG captures the
communication
component
communication cores depending
specification
flow
on the Process Partitioning partitioning and
communication resource
• Library of cores depending on the specification
automated
communication link configure and generate
values for communication
• Determine max. flow control and
cores
configure the parameters mapping to hardware
12. CUDA implementation results
zone_create
real2complex
complex_mul
complex_conj
conj_symm
CUDA kernel functions
c2c_radix2_sp
find_corrmax
transpose
x_field
y_field
indx_reorder CUDA 2.1 with memcopy
meshgrid_x CUDA 2.2 with memcopy
meshgrid_y CUDA 2.2 with zerocopy
fft_indx
cc_indx
memcopy
1 10 100 1000 10000 100000 1000000
GPU Time in usecs
CUDA Profile graph comparison
13. Results
1.1
CPU
1.05
GPU
FPGA
0.825
• Linear speedup to the
Time in seconds
0.65
number of cores 0.55
• GPU 3x speedup
0.34
0.275
0
Execution Device
14. Conclusions
• Nvidia's Tesla C1060 GPU faster than both the
sequential CPU implementation and the NOFIS
FPGA implementation for the PIV application
• FPGAs with custom communication interfaces provide
better throughput
• PRO-PART design flow expected to reduce design
development time (work in progress)
• Ideal Hybrid platform consisting of CPU and GPU for
computation and FPGA for data movement
16. Research Justification
Aurora
PE PE PE ML310 ML310
1 2 6 board 1 board 2
PE PE PE Aurora
7 8 12
ML310 ML310
board 3 board 4
PE PE PE
31 32 36 ML310 boards connected in
Clock
source mesh driven by same clock
value but different sources
Streaming architecture with
large number of PEs
requiring more than 1 board
Aurora switches
FSL
PE PE PE
1 2 3
Aurora switches
Aurora switches
FSL
PE PE PE
4 5 6
PE PE PE
7 8 9 Clock
source
Aurora switches
Inside a ML310
18. PRO-PART design flow
Inside ML310
N1 N1
l1 l2
PE PE
N2 N2
data_in1 I/O FSL I/O data_out1
ML310 board1 l3 l4
ML310 board2 FSL FSL
N3 data_in2 data_out2
I/O FSL I/O
l5
PE PE
N4
l6
N5
l7
19. Evaluating PRO-PART + NOFIS
Current implementation Proposed implementation
(work in progress)
• Time-consuming and
• Shorter design cycle
extensive simulations
• Potential increase in resource
• Hardware efficient due to
but meets performance goals
hand-coded RTL
CFG captures the
component
design capture communication
specification
flow
partitioning and
partitioning and
scheduling of processes communication resource
specification
manual
redesign loop
automated
select parameters for configure and generate
the customizable cores values for communication
cores
map the values on the
hardware mapping to hardware