SlideShare a Scribd company logo
1 of 20
Accelerating Particle Image
 Velocimetry using Hybrid
       Architectures
        Vivek Venugopal
       Cameron Patterson
        Kevin Shinpaugh
Cardio-Vascular Disease (CVD)
 Cardio Vascular Disease (CVD)
 Cancer
 Other                                              5%
                                               5%
 Chronic Lung Respiratory Disease
                                          6%
 Accidents
                                                               37%
 Alzheimerʼs Disease


                                    24%




                                                         24%




      Cause of Deaths in United States for 2005
       (reference: American Heart Association)
Research@AEThER Lab




                                              Intravascular
 Pressure distribution     Location of             stent
   within the heart      Region of Interest



•   Research focuses on cardiovascular fluid dynamics
    and stent hemodynamics useful in designing
    intravascular stents for treating CVD.
Research@AEThER Lab
    !                                        !




                                !                             !               !
    !                                                                                           !
               PIV setup             DPIV camera capturing the                    "#$%&'!()*+!,-.$
                                         !
                                     Laser illuminated fluid flow
                                           4503!6'127+!89020!01!':2#&'!/'2%3!;'10&'!#:/2.66.2
                                                                                           >'//'
                                                     4503!&#$927+!?.2.!.@A%#/#2#0:!.:=!@0:2&06!

•       The AEThER lab uses the Particle Image Velocimetry
                                                  4C0220-!6'127+!!(!.D#/!-020&#<'=!2&.>'&/'!10&!.
                                                                                         10@%/#:
                                                 4C0220-!&#$927+!G2.:=.&=!?8,H!@.-'&.!#/!&'3
        (PIV) technique to track the motion of particles in an
                                         !


        illuminated flow field.
Motivation

                                        Case 1

                         Stent                       Each case =
                                        Case 2
                     configuration 1               1250 image pairs
                                                  x 5 MB = 6.25 GB
          PIV
                         Stent
      experimental
                     configuration 2
         setup
                                       Case 100


                          Stent
                     configuration 20



•   Time for FlowIQ analysis of one image pair on 2GHz
    Xeon processor = 16 minutes.
•   1250 image pairs x 100 cases x 20 stent
    configurations = 2.6 years
FFT-based PIV algorithm
t               (16 x16) or
                (32 x 32) or
                                                      Motion
                (64 x 64)
                                                      Vector
    Image 1
                          FFT
t + dt
               zone 1
                                  Multiplication   IFFT        Reduction


    Image 2               FFT
              zone 2




    •    Peak detection using FFT blocks
    •    Reduction block: sub-pixel algorithm and filtering to
         determine the velocity value
!! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=!
Implementation platform
            B@C!        ()*+,"1DEA"./0"
                        K/)!DE$F$G!K)*3-!#LMN!
                        7"I!.5:',109C!+5-(2!0*!-9!
                        -22B09!.-(2!+-*)2!59!1/)!K)*3-!
                        #LMN!7"I=!$1!/-*!-!"#$!
                        %&'()**!<,33B/)0C/1!<5(:!<-.15(!
                        -92!0*!1-(C)1)2!-*!-!/0C/!
                        ')(<5(:-9.)!.5:',109C!
                        HO"#J!*53,1059!<5(!"#$!
                        %&'()**!/5*1!;5(A*1-1059*=!
                        7"I!#5:',109C!+5-(2B3)4)3!
                        '(52,.1*!25!951!/-4)!20*'3-?!
                        .599).15(*!-92!-()!*').0<0.-33?!
                        2)*0C9)2!<5(!.5:',109C=!
                        "(5.)**5(!.35.A*@!:):5(?!            F;G5:)"BHC-"()*+,"1DEA"./0"
 •   System G: 325 Apple Mac Pro nodes, 2 quad-core
                        .59<0C,(-1059!-92!.5:',109C!
                        <)-1,()*!;033!)4534)!20<<)()913?!
                        1/-9!C(-'/0.*!+5-(2!'(52,.1*=!
     Intel Xeon processors with 8 GB RAM per node
                K)*3-!#LMN!7"I!*').0<0.-1059*P!
     running Linux CentOS
                !! Q9)!7"I!HRSL!1/()-2!'(5.)**5(*J! !
                !! R=T!78!2)20.-1)2!:):5(?!
 •   Tesla C1060 U01*!09!59)!<,33B3)9C1/@!2,-3!*351!;01/!59)!5')9!"#$!%&'()**!&RV!*351! at 1.3
                !!
                     : 240 streaming cores clocked
     GHz with 4 GB onboard memory
GPU platform
 GPU

                                                          Multiprocessor 30   Host (CPU)               Device (GPU)
                                                                                           Grid
                                                     Multiprocessor 2
                                                   Multiprocessor 1                          Block          Block         Block
                  Shared Memory                                                              (0,0)          (0,1)         (0,2)
                                                                               kernel 1
                                                                                             Block          Block         Block
    Registers     Registers          Registers                                               (1,0)          (1,1)         (1,2)
                                                      Instruction
    Processor 1   Processor 2        Processor 8         Unit

                                                                                                        Block (1,1)
                                                                                              Thread    Thread   Thread   Thread
                                                         Constant                              (0,0)     (0,1)    (0,2)    (0,3)
                                                           Cache
                                                                                              Thread    Thread   Thread   Thread
                                                                                               (1,0)     (1,1)    (1,2)    (1,3)
                                                          Texture                             Thread    Thread Thread     Thread
                                                           Cache                             Block
                                                                                               (2,0)         Block
                                                                                                         (2,1)   (2,2)     Block
                                                                                                                           (2,3)
                                                                                             (0,0)          (0,1)          (0,2)
                                                                               kernel 2
                                                                                             Block          Block         Block
                                GPU memory                                                   (1,0)          (1,1)         (1,2)



   CUDA hardware                                                              CUDA programming model
 model for Tesla C1060

•         Nvidia Tesla C1060: 30 streaming multi-processors
          with 8 cores each (SIMD)
•         Kernel-based execution
GPU platform
              CPU                   CUDA enabled program on GPU



                              kernel 10        kernel 11             kernel 1n




      PIV sequential code                  cudaThreadSynchronize()
      in C consisting of 15
        functions looped
           3096 times
                              kernel 20        kernel 21             kernel 2n



                                           cudaThreadSynchronize()




                              kernel 30        kernel 31             kernel 3n




                                           cudaThreadSynchronize()




                              kernel 150       kernel 151            kernel 15n




                                           cudaThreadSynchronize()



                               index0           index1                indexn
                                                                 n = loop size = 3096
FPGA platform
                                              ML310 board 1                                                                 ML310 board 2

                           Aurora switches                                                               Aurora switches


                                FSL                                                                           FSL


                          PE1           PE2                            Aurora                           PE1           PE2




                                                     Aurora switches




                                                                                                                                   Aurora switches
  Aurora switches




                                                                                Aurora switches
                    FSL                                                                           FSL




                          PE3           PE4                                                             PE3           PE4




                           Aurora switches                                                               Aurora switches




  Network Of FPGAs with Integrated Aurora Switches (NOFIS)
PRO-PART design flow
                               CFG representation
                                 of application
                                   algorithm                      PRO-PART
                                       +                          Design flow

                                   component
                                 specification of
                             implementation platform
                                                                                              NOFIS platform
                               Input specifications




•   Automatic synthesis of                             CFG captures the
                                                        communication
                                                                                component

    communication cores depending
                                                                               specification
                                                            flow




    on the Process Partitioning                                 partitioning and
                                                             communication resource

•   Library of cores depending on the                            specification

                                                                                 automated


    communication link                                        configure and generate
                                                             values for communication


•   Determine max. flow control and
                                                                       cores




    configure the parameters                                    mapping to hardware
CUDA implementation results
                           zone_create
                         real2complex
                         complex_mul
                         complex_conj
                            conj_symm
CUDA kernel functions




                        c2c_radix2_sp
                          find_corrmax
                             transpose
                                 x_field
                                 y_field
                          indx_reorder                           CUDA 2.1 with memcopy
                            meshgrid_x                           CUDA 2.2 with memcopy
                            meshgrid_y                           CUDA 2.2 with zerocopy
                                fft_indx
                                cc_indx
                             memcopy

                                           1   10    100     1000     10000   100000      1000000
                                                       GPU Time in usecs
                                                CUDA Profile graph comparison
Results
                                                 1.1
                                                                        CPU
                                                       1.05
                                                                        GPU
                                                                        FPGA
                                               0.825


 •   Linear speedup to the



                             Time in seconds
                                                                       0.65
     number of cores                            0.55
 •   GPU 3x speedup

                                                              0.34
                                               0.275




                                                  0
                                                         Execution Device
Conclusions

•   Nvidia's Tesla C1060 GPU faster than both the
    sequential CPU implementation and the NOFIS
    FPGA implementation for the PIV application
•   FPGAs with custom communication interfaces provide
    better throughput
•   PRO-PART design flow expected to reduce design
    development time (work in progress)
•   Ideal Hybrid platform consisting of CPU and GPU for
    computation and FPGA for data movement
Discussion
             Questions
Research Justification
                                                                                           Aurora
                PE       PE       PE                                          ML310                            ML310
                 1        2        6                                          board 1                          board 2


                PE       PE       PE                                                    Aurora
                 7        8       12
                                                                              ML310                            ML310
                                                                              board 3                          board 4

                PE       PE       PE
                31       32       36                                        ML310 boards connected in
    Clock
   source                                                                   mesh driven by same clock
                                                                            value but different sources
            Streaming architecture with
               large number of PEs
            requiring more than 1 board


                                                                 Aurora switches

                                                                 FSL
                                                            PE         PE          PE
                                                             1          2           3




                                                                                             Aurora switches
                                    Aurora switches




                                                      FSL

                                                            PE         PE          PE
                                                             4          5           6


                                                            PE         PE          PE
                                                             7          8           9                                 Clock
                                                                                                                     source

                                                                 Aurora switches

                                                                 Inside a ML310
Communication Flow Graph

       a1        a2   N1                   N1


                           l1                    l2
       b1        b2   N2                   N2

                            l3
                                            l4
            c1                   N3


                                      l5

            d1                   N4


                                      l6

            e1                   N5


                                      l7
PRO-PART design flow
                                                                                             Inside ML310
                N1                   N1


                     l1                   l2
                                                          PE                                    PE
                N2                   N2

                                               data_in1        I/O         FSL         I/O                  data_out1
 ML310 board1         l3             l4
 ML310 board2                                                        FSL         FSL

                           N3                  data_in2                                                     data_out2
                                                               I/O         FSL         I/O

                                l5
                                                          PE                                   PE

                           N4


                                l6

                           N5


                                l7
Evaluating PRO-PART + NOFIS
    Current implementation                            Proposed implementation
                                                         (work in progress)
•                   Time-consuming and
                                                      •   Shorter design cycle
                    extensive simulations
                                                      •   Potential increase in resource
•                   Hardware efficient due to
                                                          but meets performance goals
                    hand-coded RTL
                                                             CFG captures the
                                                                                    component
                            design capture                    communication
                                                                                   specification
                                                                  flow




                            partitioning and
                                                                      partitioning and
                        scheduling of processes                    communication resource
                                                                       specification
                                             manual
    redesign loop




                                                                                      automated


                         select parameters for                      configure and generate
                        the customizable cores                     values for communication
                                                                             cores




                        map the values on the
                              hardware                               mapping to hardware
NOFIS Hardware




                 ML310 Infiniband
                  adapter board

      ML310

More Related Content

Viewers also liked

Trinityhelp
TrinityhelpTrinityhelp
Trinityhelp
vitita
 
Powerpointtema4
Powerpointtema4Powerpointtema4
Powerpointtema4
vitita
 
Going to the theater
Going to the theaterGoing to the theater
Going to the theater
vitita
 
3 f6 9a_corba
3 f6 9a_corba3 f6 9a_corba
3 f6 9a_corba
op205
 
有關財富的成功故事
有關財富的成功故事有關財富的成功故事
有關財富的成功故事
Peter Chan
 
F I N A R T C05 09
F I N A R T C05 09F I N A R T C05 09
F I N A R T C05 09
guestd08f5b
 
Deshiroj Te Pendohem Por
Deshiroj Te Pendohem PorDeshiroj Te Pendohem Por
Deshiroj Te Pendohem Por
guestef339
 
Aaaaaaa
AaaaaaaAaaaaaa
Aaaaaaa
vitita
 

Viewers also liked (20)

Trinityhelp
TrinityhelpTrinityhelp
Trinityhelp
 
Bill woodman 5 2013
Bill woodman 5 2013Bill woodman 5 2013
Bill woodman 5 2013
 
Febeliec Energy Forum 2016
Febeliec Energy Forum 2016Febeliec Energy Forum 2016
Febeliec Energy Forum 2016
 
Powerpointtema4
Powerpointtema4Powerpointtema4
Powerpointtema4
 
Going to the theater
Going to the theaterGoing to the theater
Going to the theater
 
FOSS Business Sharif
FOSS Business SharifFOSS Business Sharif
FOSS Business Sharif
 
3 f6 9a_corba
3 f6 9a_corba3 f6 9a_corba
3 f6 9a_corba
 
有關財富的成功故事
有關財富的成功故事有關財富的成功故事
有關財富的成功故事
 
Os trasnos dos libros recomendan para o verán
Os trasnos dos libros recomendan para o veránOs trasnos dos libros recomendan para o verán
Os trasnos dos libros recomendan para o verán
 
Egypt
EgyptEgypt
Egypt
 
F I N A R T C05 09
F I N A R T C05 09F I N A R T C05 09
F I N A R T C05 09
 
Achieving Mental Health
Achieving Mental HealthAchieving Mental Health
Achieving Mental Health
 
Deshiroj Te Pendohem Por
Deshiroj Te Pendohem PorDeshiroj Te Pendohem Por
Deshiroj Te Pendohem Por
 
Egypt
EgyptEgypt
Egypt
 
Your Were Made To Do Great Thing
Your Were Made To Do Great ThingYour Were Made To Do Great Thing
Your Were Made To Do Great Thing
 
MEMORIA CLUB DE LECTURA EN PORTUGUÉS
MEMORIA CLUB DE LECTURA EN PORTUGUÉSMEMORIA CLUB DE LECTURA EN PORTUGUÉS
MEMORIA CLUB DE LECTURA EN PORTUGUÉS
 
Aaaaaaa
AaaaaaaAaaaaaa
Aaaaaaa
 
Ramón Piñeiro. Letras galegas 2009
Ramón Piñeiro. Letras galegas 2009Ramón Piñeiro. Letras galegas 2009
Ramón Piñeiro. Letras galegas 2009
 
Success with Requirements: Agile Requirements Work!
Success with Requirements: Agile Requirements Work!Success with Requirements: Agile Requirements Work!
Success with Requirements: Agile Requirements Work!
 
Frankmasoneria Globale
Frankmasoneria GlobaleFrankmasoneria Globale
Frankmasoneria Globale
 

Similar to Accelerating Particle Image Velocimetry using Hybrid Architectures

มโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษามโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษา
may53638332
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
Julalak Kaewjoonla
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
Julalak Kaewjoonla
 
Video Compression Basics by sahil jain
Video Compression Basics by sahil jainVideo Compression Basics by sahil jain
Video Compression Basics by sahil jain
Sahil Jain
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
Julalak Kaewjoonla
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 
Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to come
Wang Zuo
 
Windows kernel debugging workshop in florida
Windows kernel debugging   workshop in floridaWindows kernel debugging   workshop in florida
Windows kernel debugging workshop in florida
Sisimon Soman
 

Similar to Accelerating Particle Image Velocimetry using Hybrid Architectures (20)

มโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษามโนทัศน์เทคโนโลยีทางการศึกษา
มโนทัศน์เทคโนโลยีทางการศึกษา
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
 
Displaying Animated Images on GLCD display with LPC2148 Microcontroller
Displaying Animated Images on GLCD display with LPC2148 MicrocontrollerDisplaying Animated Images on GLCD display with LPC2148 Microcontroller
Displaying Animated Images on GLCD display with LPC2148 Microcontroller
 
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
 
Video Compression Basics by sahil jain
Video Compression Basics by sahil jainVideo Compression Basics by sahil jain
Video Compression Basics by sahil jain
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Using ARM Dev.Board in physical experimental instruments
Using ARM Dev.Board in physical experimental instrumentsUsing ARM Dev.Board in physical experimental instruments
Using ARM Dev.Board in physical experimental instruments
 
그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...
그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...
그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
 
Columnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to comeColumnar processing for SQL-on-Hadoop: The best is yet to come
Columnar processing for SQL-on-Hadoop: The best is yet to come
 
Windows kernel debugging workshop in florida
Windows kernel debugging   workshop in floridaWindows kernel debugging   workshop in florida
Windows kernel debugging workshop in florida
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 

More from Vivek Venugopalan

Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Vivek Venugopalan
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
Vivek Venugopalan
 

More from Vivek Venugopalan (6)

xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsxDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
 
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
 
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
 
Real-time processing for ATST
Real-time processing for ATSTReal-time processing for ATST
Real-time processing for ATST
 
CISL talk
CISL talkCISL talk
CISL talk
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 

Accelerating Particle Image Velocimetry using Hybrid Architectures

  • 1. Accelerating Particle Image Velocimetry using Hybrid Architectures Vivek Venugopal Cameron Patterson Kevin Shinpaugh
  • 2. Cardio-Vascular Disease (CVD) Cardio Vascular Disease (CVD) Cancer Other 5% 5% Chronic Lung Respiratory Disease 6% Accidents 37% Alzheimerʼs Disease 24% 24% Cause of Deaths in United States for 2005 (reference: American Heart Association)
  • 3. Research@AEThER Lab Intravascular Pressure distribution Location of stent within the heart Region of Interest • Research focuses on cardiovascular fluid dynamics and stent hemodynamics useful in designing intravascular stents for treating CVD.
  • 4. Research@AEThER Lab ! ! ! ! ! ! ! PIV setup DPIV camera capturing the "#$%&'!()*+!,-.$ ! Laser illuminated fluid flow 4503!6'127+!89020!01!':2#&'!/'2%3!;'10&'!#:/2.66.2 >'//' 4503!&#$927+!?.2.!.@A%#/#2#0:!.:=!@0:2&06! • The AEThER lab uses the Particle Image Velocimetry 4C0220-!6'127+!!(!.D#/!-020&#<'=!2&.>'&/'!10&!. 10@%/#: 4C0220-!&#$927+!G2.:=.&=!?8,H!@.-'&.!#/!&'3 (PIV) technique to track the motion of particles in an ! illuminated flow field.
  • 5. Motivation Case 1 Stent Each case = Case 2 configuration 1 1250 image pairs x 5 MB = 6.25 GB PIV Stent experimental configuration 2 setup Case 100 Stent configuration 20 • Time for FlowIQ analysis of one image pair on 2GHz Xeon processor = 16 minutes. • 1250 image pairs x 100 cases x 20 stent configurations = 2.6 years
  • 6. FFT-based PIV algorithm t (16 x16) or (32 x 32) or Motion (64 x 64) Vector Image 1 FFT t + dt zone 1 Multiplication IFFT Reduction Image 2 FFT zone 2 • Peak detection using FFT blocks • Reduction block: sub-pixel algorithm and filtering to determine the velocity value
  • 7. !! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=! Implementation platform B@C! ()*+,"1DEA"./0" K/)!DE$F$G!K)*3-!#LMN! 7"I!.5:',109C!+5-(2!0*!-9! -22B09!.-(2!+-*)2!59!1/)!K)*3-! #LMN!7"I=!$1!/-*!-!"#$! %&'()**!<,33B/)0C/1!<5(:!<-.15(! -92!0*!1-(C)1)2!-*!-!/0C/! ')(<5(:-9.)!.5:',109C! HO"#J!*53,1059!<5(!"#$! %&'()**!/5*1!;5(A*1-1059*=! 7"I!#5:',109C!+5-(2B3)4)3! '(52,.1*!25!951!/-4)!20*'3-?! .599).15(*!-92!-()!*').0<0.-33?! 2)*0C9)2!<5(!.5:',109C=! "(5.)**5(!.35.A*@!:):5(?! F;G5:)"BHC-"()*+,"1DEA"./0" • System G: 325 Apple Mac Pro nodes, 2 quad-core .59<0C,(-1059!-92!.5:',109C! <)-1,()*!;033!)4534)!20<<)()913?! 1/-9!C(-'/0.*!+5-(2!'(52,.1*=! Intel Xeon processors with 8 GB RAM per node K)*3-!#LMN!7"I!*').0<0.-1059*P! running Linux CentOS !! Q9)!7"I!HRSL!1/()-2!'(5.)**5(*J! ! !! R=T!78!2)20.-1)2!:):5(?! • Tesla C1060 U01*!09!59)!<,33B3)9C1/@!2,-3!*351!;01/!59)!5')9!"#$!%&'()**!&RV!*351! at 1.3 !! : 240 streaming cores clocked GHz with 4 GB onboard memory
  • 8. GPU platform GPU Multiprocessor 30 Host (CPU) Device (GPU) Grid Multiprocessor 2 Multiprocessor 1 Block Block Block Shared Memory (0,0) (0,1) (0,2) kernel 1 Block Block Block Registers Registers Registers (1,0) (1,1) (1,2) Instruction Processor 1 Processor 2 Processor 8 Unit Block (1,1) Thread Thread Thread Thread Constant (0,0) (0,1) (0,2) (0,3) Cache Thread Thread Thread Thread (1,0) (1,1) (1,2) (1,3) Texture Thread Thread Thread Thread Cache Block (2,0) Block (2,1) (2,2) Block (2,3) (0,0) (0,1) (0,2) kernel 2 Block Block Block GPU memory (1,0) (1,1) (1,2) CUDA hardware CUDA programming model model for Tesla C1060 • Nvidia Tesla C1060: 30 streaming multi-processors with 8 cores each (SIMD) • Kernel-based execution
  • 9. GPU platform CPU CUDA enabled program on GPU kernel 10 kernel 11 kernel 1n PIV sequential code cudaThreadSynchronize() in C consisting of 15 functions looped 3096 times kernel 20 kernel 21 kernel 2n cudaThreadSynchronize() kernel 30 kernel 31 kernel 3n cudaThreadSynchronize() kernel 150 kernel 151 kernel 15n cudaThreadSynchronize() index0 index1 indexn n = loop size = 3096
  • 10. FPGA platform ML310 board 1 ML310 board 2 Aurora switches Aurora switches FSL FSL PE1 PE2 Aurora PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches FSL FSL PE3 PE4 PE3 PE4 Aurora switches Aurora switches Network Of FPGAs with Integrated Aurora Switches (NOFIS)
  • 11. PRO-PART design flow CFG representation of application algorithm PRO-PART + Design flow component specification of implementation platform NOFIS platform Input specifications • Automatic synthesis of CFG captures the communication component communication cores depending specification flow on the Process Partitioning partitioning and communication resource • Library of cores depending on the specification automated communication link configure and generate values for communication • Determine max. flow control and cores configure the parameters mapping to hardware
  • 12. CUDA implementation results zone_create real2complex complex_mul complex_conj conj_symm CUDA kernel functions c2c_radix2_sp find_corrmax transpose x_field y_field indx_reorder CUDA 2.1 with memcopy meshgrid_x CUDA 2.2 with memcopy meshgrid_y CUDA 2.2 with zerocopy fft_indx cc_indx memcopy 1 10 100 1000 10000 100000 1000000 GPU Time in usecs CUDA Profile graph comparison
  • 13. Results 1.1 CPU 1.05 GPU FPGA 0.825 • Linear speedup to the Time in seconds 0.65 number of cores 0.55 • GPU 3x speedup 0.34 0.275 0 Execution Device
  • 14. Conclusions • Nvidia's Tesla C1060 GPU faster than both the sequential CPU implementation and the NOFIS FPGA implementation for the PIV application • FPGAs with custom communication interfaces provide better throughput • PRO-PART design flow expected to reduce design development time (work in progress) • Ideal Hybrid platform consisting of CPU and GPU for computation and FPGA for data movement
  • 15. Discussion Questions
  • 16. Research Justification Aurora PE PE PE ML310 ML310 1 2 6 board 1 board 2 PE PE PE Aurora 7 8 12 ML310 ML310 board 3 board 4 PE PE PE 31 32 36 ML310 boards connected in Clock source mesh driven by same clock value but different sources Streaming architecture with large number of PEs requiring more than 1 board Aurora switches FSL PE PE PE 1 2 3 Aurora switches Aurora switches FSL PE PE PE 4 5 6 PE PE PE 7 8 9 Clock source Aurora switches Inside a ML310
  • 17. Communication Flow Graph a1 a2 N1 N1 l1 l2 b1 b2 N2 N2 l3 l4 c1 N3 l5 d1 N4 l6 e1 N5 l7
  • 18. PRO-PART design flow Inside ML310 N1 N1 l1 l2 PE PE N2 N2 data_in1 I/O FSL I/O data_out1 ML310 board1 l3 l4 ML310 board2 FSL FSL N3 data_in2 data_out2 I/O FSL I/O l5 PE PE N4 l6 N5 l7
  • 19. Evaluating PRO-PART + NOFIS Current implementation Proposed implementation (work in progress) • Time-consuming and • Shorter design cycle extensive simulations • Potential increase in resource • Hardware efficient due to but meets performance goals hand-coded RTL CFG captures the component design capture communication specification flow partitioning and partitioning and scheduling of processes communication resource specification manual redesign loop automated select parameters for configure and generate the customizable cores values for communication cores map the values on the hardware mapping to hardware
  • 20. NOFIS Hardware ML310 Infiniband adapter board ML310