SlideShare a Scribd company logo
1 of 144
Download to read offline
Massively Parallel Computing
                         CS 264 / CSCI E-292
Lecture #4: Intermediate-level CUDA | February 15th, 2011




                Nicolas Pinto (MIT, Harvard)
                       pinto@mit.edu
Administrivia

• HW1: due Fri 2/18/11 (this week)
• Projects: think about it, consult the staff
• New guest lecturers!
  • Max Lin (Google), Kurt Messersmith et al. (Amazon),
     David Rich et al. (Microsoft)
During this course,
                          r CS264
                adapted fo



we’ll try to


          “                         ”

and use existing material ;-)
Today
yey!!
Outline

• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
Outline

• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
gu age
                                      Lan
!   49!:(1&/<.2"'(%('")(/+(5,.$)=.-(<"#)/&()61"'>
!   *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+
    0*",-2/+*",-3/+0*",-3/+
!   $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+
    $"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/
!   )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+
    0)4%2/+)4%3/+0)4%3/+
!   5#46./+05#46./+5#461/+05#461/+5#462/+
    05#462/+5#463/+05#463/+
!   75#,%./+75#,%1/+75#,%2/+75#,%3+

                                        !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                        =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
gu age
                                    Lan

!   4%-(#/-')&,#)(%(<"#)/&()61"(8.)*('1"#.%$(
    +,-#)./->
    8,9'!!"#$%&'(%):(;/+(.!"#$

!   4%-(%##"''("$"0"-)'(/+(%(<"#)/&()61"(8.)*(
    !"#$%&!"'$%&!"($%&!")$*
    ('*(,-<=


                                      !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                      =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
gu age
                                    Lan

!   &)82 .'(%('1"#.%$(<"#)/&()61"

!   ?%0"(%'(0)4%2@("3#"1)(#%-(5"(#/-')&,#)"2(
    +&/0(%('#%$%&()/(+/&0(%(<"#)/&>
    :$*,5,-/+./+.>




                                     !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                     =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
gu age
                                    Lan

!   49!:(1&/<.2"'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"'
!   %"-',&?&=@(@5#*9?&=@(@5#*9A)8@(
    6-)&A)8

!   +',-.&/0&/&1&)822&34&10)4%22&

!   :##"''.5$"(/-$6(+&/0(2"<.#"(#/2"
!   4%--/)()%A"(%22&"''
!   4%--/)(%''.;-(<%$,"
                                       !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                       =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
gu age
         !"#$%&'()*$+(,%-.*./
                                                                                            Lan
                  0+'.'(*.'($12($/3'4(25(."#$%&'(&*$+(23'.*$%2#4(%#(4%#67'(
                  3.'8%4%2#
                            !!"#$%"&'9(:%.'8$(&*33%#6($2(+*.:1*.'(;<=
                               >*4$(-"$(721'.(*88".*8/(?4''(3.26@(6"%:'(52.(:'$*%74A
                               BC*&37'49(!!()$"&*'+,!!-*."&*'+,!!./0"&*+1'
                            "#$%"&',9(82&3%7'($2(&"7$%37'(%#4$."8$%2#4
                               <721'.(-"$(+%6+'.(*88".*8/(?D("73(2.(7'44A
                               BC*&37'49(()$"&*'+,-*."&*'+,./0"&*+1'


                  0+'(2#(-!"3(4!5346,82&3%7'.(23$%2#(52.8'4('E'./("#$%"&',$2(
                  82&3%7'($2(!!"#$%"&'

© NVIDIA Corporation 2010                                                                                 5
                                                                                                          4
                                    Unit of Least Precision (ULP) is the gap between the floating-point
                                    numbers nearest a given real number
A PI
          CUDA APIs

                 API allows the host to manage the devices
                            Allocate memory & transfer data
                            Launch kernels



                            High level of abstraction - start here!

                                                (aka “Device” API)
                            More control, more verbose


                 (OpenCL: Similar to CUDA C Driver API)


© NVIDIA Corporation 2010
A PI

!   !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$
    !"#$%&
    ! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J
    ! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J


! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$
  *,"#0+$:0#$+/#72:G2M#3
! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J

                                              !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                              =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI

!   (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#
!   !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$
    ,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)

!   '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$
    ,"0#:3+$IO5+,$G2P#$A/#6BCQJ
    ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$
      7*6,#@,
    ! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$
      *6#$"*+,$,"0#:3
                                           !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                           =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI

! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
  7*3#$*4$,=/#8$+,-"./0)
! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
  7*3#$*4$,=/#$%/!12--'-3)

!   (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0

!   %/!14")51.)2--'-L$%/!14")2--'-6)-$(7

                                      !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                      =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI

!   K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#
!   '#127#$(-.$7:GG+$95+,$7:GG$%/8($)




                                       !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                       =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Runtime API
   (high-level)
A PI

!   78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$%
    1(5%#5$*.-"/%*%#(".$6.=

!   !"+.'$(#$%&!$)/"0(
!   !"+.1$(#$%&!$

!   :"+%.'$%8)$180=

!   !"+.)2//3$#$%&!$
                                      !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                      =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI
!   :00(#*.$HI5$$%9$9(52=
!   !"+.6.99/!@%!"+.<-$$

!   <"-.-*0-E$%9$9(52=
!   !"+.6$73$(

!   4(32%9$9(52=
!   !"+.6$7!=4

                            !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                            =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI

!   +,!$--. !"#$%&#'()*+,#-*#%."#&+*/#01*#2223444#
    '&"%0(5"#("65%.0(5"#758*9.059:

!   5,(%12-6#7("%8(0("+*()%1+77)%*2%+77%$(3#1(%9:;%
    *2%)(*/6%*,(%(<(1/*#20%(03#"20-(0*




                                       !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                       =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
“Device” Driver API
       (low-level)
A PI

!   !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$
    :1:2G:FG#$3#127#+

!   %/9"#$%"4")+'/()
!   %/9"#$%"4")
!   %/9"#$%"4"):1;"
!   %/9"#$%"4")<')10=";'->
!   %/9"#$%"4")?))-$@/)"
!   !
                                       !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                       =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI

!   !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(
    &$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$

!   4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($




                                   !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                   =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI


! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%
  *00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%
! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%
  .'5$*+


! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%
  A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$
! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%
                                     !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                     =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI


!   :00(#*.$HI5$$%9$9(52=
!   !"6$7899/!:;!"6$7<-$$

!   <"-.-*0-E$%9$9(52=
!   !"6$73$(

!   4(32%9$9(52=
!   !"6$7!=4>(/#:;!"6$7!=4#(/>:;
    !"6$7!=4#(/#
                              !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                              =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI


!   3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"
    56$7".()#"$+8#"6-9(*)&$6(-
    ! >%<@6- 96'#.

!   3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"
    !"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*

!   ;(4<'#"%&-"@#"<-'(&4#4"56$7"
    !"#(2"'$45'(*2
                                      !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                      =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI

!   E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#

!   ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"
    :'(@&'"F&*6&@'#.G
    !"#(2"'$6$-7"5!-8(5
    !"#(2"'$6$-6'(9*'
    !"#(2"'$6$-:$;<$=


                                      !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                      =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI


!   H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"
    9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-

!   I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)
    96*.$




                                     !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                     =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI


!   JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G
    " L7*#&4"M'(%?"N6=#
    " N7&*#4";#)(*+"N6=#
    " O<-%$6(-"B&*&)#$#*.
    " A*64"N6=#




                                      !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                      =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI


!   L7*#&4"M'(%?"N6=#G"
    !"7"5!>$-?'(!@>A*0$

!   N7&*#4";#)(*+"N6=#G
    !"7"5!>$->A*)$2>8B$

!   O<-%$6(-"B&*&)#$#*.G
    !"C*)*%>$->8B$DE!"C*)*%>$-8DE
    !"C*)*%>$-=DE!"C*)*%>$-F
                                    !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                    =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
A PI


!   !"#$%&#'(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(%
    ./01*#20%#0321+*#204
    !"#$"%!&'()*




                                        !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                        =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Outline

• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
Threading Hierarchy
Execution Model
Software    Hardware

                             Threads are executed by thread
               Thread
                             processors
              Processor
 Thread

                             Thread blocks are executed on
                             multiprocessors

                             Thread blocks do not migrate

                             Several concurrent thread blocks can
  Thread                     reside on one multiprocessor - limited
   Block    Multiprocessor   by multiprocessor resources (shared
                             memory and register file)

                             A kernel is launched as a grid of
                             thread blocks
      ...
                             Only one kernel can execute on a
   Grid                      device at one time
               Device
                                             © 2008 NVIDIA Corporation.
Thread Batching


  Kernel launches a grid of thread blocks
        Threads within a block cooperate via shared memory
        Threads within a block can synchronize
        Threads in different blocks cannot cooperate
  Allows programs to transparently scale to
  different GPUs
 Grid
        Thread Block 0   Thread Block 1            Thread Block N-1


                                          …
        Shared Memory    Shared Memory                Shared Memory




                                              © 2008 NVIDIA Corporation.
Transparent Scalability


         Hardware is free to schedule thread blocks on any
         processor
              A kernel scales across parallel multiprocessors

                         Kernel grid
Device                                           Device
                          Block 0      Block 1

                          Block 2      Block 3

                          Block 4      Block 5

 Block 0   Block 1                                Block 0       Block 1              Block 2   Block 3
                          Block 6      Block 7


 Block 2   Block 3                                Block 4       Block 5              Block 6   Block 7



 Block 4   Block 5



 Block 6   Block 7


                                                            © 2008 NVIDIA Corporation.
Thread Arithmetic
Indexing Arrays: Example
   In this example, the red entry would have an index of 21:
      0   1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21




     M = 8 threads/block




                                                                blockIdx.x = 2

          int index = threadIdx.x + blockIdx.x * M;
                                      =           5               +           2   * 8;
                                      = 21;
Indexing Arrays: Example
   In this example, the red entry would have an index of 21:
      0   1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21




     M = 8 threads/block




Addition with Threads and Blocks                                blockIdx.x = 2

          int index = threadIdx.x + blockIdx.x * M;
    The blockDim.x is a built-in variable for threads per block:
                                      =           5               +           2   * 8;
          int index= threadIdx.x + blockIdx.x * blockDim.x;
                                      = 21;

    A combined version of our vector addition kernel to use blocks and threads:
          __global__ void add( int *a, int *b, int *c ) {
                      int index = threadIdx.x + blockIdx.x * blockDim.x;
Control Flow
Control Flow Divergence

 What happens if you have the following code?
          !"#"$$#%&'()*+*,-,..
          /
            *$01#.2
          3
          (45(
          /
            *$06#.2
          3
Control Flow Divergence




  Branch


    Path A


Path B
Control Flow Divergence

 Nested branches are handled as well
!"#"$$#%&'()*+*,-,..
/
  !"#0)'#%&'()*+*,-,..
    *$12#.3
  (45(
    *$16#.3
7
(45(
  *$18#.3
Control Flow Divergence


     Branch

           Branch


                    Path A

         Path B


Path C
Control Flow Divergence


 for correctness (*)
 You might have to think about it for performance
    Depends on your branch conditions
Control Flow Divergence

 Performance drops off with the degree of divergence



   !"#$%&'$&()*+,+-.- /012
   3
     %*!) 45
       ...
     %*!) 65
       ...
   7
Divergence

              35



              30
Performance




              25



              20



              15



              10



               5



               0
                   0   2   4   6     8   10     12   14   16   18


                                   Divergence
Occupancy
!""#$%&"'
                  ()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-+2+"#0.&6-
                  10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-4%0+&".+/-%&,-8++$-0)+-
                  )%*,7%*+-9#/'

                  !""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-"1&"#**+&04'-1&-%-
                  <#40.$*1"+//1*-,.>.,+,-9'-<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-
                  *#&-"1&"#**+&04'

                  ?.<.0+,-9'-*+/1#*"+-#/%6+@
                            A+6./0+*/
                            B)%*+,-<+<1*'

© NVIDIA Corporation 2010
!"#$%&'()*'+*,-'.)/*,&0,$&

                  1'#2'3"#$%&'4'1'#2'5/"0,(*#$)&&#*&
                            6#'7""'5/"0,(*#$)&&#*&'879)'70'")7&0'#:)'3"#$%'0#');)$/0)


                  1'#2'3"#$%&'<'1'#2'5/"0,(*#$)&&#*&'4'=
                            >/"0,(")'3"#$%&'$7:'*/:'$#:$/**):0"?',:'7'5/"0,(*#$)&&#*
                                                         !!"#$%&'()*+",-.%))('08)'87*-@7*)'3/&?
                            6/3A)$0'0#'*)&#/*$)'797,"73,",0?' *)B,&0)*&C'&87*)-'5)5#*?


                  1'#2'3"#$%&'4'DEE'0#'&$7")'0#'2/0/*)'-)9,$)&
                            !"#$%&');)$/0)-',:'(,()",:)'27&8,#:
                            DEEE'3"#$%&'()*'B*,-'@,""'&$7")'7$*#&&'5/"0,(")'B):)*70,#:&


© NVIDIA Corporation 2010
Outline

• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
Kernel Memory Access
 Kernel Memory Access

        Per-thread
                                       Registers   On-chip
                        Thread
                                   Local Memory    Off-chip, uncached

        Per-block
                                     Shared        • On-chip, small
                    Block                          • Fast
                                     Memory


        Per-device


       Kernel 0              ...                        • Off-chip, large
                                                        • Uncached
                                           Global       • Persistent across
Time




                                           Memory           kernel launches
         Kernel 1           ...                         •   Kernel I/O
!"#$%&"'%
! !"#$%&'()*%#&+,&#%$-./%#.&0%#&./#%")&
  0#+1%..+#&23456&-'&.+)%&7"#89"#%:
! ;%#+<1=+1>&1?1=%&"11%..
! @/+#%&%-/7%#&A5*-/&-'/%$%#&+#&A5*-/&,=+"/




                                      !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                      =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
()*+,-."/)'0
! B&.)"==&0+#/-+'&+,&$=+*"=&)%)+#?&/7"/&-.&
  0#-C"/%&/+&"&./#%")&0#+1%..+#
! D,/%'&(.%8&".&+C%#,=+9&,#+)&#%$-./%#.
! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?:




                                    !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                    =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
12+'"3-."/)'0
! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==&
  ./#%")&0#+1%..+#.&-'&"&)(=/-<0#+1%..+#

!   3EFG&0%#&*=+1>H&./+#%8&-'&3EI3FG&*"'>.
!   J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&&
    9-/7+(/&!"#$%&'#()*&+,


                                          !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                          =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Global Memory
 Kernel Memory Access

   • Different types of “global memory”
     Per-thread
                                       Registers   On-chip

    • Linear Memory     Thread
                                   Local Memory    Off-chip, uncached


    • Texture
     Per-block Memory

    • Constant Memory
                    Block
                                    •
                                    •
                                     Shared
                                     Memory
                                                    On-chip, small
                                                    Fast


        Per-device


       Kernel 0              ...                       • Off-chip, large
                                                       • Uncached
                                           Global      • Persistent across
Time




                                           Memory          kernel launches
         Kernel 1           ...                        •   Kernel I/O
! !"#$%#&'#%(")'*+%,*"-'*+%."/0#'/#%'/1%2$3#45$%
  6$6"57%'**%)"6$%85"6%#&$%0'6$%9&70:)'*%
  6$6"57%9""*
! ;40#%1:88$5%:/%'))$00%9'##$5/0+%)')&:/<+%$#)=




                                   !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                   =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
4,)5+,-."/)'0
! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==&
  )(=/-<0#+1%..+#.&+'&/7%&1+)0(/%&8%C-1%
! @-N%&8%0%'8.&+'&8%C-1%&! 5OEPG&/+&3KOQG
! R-$7&*"'89-8/7&S&344QGT.

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&
  ="/%'1?K&
! 678-1"17%8
                                    !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
                                    =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
Constant Memory
Constants set by CPU, read by GPU
Each SM has 8kiB cache for constants
Optimized for broadcast
  Accessing different elements forces serialisation
Can speed some calculations
Can relieve register pressure
Constant Memory
Declared at file scope
__constant__ float dc_myConst;

Set via cudaMemcpyToSymbol API call
cudaMemcpyToSymbol( “dc_myConst”, 3.14f, sizeof(float) )

Accessed by name in kernel
__global__ MyKernel( ... ) {
   ....
   float myVal = dc_myConst+1;
   ....
}
Textures

Textures are essentially look up tables
  Can only be written by the host
Cached on each multiprocessor (8kiB)
  Optimised for 2D spatial locality
Hardware interpolation possible
  Limited precision
Can clamp or wrap at boundaries
Textures

Declaration and setup rather involved
  See programming guide
Accessed in kernels via texture fetches:
tex1D, tex2D, tex3D, etc.

Co-ordinates at texel centres
  Have to take care when accessing elements
Textures

Can improve load coalescing from global memory
If whole texture fits in 8kiB cache, has grid lifetime
Clamping/wrapping can aid edge case handling
Have to test to determine benefits
General Principles

 Memory access patterns are crucial
 Even CPUs are typically memory bound
 GPUs have 100x FP
   Only 10x memory bandwidth
 Have to keep the GPU busy
PC Architecture
8 GB/s
                                         >?@

                    ?>L9G=2%&66"K16
                                               J%+8#"F7(&"K16


 H%'2$7,6">'%("I"
                                      A+%#$)%7(B&                 F+1#$)%7(B&
     >@C!

                                             E&.+%/"K16                ?>L"K16
                                                                                            3+ Gb/s
                                        CD!E                    F!:!         G#$&%8&#      !
160+ GB/s
   to
  VRAM                                      25+ GB/s
                                                                                 modified from Matthew Bolitho
PCIe Transfers
  (first thing to optimize?)
!"#$%&'()##*+&%,
   -+#"(4&5&(6*+3(
   -./(123+*"(5+             -./
    0./(123+*"

                                              !"#$
                                            %&'()*+,
      789(-/4)             -,$#<25                     0./
      :2*92';<=


                     & -.                                       &. /.'
                     ()*+                                       ()*+

   -+#"(4&5&(6*+3
   0./(323+*"(5+             -./                       0./
                           123+*"                      123+*"
    -./(123+*"

                     *Averaged observed bandwidth
Processing Flow


                                             PCI Bus




           1. Copy input data from CPU memory to GPU
              memory




© NVIDIA Corporation 2010
Processing Flow


                                             PCI Bus




           1. Copy input data from CPU memory to GPU
              memory
           2. Load GPU program and execute,
              caching data on chip for performance




© NVIDIA Corporation 2010
Processing Flow


                                             PCI Bus




           1. Copy input data from CPU memory to GPU
              memory
           2. Load GPU program and execute,
              caching data on chip for performance
           3. Copy results from GPU memory to CPU
              memory


© NVIDIA Corporation 2010
PCIe Transfers

PCIe 2.0 x16 bus has
  Latency of 10 µs (observed)
  Bandwidth of 8GB/s (theory), 5 GB/s (observed)
A lot of calculations can happen in these times
PCIe Transfers
PCIe transfers occur via DMA
  GPU reads pages direct from CPU memory
  Very bad if page gets moved mid-transfer
CUDA maintains internal pinned memory buffers
  Used for cudaMemcpy calls
  Data staged through these
!"#$%&'#'()*+(#$,-'#)
!"#$%#&%'()*+'(',-$."/0$1'#&2'!"#$%&'#'()
32&$24'4#-$.521'#&26
   7-$"/82'+9:6'+1;$.5&0$0-1
   *&/<2&'+9:6'!"#$"%!&'()*!" 0&'!"#$"%!&'()*+,-%!!"
!;$.5&0$0-1',-$."/0$1'=40.>'0$'#$;'?&/0&'#1;$.5&0$0-1'
>2&$24'4#-$.521
!"#$%&'
!"#$%&'!()
')*&+,&-     !"#$%&'()**"+),#"-.

!"#$%&'!()   /,)#'(01%(')*&+,&- #1(21*3-"#"4(

             $&#)-(213.()'(21*3-"#"5

              !"#$%&'(!&)&*+,$-./010$&2#)1&(
              '"#'(6"7,8)1%5(9%,+"100(6"#:""&(;<=(
              2.2-"'(,&+(%"'31&'"('3""+
               !"#$%&'(!&)!2&#",&)3(4
               !"#$%&'(!&)!2&#",&5(&,#
               !"#$%&'(!&6,7!8(4-)94!
              ()*+'),-./,0#1,'23*+#&'45,6745'"5,
              6)'#5*74,8&#91
!"#$%&'($()"*!+,"
!""#$%&'()#'*%(+,-'./#0+.#+"/'#
1%#+/).02('.'3/4#+.5#(%,3(.#
-&&%5-+,%")
$%&'()#&3/,#1%#+""'0+,%5#+/#

   !"#$%&'()*++'!,-
   !"./&'()*++'!,-                     +,--./ &%&'()#+""'9/#5-(%0,#
                                       ;$!#,(+./8%(/#1)#,2%#<=>#,'#
   6"5%(#7%(/-'.#'8#,2%/%#83.0,-'./#
   !"#$%*++'!&'(),-0
                                       "'0:%5#,'#+#*2)/-0+"#+55(%//
   !"./#/++'!&'(),-+"/'#9'(:4#
!"#$%&'($()"*!+,"*-.($/01
!"#$%&'(')%*+%,&'-*%'./%/%0'/#'$+'12%'345
6+7',-/+82'"9%*2%-0'$&'"9%*,-##%0

:7+82*"+"/&'8-,,&'&2"/,0';%'0"+%'"/1&$0%'8*$1$8-,'
&%81$"+&' &"<%'"='12%&%'-*%'%>#%+&$9%?
   @+$1$-,$A-1$"+
   B%<"*7'-,,"8-1$"+&
   :1*%-<'C'D9%+1'8*%-1$"+
   @+1%*"# *%&"/*8%''234"/'5/4($
!"#$%&'
!"#$%&'!()*+),!!"#$%&"'"&()*"
                  #$%&"'"(+,*
                  (%-./)",$01)*
                  .102",0&34.2,567%1&"8%1&*              .&+'$&/$()+'01(
                  ,0&39)+.32/)"()+.32:"""""""            (&&/2$-&+$/&3$0((,1'
                                                         ()+'01$4$5


!"#$%&'!()*+),!                      6'),+/($711'%70)'89
')-&.,&/                             6'),+/($711'%70)'89
!"#$%&'!()*+),!                      6'),+/($711'%70)'89

 !"#$%&'($&)*'+$(),--$*'+'


!"#$01.&$#2),!1.3,45&!:;             :07)($-&+$';'+9)*7/<$&/$)*'$="#$)&$-7/7(*2$
                                     )*'/$+'),+/(
!"#$%&'()$'*#'+&#,'%-'.-$/%-0'(-123#%/-$
 !"#$%&'()*)+,+-.'&'()+
       !"#$%#&'() !"#$%&'()*+'$)'/0+,+!"%&'()*+'$)'/0

!"#$%&'()*)+1)23#2('4
!"#$%&'()-5'$)'/61)23#2('7804
!"#$,'-!./01/(!/#'9)792"5!'7:;)'97!"#$<'=!>;129)?23'&@!'7804
!"#$%&'()2'!3+#/1)23#2('04
-/4'+('5
!"#$,'-!./01/(!/#'9)792"5!'7:;)'97!"#$<'=!>;3'&@!'?2129)7804
!;<'5$3'&0',%.//'2#"#
                                                   *$+%,'-.,%'/0"'#1#")%2+34'(#/0"#
!"#$%&'()6/(!7+3(89'/1)23#2('04                    !"%&'()A'!25#/1)23#2('0
                                                   %0'50678#%#9'%2#3'"#%."3,
=2#'/+",%'6#60")'507)'+,'&03#9',0'%2#'6#60")'$%'
,0."5#'50.8&'(#'.,#&'$4$+3'()'%2#'!;<
!"#$:7+'$#6/(!7+3(89'/04                           *$+%,'/0"'#1#")%2+34'03'%2#':;<'
                                                   %0'/+3+,29'%2#3'"#%."3,
!"#$"%&'()*+%#,$-)./01232"245)62"7)8#$539%#
             !""#$%&'()(*+,,'-(./0(1233'456(*+,78*#,7(1'&9'',(&:';
             !"#$%&'(&)$*$+,
             -.!/$0$1234%&'567,
             <<<((!""#$%&'(1233'45
             89!:;$<=/.";>
             ?
                   !"#$%&'!()*+),!<1@34%&'A1234%&'5<%&'(&)BC>D67AE!F;A
                                   G&/HI;)G1JK.E#L.M;-!G;A+>,
                   ')-&.,&/0
.#,$244',&




                   ')-&.,&/1


                 2.$3%)4.$'&<1234%&'5%&'(&)7>,


                 !"#$56.&$#7),!6.8,9:&<>,
                 %&'(&)BB,$%&'(&)$D*6,
             N
!"#$%&'()*+,                                                        GPU

   !"#$#%&'()*&+,-.&/0123-4&
   /5256,7,-8&9:&;<;&.5=4&5                                 #-$%     !.+56')
                                                           20340)     20340)
   >4>,?5-4>&$@%&4AB,A4
   $@%&-C5A*D4C*&0=4C&(/#4&?5A&64
   ?0A?3CC4A-&+,-.&/)$%&E4CA47&4F4?3-,0AG
   &'()*+, 5770+*&,A>424A>4A-&?0A?3CC4A-&,AH                 -)+.(/ !.0'(.11)(

   0C>4C&I3434*&0D&4F4?3-,0A
           !"#$%&'($)*&+,-./&'($)
           !"#$%&'($)-'($&(01+,!"%&'($)-'($&(01
   @37-,274&*-C451*&4F,*-&+,-.,A&5&*,AB74&?0A-4F-J&-.48&
   *.5C4&1410C8&5A>&0-.4C&C4*03C?4*
                                                              '()&@410C8


78#%!.54),%.01/9%%!"#$%&'!()*+,-).! :*00.'%.;)(1*5<
!"#$%&'(%#%&$"$#
!""#!$%&' ()&'*+,&#-./+0*+0$#1.-0#.
)"#$%& 2./.30*0/
4)&*+30#50/&0"#6.)&'1
  !!!"#$%&'%()*"+,-./'%()*'010 '%()*"'2$)34555

7/+-0/#!89
  .67368.9#$%&:;<8.=>68.2%-8*"?%&29*"9)%@92*"
    ;2$)34A

:,2+0$#;#50/&0"#".)&'10$#<+*1#*10#)%&$ $*/0.3#
2./.30*0/#0=0')*0#*+,-#.$#
Scheduling on GPU
,12'&3456789                  ,;<=          ,;8<A46
                             ">?@>6          ">?@>6
     Independent Tasks
                             ,-./&'(
 :'3!&'           :'3!&+
                             ,-./&+(       !"#$"%&'(
 ,-./&'(          ,-./&+(
                             ,-./&+)       !"#$"%&')
 !"#$"%&'(        ,-./&+)
                                           !"#$"%&'*
 !"#$"%&')       !"#$"%&+(
                             ,-./&')       !"#$"%&+(
 !"#$"%&'*        ,-./&+*
                             ,-./&+*
 ,-./&')          ,-./&+0
                             ,-./&+0
=#BC&    =#EBF-(

!"#$%&'()$*+$,*-$#./                                      7.D$.(
                                                          =2>5&!9
                                                                     7.D$.(



'@17!A&!
                      01234&0!5/                                    671378&!9
 =2>5&!9

                                                                    671378&!:
671378&!9

671378&!:                                                           671378&!;

671378&!;                                                 =2>5&!:

 =2>5&!:
                                                          =2>5&<9

'@17!A&<                                                  =2>5&<:

 =2>5&<9
                                                                    671378&<9
 =2>5&<:

671378&<9                                                 =2>5&<;

 =2>5&<;    !"#$"%&'(%(%)&*+%&,$--%.&$"&/0%&1+.%+&21.%&   =2>5&<?

 =2>5&<?
            $)&%3%2(/%.
()<='    ()&<A"$

!"#$%&'()*$'+#*$#                      ->?@>$
                                       (+34'12
                                                  ->?@>$



!9.-1B'1                (+..-(9':14;
                                       (+34'72   ,-./-0'12
 (+34'12

,-./-0'12                              (+34'75   ,-./-0'15

,-./-0'15
                                                 ,-./-0'16
,-./-0'16   !9.-1B'7
             (+34'72
                                       (+34'15   ,-./-0'72
             (+34'75
                                       (+34'76
            ,-./-0'72

 (+34'15                               (+34'78

             (+34'76

             (+34'78
!"#$%&'(")*%&+,,-./0%+121+% 3')"45",*
>/%'%;6,'?;,'0*';./#%@%'.*9,A.*)910%'*@%+:;9=
61 7.+89'%!"#$%&+,,-./B
     C1"0#)%D'!"#$%&'(&)*!&+,$-.23'?#0/'!"#$(&)*!&/$012.' $:;<
     E+#@%+D'!"3'435&$'&23'?#0/'3673897/:;71<%8
:1 ;99"<+$'%,-..'=%5>?%('(")*
     C1"0#)%D'!"#$12.':,,2!=>F'16%'!"#$12.':,,2!/$00&# $:;<
     E+#@%+D'!"/&?12.':,,2!=> 16%'36(:7/@/1<%8:AA<37(@BC3@/:;
@1 A'$%+%5?B;%='C-<'%,"-.$')%$"%$D-#%('(")*
     C1"0#)%D'!"#$12.'D&'(&)*!&;2*E'&5=>
     E+#@%+D'!"/&?12.'D&'(&)*!&;2*E'&5=>                                !"#$#%&'(%)*+,'-+./#0%.01+%'
E1 FG#$%G#'%$D+$%,"-.$')%-.%*"G)%2').'9#H                               2!(-3'45!6'7%+*8.*9,'
                                                                        %:#)#";0%6'&;0;'0+;"6$%+'
BG/%.H'0/%'!"#$"%&'()$*+',- A'./01234.20566748/620.590$5:0&;<60$2$;7=   ;:0*<%0/%+=
 &%@#.%'9+*9%+0,'$:;<'0*'6%%'#$'I%+*8G*9,'#6';@;#:;J:%K
!"#$%&$'()*+,-".,/"0
!"#"$%&$#'"(&)*''*+$,-*'$#.*$/01*$23&$"3#,4"#%5"6678$

9&*$:.*($+"#"$%&$,(67$'*"+;:'%##*($,(5*
9&*$),'$-*'7$&4"66$"4,3(#&$,)$+"#"$<(*:$-"'%"26*&8$
0/9;=/9$5,443(%5"#%,(>
9&*$:.*($5,4?3#*;4*4,'7$'"#%,$%&$-*'7$.%@.$"(+$
,553?"(57$%&$.%@.8$&,$6"#*(57$,-*'$/01*$%&$.%++*(
0,"6*&5%(@$%&$!"#$#!%&&'(%4?,'#"(#A
PCIe Transfers Optimization

 PCIe bus is slow
 Try to minimize transfers
 Use pinned memory on host whenever possible
 Try to perform copies asynchronously
Outline

• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
CUDA-GDB


                Extended version of GDB with support for C for CUDA

              Supported on Linux 32bit/64bit systems

                Seamlessly debug both the host|CPU and device|GPU code
                  Set breakpoints on any source line or symbol name
                  Single step executes only one warp except on sync threads
                  Access and print all CUDA memory allocations, local, global,
                  constant and shared vars.

          Walkthrough example with sourcecode : CUDA-GDB manual
© NVIDIA Corporation 2010
Linux GDB
   Integration with
   EMACS




© NVIDIA Corporation 2010
Linux GDB
Integration with
DDD




© NVIDIA Corporation 2010
CUDA-MemCheck

                 Detects/tracks memory errors
                            Out of bounds accesses
                            Misaligned accesses (types must be aligned on their size)
                 Integrated into CUDA-GDB
                 Linux and WinXP
                 Win7 and Vista support coming




© NVIDIA Corporation 2010
                                                                                   11
©NVIDIA 2010
CUDA Driver                        Low-level Profiling support
          1. Set up environment variables
                            !"#$%&'()*+,-./012345
                            !"#$%&'()*+,-./0123,(6745
                            !"#$%&'()*+,-./0123,(/80194:$;<=>?&"&'
                            !"#$%&'()*+,-./0123,2/94#%$<=@!?:AB


          2. Set up configuration file
                   FILE "config.txt":    FILE "profile.csv":
                       >#CA&D%&&=E!A&DE# G'()*+,-./0123,2/9,73.61/8'5?H
                                         G'()*+,*371(3'I'9!0$%:!'JJII'9K
                       =;A&%C:&=$;A      G'()*+,-./0123,(67'5
                                                  G'K1F36K+F-0+(K/.'<DLMLNN5!DL:5L:
                                                  >#CA&D%&&=E!A&DE#OE!&P$QO>#C&=E!O:#C&=E!O$::C#D;:RO=;A&%C:&=$;A
          3. Run application                      55H<S!DD5I!TNLLIOE!E:#RU&$*OV?TLJO5L?III
                            ED&%="FC@             55H<S!DD5I!HQD:IOE!E:#RU&$*OH?WWSOS?III
                                                  55H<S!DD5I!MH:!IOE!E:#RU&$*OV?TLJOW?III
                                                  55H<S!DD5I<L!DWIO,X5IQED&%="EC@-<==6,==6,O5M?LMWOSI?IIIOI?TTTOST
          4. View profiler output                 HL
                                                  55H<S!DD5I<SSTDIOE!E:#R*&$UOV?VVWOTW?III


© NVIDIA Corporation 2010
CUDA Visual Profiler - Overview
             Performance analysis tool to fine tune CUDA applications

             Supported on Linux/Windows/Mac platforms

             Functionality:

                      Execute a CUDA application and collect profiling data

                      Multiple application runs to collect data for all hardware performance counters

                      Profiling data for all kernels and memory transfers

                      Analyze profiling data



© NVIDIA Corporation 2010
CUDA Visual Profiler   data for kernels




© NVIDIA Corporation 2010
CUDA Visual Profiler                    computed data for kernels
            Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

            Global memory read throughput (Gigabytes/second)

            Global memory write throughput (Gigabytes/second)

            Overall global memory access throughput (Gigabytes/second)

            Global memory load efficiency

            Global memory store efficiency




© NVIDIA Corporation 2010
CUDA Visual Profiler                     data for memory transfers

             Memory transfer type and direction
             (D=Device, H=Host, A=cuArray)
                e.g. H to D: Host to Device

                      Synchronous / Asynchronous

             Memory transfer size, in bytes

             Stream ID




© NVIDIA Corporation 2010
CUDA Visual Profiler                data analysis views
             Views:
                 Summary table
                 Kernel table
                 Memcopy table
                 Summary plot
                 GPU Time Height plot
                 GPU Time Width plot
                 Profiler counter plot
                 Profiler table column plot
                 Multi-device plot
                 Multi-stream plot
             Analyze profiler counters
             Analyze kernel occupancy

© NVIDIA Corporation 2010
CUDA Visual Profiler                      Misc.
             Multiple sessions

             Compare views for different sessions

             Comparison Summary plot

             Profiler projects   save & load

            Import/Export profiler data
            (.CSV format)




© NVIDIA Corporation 2010
Outline

• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries
CUBLAS
CUBLAS
         CUDA accelerated BLAS (Basic Linear Algebra
         Subprograms)
                   Create matrix and vector objects in GPU memory space
                   Fill objects with data
                   Call sequence of CUBLAS functions
                   Retrieve data from GPU (optionally)

                     !"#$%&'#((')'*+,-#.%/'00'1%$.+2%!'3'4.56-.5$'7
                     8
                       9:;$+4<=%*>
                       ?$5+.'+$6"+'@''1%$.+2%!'A'9:;$+4<15.&BC1-1CDC1-ECD7F'''''''''
                       9:;$+4<+,6E&BC'+$6"+C1-1CDC1-,CD7F'

                       AA'%>%/E'GH'#.%/+.#524C'/%4.+/.'/%4#1:+$
                       #?'&#'I'GH'@@'H7'8
                         9:;$+4<=%*>
                         9:;$+4<956E&BC'1-;C'DC'1-/C'D7F
                         9:;$+4<+,6E&BC'JDKHC'1-EC'DC'1-/C'D7F
                       L
                       %$4%'
                         9:;$+4<+,6E&BCJ+$6"+C1-ECDC1-/CD7F'
                     KKK'

© NVIDIA Corporation 2009
CUBLAS Features
         Single precision data:
                   Level 1 (vector-vector O(N) )
                   Level 2 (matrix-vector O(N2) )
                                             3
                   Level 3 (matrix-matrix O(N ) )
         Complex single precision data:
                   Level 1
                   CGEMM
         Double precision data:
                   Level 1: DASUM, DAXPY, DCOPY, DDOT, DNRM2,
                   DROT, DROTM, DSCAL, DSWAP, ISAMAX, IDAMIN
                   Level 2: DGEMV, DGER, DSYR, DTRSV
                   Level 3: ZGEMM, DGEMM, DTRSM, DTRMM, DSYMM,
                   DSYRK, DSYR2K


© NVIDIA Corporation 2009
CUBLAS Performance: CPU vs GPU




                               CUBLAS: CUDA 2.3, Tesla C1060
                            MKL 10.0.3: Intel Core2 Extreme, 3.00GHz
© NVIDIA Corporation 2009
!"#$%&'()*+,*-./0)
                  '#"

                                    Up to 2x average speedup over CUBLAS 3.1
                  '!"


                   &"
&1))231'456'78$




                                                                                          Less variation in performance
                                                                                          for different dimensions vs. 3.1
                   %"


                   $"


                   #"
                                                                                                                             ,-.
                                                                                                                             /(0'
                   !"
                                                                                                                             /(0#
                    '!#$           #!$&          (!)#            $!*%              +'#!       %'$$              )'%&
                                                         7.9*:;'2:-)/5:,/5'<=;=>

                  Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT}
                  !"##$%&'(%)%&'*%+,%-./0/1%$2345%!(676%89"
                  :;<%*6'('&'6(=%+,%>?5@A!+B2%/,C24%!+B2%DE%F-2G542HI
CULA
CULA (LAPACK for heterogeneous systems)
                                            GPU Accelerated
                                             Linear Algebra

                                  Partnership

  ! Dense linear algebra          Developed in
  ! C/C++ & FORTRAN               partnership with
  ! 150+ Routines                 NVIDIA

 MATLAB Interface                 Supercomputer Speeds

  ! 15+ functions                 Performance 7x of
  ! Up to 10x speedup
CULA - Performance
 Supercomputing Speeds
  This graph shows the relative speed of many CULA functions when compared to

  (Fermi) and an Intel Core i7 860. More at www.culatools.com
CUSPARSE
Sparse Matrix Performance: CPU vs. GPU
                  Multiplication of a sparse matrix by multiple vectors
35x

30x

25x

20x

15x

10x
                                                                          "Non-transposed"
 5x                                                                       "Transposed"
 0x                                                                       MKL 10.2




 Average speedup across S,D,C,Z

 !"#$%&#'()*+(,-(./010%(23456(!+787(9$"
 :;<(=7*+*)*7+>(,-(?@6AB!,C3(0-D35(!,C3(EF(G.3H653IJ
CUFFT
CUFFT
         CUFFT is the CUDA FFT library
         Computes parallel FFT on an NVIDIA GPU

                   Plan contains information about optimal configuration for a
                   given transform.
                   Plans can be persisted to prevent recalculation.
                   Good fit for CUFFT because different kinds of FFTs require
                   different thread/block/grid configurations.




© NVIDIA Corporation 2009
CUFFT Features


         1D, 2D and 3D transforms of complex and real-
         valued data
         Batched execution for doing multiple 1D transforms
         in parallel
         1D transform size up to 8M elements
         2D and 3D transform sizes in the range [2,16384]
         In-place and out-of-place transforms for real and
         complex data.




© NVIDIA Corporation 2009
CUFFT Example
                                                  Complex 2D transform
                       !"#$%&#'()'*+,
                       !"#$%&#'(-'.*/

                       01$$234&"5# 654&7
                       01$$289:65#; <%"424='<9"4247
                       01"4>45590??@9%"<<AB%"424='C%D#9$?01$$289:65#;A<()<(-A7
                       01"4>45590??@9%"<<AB9"424='C%D#9$?01$$289:65#;A<()<(-A7

                       E<'8F#42#'4'*G'HHI'654&J'<E
                       01$$2K54&*"?B654&='()=(-='8LHHIM8*8A7

                       E<'LC#'2N#'8LHHI'654&'29'2F4&C$9F:'2N#'C%O&45'912'9$'6540#J'<E
                       01$$2P;#08*8?654&='%"424='9"424='8LHHIMHQRSTRGA7

                       E<'U&@#FC#'2F4&C$9F:'2N#'C%O&45'%&'6540#J'<E
                       01$$2P;#08*8?654&='9"424='9"424='8LHHIMU(VPRWPA7

                       E<'G#C2F9X'2N#'8LHHI'654&J'<E
                       01$$2G#C2F9X?654&A7

                       01"4HF##?%"424A7
                       01"4HF##?9"424A7




© NVIDIA Corporation 2009
CUFFT Performance: CPU vs GPU




                                 !"##$%&'()%*+,-,.%$/012%34565%789
                        :;<%45'4=4)%>"2?@3A=/%,BC/1%3A=/%DE%F*/G21/HI%('&7JK
© NVIDIA Corporation 2009
CUFFT 3.2: Improved Radix-3, -5, -7
                           123-45*6+&%768996(::06                                                 123-45*6+;%768996(::60
         $"!                                                                     (!

                                                                                 '!
         $!!
                                                                                 "!

         #"!
!"#$%&




                                                                        !"#$%&
                                                                                 &!
                                                    +$!(!,-%.$                                                                 +$!(!,-%.$
                                                    +$!(!,-%.#                   %!                                            +$!(!,-%.#
         #!!
                                                    /01                                                                        /01
                                                                                 $!
         "!
                                                                                 #!

           !                                                                      !
               #   $   %   &   "   '   (   )   *    #! ## #$ #% #& #"                 #   $   %   &   "   '   (   )   * #! ## #$ #% #& #"
                                       '()*+,-./0                                                             '()*+,-./0

         Radix-5, -7 and mixed radix improvements not shown

         9<""=6*>?6@6*>A6(B6CDE;EF6=/,'269?GHG6!%<
         IJ#6AG>?>*>G?K6(B6LM2359(N/6EBO/'69(N/6-H6+C/P2'/Q0
CUDPP
!"#$$
!"#$%&'&(")*"+$,+-./&*)&0'12/".'&'##/#".&$0$3$4/5"*)&"!678
   9:";'&&$5"<=>?7?8@A"B:"CD/15"<6!7@A"E:"E/1,F.3'"<6!7@A"8:"7'4$G5)1"
   <6!7@A"E:"HI/1,"<6!7@A"J:"K+'1,"<6!7@


8#,)&$3+05
   2FG..E2'1A"2FG..E/,0/13/GE2'1A"2FG..L/GF2/
   2FG..E)&3A"2FG..L'1GA"2FG..E.'&5/9'3&$M>/23)&9F#3$.#(


8GG$3$)1'#"'#,)&$3+05"$1".&),&/55
   N&'.+5A"0)&/"5)&3$1,A"3&//5A"+'5+$1,A"'F3)3F1$1,
!"#$$%&'()*+,
!"#$$!%&'()*+,-(%& .%&'() /010!"#$$23!456000
 !"#$$24##60!"#$$27894:60!"#$$29$:;95279<=4<#0>?

!"#$$@,&ABC DB,&?0
!"#$$<CE*B- +CE*B-0/0.*ADD$B,&FGDB,&60
                              .%&'()60
                               &*HIBCHC&-E60
                               J60KL?0
.*ADD3.,&FDB,&60A2%A,-,60A2(A,-,60&*HIBCHC&-EL?
More?
Thrust
Objectives

!   Programmer productivity
           !   Rapidly develop complex applications
           !   Leverage parallel primitives
!   Encourage generic programming
           !   Don’t reinvent the wheel
           !   E.g. one reduction to rule them all
!   High performance
           !   With minimal programmer effort
!   Interoperability
           !   Integrates with CUDA C/C++ code



                                                      3
© 2008 NVIDIA Corporation
!"#$%&
!""#$%&'()$%#(*+,),-#./,#!012
   3*&*45#6$)78),8#9%&'()$%#:*+,),-#;69:<#
!/7$)*7%,5
   !"#$%!&&"'%!()*+!'#,-.
   !"#$%!&&/*)0+*()*+!'#,-.
2(=/,*$>&5
   !"#$%!&&%'#!12
   !"#$%!&&#*/$+*12
   !"#$%!&&03+4$%0)*(%+5312
   ?$4@                                      63
!"#$%&'()*+,-.
!!"#$%$&'($")*+"&'%,-."%/.0$&1"-%"(2$"2-1(
(2&/1(332-1(45$6(-&78%(9"245$6:)"77";<=>"
(2&/1(33#$%$&'($:245$6?0$#8%:=@"245$6?$%,:=@"&'%,=>

!!"(&'%1A$&",'('"(-"(2$",$586$
(2&/1(33,$586$45$6(-&78%(9",45$6 B"245$6>

!!"1-&(",'('"-%"(2$",$586$
(2&/1(331-&(:,45$6?0$#8%:=@",45$6?$%,:==>

!!"(&'%1A$&",'('"0'6C"(-"2-1(
(2&/1(336-DE:,45$6?0$#8%:=@",45$6?$%,:=@"245$6?0$#8%:==>
!"#$%&'()'*((+,-'.(/-
 !"#$%&'()*(&+"#,-


 ./)012-3


 45$"0-6()(#56


 7)#2#68&9#3(&:(;*"3(<"3-*3=
!"#$%$&'($)*+&',-
 !"#$%&'($)*+#$&#$,
    -../+0%'#+#$&#&


 12$%+34056$
    7$58'&&'($+9'6$%&$+:;2<6=$+(>?


 ;6#'($+64880%'#*
    >../+8$8@$5&+4%+60&2A0&$5&
More?
PyCUDA
PyCUDA


         3rd party open source, written by Andreas Klöckner
         Exposes all of CUDA via Python bindings
         Compiles CUDA on the fly
                   presents CUDA as an interpreted language
         Integration with numpy
         Handles memory management, resource allocation
         CUDA programs are Python strings
                   Metaprogramming modify source code on-the-fly
                   Like a really complex pre-processor
         http://mathema.tician.de/software/pycuda


© NVIDIA Corporation 2009
PyCUDA Example
   !       "#$%&' $()*+,-+&"./&0,1 )*+,
   20      "#$%&' $()*+,-,*'%"3"'
   4       "#$%&' 3*#$(
   5
   6      ,0703*#$(-&,3+%#-&,3+38595:-0,1'($/83*#$(-%,'42:
   ;      ,<=$*070)*+,-#/#<,>>%)8,-1"?/90,-+'($/-"'/#1"?/:
   @      )*+,-#/#)$(<A'%+8,<=$*90,:
   B
   C       #%+070)*+,-D%*&)/E%+*>/8FFF
   !G        <<=>%H,><<0.%"+0+%*H>"I(8I>%,'0J,:
   !!        K
   !2           "3'0"+L070'A&/,+M+L-L0N0'A&/,+M+L-(J5O
   !4           ,P0"+L0Q0J702-GIO
   !5        R
   !6      FFF:
   !;      I*3)070#%+-=/'<I*3)'"%38F+%*H>"I(F:
   !@      I*3)8,<=$*90H>%)S785959!::
   !B
   !C      ,<+%*H>/+0703*#$(-/#$'(<>"S/8,:
   2G      )*+,-#/#)$(<+'%A8,<+%*H>/+90,<=$*:
   2!      $&"3' ,<+%*H>/+
   22      $&"3' ,
© NVIDIA Corporation 2009
More?
CURAND
RNG Performance: CPU vs. GPU
                              Generating 100K Sobol' Samples
 25x


 20x


 15x


 10x                                                                  CURAND 3.2
                                                                      MKL 10.2

  5x


  0x
             SP                  DP               SP             DP

                    Uniform                             Normal
  !"#$%&'()*'+,'%-.&.$'/0123'!*454'67"
  89:';4)*)()4*<'+,'=>3?@!+A0'.,B02'!+A0'CD'E%0F320GH
OpenVIDIA
OpenVIDIA
 Open source, supported by NVIDIA
 Computer Vision Workbench (CVWB)
    GPU imaging & computer vision

    Demonstrates most commonly used image
    processing primitives on CUDA

    Demos, code & tutorials/information




   http://openvidia.sourceforge.net
and many more...
References
•   CUDA C Programming Guide 
•   CUDA C Best Practices Guide 
•   CUDA Reference Manual 
•   API Reference, PTX ISA 2.2 
•   CUDA-GDB User Manual 
•   Visual Profiler Manual  
•   User Guides: CUBLAS, CUFFT, CUSPARSE, CURAND

http://developer.nvidia.com/object/gpucomputing.html
one more thing
           or two...
Life/Code Hacking #2.x
                Speed {listen,read,writ}ing




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.1
                                              Speed listening




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.1
                                                   Speed listening
• Step 1: Collect
 • online videos, tutorials, podcasts, etc.
 • audiobooks
 • youtube-dl, get_flash_videos, jDownloader,
   ffmpeg, mplayer, etc.
 • etc.
     accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.1
                                                   Speed listening

• Step 2: Accelerate (time-stretch)
 • VLC (Playback > Faster)
 • sox $f{,.1.8X.mp3} tempo 1.8 50
 • iPod ? mp3splt -t 5.00 -o small-@n large.mp3

     accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.1
                                                   Speed listening



• Step 3: chill or do more ;-)


     accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Demo
CO ME

More Related Content

What's hot

Public international-law-notesp
Public international-law-notespPublic international-law-notesp
Public international-law-notespAmita Pradhan
 
Análise da cartelería electoral na década dos 80 en Galicia
Análise da cartelería electoral na década dos 80 en GaliciaAnálise da cartelería electoral na década dos 80 en Galicia
Análise da cartelería electoral na década dos 80 en GaliciaAlberto Rey
 
Blenderbookv01
Blenderbookv01Blenderbookv01
Blenderbookv01whiedhie
 
Cv dott.ssa maria rosaria pellegrino(mag.13). doc
 Cv dott.ssa maria rosaria pellegrino(mag.13). doc Cv dott.ssa maria rosaria pellegrino(mag.13). doc
Cv dott.ssa maria rosaria pellegrino(mag.13). docMarisa Pellegrino
 
Gccp Re Investment Strategy Paper
Gccp Re Investment Strategy PaperGccp Re Investment Strategy Paper
Gccp Re Investment Strategy Papermjwhite
 
Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...
Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...
Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...Flávio Radamarker, RDI
 
Arduino notebook v1-1
Arduino notebook v1-1Arduino notebook v1-1
Arduino notebook v1-1Anil Yadav
 
NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...
NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...
NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...JillHollister
 
Rapport de Situation #3 Tempête Tropicale Sandy
Rapport de Situation #3 Tempête Tropicale SandyRapport de Situation #3 Tempête Tropicale Sandy
Rapport de Situation #3 Tempête Tropicale Sandylaurentlamothe
 
Young Lions Media 2011 / SPYL509
Young Lions Media 2011 / SPYL509Young Lions Media 2011 / SPYL509
Young Lions Media 2011 / SPYL509Vane Marín
 

What's hot (13)

Public international-law-notesp
Public international-law-notespPublic international-law-notesp
Public international-law-notesp
 
Análise da cartelería electoral na década dos 80 en Galicia
Análise da cartelería electoral na década dos 80 en GaliciaAnálise da cartelería electoral na década dos 80 en Galicia
Análise da cartelería electoral na década dos 80 en Galicia
 
Blenderbookv01
Blenderbookv01Blenderbookv01
Blenderbookv01
 
Cv dott.ssa maria rosaria pellegrino(mag.13). doc
 Cv dott.ssa maria rosaria pellegrino(mag.13). doc Cv dott.ssa maria rosaria pellegrino(mag.13). doc
Cv dott.ssa maria rosaria pellegrino(mag.13). doc
 
Gccp Re Investment Strategy Paper
Gccp Re Investment Strategy PaperGccp Re Investment Strategy Paper
Gccp Re Investment Strategy Paper
 
Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...
Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...
Fachada - O verdadeiro cartão de visita de uma loja. Artigo para a Edição 452...
 
Xarxes socials
Xarxes socialsXarxes socials
Xarxes socials
 
Arduino notebook v1-1
Arduino notebook v1-1Arduino notebook v1-1
Arduino notebook v1-1
 
Arduino notebook
Arduino notebookArduino notebook
Arduino notebook
 
NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...
NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...
NantKwest Chairman & CEO Dr. Patrick Soon-Shiong to Present Vision for Next G...
 
Rapport de Situation #3 Tempête Tropicale Sandy
Rapport de Situation #3 Tempête Tropicale SandyRapport de Situation #3 Tempête Tropicale Sandy
Rapport de Situation #3 Tempête Tropicale Sandy
 
Young Lions Media 2011 / SPYL509
Young Lions Media 2011 / SPYL509Young Lions Media 2011 / SPYL509
Young Lions Media 2011 / SPYL509
 
Ph 35
Ph 35Ph 35
Ph 35
 

Similar to Massively Parallel Computing CS 264 Lecture

Nearby Startup Pitch for SUU 2013 conference
Nearby Startup Pitch for SUU 2013 conferenceNearby Startup Pitch for SUU 2013 conference
Nearby Startup Pitch for SUU 2013 conferenceAdam Nemeth
 
SEO - It Works Even if You Don’t Know How or Why
SEO - It Works Even if You Don’t Know How or Why SEO - It Works Even if You Don’t Know How or Why
SEO - It Works Even if You Don’t Know How or Why Wolfgang Weicht
 
ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1
ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1
ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1Dimitris Psounis
 
Ico corporate presentation en
Ico corporate presentation enIco corporate presentation en
Ico corporate presentation enHarpreet kaur
 
Paradise Renewables Brochure V3
Paradise Renewables Brochure  V3Paradise Renewables Brochure  V3
Paradise Renewables Brochure V3Timothy Bybee
 
Portefólio Joana Dinis Ferreira
Portefólio Joana Dinis FerreiraPortefólio Joana Dinis Ferreira
Portefólio Joana Dinis Ferreirajanedinis
 
Iak13 arno bublitz enterprise ux
Iak13 arno bublitz enterprise uxIak13 arno bublitz enterprise ux
Iak13 arno bublitz enterprise uxabublitz
 
Lessons RN labor union learned using social media in 2010
Lessons RN labor union learned using social media in 2010Lessons RN labor union learned using social media in 2010
Lessons RN labor union learned using social media in 2010LinkedIn Riches
 
Start-up: FanpageTrender.pl - pomiar działań marketingowych na Facebooku
Start-up: FanpageTrender.pl - pomiar działań marketingowych na FacebookuStart-up: FanpageTrender.pl - pomiar działań marketingowych na Facebooku
Start-up: FanpageTrender.pl - pomiar działań marketingowych na Facebookucendoo1
 
Fanpage Trender
Fanpage TrenderFanpage Trender
Fanpage TrenderCendoo
 
Self Optimizing transactional data grids for elastic cloud environments
Self Optimizing transactional data grids for elastic cloud environmentsSelf Optimizing transactional data grids for elastic cloud environments
Self Optimizing transactional data grids for elastic cloud environmentsEuroCloud
 
Wikimedia UK Keynote Presentation
Wikimedia UK Keynote Presentation Wikimedia UK Keynote Presentation
Wikimedia UK Keynote Presentation Ollie Bray
 

Similar to Massively Parallel Computing CS 264 Lecture (20)

Nearby Startup Pitch for SUU 2013 conference
Nearby Startup Pitch for SUU 2013 conferenceNearby Startup Pitch for SUU 2013 conference
Nearby Startup Pitch for SUU 2013 conference
 
Ph 2
Ph 2Ph 2
Ph 2
 
SEO - It Works Even if You Don’t Know How or Why
SEO - It Works Even if You Don’t Know How or Why SEO - It Works Even if You Don’t Know How or Why
SEO - It Works Even if You Don’t Know How or Why
 
ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1
ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1
ΠΛΗ31 ΤΥΠΟΛΟΓΙΟ ΕΝΟΤΗΤΑΣ 1
 
Ico corporate presentation en
Ico corporate presentation enIco corporate presentation en
Ico corporate presentation en
 
Thai Alcoholic Beverages Regulations 2011
Thai Alcoholic Beverages Regulations 2011Thai Alcoholic Beverages Regulations 2011
Thai Alcoholic Beverages Regulations 2011
 
OSGi - beyond the myth
OSGi -  beyond the mythOSGi -  beyond the myth
OSGi - beyond the myth
 
Ph 38
Ph 38Ph 38
Ph 38
 
Paradise Renewables Brochure V3
Paradise Renewables Brochure  V3Paradise Renewables Brochure  V3
Paradise Renewables Brochure V3
 
Portefólio Joana Dinis Ferreira
Portefólio Joana Dinis FerreiraPortefólio Joana Dinis Ferreira
Portefólio Joana Dinis Ferreira
 
Safeway presentation
Safeway presentationSafeway presentation
Safeway presentation
 
Iak13 arno bublitz enterprise ux
Iak13 arno bublitz enterprise uxIak13 arno bublitz enterprise ux
Iak13 arno bublitz enterprise ux
 
Mailings Sublevación de Abril
Mailings Sublevación de AbrilMailings Sublevación de Abril
Mailings Sublevación de Abril
 
Ph 37
Ph 37Ph 37
Ph 37
 
Lessons RN labor union learned using social media in 2010
Lessons RN labor union learned using social media in 2010Lessons RN labor union learned using social media in 2010
Lessons RN labor union learned using social media in 2010
 
Start-up: FanpageTrender.pl - pomiar działań marketingowych na Facebooku
Start-up: FanpageTrender.pl - pomiar działań marketingowych na FacebookuStart-up: FanpageTrender.pl - pomiar działań marketingowych na Facebooku
Start-up: FanpageTrender.pl - pomiar działań marketingowych na Facebooku
 
Fanpage Trender
Fanpage TrenderFanpage Trender
Fanpage Trender
 
Mastering Enterprise Risk Management Inside Your Organization
Mastering Enterprise Risk Management Inside Your OrganizationMastering Enterprise Risk Management Inside Your Organization
Mastering Enterprise Risk Management Inside Your Organization
 
Self Optimizing transactional data grids for elastic cloud environments
Self Optimizing transactional data grids for elastic cloud environmentsSelf Optimizing transactional data grids for elastic cloud environments
Self Optimizing transactional data grids for elastic cloud environments
 
Wikimedia UK Keynote Presentation
Wikimedia UK Keynote Presentation Wikimedia UK Keynote Presentation
Wikimedia UK Keynote Presentation
 

More from npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 

More from npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 

Recently uploaded

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 

Recently uploaded (20)

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 

Massively Parallel Computing CS 264 Lecture

  • 1. Massively Parallel Computing CS 264 / CSCI E-292 Lecture #4: Intermediate-level CUDA | February 15th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 2. Administrivia • HW1: due Fri 2/18/11 (this week) • Projects: think about it, consult the staff • New guest lecturers! • Max Lin (Google), Kurt Messersmith et al. (Amazon), David Rich et al. (Microsoft)
  • 3. During this course, r CS264 adapted fo we’ll try to “ ” and use existing material ;-)
  • 5. Outline • CUDA Language & APIs (overview) • Threading/Execution (cont’d) • Memory/Communication (cont’d) • Tools • Libraries
  • 6. Outline • CUDA Language & APIs (overview) • Threading/Execution (cont’d) • Memory/Communication (cont’d) • Tools • Libraries
  • 7. gu age Lan ! 49!:(1&/<.2"'(%('")(/+(5,.$)=.-(<"#)/&()61"'> ! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+ 0*",-2/+*",-3/+0*",-3/+ ! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+ $"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/ ! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+ 0)4%2/+)4%3/+0)4%3/+ ! 5#46./+05#46./+5#461/+05#461/+5#462/+ 05#462/+5#463/+05#463/+ ! 75#,%./+75#,%1/+75#,%2/+75#,%3+ !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 8. gu age Lan ! 4%-(#/-')&,#)(%(<"#)/&()61"(8.)*('1"#.%$( +,-#)./-> 8,9'!!"#$%&'(%):(;/+(.!"#$ ! 4%-(%##"''("$"0"-)'(/+(%(<"#)/&()61"(8.)*( !"#$%&!"'$%&!"($%&!")$* ('*(,-<= !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 9. gu age Lan ! &)82 .'(%('1"#.%$(<"#)/&()61" ! ?%0"(%'(0)4%2@("3#"1)(#%-(5"(#/-')&,#)"2( +&/0(%('#%$%&()/(+/&0(%(<"#)/&> :$*,5,-/+./+.> !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 10. gu age Lan ! 49!:(1&/<.2"'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"' ! %"-',&?&=@(@5#*9?&=@(@5#*9A)8@( 6-)&A)8 ! +',-.&/0&/&1&)822&34&10)4%22& ! :##"''.5$"(/-$6(+&/0(2"<.#"(#/2" ! 4%--/)()%A"(%22&"'' ! 4%--/)(%''.;-(<%$," !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 11. gu age !"#$%&'()*$+(,%-.*./ Lan 0+'.'(*.'($12($/3'4(25(."#$%&'(&*$+(23'.*$%2#4(%#(4%#67'( 3.'8%4%2# !!"#$%"&'9(:%.'8$(&*33%#6($2(+*.:1*.'(;<= >*4$(-"$(721'.(*88".*8/(?4''(3.26@(6"%:'(52.(:'$*%74A BC*&37'49(!!()$"&*'+,!!-*."&*'+,!!./0"&*+1' "#$%"&',9(82&3%7'($2(&"7$%37'(%#4$."8$%2#4 <721'.(-"$(+%6+'.(*88".*8/(?D("73(2.(7'44A BC*&37'49(()$"&*'+,-*."&*'+,./0"&*+1' 0+'(2#(-!"3(4!5346,82&3%7'.(23$%2#(52.8'4('E'./("#$%"&',$2( 82&3%7'($2(!!"#$%"&' © NVIDIA Corporation 2010 5 4 Unit of Least Precision (ULP) is the gap between the floating-point numbers nearest a given real number
  • 12. A PI CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels High level of abstraction - start here! (aka “Device” API) More control, more verbose (OpenCL: Similar to CUDA C Driver API) © NVIDIA Corporation 2010
  • 13. A PI ! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$ !"#$%& ! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J ! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J ! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$ *,"#0+$:0#$+/#72:G2M#3 ! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 14. A PI ! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127# ! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$ ,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*) ! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$ ,"0#:3+$IO5+,$G2P#$A/#6BCQJ ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$ 7*6,#@, ! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$ *6#$"*+,$,"0#:3 !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 15. A PI ! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$ 7*3#$*4$,=/#8$+,-"./0) ! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$ 7*3#$*4$,=/#$%/!12--'-3) ! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0 ! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7 !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 16. A PI ! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M# ! '#127#$(-.$7:GG+$95+,$7:GG$%/8($) !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 17. Runtime API (high-level)
  • 18. A PI ! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$% 1(5%#5$*.-"/%*%#(".$6.= ! !"+.'$(#$%&!$)/"0( ! !"+.1$(#$%&!$ ! :"+%.'$%8)$180= ! !"+.)2//3$#$%&!$ !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 19. A PI ! :00(#*.$HI5$$%9$9(52= ! !"+.6.99/!@%!"+.<-$$ ! <"-.-*0-E$%9$9(52= ! !"+.6$73$( ! 4(32%9$9(52= ! !"+.6$7!=4 !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 20. A PI ! +,!$--. !"#$%&#'()*+,#-*#%."#&+*/#01*#2223444# '&"%0(5"#("65%.0(5"#758*9.059: ! 5,(%12-6#7("%8(0("+*()%1+77)%*2%+77%$(3#1(%9:;% *2%)(*/6%*,(%(<(1/*#20%(03#"20-(0* !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 21. “Device” Driver API (low-level)
  • 22. A PI ! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$ :1:2G:FG#$3#127#+ ! %/9"#$%"4")+'/() ! %/9"#$%"4") ! %/9"#$%"4"):1;" ! %/9"#$%"4")<')10=";'-> ! %/9"#$%"4")?))-$@/)" ! ! !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 23. A PI ! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$( &$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$ ! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($ !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 24. A PI ! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"% *00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%% ! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/% .'5$*+ ! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'% A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$ ! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'% !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 25. A PI ! :00(#*.$HI5$$%9$9(52= ! !"6$7899/!:;!"6$7<-$$ ! <"-.-*0-E$%9$9(52= ! !"6$73$( ! 4(32%9$9(52= ! !"6$7!=4>(/#:;!"6$7!=4#(/>:; !"6$7!=4#(/# !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 26. A PI ! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:" 56$7".()#"$+8#"6-9(*)&$6(- ! >%<@6- 96'#. ! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7" !"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-* ! ;(4<'#"%&-"@#"<-'(&4#4"56$7" !"#(2"'$45'(*2 !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 27. A PI ! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%# ! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4" :'(@&'"F&*6&@'#.G !"#(2"'$6$-7"5!-8(5 !"#(2"'$6$-6'(9*' !"#(2"'$6$-:$;<$= !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 28. A PI ! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&" 9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(- ! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%) 96*.$ !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 29. A PI ! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G " L7*#&4"M'(%?"N6=# " N7&*#4";#)(*+"N6=# " O<-%$6(-"B&*&)#$#*. " A*64"N6=# !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 30. A PI ! L7*#&4"M'(%?"N6=#G" !"7"5!>$-?'(!@>A*0$ ! N7&*#4";#)(*+"N6=#G !"7"5!>$->A*)$2>8B$ ! O<-%$6(-"B&*&)#$#*.G !"C*)*%>$->8B$DE!"C*)*%>$-8DE !"C*)*%>$-=DE!"C*)*%>$-F !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 31. A PI ! !"#$%&#'(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(% ./01*#20%#0321+*#204 !"#$"%!&'()* !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 32. Outline • CUDA Language & APIs (overview) • Threading/Execution (cont’d) • Memory/Communication (cont’d) • Tools • Libraries
  • 33. Threading Hierarchy Execution Model Software Hardware Threads are executed by thread Thread processors Processor Thread Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can Thread reside on one multiprocessor - limited Block Multiprocessor by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks ... Only one kernel can execute on a Grid device at one time Device © 2008 NVIDIA Corporation.
  • 34. Thread Batching Kernel launches a grid of thread blocks Threads within a block cooperate via shared memory Threads within a block can synchronize Threads in different blocks cannot cooperate Allows programs to transparently scale to different GPUs Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared Memory Shared Memory Shared Memory © 2008 NVIDIA Corporation.
  • 35. Transparent Scalability Hardware is free to schedule thread blocks on any processor A kernel scales across parallel multiprocessors Kernel grid Device Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 © 2008 NVIDIA Corporation.
  • 37. Indexing Arrays: Example In this example, the red entry would have an index of 21: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 M = 8 threads/block blockIdx.x = 2 int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21;
  • 38. Indexing Arrays: Example In this example, the red entry would have an index of 21: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 M = 8 threads/block Addition with Threads and Blocks blockIdx.x = 2 int index = threadIdx.x + blockIdx.x * M; The blockDim.x is a built-in variable for threads per block: = 5 + 2 * 8; int index= threadIdx.x + blockIdx.x * blockDim.x; = 21; A combined version of our vector addition kernel to use blocks and threads: __global__ void add( int *a, int *b, int *c ) { int index = threadIdx.x + blockIdx.x * blockDim.x;
  • 40. Control Flow Divergence What happens if you have the following code? !"#"$$#%&'()*+*,-,.. / *$01#.2 3 (45( / *$06#.2 3
  • 41. Control Flow Divergence Branch Path A Path B
  • 42. Control Flow Divergence Nested branches are handled as well !"#"$$#%&'()*+*,-,.. / !"#0)'#%&'()*+*,-,.. *$12#.3 (45( *$16#.3 7 (45( *$18#.3
  • 43. Control Flow Divergence Branch Branch Path A Path B Path C
  • 44. Control Flow Divergence for correctness (*) You might have to think about it for performance Depends on your branch conditions
  • 45. Control Flow Divergence Performance drops off with the degree of divergence !"#$%&'$&()*+,+-.- /012 3 %*!) 45 ... %*!) 65 ... 7
  • 46. Divergence 35 30 Performance 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 Divergence
  • 48. !""#$%&"' ()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-+2+"#0.&6- 10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-4%0+&".+/-%&,-8++$-0)+- )%*,7%*+-9#/' !""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-"1&"#**+&04'-1&-%- <#40.$*1"+//1*-,.>.,+,-9'-<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&- *#&-"1&"#**+&04' ?.<.0+,-9'-*+/1#*"+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*' © NVIDIA Corporation 2010
  • 49. !"#$%&'()*'+*,-'.)/*,&0,$& 1'#2'3"#$%&'4'1'#2'5/"0,(*#$)&&#*& 6#'7""'5/"0,(*#$)&&#*&'879)'70'")7&0'#:)'3"#$%'0#');)$/0) 1'#2'3"#$%&'<'1'#2'5/"0,(*#$)&&#*&'4'= >/"0,(")'3"#$%&'$7:'*/:'$#:$/**):0"?',:'7'5/"0,(*#$)&&#* !!"#$%&'()*+",-.%))('08)'87*-@7*)'3/&? 6/3A)$0'0#'*)&#/*$)'797,"73,",0?' *)B,&0)*&C'&87*)-'5)5#*? 1'#2'3"#$%&'4'DEE'0#'&$7")'0#'2/0/*)'-)9,$)& !"#$%&');)$/0)-',:'(,()",:)'27&8,#: DEEE'3"#$%&'()*'B*,-'@,""'&$7")'7$*#&&'5/"0,(")'B):)*70,#:& © NVIDIA Corporation 2010
  • 50. Outline • CUDA Language & APIs (overview) • Threading/Execution (cont’d) • Memory/Communication (cont’d) • Tools • Libraries
  • 51. Kernel Memory Access Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent across Time Memory kernel launches Kernel 1 ... • Kernel I/O
  • 52. !"#$%&"'% ! !"#$%&'()*%#&+,&#%$-./%#.&0%#&./#%")& 0#+1%..+#&23456&-'&.+)%&7"#89"#%: ! ;%#+<1=+1>&1?1=%&"11%.. ! @/+#%&%-/7%#&A5*-/&-'/%$%#&+#&A5*-/&,=+"/ !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 53. ()*+,-."/)'0 ! B&.)"==&0+#/-+'&+,&$=+*"=&)%)+#?&/7"/&-.& 0#-C"/%&/+&"&./#%")&0#+1%..+# ! D,/%'&(.%8&".&+C%#,=+9&,#+)&#%$-./%#. ! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?: !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 54. 12+'"3-."/)'0 ! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==& ./#%")&0#+1%..+#.&-'&"&)(=/-<0#+1%..+# ! 3EFG&0%#&*=+1>H&./+#%8&-'&3EI3FG&*"'>. ! J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&& 9-/7+(/&!"#$%&'#()*&+, !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 55. Global Memory Kernel Memory Access • Different types of “global memory” Per-thread Registers On-chip • Linear Memory Thread Local Memory Off-chip, uncached • Texture Per-block Memory • Constant Memory Block • • Shared Memory On-chip, small Fast Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent across Time Memory kernel launches Kernel 1 ... • Kernel I/O
  • 56. ! !"#$%#&'#%(")'*+%,*"-'*+%."/0#'/#%'/1%2$3#45$% 6$6"57%'**%)"6$%85"6%#&$%0'6$%9&70:)'*% 6$6"57%9""* ! ;40#%1:88$5%:/%'))$00%9'##$5/0+%)')&:/<+%$#)= !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 57. 4,)5+,-."/)'0 ! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==& )(=/-<0#+1%..+#.&+'&/7%&1+)0(/%&8%C-1% ! @-N%&8%0%'8.&+'&8%C-1%&! 5OEPG&/+&3KOQG ! R-$7&*"'89-8/7&S&344QGT. ! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%& ="/%'1?K& ! 678-1"17%8 !"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 58. Constant Memory Constants set by CPU, read by GPU Each SM has 8kiB cache for constants Optimized for broadcast Accessing different elements forces serialisation Can speed some calculations Can relieve register pressure
  • 59. Constant Memory Declared at file scope __constant__ float dc_myConst; Set via cudaMemcpyToSymbol API call cudaMemcpyToSymbol( “dc_myConst”, 3.14f, sizeof(float) ) Accessed by name in kernel __global__ MyKernel( ... ) { .... float myVal = dc_myConst+1; .... }
  • 60. Textures Textures are essentially look up tables Can only be written by the host Cached on each multiprocessor (8kiB) Optimised for 2D spatial locality Hardware interpolation possible Limited precision Can clamp or wrap at boundaries
  • 61. Textures Declaration and setup rather involved See programming guide Accessed in kernels via texture fetches: tex1D, tex2D, tex3D, etc. Co-ordinates at texel centres Have to take care when accessing elements
  • 62. Textures Can improve load coalescing from global memory If whole texture fits in 8kiB cache, has grid lifetime Clamping/wrapping can aid edge case handling Have to test to determine benefits
  • 63. General Principles Memory access patterns are crucial Even CPUs are typically memory bound GPUs have 100x FP Only 10x memory bandwidth Have to keep the GPU busy
  • 64. PC Architecture 8 GB/s >?@ ?>L9G=2%&66"K16 J%+8#"F7(&"K16 H%'2$7,6">'%("I" A+%#$)%7(B& F+1#$)%7(B& >@C! E&.+%/"K16 ?>L"K16 3+ Gb/s CD!E F!:! G#$&%8&# ! 160+ GB/s to VRAM 25+ GB/s modified from Matthew Bolitho
  • 65. PCIe Transfers (first thing to optimize?)
  • 66. !"#$%&'()##*+&%, -+#"(4&5&(6*+3( -./(123+*"(5+ -./ 0./(123+*" !"#$ %&'()*+, 789(-/4) -,$#<25 0./ :2*92';<= & -. &. /.' ()*+ ()*+ -+#"(4&5&(6*+3 0./(323+*"(5+ -./ 0./ 123+*" 123+*" -./(123+*" *Averaged observed bandwidth
  • 67. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory © NVIDIA Corporation 2010
  • 68. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance © NVIDIA Corporation 2010
  • 69. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory © NVIDIA Corporation 2010
  • 70. PCIe Transfers PCIe 2.0 x16 bus has Latency of 10 µs (observed) Bandwidth of 8GB/s (theory), 5 GB/s (observed) A lot of calculations can happen in these times
  • 71. PCIe Transfers PCIe transfers occur via DMA GPU reads pages direct from CPU memory Very bad if page gets moved mid-transfer CUDA maintains internal pinned memory buffers Used for cudaMemcpy calls Data staged through these
  • 72. !"#$%&'#'()*+(#$,-'#) !"#$%#&%'()*+'(',-$."/0$1'#&2'!"#$%&'#'() 32&$24'4#-$.521'#&26 7-$"/82'+9:6'+1;$.5&0$0-1 *&/<2&'+9:6'!"#$"%!&'()*!" 0&'!"#$"%!&'()*+,-%!!" !;$.5&0$0-1',-$."/0$1'=40.>'0$'#$;'?&/0&'#1;$.5&0$0-1' >2&$24'4#-$.521
  • 73. !"#$%&' !"#$%&'!() ')*&+,&- !"#$%&'()**"+),#"-. !"#$%&'!() /,)#'(01%(')*&+,&- #1(21*3-"#"4( $&#)-(213.()'(21*3-"#"5 !"#$%&'(!&)&*+,$-./010$&2#)1&( '"#'(6"7,8)1%5(9%,+"100(6"#:""&(;<=( 2.2-"'(,&+(%"'31&'"('3""+ !"#$%&'(!&)!2&#",&)3(4 !"#$%&'(!&)!2&#",&5(&,# !"#$%&'(!&6,7!8(4-)94! ()*+'),-./,0#1,'23*+#&'45,6745'"5, 6)'#5*74,8&#91
  • 74. !"#$%&'($()"*!+," !""#$%&'()#'*%(+,-'./#0+.#+"/'# 1%#+/).02('.'3/4#+.5#(%,3(.# -&&%5-+,%") $%&'()#&3/,#1%#+""'0+,%5#+/# !"#$%&'()*++'!,- !"./&'()*++'!,- +,--./ &%&'()#+""'9/#5-(%0,# ;$!#,(+./8%(/#1)#,2%#<=>#,'# 6"5%(#7%(/-'.#'8#,2%/%#83.0,-'./# !"#$%*++'!&'(),-0 "'0:%5#,'#+#*2)/-0+"#+55(%// !"./#/++'!&'(),-+"/'#9'(:4#
  • 76. !"#$%&' !"#$%&'!()*+),!!"#$%&"'"&()*" #$%&"'"(+,* (%-./)",$01)* .102",0&34.2,567%1&"8%1&* .&+'$&/$()+'01( ,0&39)+.32/)"()+.32:""""""" (&&/2$-&+$/&3$0((,1' ()+'01$4$5 !"#$%&'!()*+),! 6'),+/($711'%70)'89 ')-&.,&/ 6'),+/($711'%70)'89 !"#$%&'!()*+),! 6'),+/($711'%70)'89 !"#$%&'($&)*'+$(),--$*'+' !"#$01.&$#2),!1.3,45&!:; :07)($-&+$';'+9)*7/<$&/$)*'$="#$)&$-7/7(*2$ )*'/$+'),+/(
  •
  • 78. !"#$"%&'()*+%#,$-)./01232"245)62"7)8#$539%# !""#$%&'()(*+,,'-(./0(1233'456(*+,78*#,7(1'&9'',(&:'; !"#$%&'(&)$*$+, -.!/$0$1234%&'567, <<<((!""#$%&'(1233'45 89!:;$<=/.";> ? !"#$%&'!()*+),!<1@34%&'A1234%&'5<%&'(&)BC>D67AE!F;A G&/HI;)G1JK.E#L.M;-!G;A+>, ')-&.,&/0 .#,$244',& ')-&.,&/1 2.$3%)4.$'&<1234%&'5%&'(&)7>, !"#$56.&$#7),!6.8,9:&<>, %&'(&)BB,$%&'(&)$D*6, N
  • 79. !"#$%&'()*+, GPU !"#$#%&'()*&+,-.&/0123-4& /5256,7,-8&9:&;<;&.5=4&5 #-$% !.+56') 20340) 20340) >4>,?5-4>&$@%&4AB,A4 $@%&-C5A*D4C*&0=4C&(/#4&?5A&64 ?0A?3CC4A-&+,-.&/)$%&E4CA47&4F4?3-,0AG &'()*+, 5770+*&,A>424A>4A-&?0A?3CC4A-&,AH -)+.(/ !.0'(.11)( 0C>4C&I3434*&0D&4F4?3-,0A !"#$%&'($)*&+,-./&'($) !"#$%&'($)-'($&(01+,!"%&'($)-'($&(01 @37-,274&*-C451*&4F,*-&+,-.,A&5&*,AB74&?0A-4F-J&-.48& *.5C4&1410C8&5A>&0-.4C&C4*03C?4* '()&@410C8 78#%!.54),%.01/9%%!"#$%&'!()*+,-).! :*00.'%.;)(1*5<
  • 80. !"#$%&'(%#%&$"$# !""#!$%&' ()&'*+,&#-./+0*+0$#1.-0#. )"#$%& 2./.30*0/ 4)&*+30#50/&0"#6.)&'1 !!!"#$%&'%()*"+,-./'%()*'010 '%()*"'2$)34555 7/+-0/#!89 .67368.9#$%&:;<8.=>68.2%-8*"?%&29*"9)%@92*" ;2$)34A :,2+0$#;#50/&0"#".)&'10$#<+*1#*10#)%&$ $*/0.3# 2./.30*0/#0=0')*0#*+,-#.$#
  • 81. Scheduling on GPU ,12'&3456789 ,;<= ,;8<A46 ">?@>6 ">?@>6 Independent Tasks ,-./&'( :'3!&' :'3!&+ ,-./&+( !"#$"%&'( ,-./&'( ,-./&+( ,-./&+) !"#$"%&') !"#$"%&'( ,-./&+) !"#$"%&'* !"#$"%&') !"#$"%&+( ,-./&') !"#$"%&+( !"#$"%&'* ,-./&+* ,-./&+* ,-./&') ,-./&+0 ,-./&+0
  • 82. =#BC& =#EBF-( !"#$%&'()$*+$,*-$#./ 7.D$.( =2>5&!9 7.D$.( '@17!A&! 01234&0!5/ 671378&!9 =2>5&!9 671378&!: 671378&!9 671378&!: 671378&!; 671378&!; =2>5&!: =2>5&!: =2>5&<9 '@17!A&< =2>5&<: =2>5&<9 671378&<9 =2>5&<: 671378&<9 =2>5&<; =2>5&<; !"#$"%&'(%(%)&*+%&,$--%.&$"&/0%&1+.%+&21.%& =2>5&<? =2>5&<? $)&%3%2(/%.
  • 83. ()<=' ()&<A"$ !"#$%&'()*$'+#*$# ->?@>$ (+34'12 ->?@>$ !9.-1B'1 (+..-(9':14; (+34'72 ,-./-0'12 (+34'12 ,-./-0'12 (+34'75 ,-./-0'15 ,-./-0'15 ,-./-0'16 ,-./-0'16 !9.-1B'7 (+34'72 (+34'15 ,-./-0'72 (+34'75 (+34'76 ,-./-0'72 (+34'15 (+34'78 (+34'76 (+34'78
  •
  • 86. PCIe Transfers Optimization PCIe bus is slow Try to minimize transfers Use pinned memory on host whenever possible Try to perform copies asynchronously
  • 87. Outline • CUDA Language & APIs (overview) • Threading/Execution (cont’d) • Memory/Communication (cont’d) • Tools • Libraries
  • 88. CUDA-GDB Extended version of GDB with support for C for CUDA Supported on Linux 32bit/64bit systems Seamlessly debug both the host|CPU and device|GPU code Set breakpoints on any source line or symbol name Single step executes only one warp except on sync threads Access and print all CUDA memory allocations, local, global, constant and shared vars. Walkthrough example with sourcecode : CUDA-GDB manual © NVIDIA Corporation 2010
  • 89. Linux GDB Integration with EMACS © NVIDIA Corporation 2010
  • 90. Linux GDB Integration with DDD © NVIDIA Corporation 2010
  • 91. CUDA-MemCheck Detects/tracks memory errors Out of bounds accesses Misaligned accesses (types must be aligned on their size) Integrated into CUDA-GDB Linux and WinXP Win7 and Vista support coming © NVIDIA Corporation 2010 11 ©NVIDIA 2010
  • 92. CUDA Driver Low-level Profiling support 1. Set up environment variables !"#$%&'()*+,-./012345 !"#$%&'()*+,-./0123,(6745 !"#$%&'()*+,-./0123,(/80194:$;<=>?&"&' !"#$%&'()*+,-./0123,2/94#%$<=@!?:AB 2. Set up configuration file FILE "config.txt": FILE "profile.csv": >#CA&D%&&=E!A&DE# G'()*+,-./0123,2/9,73.61/8'5?H G'()*+,*371(3'I'9!0$%:!'JJII'9K =;A&%C:&=$;A G'()*+,-./0123,(67'5 G'K1F36K+F-0+(K/.'<DLMLNN5!DL:5L: >#CA&D%&&=E!A&DE#OE!&P$QO>#C&=E!O:#C&=E!O$::C#D;:RO=;A&%C:&=$;A 3. Run application 55H<S!DD5I!TNLLIOE!E:#RU&$*OV?TLJO5L?III ED&%="FC@ 55H<S!DD5I!HQD:IOE!E:#RU&$*OH?WWSOS?III 55H<S!DD5I!MH:!IOE!E:#RU&$*OV?TLJOW?III 55H<S!DD5I<L!DWIO,X5IQED&%="EC@-<==6,==6,O5M?LMWOSI?IIIOI?TTTOST 4. View profiler output HL 55H<S!DD5I<SSTDIOE!E:#R*&$UOV?VVWOTW?III © NVIDIA Corporation 2010
  • 93. CUDA Visual Profiler - Overview Performance analysis tool to fine tune CUDA applications Supported on Linux/Windows/Mac platforms Functionality: Execute a CUDA application and collect profiling data Multiple application runs to collect data for all hardware performance counters Profiling data for all kernels and memory transfers Analyze profiling data © NVIDIA Corporation 2010
  • 94. CUDA Visual Profiler data for kernels © NVIDIA Corporation 2010
  • 95. CUDA Visual Profiler computed data for kernels Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate Global memory read throughput (Gigabytes/second) Global memory write throughput (Gigabytes/second) Overall global memory access throughput (Gigabytes/second) Global memory load efficiency Global memory store efficiency © NVIDIA Corporation 2010
  • 96. CUDA Visual Profiler data for memory transfers Memory transfer type and direction (D=Device, H=Host, A=cuArray) e.g. H to D: Host to Device Synchronous / Asynchronous Memory transfer size, in bytes Stream ID © NVIDIA Corporation 2010
  • 97. CUDA Visual Profiler data analysis views Views: Summary table Kernel table Memcopy table Summary plot GPU Time Height plot GPU Time Width plot Profiler counter plot Profiler table column plot Multi-device plot Multi-stream plot Analyze profiler counters Analyze kernel occupancy © NVIDIA Corporation 2010
  • 98. CUDA Visual Profiler Misc. Multiple sessions Compare views for different sessions Comparison Summary plot Profiler projects save & load Import/Export profiler data (.CSV format) © NVIDIA Corporation 2010
  • 99. Outline • CUDA Language & APIs (overview) • Threading/Execution (cont’d) • Memory/Communication (cont’d) • Tools • Libraries
  • 100. CUBLAS
  • 101. CUBLAS CUDA accelerated BLAS (Basic Linear Algebra Subprograms) Create matrix and vector objects in GPU memory space Fill objects with data Call sequence of CUBLAS functions Retrieve data from GPU (optionally) !"#$%&'#((')'*+,-#.%/'00'1%$.+2%!'3'4.56-.5$'7 8 9:;$+4<=%*> ?$5+.'+$6"+'@''1%$.+2%!'A'9:;$+4<15.&BC1-1CDC1-ECD7F''''''''' 9:;$+4<+,6E&BC'+$6"+C1-1CDC1-,CD7F' AA'%>%/E'GH'#.%/+.#524C'/%4.+/.'/%4#1:+$ #?'&#'I'GH'@@'H7'8 9:;$+4<=%*> 9:;$+4<956E&BC'1-;C'DC'1-/C'D7F 9:;$+4<+,6E&BC'JDKHC'1-EC'DC'1-/C'D7F L %$4%' 9:;$+4<+,6E&BCJ+$6"+C1-ECDC1-/CD7F' KKK' © NVIDIA Corporation 2009
  • 102. CUBLAS Features Single precision data: Level 1 (vector-vector O(N) ) Level 2 (matrix-vector O(N2) ) 3 Level 3 (matrix-matrix O(N ) ) Complex single precision data: Level 1 CGEMM Double precision data: Level 1: DASUM, DAXPY, DCOPY, DDOT, DNRM2, DROT, DROTM, DSCAL, DSWAP, ISAMAX, IDAMIN Level 2: DGEMV, DGER, DSYR, DTRSV Level 3: ZGEMM, DGEMM, DTRSM, DTRMM, DSYMM, DSYRK, DSYR2K © NVIDIA Corporation 2009
  • 103. CUBLAS Performance: CPU vs GPU CUBLAS: CUDA 2.3, Tesla C1060 MKL 10.0.3: Intel Core2 Extreme, 3.00GHz © NVIDIA Corporation 2009
  • 104. !"#$%&'()*+,*-./0) '#" Up to 2x average speedup over CUBLAS 3.1 '!" &" &1))231'456'78$ Less variation in performance for different dimensions vs. 3.1 %" $" #" ,-. /(0' !" /(0# '!#$ #!$& (!)# $!*% +'#! %'$$ )'%& 7.9*:;'2:-)/5:,/5'<=;=> Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT} !"##$%&'(%)%&'*%+,%-./0/1%$2345%!(676%89" :;<%*6'('&'6(=%+,%>?5@A!+B2%/,C24%!+B2%DE%F-2G542HI
  • 105. CULA
  • 106. CULA (LAPACK for heterogeneous systems) GPU Accelerated Linear Algebra Partnership ! Dense linear algebra Developed in ! C/C++ & FORTRAN partnership with ! 150+ Routines NVIDIA MATLAB Interface Supercomputer Speeds ! 15+ functions Performance 7x of ! Up to 10x speedup
  • 107. CULA - Performance Supercomputing Speeds This graph shows the relative speed of many CULA functions when compared to (Fermi) and an Intel Core i7 860. More at www.culatools.com
  • 109. Sparse Matrix Performance: CPU vs. GPU Multiplication of a sparse matrix by multiple vectors 35x 30x 25x 20x 15x 10x "Non-transposed" 5x "Transposed" 0x MKL 10.2 Average speedup across S,D,C,Z !"#$%&#'()*+(,-(./010%(23456(!+787(9$" :;<(=7*+*)*7+>(,-(?@6AB!,C3(0-D35(!,C3(EF(G.3H653IJ
  • 110. CUFFT
  • 111. CUFFT CUFFT is the CUDA FFT library Computes parallel FFT on an NVIDIA GPU Plan contains information about optimal configuration for a given transform. Plans can be persisted to prevent recalculation. Good fit for CUFFT because different kinds of FFTs require different thread/block/grid configurations. © NVIDIA Corporation 2009
  • 112. CUFFT Features 1D, 2D and 3D transforms of complex and real- valued data Batched execution for doing multiple 1D transforms in parallel 1D transform size up to 8M elements 2D and 3D transform sizes in the range [2,16384] In-place and out-of-place transforms for real and complex data. © NVIDIA Corporation 2009
  • 113. CUFFT Example Complex 2D transform© NVIDIA Corporation 2009
  • 114. CUFFT Performance: CPU vs GPU !"##$%&'()%*+,-,.%$/012%34565%789 :;<%45'4=4)%>"2?@3A=/%,BC/1%3A=/%DE%F*/G21/HI%('&7JK © NVIDIA Corporation 2009
  • 115. CUFFT 3.2: Improved Radix-3, -5, -7 123-45*6+&%768996(::06 123-45*6+;%768996(::60 $"! (! '! $!! "! #"! !"#$%& !"#$%& &! +$!(!,-%.$ +$!(!,-%.$ +$!(!,-%.# %! +$!(!,-%.# #!! /01 /01 $! "! #! ! ! # $ % & " ' ( ) * #! ## #$ #% #& #" # $ % & " ' ( ) * #! ## #$ #% #& #" '()*+,-./0 '()*+,-./0 Radix-5, -7 and mixed radix improvements not shown 9<""=6*>?6@6*>A6(B6CDE;EF6=/,'269?GHG6!%< IJ#6AG>?>*>G?K6(B6LM2359(N/6EBO/'69(N/6-H6+C/P2'/Q0
  • 116. CUDPP
  • 117. !"#$$ !"#$%&'&(")*"+$,+-./&*)&0'12/".'&'##/#".&$0$3$4/5"*)&"!678 9:";'&&$5"<=>?7?8@A"B:"CD/15"<6!7@A"E:"E/1,F.3'"<6!7@A"8:"7'4$G5)1" <6!7@A"E:"HI/1,"<6!7@A"J:"K+'1,"<6!7@ 8#,)&$3+05 2FG..E2'1A"2FG..E/,0/13/GE2'1A"2FG..L/GF2/ 2FG..E)&3A"2FG..L'1GA"2FG..E.'&5/9'3&$M>/23)&9F#3$.#( 8GG$3$)1'#"'#,)&$3+05"$1".&),&/55 N&'.+5A"0)&/"5)&3$1,A"3&//5A"+'5+$1,A"'F3)3F1$1,
  • 118. !"#$$%&'()*+, !"#$$!%&'()*+,-(%& .%&'() /010!"#$$23!456000 !"#$$24##60!"#$$27894:60!"#$$29$:;95279<=4<#0>? !"#$$@,&ABC DB,&?0 !"#$$<CE*B- +CE*B-0/0.*ADD$B,&FGDB,&60 .%&'()60 &*HIBCHC&-E60 J60KL?0 .*ADD3.,&FDB,&60A2%A,-,60A2(A,-,60&*HIBCHC&-EL?
  • 119. More?
  • 120. Thrust
  • 121. Objectives !   Programmer productivity !   Rapidly develop complex applications !   Leverage parallel primitives !   Encourage generic programming !   Don’t reinvent the wheel !   E.g. one reduction to rule them all !   High performance !   With minimal programmer effort !   Interoperability !   Integrates with CUDA C/C++ code 3 © 2008 NVIDIA Corporation
  • 122. !"#$%& !""#$%&'()$%#(*+,),-#./,#!012 3*&*45#6$)78),8#9%&'()$%#:*+,),-#;69:<# !/7$)*7%,5 !"#$%!&&"'%!()*+!'#,-. !"#$%!&&/*)0+*()*+!'#,-. 2(=/,*$>&5 !"#$%!&&%'#!12 !"#$%!&&#*/$+*12 !"#$%!&&03+4$%0)*(%+5312 ?$4@ 63
  • 124. !"#$%&'()'*((+,-'.(/- !"#$%&'()*(&+"#,- ./)012-3 45$"0-6()(#56 7)#2#68&9#3(&:(;*"3(<"3-*3=
  • 125. !"#$%$&'($)*+&',- !"#$%&'($)*+#$&#$, -../+0%'#+#$&#& 12$%+34056$ 7$58'&&'($+9'6$%&$+:;2<6=$+(>? ;6#'($+64880%'#* >../+8$8@$5&+4%+60&2A0&$5&
  • 126. More?
  • 127. PyCUDA
  • 128. PyCUDA 3rd party open source, written by Andreas Klöckner Exposes all of CUDA via Python bindings Compiles CUDA on the fly presents CUDA as an interpreted language Integration with numpy Handles memory management, resource allocation CUDA programs are Python strings Metaprogramming modify source code on-the-fly Like a really complex pre-processor http://mathema.tician.de/software/pycuda © NVIDIA Corporation 2009
  • 129. PyCUDA Example© NVIDIA Corporation 2009
  • 130. More?
  • 131. CURAND
  • 132. RNG Performance: CPU vs. GPU Generating 100K Sobol' Samples 25x 20x 15x 10x CURAND 3.2 MKL 10.2 5x 0x SP DP SP DP Uniform Normal !"#$%&'()*'+,'%-.&.$'/0123'!*454'67" 89:';4)*)()4*<'+,'=>3?@!+A0'.,B02'!+A0'CD'E%0F320GH
  • 134. OpenVIDIA Open source, supported by NVIDIA Computer Vision Workbench (CVWB) GPU imaging & computer vision Demonstrates most commonly used image processing primitives on CUDA Demos, code & tutorials/information http://openvidia.sourceforge.net
  • 136. References • CUDA C Programming Guide  • CUDA C Best Practices Guide  • CUDA Reference Manual  • API Reference, PTX ISA 2.2  • CUDA-GDB User Manual  • Visual Profiler Manual   • User Guides: CUBLAS, CUFFT, CUSPARSE, CURAND http://developer.nvidia.com/object/gpucomputing.html
  • 137. one more thing or two...
  • 138. Life/Code Hacking #2.x Speed {listen,read,writ}ing accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 139. Life/Code Hacking #2.1 Speed listening accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 140. Life/Code Hacking #2.1 Speed listening • Step 1: Collect • online videos, tutorials, podcasts, etc. • audiobooks • youtube-dl, get_flash_videos, jDownloader, ffmpeg, mplayer, etc. • etc. accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 141. Life/Code Hacking #2.1 Speed listening • Step 2: Accelerate (time-stretch) • VLC (Playback > Faster) • sox $f{,.1.8X.mp3} tempo 1.8 50 • iPod ? mp3splt -t 5.00 -o small-@n large.mp3 accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 142. Life/Code Hacking #2.1 Speed listening • Step 3: chill or do more ;-) accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 143. Demo
  • 144. CO ME