SlideShare a Scribd company logo
1 of 62
Introduction to SPU Optimizations
                          Part 1: Assembly Instructions



                              P˚l-Kristian Engstad
                               a
                         pal engstad@naughtydog.com




                                      March 5, 2010



P˚
 al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Introduction




    These slides are used internally at Naughty Dog to introduce new programmers
    to our SPU programming methods. Due to popular interest, we are now
    making these public. Note that some of the tools that we are using are not
    released to the public, but there exists many other alternatives out there that
    do similar things.

    The first set of slides introduce most of the SPU assembly instructions. Please
    read these carefully before reading the second set. Those slides go through a
    made-up example showing how one can improve performance drastically, by
    knowing the hardware as well as employing a technique called software
    pipe-lining.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SPU programming is Cool




   In these slides, we will go through all of the assembly instructions that exist on
   the SPU, giving you a quick introduction to the power of the SPUs.

       Each SPU has 256 kB of local memory.
       This local memory can be thought of as 1 cycle memory.
       Programs and data exist in the same local memory space.
       There are no memory protections in local memory!
       The only way to access external memory is through DMA.
       There is a significant delay between when a DMA request is queued until
       it finishes.




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SPU Execution Environment




       The SPU has 128 general purpose 128-bit wide registers.
       You can think of these as
           2 doubles (64-bit floating-point values),
           4 floats (32-bit floating-point values),
           4 words (32-bit integer values),
           8 half-words (16-bit integer values), or
           16 bytes (8-bit integer values).
       An SPU executes an even and an odd instruction each cycle.
           Even instructions are mostly arithmetic instructions, whereas
           the odd ones are load/store instructions, shuffles, branches and other
           special instructions.




         P˚
          al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Instruction Classes

    The instruction set can be put in classes, where the instructions in the same
    class have the same arity (i.e. whether they are even or odd) and latency (how
    long it takes for the result to be ready):

        (SP) Single Precision {e6}
        (FX) FiXed {e2}
        (WS) Word Shift {e4}
        (LS) Load/Store {o6}
        (SH) SHuffle {o4}
        (FI) Fp Integer {e7}
        (BO) Byte Operations {e4}
        (BR) BRanch {o-}
        (HB) Hint Branch {o15}
        (CH) CHannel Operations {o6}
        (DP) Double Precision {e13}


           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Single Precision Floating Point Class (SP) [Even:6]

    The SP class of instructions have latency of 6 cycles and a throughput of 1
    cycle. These are all even instructions.

    fa        a,   b,    c    ; a.f[n]             =   b.f[n] +      c.f[n]
    fs        a,   b,    c    ; a.f[n]             =   b.f[n] -      c.f[n]
    fm        a,   b,    c    ; a.f[n]             =   b.f[n] *      c.f[n]
    fma       a,   b,    c, d ; a.f[n]             =   b.f[n] *      c.f[n] + d.f[n]
    fms       a,   b,    c, d ; a.f[n]             =   b.f[n] *      c.f[n] - d.f[n]
    fnms      a,   b,    c, d ; a.f[n]             =   -(b.f[n]      * c.f[n] - d.f[n])

    The syntax here indicates that for each of the 4 32-bit floating point values in
    the register, the operation in the comment is executed.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Single Precision Floating Point Class (SP)



        No broadcast versions.
        No dot-products or cross-products.
        No “fnma” instruction.

    Example:

    If the registers r1 and r2 contains
            r1 = ( 1.0, 2.0, 3.0, 4.0 ),
            r2 = ( 0.0, -2.0, 1.0, 4.0 ),
    then after
            fa r0, r1, r2 ; r0 = r1 + r2
    then r0 contains
            r0 = ( 1.0, 0.0, 4.0, 8.0 ).




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FiXed precision Class (FX) [Even:2]




    The FX class of instructions all have latency of just 2 cycles and all have a
    throughput of 1 cycle. These are even instructions.

    There’s quite a few of them, and we can further divide them down into:

        Integer Arithmetic Operations.
        Immediate Loads Operations.
        Comparison Operations.
        Select Bit Operation.
        Logical Bit Operations.
        Extensions and Misc Operations.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Arithmetic Operations

    The integer arithmetic operations “add” and “subtract from” work on either 4
    words at a time or 8 half-words at a time.

    ah        i,   j,    k       ;   i.h[n]       =   j.h[n] + k.h[n]
    ahi       i,   j,    s10     ;   i.h[n]       =   j.h[n] + ext(s10)
    a         i,   j,    k       ;   i.w[n]       =   j.w[n] + k.w[n]
    ai        i,   j,    s10     ;   i.w[n]       =   j.w[n] + ext(s10)
    sfh       i,   j,    k       ;   i.h[n]       =   -j.h[n] + k.h[n]
    sfhi      i,   j,    s10     ;   i.h[n]       =   -j.h[n] + ext(s10)
    sf        i,   j,    k       ;   i.w[n]       =   -j.w[n] + k.w[n]
    sfi       i,   j,    s10     ;   i.w[n]       =   -j.w[n] + ext(s10)




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Arithmetic Operations & Examples

   Notice the subtract from semantics. This is different from the floating point
   subtract (fs) semantic. We think this was mainly due to the additional power
   of the immediate forms.

   ai     i,   i,   1    ;   i   =   i + 1, for each word in i
   ahi    i,   i,   -1   ;   i   =   i - 1, for each half-word in i
   sfi    i,   i,   0    ;   i   =   (-i), for each word in i
   sfhi   x,   x,   1    ;   x   =   1 - x, for each half-word in i
   sf     z,   y,   x    ;   z   =   x - y, for each word in i




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Immediate Loads

   The SPU has some instructions that enable us to quickly set up registers
   values. These immediate loads are also 2-cycle FX instructions:

   il        i,   s16     ;   i.w[n]       = ext(s16)
   ilh       i,   u16     ;   i.h[n]       = u16
   ila       i,   u18     ;   i.w[n]       = u18
   ilhu      i,   u16     ;   i.w[n]       = u16 << 16
   iohl      i,   u16     ;   i.w[n]       |= u16

   Example:

   ilhu ones, 0x3f80 ; ones = (1.0, 1.0, 1.0, 1.0)
   ila magic, 0x10203; magic = (0x00010203_00010203_00010203_00010203)




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Logical Bit Operations

    These instructions work on each of the 128 bits in the registers.

    and       i,   j,    k       ;   i   =   j & k
    nand      i,   j,    k       ;   i   = ~(j & k)
    andc      i,   j,    k       ;   i   =   j & ~k
    or        i,   j,    k       ;   i   =   j | k
    nor       i,   j,    k       ;   i   = ~(j | k)
    orc       i,   j,    k       ;   i   =   j | ~k
    xor       i,   j,    k       ;   i   =   j ^ k
    eqv       i,   j,    k       ;   i   =   j == k




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Logical Operations w/immediates


    andbi      i, j, u8 ; i.b[n] = j.b[n] & u8
    andhi      i, j, s10 ; i.h[n] = j.h[n] & ext(s10)
    andi       i, j, s10 ; i.w[n] = j.w[n] & ext(s10)

    orbi       i, j, u8 ; i.b[n] = j.b[n] | u8
    orhi       i, j, s10 ; i.h[n] = j.h[n] | ext(s10)
    ori        i, j, s10 ; i.w[n] = j.w[n] | ext(s10)

    xorbi      i, j, u8 ; i.b[n] = j.b[n] ^ u8
    xorhi      i, j, s10 ; i.h[n] = j.h[n] ^ ext(s10)
    xori       i, j, s10 ; i.w[n] = j.w[n] ^ ext(s10)




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Comparisons (Bytes)


   ceqb      i,    j,   k       ;   i.b[n]       =   (j.b[n]    ==    k.b[n])        ?   TRUE    :   FALSE
   ceqbi     i,    j,   su8     ;   i.b[n]       =   (j.b[n]    ==    su8)           ?   TRUE    :   FALSE
   cgtb      i,    j,   k       ;   i.b[n]       =   (j.b[n]    >     k.b[n])        ?   TRUE    :   FALSE (s)
   cgtbi     i,    j,   su8     ;   i.b[n]       =   (j.b[n]    >     su8)           ?   TRUE    :   FALSE
   clgtb     i,    j,   k       ;   i.b[n]       =   (j.b[n]    >     k.b[n])        ?   TRUE    :   FALSE (u)
   clgtbi    i,    j,   su8     ;   i.b[n]       =   (j.b[n]    >     su8)           ?   TRUE    :   FALSE

   TRUE = 0xFF
   FALSE = 0x00

   (s) means “signed” and (u) means “unsigned” compares.




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Comparisons (Halves)


   ceqh      i,    j,   k       ;   i.h[n]       =   (j.h[n]    ==    k.h[n])            ?   TRUE   :   FALSE
   ceqhi     i,    j,   s10     ;   i.h[n]       =   (j.h[n]    ==    ext(s10))          ?   TRUE   :   FALSE
   cgth      i,    j,   k       ;   i.h[n]       =   (j.h[n]    >     k.h[n])            ?   TRUE   :   FALSE   (s)
   cgthi     i,    j,   s10     ;   i.h[n]       =   (j.h[n]    >     ext(s10))          ?   TRUE   :   FALSE   (s)
   clgth     i,    j,   k       ;   i.h[n]       =   (j.h[n]    >     k.h[n])            ?   TRUE   :   FALSE   (u)
   clgthi    i,    j,   s10     ;   i.h[n]       =   (j.h[n]    >     ext(s10))          ?   TRUE   :   FALSE   (u)

   TRUE = 0xFFFF
   FALSE = 0x0000




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Comparisons (Words)


   ceq      i,    j,   k       ;   i.w[n]       =   (j.w[n]    ==    k.w[n])            ?   TRUE   :   FALSE
   ceqi     i,    j,   s10     ;   i.w[n]       =   (j.w[n]    ==    ext(s10))          ?   TRUE   :   FALSE
   cgt      i,    j,   k       ;   i.w[n]       =   (j.w[n]    >     k.w[n])            ?   TRUE   :   FALSE   (s)
   cgti     i,    j,   s10     ;   i.w[n]       =   (j.w[n]    >     ext(s10))          ?   TRUE   :   FALSE   (s)
   clgt     i,    j,   k       ;   i.w[n]       =   (j.w[n]    >     k.w[n])            ?   TRUE   :   FALSE   (u)
   clgti    i,    j,   s10     ;   i.w[n]       =   (j.w[n]    >     ext(s10))          ?   TRUE   :   FALSE   (u)

   TRUE = 0xFFFF_FFFF
   FALSE = 0x0000_0000




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Comparisons (Floats)


    fceq     i,    b,   c       ;   i.w[n]       =   (b[n] == c[n])                              ?   TRUE   :   FALSE
    fcmeq    i,    b,   c       ;   i.w[n]       =   (abs(b[n]) == abs(c[n]))                    ?   TRUE   :   FALSE
    fcgt     i,    b,   c       ;   i.w[n]       =   (b[n] > c[n])                               ?   TRUE   :   FALSE
    fcmgt    i,    b,   c       ;   i.w[n]       =   (abs(b[n]) > abs(c[n]))                     ?   TRUE   :   FALSE

    TRUE = 0xFFFF_FFFF
    FALSE = 0x0000_0000

    Note: All zeros are equal, e.g.: 0.0 == -0.0.




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Select Bits

    This very important operation selects bits from j and k depending on the bits
    in the l registers. These fit well with the comparison functions given previously.

    selb i, j, k, l ; i = (l==0) ? j : k

    Notice that if the bit is 0, then it selects j and if not then it selects the bit in k.

    Example: SIMD min/max

    fcgt mask, a, b      ; mask is all 1’s if a > b
    selb max, b, a, mask ; select a if a > b
    selb min, a, b, mask ; select b if !(a > b)




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Misc


   generate borrow bit
   bg   i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n])
                  i.w[n] = tmp.w[n] < 0 ? 0 : 1
   generate borrow bit with borrow
   bgx i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n] + (i.w[n]&1) - 1)
                  i.w[n] = tmp.w[n] < 0 ? 0 : 1
   generate carry bit
   cg   i, j, k ; i.w[n] = (j.w[n] + k.w[n]) > 0xffffffff ? 1 : 0
   generate carry bit with carry
   cgx i, j, k ; tmp.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1)
                  i.w[n] = tmp.w[n] > 0xffffffff ? 1 : 0




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FX: Misc


   add with carry bit
   addx i, j, k ; i.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1))
   subtract with borrow bit
   sfx i, j, k ; i.w[n] = (-j.w[n] + k.w[n] + (i.w[n] & 1) - 1)

   sign-extend byte to half-word
   xsbh i, j    ; i.h[n] = ext(i.h[n] & 0xff)
   sign-extend half-word to word
   xshw i, j    ; i.w[n] = ext(i.w[n] & 0xffff)
   sign-extend word to double-word
   xswd i, j    ; i.d[n] = ext(i.d[n] & 0xffffffff)

   count leading zeros
   clz i, j     ; i.w[n] = leadingZeroCount(j.w[n])




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Word Shift Class (WS) [Even:4]

    The WS class of instructions have latency of 4 cycles and a throughput of 1
    cycle. These are all even instructions.

    shlh       i,   j,    k        ;   i.h[n]       =   j.h[n]    <<   (   k.h[n]        &   0x1f   )
    shlhi      i,   j,    imm      ;   i.h[n]       =   j.h[n]    <<   (   imm           &   0x1f   )
    shl        i,   j,    k        ;   i.w[n]       =   j.w[n]    <<   (   k.w[n]        &   0x3f   )
    shli       i,   j,    imm      ;   i.w[n]       =   j.w[n]    <<   (   imm           &   0x3f   )

    Notice that there is an independent shift amount for each of the shlh and shl
    versions, i.e., this is truly SIMD!




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Example


   ; Assume r0      = ( 1, 2, 4, 8 )
   ;        r1      = ( 1, 2, 3, 4 )
   shl r2, r0,      r1
   ; Now r2 =       ( 1<<1, 2<<2, 4<<3, 8<<4 )
             =      ( 2, 4, 32, 128 )




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
WS: Rotate left logical


    roth       i,   j,    k        ;   i.h[n]       =   j.h[n]    <^   (   k.h[n]        &   0x0f   )
    rothi      i,   j,    imm      ;   i.h[n]       =   j.h[n]    <^   (   imm           &   0x0f   )
    rot        i,   j,    k        ;   i.w[n]       =   j.w[n]    <^   (   k.w[n]        &   0x1f   )
    roti       i,   j,    imm      ;   i.w[n]       =   j.w[n]    <^   (   imm           &   0x1f   )

    <^ is my idiosyncratic symbol for rotate.




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
WS: Shift right logical


    rothm       i,   j,    k        ;   i.h[n]       =   j.h[n]    >>   (   -k.h[n]         &     0x1f   )
    rothmi      i,   j,    imm      ;   i.h[n]       =   j.h[n]    >>   (   -imm            &     0x1f   )
    rotm        i,   j,    k        ;   i.w[n]       =   j.w[n]    >>   (   -k.w[n]         &     0x3f   )
    rotmi       i,   j,    imm      ;   i.w[n]       =   j.w[n]    >>   (   -imm            &     0x3f   )

    Notice here that the shift amounts need to be negative in order to produce a
    proper shift. This is because this is actually a rotate left and then mask
    operation.




             P˚
              al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
WS: Shift right arithmetic


    rotmah    i,   j,   k        ;   i.h[n]       =   j.h[n]    >>   (   -k.h[n]         &     0x1f   )
    rotmahi   i,   j,   imm      ;   i.h[n]       =   j.h[n]    >>   (   -imm            &     0x1f   )
    rotma     i,   j,   k        ;   i.w[n]       =   j.w[n]    >>   (   -k.w[n]         &     0x3f   )
    rotmai    i,   j,   imm      ;   i.w[n]       =   j.w[n]    >>   (   -imm            &     0x3f   )




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Load/Store Class (LS) [Odd:6]



    The load/store operations are odd instructions that work on the 256 kB local
    memory. They have a latency of 6 cycles, but the hardware has short-cuts in
    place so that you can read a written value immediately after the store. Do note:

        Memory wraps around, so you can never access memory outside the local
        store (LS).
        You can only load and store a whole quadword, so if you need to modify a
        part, you need to load the quadword value, merge in the modified part
        into the value and store the whole quadword back.
        Addresses are in units of bytes, unlike the VU’s on the PS2.
        The load/store operations will use the value in the preferred word of the
        address register, i.e.: the first word.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
LS: Loads


   lqa i, label18 ;             addr = label18
                  ;             range = 256kb (or +/- 128kb)
   lqd i, qoff(j) ;             addr = qoff * 16 + j.w[0]
                  ;             qoff is 10 bit signed, addr range = +/-8kb.
   lqr i, label14 ;             addr = ext(label14) + pc
                  ;             label14 range = +/- 8kb.
   lqx i, j, k    ;             addr = j.w[0] + k.w[0]




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
LS: Stores


    stqa i, label18 ;              addr = label18
                    ;              range = 256kb (or +/- 128kb)
    stqd i, qoff(j) ;              addr = qoff * 16 + j.w[0]
                    ;              qoff is 10 bit signed, addr range = +/-8kb.
    stqr i, label14 ;              addr = ext(label14) + pc
                    ;              label14 range = +/- 8kb.
    stqx i, j, k    ;              addr = j.w[0] + k.w[0]




             P˚
              al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Shuffle Class (SH) [Odd:4]




   The shuffle operations all have 4 cycle latency and they are odd instructions.
   Most of the instructions in this class deal with the whole quadword:

   We can divide the SH class into:

       The Shuffle Bytes Instruction.
       Quadword left-shifts, rotates and right-shifts.
       Creation of Shuffle Masks.
       Form Select Instructions.
       Gather Bit Instructions.
       Reciprocal Estimate Instructions.




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Shuffle Bytes



   The ordering of bytes, half-words and words within the quadword is shown
   below. Notice that this is big-endian, not little-endian:

   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f |
   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
   |   0   |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
   +-------+-------+-------+-------+-------+-------+-------+-------+
   |       0       |       1       |       2       |       3       |
   +---------------+---------------+---------------+---------------+

   The shuffle byte instruction shufb take three inputs, two source registers r0,
   r1, and a shuffle mask msk. The output register d is found by running the
   following logic on each byte within the input registers:




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Shuffle Bytes




   Let x = msk.b[n], where n goes from 0 to 15:

       if x in 0 .. 0x7f:
            If (x & 0x10) == 0x00, then d.b[n] = r0.b[x & 0x0f].
            If (x & 0x10) == 0x10, then d.b[n] = r1.b[x & 0x0f].
       if x in 0x80 .. 0xbf: d.b[n] = 0x00
       if x in 0xc0 .. 0xdf: d.b[n] = 0xff
       if x in 0xe0 .. 0xff: d.b[n] = 0x80

   This is very powerful stuff!




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Shufb Examples

   Previously, we mentioned that the SPU has no broadcast ability, but with a
   single shufb instruction we can broadcast one word into all words. We can
   create the shuffle masks using instructions directly, or else we could simply load
   it using a LS class instruction.

   ila    s_AAAA, 0x10203     ; s_AAAA = 0x00_01_02_03 x 4
                 ; = 0x00010203_00010203_00010203_00010203
   orbi   s_CCCC, s_AAAA, 8   ; s_CCCC = 0x08_09_0a_0b x 4

   Using these masks, we can quickly create a registers with all x’s, y’s, z’s or w’s:

   shufb xs, v, v, s_AAAA                      ; xs = (v.x, v.x, v.x, v.x)
   shufb zs, v, v, s_CCCC                      ; zs = (v.z, v.z, v.z, v.z)




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: .dshuf

    Because the shuffle instruction is so useful, our “frontend” tool supports quick
    creations of shuffle masks. Using the .dshuf directive, we create shuffle masks
    that follow the following rules.

        If the length of the string is 4, we assume it is word-sized shuffles, if 8
        then half-word sized, and if 16 then byte-sized shuffles,
        upper-cased letters indicate sources from the first input, lower-cased ones
        indicate from the second input,
        ’0’ indicates zeros, ’X’ ones and ’8’ 0x80’s.

    .dshuf "ABC0"     ; 0x00010203_04050607_08090a0b_80808080
    .dshuf "aX08"     ; 0x10111213_c0c0c0c0_80808080_e0e0e0e0
    .dshuf "aBC0aBC0" ; 0x1011_0203_0405_8080_1011_0203_0405_8080




             P˚
              al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Another Shufb Example

   We can create finite state-machines, piping input into one end of the
   quad-word, while spitting out the result into another (like e.g. the preferred
   word). Here’s an example of such a “delay machine”:

   ; in the data section:
   m_bcdA: .dshufb "bcdA"
   ; in the init section:
       lqa s_bcdA, 0(m_bcdA)
   ; in the loop:
       shufb state, input, state, s_bcdA ;                           state.x         =   state.y
                                         ;                           state.y         =   state.z
                                         ;                           state.z         =   state.w
                                         ;                           state.w         =   input.x




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Quadword Shift Left

    These instructions take the preferred byte (byte 3) or an immediate value,
    shifting the whole quadword to the left. There are versions that shift in number
    of bytes as well as in number of bits. For bit shifts, the shift amount is
    clamped to be less than 8.

    SHift Left Quadword by                BYtes
    shlqby   i, j, k   ; i                = j << ((k.b[3] & 0x1f) * 8)
    SHift Left Quadword by                BYtes Immediate
    shlqbyi i, j, imm ; i                 = j << ((imm     & 0x1f) * 8)
    SHift Left Quadword by                BYtes using BIt count
    shlqbybi i, j, k   ; i                = j << (k.b[3] & 0xf8)
    SHift Left Quadword by                BIts
    shlqbi   i, j, k   ; i                = j << (k.b[3] & 0x07)
    SHift Left Quadword by                BIts Immediate
    shlqbii i, j, imm ; i                 = j << (imm     & 0x07)




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Quadword Rotate Left

   These follow the same pattern as left shifts:

   ROTate (left) Quadword                by BYtes
   rotqby   i, j, k   ; i                = j <^ ((k.b[3] & 0x0f) * 8)
   ROTate (left) Quadword                by BYtes Immediate
   rotqbyi i, j, imm ; i                 = j <^ ((imm    & 0x0f) * 8)
   ROTate (left) Quadword                by BYtes using BIt count
   rotqbybi i, j, k   ; i                = j <^ (k.b[3] & 0x78)
   ROTate (left) Quadword                by BIts
   rotqbi   i, j, k   ; i                = j <^ (k.b[3] & 0x07)
   ROTate (left) Quadword                by BIts Immediate
   rotqbii i, j, imm ; i                 = j <^ (imm    & 0x07)




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Quadword Shift Right

   Ditto for shift rights, though as for the WS class, we call it rotates with mask
   and use the negative shift amounts:

   ROTate and Mask          Quadword by           BYtes
   rotmqby   i, j,          k   ; i = j           >> ((-k.b[3] & 0x1f) * 8)
   ROTate and Mask          Quadword by           BYtes Immediate
   rotmqbyi i, j,           imm ; i = j           >> ((-imm    & 0x1f) * 8)
   ROTate and Mask          Quadword by           BYtes using BIt count
   rotmqbybi i, j,          k   ; i = j           >> (-(k.b[3] & 0xf8) & 0xf8) (*)
   ROTate and Mask          Quadword by           BIts
   rotmqbi   i, j,          k   ; i = j           >> (-(k.b[3] & 0x07))
   ROTate and Mask          Quadword by           BIts Immediate
   rotmqbii i, j,           imm ; i = j           >> (-imm    & 0x07)




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Form Select Instructions

    These instructions are designed to expand a small number of bits into many bits of ones,
    and they are good for use with the sel operation.

    Form Select Mask for Bytes Immediate
    fsmbi i, u16 ; i.b[n] = ( (u16     <<                    n) & 0x8000 ) ? 0xff : 0x00
    Form Select Mask for Bytes
    fsmb i, j     ; i.b[n] = ( (i.h[1] <<                    n) & 0x8000 ) ? 0xff : 0x00
    Form Select Mask for Halfwords
    fsmh i, j     ; i.h[n] = ( (i.b[3] <<                    n) & 0x80 ) ? 0xffff : 0x00
    Form Select Mask for Words
    fsm   i, j    ; i.w[n] = ( (i.b[3] <<                    n) & 0x8 ) ? 0xffffffff : 0x00

    Example:

    fsmbi selABCd, 0x000f; make select mask to get XYZ from first arg




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Gather Bits Instructions

    These are the opposite to the form select instructions, and can be used to
    quickly pack results from comparison operators into compact bytes or
    half-words. They all gather the rightmost bit from the the source register and
    packs it into a single bit in the target.

    Gather    Bits     from Bytes
    gbb i,    j ;      i=0;for(n=0;n<16;n++){i.w[0]|=(j.b[n]&1);i.w[0]<<=1;}
    Gather    Bits     from Halfwords
    gbh i,    j ;      i=0;for(n=0;n< 8;n++){i.w[0]|=(j.h[n]&1);i.w[0]<<=1;}
    Gather    Bits     (from Words)
    gb i,     j ;      i=0;for(n=0;n< 4;n++){i.w[0]|=(j.w[n]&1);i.w[0]<<=1;}




             P˚
              al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: How to generate masks for non-quadword stores.

    As seen in the section for load/store, there are no non-quadword load/store
    operations. A way to store a non-quadword value is to load the destination
    quadword, shuffle the value with the loaded quadword, and store it back to the
    same location. In order to make the process of generating these shuffle-masks,
    there are a few instructions that generate these control masks:

    Generate Controls for Byte Insertion (d-form)
    cbd i, imm(j)
    Generate Controls for Byte Insertion (x-form)
    cbx i, j, k




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: How to generate masks for non-quadword stores.


    Generate Controls           for Halfword Insertion (d-form)
    chd i, imm(j)
    Generate Controls           for Halfword Insertion (x-form)
    chx i, j, k
    Generate Controls           for Word Insertion (d-form)
    cwd i, imm(j)
    Generate Controls           for Word Insertion (x-form)
    cwx i, j, k
    Generate Controls           for Doubleword Insertion (d-form)
    cdd i, imm(j)
    Generate Controls           for Doubleword Insertion (x-form)
    cdx i, j, k




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: How to generate masks for non-quadword stores.

    Example: Store prefered byte into a table

    lqx qword, table, offset
    cbx mask, table, offset
    shufb qword, value, qword, mask
    stqx qword, table, offset
    ai offset, offset, 1




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Reciprocal Estimate Instructions

    The hardware supports two fast (4 cycles) that calculate the reciprocal
                                                                  √
    recip(x) = 1/x, or the reciprocal square root rsqrt(x) = 1/ x. These
    instructions work in conjunction with the “fi” instruction that we’ll later explain
    in detail. After the interpolation instruction, result are accurate to a precision
    of 12 bits, which is about half the floating-point precision of 23. In order to
    improve the accuracy, one must perform another Taylor- or Euler-step.

    Do note that:
                                                 √ √
                                      √           x x        1
                       sqrt(x) =       x=         √   = |x| √ = x · rsqrt(x),
                                                   x          x
    since x ≥ 0, so there is no need for a seperate square-root function.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Improving precision on the reciprocal function

    Assuming we have the input in the x-register, we proceed to calculate

    frest      a,   x
    fi         b,   x, a                     ; b is good to 12 bits precision
    fnms       c,   b, x, one                ;
    fma        b,   c, b, b                  ; b is good to 24 bits precision
                                             ;




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Improving precision on the reciprocal square-root function


    frsqest   a,   x
    fi        b,   x,   a                  ; b is good to 12 bits precision
    fm        c,   b,   x                  ; (b and a can share register)
    fm        d,   b,   onehalf            ; (c and x can share register)
    fnms      c,   c,   b, one
    fma       b,   d,   c, b               ; b is good to 24 bits precision




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
SH: Or Across - The Final Instruction

    The last instruction in the SH class is a new addition.

    Or Across
    orx     i, j           ; i.w[0] = ( j.w[0] | j.w[1] | j.w[2] | j.w[3] );
                             i.w[1] = i.w[2] = i.w[3] = 0




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Floating point / Integer Class (FI) [Even:7]

    The FI class of instructions have latency of 7 cycles and a throughput of 1
    cycle. These are all even instructions. There are basically three types of
    instructions: integer multiplies, interpolations for reciprocal calculations, and
    finally, fp/integer conversions.




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FI: Integer Multiplies


    multiply lower halves               signed
    mpy     i, j, k     ;               i.w[n] = j.h[2n+1]              * k.h[2n+1]
    multiply lower halves               signed immediate
    mpyi    i, j, s10   ;               i.w[n] = j.h[2n+1]              * ext(s10)
    multiply lower halves               unsigned
    mpyu    i, j, k     ;               i.w[n] = j.h[2n+1]              * k.h[2n+1]
    multiply lower halves               unsigned immediate              (immediate sign-extends)
    mpyui   i, j, s10   ;               i.w[n] = j.h[2n+1]              * ext(s10)




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FI: Integer Multiplies


    multiply lower halves, add word
    mpya    i, j, k, l ; i.w[n] = j.h[2n+1] * k.h[2n+1] + l.w[n]
    multiply lower halves, shift result down 16 with sign extend
    mpys    i, j, k     ; i.w[n] = j.h[2n+1] * k.h[2n+1] >> 16
    multiply upper half j by lower half k, shift up 16
    mpyh    i, j, k     ; i.w[n] = j.h[2n] * k.h[2n+1] << 16




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FI: Integer Multiplies


    multiply upper halves signed
    mpyhh   i, j, k     ; i.w[n] = j.h[2n] * k.h[2n]
    multiply upper halves unsigned
    mpyhhu i, j, k      ; i.w[n] = j.h[2n] * k.h[2n]
    multiply/accumulate upper halves
    mpyhha i, j, k      ; i.w[n] += j.h[2n] * k.h[2n]
    multiply/accumulate upper halves unsigned
    mpyhhau i, j, k     ; i.w[n] += j.h[2n] * k.h[2n]




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
FI: Conversions and FI instruction


    fi         a, b, c                 ; use after frest or frsqest

    cuflt      a,   j,    precis       ;   unsigned int to float
    csflt      a,   j,    precis       ;   signed int to float
    cfltu      i,   b,    precis       ;   float to unsigned int
    cflts      i,   b,    precis       ;   float to signed int

    Here precis is the precision as an immediate, so that e.g.

    cuflt fp, val, 8; converts 0x80 into 0.5

    Also, please note that these instructions saturate to the min and max values of
    their precision.




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Byte Operations (BO) [Even: 4]

    There’s a couple of interesting instructions that help with multi-media and
    streaming logic.

    Count Ones in Bytes
    cntb    i, j     ; i.b[n] = numOneBits( j.b[n] )
    Average Bytes
    avgb    i, j, k ; i.b[n] = ( j.b[n] + k.b[n] + 1 ) / 2
    Absolute Difference in Bytes
    absdb   i, j, k ; i.b[n] = abs( j.b[n] - k.b[n] )
    Sum Bytes into Half-words
    sumb    i, j, k ; i.h[0] = k.b[0] + k.b[1] + k.b[2] + k.b[3];
                       i.h[1] = j.b[0] + j.b[1] + j.b[2] + j.b[3];
                       :
                       i.h[6] = k.b[12] + k.b[13] + k.b[14] + k.b[15];
                       i.h[7] = j.b[12] + j.b[13] + j.b[14] + j.b[15];




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Branch Class (BR) [Odd:-]




    Branches on the SPU are costly. If a branch is taken, and it has not been
    predicted, there is a 18 cycle penalty so that the chip can restart the pipe.
    There is no penalty for falling through a non-predicted branch. However, if you
    have predicted a branch, and this does not occur - then there is also a 18 cycle
    penalty. Branches and branch hints are all odd instructions.

    Note: Even a static branch needs to be predicted.

    Note: This is one of the reasons why diverging control-paths are so difficult to
    optimize for.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
BR: Unconditional Branches


    Branch Relative
    br    brTo    ;          goto label address
    Branch Relative          and Set Link
    brsl i, brTo ;           gosub label address, i.w[0] = return address, (*)
    Branch Indirect
    bi    i       ;          goto i.w[0]
    Branch Indirect          and Set Link
    bisl i, j     ;          gosub j.w[0], i.w[0] = return address, (*)
    BRanch Absolute
    bra   brTo    ;          goto brTo
    BRanch Absolute          and Set Link
    brasl i, brTo ;          gosub label address, i.w[0] = return address (*)

    (*): These instructions have a 4 cycle latency for the return register. Note:
    The bi instructions have enable/disable interrupt versions, e.g.: bie, bid,
    bisle, bisld.



           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
BR: Conditional Branches (Relative)


    Branch on Zero
    brz   i, brTo ; branch               if i.w[0] == 0
    Branch on Not Zero
    brnz i, brTo ; branch                if i.w[0] != 0
    Branch on Zero
    brhz i, brTo ; branch                if i.h[1] == 0
    Branch on Not Zero
    brhnz i, brTo ; branch               if i.h[1] != 0




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
BR: Conditional Branches (Indirect)


    Branch Indirect on Zero
    biz   i, j ; branch to j.w[0]                      if i.w[0] == 0
    Branch Indirect on Not Zero
    binz i, j ; branch to j.w[0]                       if i.w[0] != 0
    Branch Indirect on Zero
    bihz i, j ; branch to j.w[0]                       if i.h[1] == 0
    Branch Indirect on Not Zero
    bihnz i, j ; branch to j.w[0]                      if i.h[1] != 0

    Note: These instructions can enable/disable interrupts as well.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
BR: Interrupt & Misc


    Interrupt RETurn
    iret   i    ; Return from interrupt
    Interrupt RETurn
    iretd i     ; Return from interrupt, disable interrupts
    Interrupt RETurn
    irete i     ; Return from interrupt, enable interrupts
    Branch Indirect and Set Link if External Data
    bisled i, j ; gosub j if channel 0 is non-zero




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Hints Branch Class (HB) [Odd:15]




    If you know the most likely (or only) outcome for a branch, you can make sure
    the branch is penalty free as long as the hint occurs at least 15 cycles before
    the branch is taken. If the hint occurs later, there still may be a benefit, since
    the penalty is lowered. However, if the hint arrives less than 4 cycles before the
    branch, there is no benefit.

    Please note that it also turns out that there is a hardware bug w.r.t. the hbr
    instructions. One cannot hint a branch where the branch targets forwards and
    is also within the same 64-byte block as the branch.




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Hints Branch Instructions


    Hint   Branch (Immediate)
    hbr    brFrom, j    ; branch                   hint for any BIxxx type branch
    Hint   Branch Absolute
    hbra   brFrom, brTo ; branch                   hint for any BRAxxx type branch
    Hint   Branch Relative
    hbrr   brFrom, brTo ; branch                   hint for any BRxxx type branch
    Hint   Branch Prefetch
    hbrp                ; inline                   prefetch code (*)

    (*) allows 15 LS instructions in a row without any instruction fetch stall.




            P˚
             al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
CH: DMA Channel Ops

   We will explain these in further talks, but for completeness we’ve included
   these here. They are all odd instructions with a latency of 6. Note, that the
   latency may actually be much higher if channels are not ready.

   Read from Channel
   rdch i, chn ; read i from channel chn
   Write to Channel
   wrch chn, i ; write i into channel chn
   Read Channel Count
   rdchcnt i, chn; read channel count for channel chn into i




          P˚
           al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
DP: Double Precision




    DP instructions have a latency of 13 and are even. However, they will stall
    pipelining for 6 cycles (that is all currently executing instructions are halted)
    while this instruction is executed. Therefore, we do not recommend using
    double precision at all!




           P˚
            al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations
Questions?




                                          That’s all folks!




         P˚
          al-Kristian Engstadpal engstad@naughtydog.com   Introduction to SPU Optimizations

More Related Content

What's hot

Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizations
pjcozzi
 
Uncharted 2: Character Pipeline
Uncharted 2: Character PipelineUncharted 2: Character Pipeline
Uncharted 2: Character Pipeline
Naughty Dog
 
Creating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's FortuneCreating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's Fortune
Naughty Dog
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s Fortune
Naughty Dog
 

What's hot (20)

Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizations
 
CryENGINE 3 Rendering Techniques
CryENGINE 3 Rendering TechniquesCryENGINE 3 Rendering Techniques
CryENGINE 3 Rendering Techniques
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1
 
Uncharted 2: Character Pipeline
Uncharted 2: Character PipelineUncharted 2: Character Pipeline
Uncharted 2: Character Pipeline
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Global illumination
Global illuminationGlobal illumination
Global illumination
 
Creating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's FortuneCreating A Character in Uncharted: Drake's Fortune
Creating A Character in Uncharted: Drake's Fortune
 
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
An introduction to Realistic Ocean Rendering through FFT - Fabio Suriano - Co...
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s Fortune
 
Best Practices for Shader Graph
Best Practices for Shader GraphBest Practices for Shader Graph
Best Practices for Shader Graph
 
Rendering Tech of Space Marine
Rendering Tech of Space MarineRendering Tech of Space Marine
Rendering Tech of Space Marine
 
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
 
Hair in Tomb Raider
Hair in Tomb RaiderHair in Tomb Raider
Hair in Tomb Raider
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
 

Viewers also liked (7)

SPU Optimizations - Part 2
SPU Optimizations - Part 2SPU Optimizations - Part 2
SPU Optimizations - Part 2
 
Uncharted Animation Workflow
Uncharted Animation WorkflowUncharted Animation Workflow
Uncharted Animation Workflow
 
Amazing Feats of Daring - Uncharted Post Mortem
Amazing Feats of Daring - Uncharted Post MortemAmazing Feats of Daring - Uncharted Post Mortem
Amazing Feats of Daring - Uncharted Post Mortem
 
Adventures In Data Compilation
Adventures In Data CompilationAdventures In Data Compilation
Adventures In Data Compilation
 
Lighting Shading by John Hable
Lighting Shading by John HableLighting Shading by John Hable
Lighting Shading by John Hable
 
State-Based Scripting in Uncharted 2: Among Thieves
State-Based Scripting in Uncharted 2: Among ThievesState-Based Scripting in Uncharted 2: Among Thieves
State-Based Scripting in Uncharted 2: Among Thieves
 
Naughty Dog Vertex
Naughty Dog VertexNaughty Dog Vertex
Naughty Dog Vertex
 

Similar to SPU Optimizations-part 1

«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
it-people
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
Linaro
 

Similar to SPU Optimizations-part 1 (20)

Quasi succinct indices
Quasi succinct indicesQuasi succinct indices
Quasi succinct indices
 
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
 
Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term Rewriting
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
Python 1 liners
Python 1 linersPython 1 liners
Python 1 liners
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 
2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark2014-06-20 Multinomial Logistic Regression with Apache Spark
2014-06-20 Multinomial Logistic Regression with Apache Spark
 
C Code and the Art of Obfuscation
C Code and the Art of ObfuscationC Code and the Art of Obfuscation
C Code and the Art of Obfuscation
 
2Bytesprog2 course_2014_c1_sets
2Bytesprog2 course_2014_c1_sets2Bytesprog2 course_2014_c1_sets
2Bytesprog2 course_2014_c1_sets
 
2Bytesprog2 course_2014_c1_sets
2Bytesprog2 course_2014_c1_sets2Bytesprog2 course_2014_c1_sets
2Bytesprog2 course_2014_c1_sets
 
Music as data
Music as dataMusic as data
Music as data
 
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
 
Processing Basics 1
Processing Basics 1Processing Basics 1
Processing Basics 1
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
 
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
NTC_Tensor flow 深度學習快速上手班_Part1 -機器學習
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習
 
Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learning
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

SPU Optimizations-part 1

  • 1. Introduction to SPU Optimizations Part 1: Assembly Instructions P˚l-Kristian Engstad a pal engstad@naughtydog.com March 5, 2010 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 2. Introduction These slides are used internally at Naughty Dog to introduce new programmers to our SPU programming methods. Due to popular interest, we are now making these public. Note that some of the tools that we are using are not released to the public, but there exists many other alternatives out there that do similar things. The first set of slides introduce most of the SPU assembly instructions. Please read these carefully before reading the second set. Those slides go through a made-up example showing how one can improve performance drastically, by knowing the hardware as well as employing a technique called software pipe-lining. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 3. SPU programming is Cool In these slides, we will go through all of the assembly instructions that exist on the SPU, giving you a quick introduction to the power of the SPUs. Each SPU has 256 kB of local memory. This local memory can be thought of as 1 cycle memory. Programs and data exist in the same local memory space. There are no memory protections in local memory! The only way to access external memory is through DMA. There is a significant delay between when a DMA request is queued until it finishes. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 4. SPU Execution Environment The SPU has 128 general purpose 128-bit wide registers. You can think of these as 2 doubles (64-bit floating-point values), 4 floats (32-bit floating-point values), 4 words (32-bit integer values), 8 half-words (16-bit integer values), or 16 bytes (8-bit integer values). An SPU executes an even and an odd instruction each cycle. Even instructions are mostly arithmetic instructions, whereas the odd ones are load/store instructions, shuffles, branches and other special instructions. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 5. Instruction Classes The instruction set can be put in classes, where the instructions in the same class have the same arity (i.e. whether they are even or odd) and latency (how long it takes for the result to be ready): (SP) Single Precision {e6} (FX) FiXed {e2} (WS) Word Shift {e4} (LS) Load/Store {o6} (SH) SHuffle {o4} (FI) Fp Integer {e7} (BO) Byte Operations {e4} (BR) BRanch {o-} (HB) Hint Branch {o15} (CH) CHannel Operations {o6} (DP) Double Precision {e13} P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 6. Single Precision Floating Point Class (SP) [Even:6] The SP class of instructions have latency of 6 cycles and a throughput of 1 cycle. These are all even instructions. fa a, b, c ; a.f[n] = b.f[n] + c.f[n] fs a, b, c ; a.f[n] = b.f[n] - c.f[n] fm a, b, c ; a.f[n] = b.f[n] * c.f[n] fma a, b, c, d ; a.f[n] = b.f[n] * c.f[n] + d.f[n] fms a, b, c, d ; a.f[n] = b.f[n] * c.f[n] - d.f[n] fnms a, b, c, d ; a.f[n] = -(b.f[n] * c.f[n] - d.f[n]) The syntax here indicates that for each of the 4 32-bit floating point values in the register, the operation in the comment is executed. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 7. Single Precision Floating Point Class (SP) No broadcast versions. No dot-products or cross-products. No “fnma” instruction. Example: If the registers r1 and r2 contains r1 = ( 1.0, 2.0, 3.0, 4.0 ), r2 = ( 0.0, -2.0, 1.0, 4.0 ), then after fa r0, r1, r2 ; r0 = r1 + r2 then r0 contains r0 = ( 1.0, 0.0, 4.0, 8.0 ). P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 8. FiXed precision Class (FX) [Even:2] The FX class of instructions all have latency of just 2 cycles and all have a throughput of 1 cycle. These are even instructions. There’s quite a few of them, and we can further divide them down into: Integer Arithmetic Operations. Immediate Loads Operations. Comparison Operations. Select Bit Operation. Logical Bit Operations. Extensions and Misc Operations. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 9. FX: Arithmetic Operations The integer arithmetic operations “add” and “subtract from” work on either 4 words at a time or 8 half-words at a time. ah i, j, k ; i.h[n] = j.h[n] + k.h[n] ahi i, j, s10 ; i.h[n] = j.h[n] + ext(s10) a i, j, k ; i.w[n] = j.w[n] + k.w[n] ai i, j, s10 ; i.w[n] = j.w[n] + ext(s10) sfh i, j, k ; i.h[n] = -j.h[n] + k.h[n] sfhi i, j, s10 ; i.h[n] = -j.h[n] + ext(s10) sf i, j, k ; i.w[n] = -j.w[n] + k.w[n] sfi i, j, s10 ; i.w[n] = -j.w[n] + ext(s10) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 10. FX: Arithmetic Operations & Examples Notice the subtract from semantics. This is different from the floating point subtract (fs) semantic. We think this was mainly due to the additional power of the immediate forms. ai i, i, 1 ; i = i + 1, for each word in i ahi i, i, -1 ; i = i - 1, for each half-word in i sfi i, i, 0 ; i = (-i), for each word in i sfhi x, x, 1 ; x = 1 - x, for each half-word in i sf z, y, x ; z = x - y, for each word in i P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 11. FX: Immediate Loads The SPU has some instructions that enable us to quickly set up registers values. These immediate loads are also 2-cycle FX instructions: il i, s16 ; i.w[n] = ext(s16) ilh i, u16 ; i.h[n] = u16 ila i, u18 ; i.w[n] = u18 ilhu i, u16 ; i.w[n] = u16 << 16 iohl i, u16 ; i.w[n] |= u16 Example: ilhu ones, 0x3f80 ; ones = (1.0, 1.0, 1.0, 1.0) ila magic, 0x10203; magic = (0x00010203_00010203_00010203_00010203) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 12. FX: Logical Bit Operations These instructions work on each of the 128 bits in the registers. and i, j, k ; i = j & k nand i, j, k ; i = ~(j & k) andc i, j, k ; i = j & ~k or i, j, k ; i = j | k nor i, j, k ; i = ~(j | k) orc i, j, k ; i = j | ~k xor i, j, k ; i = j ^ k eqv i, j, k ; i = j == k P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 13. FX: Logical Operations w/immediates andbi i, j, u8 ; i.b[n] = j.b[n] & u8 andhi i, j, s10 ; i.h[n] = j.h[n] & ext(s10) andi i, j, s10 ; i.w[n] = j.w[n] & ext(s10) orbi i, j, u8 ; i.b[n] = j.b[n] | u8 orhi i, j, s10 ; i.h[n] = j.h[n] | ext(s10) ori i, j, s10 ; i.w[n] = j.w[n] | ext(s10) xorbi i, j, u8 ; i.b[n] = j.b[n] ^ u8 xorhi i, j, s10 ; i.h[n] = j.h[n] ^ ext(s10) xori i, j, s10 ; i.w[n] = j.w[n] ^ ext(s10) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 14. FX: Comparisons (Bytes) ceqb i, j, k ; i.b[n] = (j.b[n] == k.b[n]) ? TRUE : FALSE ceqbi i, j, su8 ; i.b[n] = (j.b[n] == su8) ? TRUE : FALSE cgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (s) cgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSE clgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (u) clgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSE TRUE = 0xFF FALSE = 0x00 (s) means “signed” and (u) means “unsigned” compares. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 15. FX: Comparisons (Halves) ceqh i, j, k ; i.h[n] = (j.h[n] == k.h[n]) ? TRUE : FALSE ceqhi i, j, s10 ; i.h[n] = (j.h[n] == ext(s10)) ? TRUE : FALSE cgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (s) cgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (s) clgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (u) clgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (u) TRUE = 0xFFFF FALSE = 0x0000 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 16. FX: Comparisons (Words) ceq i, j, k ; i.w[n] = (j.w[n] == k.w[n]) ? TRUE : FALSE ceqi i, j, s10 ; i.w[n] = (j.w[n] == ext(s10)) ? TRUE : FALSE cgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (s) cgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (s) clgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (u) clgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (u) TRUE = 0xFFFF_FFFF FALSE = 0x0000_0000 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 17. FX: Comparisons (Floats) fceq i, b, c ; i.w[n] = (b[n] == c[n]) ? TRUE : FALSE fcmeq i, b, c ; i.w[n] = (abs(b[n]) == abs(c[n])) ? TRUE : FALSE fcgt i, b, c ; i.w[n] = (b[n] > c[n]) ? TRUE : FALSE fcmgt i, b, c ; i.w[n] = (abs(b[n]) > abs(c[n])) ? TRUE : FALSE TRUE = 0xFFFF_FFFF FALSE = 0x0000_0000 Note: All zeros are equal, e.g.: 0.0 == -0.0. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 18. FX: Select Bits This very important operation selects bits from j and k depending on the bits in the l registers. These fit well with the comparison functions given previously. selb i, j, k, l ; i = (l==0) ? j : k Notice that if the bit is 0, then it selects j and if not then it selects the bit in k. Example: SIMD min/max fcgt mask, a, b ; mask is all 1’s if a > b selb max, b, a, mask ; select a if a > b selb min, a, b, mask ; select b if !(a > b) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 19. FX: Misc generate borrow bit bg i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n]) i.w[n] = tmp.w[n] < 0 ? 0 : 1 generate borrow bit with borrow bgx i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n] + (i.w[n]&1) - 1) i.w[n] = tmp.w[n] < 0 ? 0 : 1 generate carry bit cg i, j, k ; i.w[n] = (j.w[n] + k.w[n]) > 0xffffffff ? 1 : 0 generate carry bit with carry cgx i, j, k ; tmp.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1) i.w[n] = tmp.w[n] > 0xffffffff ? 1 : 0 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 20. FX: Misc add with carry bit addx i, j, k ; i.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1)) subtract with borrow bit sfx i, j, k ; i.w[n] = (-j.w[n] + k.w[n] + (i.w[n] & 1) - 1) sign-extend byte to half-word xsbh i, j ; i.h[n] = ext(i.h[n] & 0xff) sign-extend half-word to word xshw i, j ; i.w[n] = ext(i.w[n] & 0xffff) sign-extend word to double-word xswd i, j ; i.d[n] = ext(i.d[n] & 0xffffffff) count leading zeros clz i, j ; i.w[n] = leadingZeroCount(j.w[n]) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 21. Word Shift Class (WS) [Even:4] The WS class of instructions have latency of 4 cycles and a throughput of 1 cycle. These are all even instructions. shlh i, j, k ; i.h[n] = j.h[n] << ( k.h[n] & 0x1f ) shlhi i, j, imm ; i.h[n] = j.h[n] << ( imm & 0x1f ) shl i, j, k ; i.w[n] = j.w[n] << ( k.w[n] & 0x3f ) shli i, j, imm ; i.w[n] = j.w[n] << ( imm & 0x3f ) Notice that there is an independent shift amount for each of the shlh and shl versions, i.e., this is truly SIMD! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 22. Example ; Assume r0 = ( 1, 2, 4, 8 ) ; r1 = ( 1, 2, 3, 4 ) shl r2, r0, r1 ; Now r2 = ( 1<<1, 2<<2, 4<<3, 8<<4 ) = ( 2, 4, 32, 128 ) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 23. WS: Rotate left logical roth i, j, k ; i.h[n] = j.h[n] <^ ( k.h[n] & 0x0f ) rothi i, j, imm ; i.h[n] = j.h[n] <^ ( imm & 0x0f ) rot i, j, k ; i.w[n] = j.w[n] <^ ( k.w[n] & 0x1f ) roti i, j, imm ; i.w[n] = j.w[n] <^ ( imm & 0x1f ) <^ is my idiosyncratic symbol for rotate. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 24. WS: Shift right logical rothm i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f ) rothmi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f ) rotm i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f ) rotmi i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f ) Notice here that the shift amounts need to be negative in order to produce a proper shift. This is because this is actually a rotate left and then mask operation. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 25. WS: Shift right arithmetic rotmah i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f ) rotmahi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f ) rotma i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f ) rotmai i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f ) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 26. Load/Store Class (LS) [Odd:6] The load/store operations are odd instructions that work on the 256 kB local memory. They have a latency of 6 cycles, but the hardware has short-cuts in place so that you can read a written value immediately after the store. Do note: Memory wraps around, so you can never access memory outside the local store (LS). You can only load and store a whole quadword, so if you need to modify a part, you need to load the quadword value, merge in the modified part into the value and store the whole quadword back. Addresses are in units of bytes, unlike the VU’s on the PS2. The load/store operations will use the value in the preferred word of the address register, i.e.: the first word. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 27. LS: Loads lqa i, label18 ; addr = label18 ; range = 256kb (or +/- 128kb) lqd i, qoff(j) ; addr = qoff * 16 + j.w[0] ; qoff is 10 bit signed, addr range = +/-8kb. lqr i, label14 ; addr = ext(label14) + pc ; label14 range = +/- 8kb. lqx i, j, k ; addr = j.w[0] + k.w[0] P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 28. LS: Stores stqa i, label18 ; addr = label18 ; range = 256kb (or +/- 128kb) stqd i, qoff(j) ; addr = qoff * 16 + j.w[0] ; qoff is 10 bit signed, addr range = +/-8kb. stqr i, label14 ; addr = ext(label14) + pc ; label14 range = +/- 8kb. stqx i, j, k ; addr = j.w[0] + k.w[0] P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 29. Shuffle Class (SH) [Odd:4] The shuffle operations all have 4 cycle latency and they are odd instructions. Most of the instructions in this class deal with the whole quadword: We can divide the SH class into: The Shuffle Bytes Instruction. Quadword left-shifts, rotates and right-shifts. Creation of Shuffle Masks. Form Select Instructions. Gather Bit Instructions. Reciprocal Estimate Instructions. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 30. SH: Shuffle Bytes The ordering of bytes, half-words and words within the quadword is shown below. Notice that this is big-endian, not little-endian: +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +-------+-------+-------+-------+-------+-------+-------+-------+ | 0 | 1 | 2 | 3 | +---------------+---------------+---------------+---------------+ The shuffle byte instruction shufb take three inputs, two source registers r0, r1, and a shuffle mask msk. The output register d is found by running the following logic on each byte within the input registers: P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 31. SH: Shuffle Bytes Let x = msk.b[n], where n goes from 0 to 15: if x in 0 .. 0x7f: If (x & 0x10) == 0x00, then d.b[n] = r0.b[x & 0x0f]. If (x & 0x10) == 0x10, then d.b[n] = r1.b[x & 0x0f]. if x in 0x80 .. 0xbf: d.b[n] = 0x00 if x in 0xc0 .. 0xdf: d.b[n] = 0xff if x in 0xe0 .. 0xff: d.b[n] = 0x80 This is very powerful stuff! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 32. SH: Shufb Examples Previously, we mentioned that the SPU has no broadcast ability, but with a single shufb instruction we can broadcast one word into all words. We can create the shuffle masks using instructions directly, or else we could simply load it using a LS class instruction. ila s_AAAA, 0x10203 ; s_AAAA = 0x00_01_02_03 x 4 ; = 0x00010203_00010203_00010203_00010203 orbi s_CCCC, s_AAAA, 8 ; s_CCCC = 0x08_09_0a_0b x 4 Using these masks, we can quickly create a registers with all x’s, y’s, z’s or w’s: shufb xs, v, v, s_AAAA ; xs = (v.x, v.x, v.x, v.x) shufb zs, v, v, s_CCCC ; zs = (v.z, v.z, v.z, v.z) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 33. SH: .dshuf Because the shuffle instruction is so useful, our “frontend” tool supports quick creations of shuffle masks. Using the .dshuf directive, we create shuffle masks that follow the following rules. If the length of the string is 4, we assume it is word-sized shuffles, if 8 then half-word sized, and if 16 then byte-sized shuffles, upper-cased letters indicate sources from the first input, lower-cased ones indicate from the second input, ’0’ indicates zeros, ’X’ ones and ’8’ 0x80’s. .dshuf "ABC0" ; 0x00010203_04050607_08090a0b_80808080 .dshuf "aX08" ; 0x10111213_c0c0c0c0_80808080_e0e0e0e0 .dshuf "aBC0aBC0" ; 0x1011_0203_0405_8080_1011_0203_0405_8080 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 34. SH: Another Shufb Example We can create finite state-machines, piping input into one end of the quad-word, while spitting out the result into another (like e.g. the preferred word). Here’s an example of such a “delay machine”: ; in the data section: m_bcdA: .dshufb "bcdA" ; in the init section: lqa s_bcdA, 0(m_bcdA) ; in the loop: shufb state, input, state, s_bcdA ; state.x = state.y ; state.y = state.z ; state.z = state.w ; state.w = input.x P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 35. SH: Quadword Shift Left These instructions take the preferred byte (byte 3) or an immediate value, shifting the whole quadword to the left. There are versions that shift in number of bytes as well as in number of bits. For bit shifts, the shift amount is clamped to be less than 8. SHift Left Quadword by BYtes shlqby i, j, k ; i = j << ((k.b[3] & 0x1f) * 8) SHift Left Quadword by BYtes Immediate shlqbyi i, j, imm ; i = j << ((imm & 0x1f) * 8) SHift Left Quadword by BYtes using BIt count shlqbybi i, j, k ; i = j << (k.b[3] & 0xf8) SHift Left Quadword by BIts shlqbi i, j, k ; i = j << (k.b[3] & 0x07) SHift Left Quadword by BIts Immediate shlqbii i, j, imm ; i = j << (imm & 0x07) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 36. SH: Quadword Rotate Left These follow the same pattern as left shifts: ROTate (left) Quadword by BYtes rotqby i, j, k ; i = j <^ ((k.b[3] & 0x0f) * 8) ROTate (left) Quadword by BYtes Immediate rotqbyi i, j, imm ; i = j <^ ((imm & 0x0f) * 8) ROTate (left) Quadword by BYtes using BIt count rotqbybi i, j, k ; i = j <^ (k.b[3] & 0x78) ROTate (left) Quadword by BIts rotqbi i, j, k ; i = j <^ (k.b[3] & 0x07) ROTate (left) Quadword by BIts Immediate rotqbii i, j, imm ; i = j <^ (imm & 0x07) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 37. SH: Quadword Shift Right Ditto for shift rights, though as for the WS class, we call it rotates with mask and use the negative shift amounts: ROTate and Mask Quadword by BYtes rotmqby i, j, k ; i = j >> ((-k.b[3] & 0x1f) * 8) ROTate and Mask Quadword by BYtes Immediate rotmqbyi i, j, imm ; i = j >> ((-imm & 0x1f) * 8) ROTate and Mask Quadword by BYtes using BIt count rotmqbybi i, j, k ; i = j >> (-(k.b[3] & 0xf8) & 0xf8) (*) ROTate and Mask Quadword by BIts rotmqbi i, j, k ; i = j >> (-(k.b[3] & 0x07)) ROTate and Mask Quadword by BIts Immediate rotmqbii i, j, imm ; i = j >> (-imm & 0x07) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 38. SH: Form Select Instructions These instructions are designed to expand a small number of bits into many bits of ones, and they are good for use with the sel operation. Form Select Mask for Bytes Immediate fsmbi i, u16 ; i.b[n] = ( (u16 << n) & 0x8000 ) ? 0xff : 0x00 Form Select Mask for Bytes fsmb i, j ; i.b[n] = ( (i.h[1] << n) & 0x8000 ) ? 0xff : 0x00 Form Select Mask for Halfwords fsmh i, j ; i.h[n] = ( (i.b[3] << n) & 0x80 ) ? 0xffff : 0x00 Form Select Mask for Words fsm i, j ; i.w[n] = ( (i.b[3] << n) & 0x8 ) ? 0xffffffff : 0x00 Example: fsmbi selABCd, 0x000f; make select mask to get XYZ from first arg P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 39. SH: Gather Bits Instructions These are the opposite to the form select instructions, and can be used to quickly pack results from comparison operators into compact bytes or half-words. They all gather the rightmost bit from the the source register and packs it into a single bit in the target. Gather Bits from Bytes gbb i, j ; i=0;for(n=0;n<16;n++){i.w[0]|=(j.b[n]&1);i.w[0]<<=1;} Gather Bits from Halfwords gbh i, j ; i=0;for(n=0;n< 8;n++){i.w[0]|=(j.h[n]&1);i.w[0]<<=1;} Gather Bits (from Words) gb i, j ; i=0;for(n=0;n< 4;n++){i.w[0]|=(j.w[n]&1);i.w[0]<<=1;} P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 40. SH: How to generate masks for non-quadword stores. As seen in the section for load/store, there are no non-quadword load/store operations. A way to store a non-quadword value is to load the destination quadword, shuffle the value with the loaded quadword, and store it back to the same location. In order to make the process of generating these shuffle-masks, there are a few instructions that generate these control masks: Generate Controls for Byte Insertion (d-form) cbd i, imm(j) Generate Controls for Byte Insertion (x-form) cbx i, j, k P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 41. SH: How to generate masks for non-quadword stores. Generate Controls for Halfword Insertion (d-form) chd i, imm(j) Generate Controls for Halfword Insertion (x-form) chx i, j, k Generate Controls for Word Insertion (d-form) cwd i, imm(j) Generate Controls for Word Insertion (x-form) cwx i, j, k Generate Controls for Doubleword Insertion (d-form) cdd i, imm(j) Generate Controls for Doubleword Insertion (x-form) cdx i, j, k P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 42. SH: How to generate masks for non-quadword stores. Example: Store prefered byte into a table lqx qword, table, offset cbx mask, table, offset shufb qword, value, qword, mask stqx qword, table, offset ai offset, offset, 1 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 43. SH: Reciprocal Estimate Instructions The hardware supports two fast (4 cycles) that calculate the reciprocal √ recip(x) = 1/x, or the reciprocal square root rsqrt(x) = 1/ x. These instructions work in conjunction with the “fi” instruction that we’ll later explain in detail. After the interpolation instruction, result are accurate to a precision of 12 bits, which is about half the floating-point precision of 23. In order to improve the accuracy, one must perform another Taylor- or Euler-step. Do note that: √ √ √ x x 1 sqrt(x) = x= √ = |x| √ = x · rsqrt(x), x x since x ≥ 0, so there is no need for a seperate square-root function. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 44. Improving precision on the reciprocal function Assuming we have the input in the x-register, we proceed to calculate frest a, x fi b, x, a ; b is good to 12 bits precision fnms c, b, x, one ; fma b, c, b, b ; b is good to 24 bits precision ; P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 45. Improving precision on the reciprocal square-root function frsqest a, x fi b, x, a ; b is good to 12 bits precision fm c, b, x ; (b and a can share register) fm d, b, onehalf ; (c and x can share register) fnms c, c, b, one fma b, d, c, b ; b is good to 24 bits precision P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 46. SH: Or Across - The Final Instruction The last instruction in the SH class is a new addition. Or Across orx i, j ; i.w[0] = ( j.w[0] | j.w[1] | j.w[2] | j.w[3] ); i.w[1] = i.w[2] = i.w[3] = 0 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 47. Floating point / Integer Class (FI) [Even:7] The FI class of instructions have latency of 7 cycles and a throughput of 1 cycle. These are all even instructions. There are basically three types of instructions: integer multiplies, interpolations for reciprocal calculations, and finally, fp/integer conversions. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 48. FI: Integer Multiplies multiply lower halves signed mpy i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] multiply lower halves signed immediate mpyi i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10) multiply lower halves unsigned mpyu i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] multiply lower halves unsigned immediate (immediate sign-extends) mpyui i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10) P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 49. FI: Integer Multiplies multiply lower halves, add word mpya i, j, k, l ; i.w[n] = j.h[2n+1] * k.h[2n+1] + l.w[n] multiply lower halves, shift result down 16 with sign extend mpys i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] >> 16 multiply upper half j by lower half k, shift up 16 mpyh i, j, k ; i.w[n] = j.h[2n] * k.h[2n+1] << 16 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 50. FI: Integer Multiplies multiply upper halves signed mpyhh i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply upper halves unsigned mpyhhu i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply/accumulate upper halves mpyhha i, j, k ; i.w[n] += j.h[2n] * k.h[2n] multiply/accumulate upper halves unsigned mpyhhau i, j, k ; i.w[n] += j.h[2n] * k.h[2n] P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 51. FI: Conversions and FI instruction fi a, b, c ; use after frest or frsqest cuflt a, j, precis ; unsigned int to float csflt a, j, precis ; signed int to float cfltu i, b, precis ; float to unsigned int cflts i, b, precis ; float to signed int Here precis is the precision as an immediate, so that e.g. cuflt fp, val, 8; converts 0x80 into 0.5 Also, please note that these instructions saturate to the min and max values of their precision. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 52. Byte Operations (BO) [Even: 4] There’s a couple of interesting instructions that help with multi-media and streaming logic. Count Ones in Bytes cntb i, j ; i.b[n] = numOneBits( j.b[n] ) Average Bytes avgb i, j, k ; i.b[n] = ( j.b[n] + k.b[n] + 1 ) / 2 Absolute Difference in Bytes absdb i, j, k ; i.b[n] = abs( j.b[n] - k.b[n] ) Sum Bytes into Half-words sumb i, j, k ; i.h[0] = k.b[0] + k.b[1] + k.b[2] + k.b[3]; i.h[1] = j.b[0] + j.b[1] + j.b[2] + j.b[3]; : i.h[6] = k.b[12] + k.b[13] + k.b[14] + k.b[15]; i.h[7] = j.b[12] + j.b[13] + j.b[14] + j.b[15]; P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 53. Branch Class (BR) [Odd:-] Branches on the SPU are costly. If a branch is taken, and it has not been predicted, there is a 18 cycle penalty so that the chip can restart the pipe. There is no penalty for falling through a non-predicted branch. However, if you have predicted a branch, and this does not occur - then there is also a 18 cycle penalty. Branches and branch hints are all odd instructions. Note: Even a static branch needs to be predicted. Note: This is one of the reasons why diverging control-paths are so difficult to optimize for. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 54. BR: Unconditional Branches Branch Relative br brTo ; goto label address Branch Relative and Set Link brsl i, brTo ; gosub label address, i.w[0] = return address, (*) Branch Indirect bi i ; goto i.w[0] Branch Indirect and Set Link bisl i, j ; gosub j.w[0], i.w[0] = return address, (*) BRanch Absolute bra brTo ; goto brTo BRanch Absolute and Set Link brasl i, brTo ; gosub label address, i.w[0] = return address (*) (*): These instructions have a 4 cycle latency for the return register. Note: The bi instructions have enable/disable interrupt versions, e.g.: bie, bid, bisle, bisld. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 55. BR: Conditional Branches (Relative) Branch on Zero brz i, brTo ; branch if i.w[0] == 0 Branch on Not Zero brnz i, brTo ; branch if i.w[0] != 0 Branch on Zero brhz i, brTo ; branch if i.h[1] == 0 Branch on Not Zero brhnz i, brTo ; branch if i.h[1] != 0 P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 56. BR: Conditional Branches (Indirect) Branch Indirect on Zero biz i, j ; branch to j.w[0] if i.w[0] == 0 Branch Indirect on Not Zero binz i, j ; branch to j.w[0] if i.w[0] != 0 Branch Indirect on Zero bihz i, j ; branch to j.w[0] if i.h[1] == 0 Branch Indirect on Not Zero bihnz i, j ; branch to j.w[0] if i.h[1] != 0 Note: These instructions can enable/disable interrupts as well. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 57. BR: Interrupt & Misc Interrupt RETurn iret i ; Return from interrupt Interrupt RETurn iretd i ; Return from interrupt, disable interrupts Interrupt RETurn irete i ; Return from interrupt, enable interrupts Branch Indirect and Set Link if External Data bisled i, j ; gosub j if channel 0 is non-zero P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 58. Hints Branch Class (HB) [Odd:15] If you know the most likely (or only) outcome for a branch, you can make sure the branch is penalty free as long as the hint occurs at least 15 cycles before the branch is taken. If the hint occurs later, there still may be a benefit, since the penalty is lowered. However, if the hint arrives less than 4 cycles before the branch, there is no benefit. Please note that it also turns out that there is a hardware bug w.r.t. the hbr instructions. One cannot hint a branch where the branch targets forwards and is also within the same 64-byte block as the branch. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 59. Hints Branch Instructions Hint Branch (Immediate) hbr brFrom, j ; branch hint for any BIxxx type branch Hint Branch Absolute hbra brFrom, brTo ; branch hint for any BRAxxx type branch Hint Branch Relative hbrr brFrom, brTo ; branch hint for any BRxxx type branch Hint Branch Prefetch hbrp ; inline prefetch code (*) (*) allows 15 LS instructions in a row without any instruction fetch stall. P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 60. CH: DMA Channel Ops We will explain these in further talks, but for completeness we’ve included these here. They are all odd instructions with a latency of 6. Note, that the latency may actually be much higher if channels are not ready. Read from Channel rdch i, chn ; read i from channel chn Write to Channel wrch chn, i ; write i into channel chn Read Channel Count rdchcnt i, chn; read channel count for channel chn into i P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 61. DP: Double Precision DP instructions have a latency of 13 and are even. However, they will stall pipelining for 6 cycles (that is all currently executing instructions are halted) while this instruction is executed. Therefore, we do not recommend using double precision at all! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
  • 62. Questions? That’s all folks! P˚ al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations