Introduction to Multilingual Retrieval Augmented Generation (RAG)
SPU Optimizations-part 1
1. Introduction to SPU Optimizations
Part 1: Assembly Instructions
P˚l-Kristian Engstad
a
pal engstad@naughtydog.com
March 5, 2010
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
2. Introduction
These slides are used internally at Naughty Dog to introduce new programmers
to our SPU programming methods. Due to popular interest, we are now
making these public. Note that some of the tools that we are using are not
released to the public, but there exists many other alternatives out there that
do similar things.
The first set of slides introduce most of the SPU assembly instructions. Please
read these carefully before reading the second set. Those slides go through a
made-up example showing how one can improve performance drastically, by
knowing the hardware as well as employing a technique called software
pipe-lining.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
3. SPU programming is Cool
In these slides, we will go through all of the assembly instructions that exist on
the SPU, giving you a quick introduction to the power of the SPUs.
Each SPU has 256 kB of local memory.
This local memory can be thought of as 1 cycle memory.
Programs and data exist in the same local memory space.
There are no memory protections in local memory!
The only way to access external memory is through DMA.
There is a significant delay between when a DMA request is queued until
it finishes.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
4. SPU Execution Environment
The SPU has 128 general purpose 128-bit wide registers.
You can think of these as
2 doubles (64-bit floating-point values),
4 floats (32-bit floating-point values),
4 words (32-bit integer values),
8 half-words (16-bit integer values), or
16 bytes (8-bit integer values).
An SPU executes an even and an odd instruction each cycle.
Even instructions are mostly arithmetic instructions, whereas
the odd ones are load/store instructions, shuffles, branches and other
special instructions.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
5. Instruction Classes
The instruction set can be put in classes, where the instructions in the same
class have the same arity (i.e. whether they are even or odd) and latency (how
long it takes for the result to be ready):
(SP) Single Precision {e6}
(FX) FiXed {e2}
(WS) Word Shift {e4}
(LS) Load/Store {o6}
(SH) SHuffle {o4}
(FI) Fp Integer {e7}
(BO) Byte Operations {e4}
(BR) BRanch {o-}
(HB) Hint Branch {o15}
(CH) CHannel Operations {o6}
(DP) Double Precision {e13}
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
6. Single Precision Floating Point Class (SP) [Even:6]
The SP class of instructions have latency of 6 cycles and a throughput of 1
cycle. These are all even instructions.
fa a, b, c ; a.f[n] = b.f[n] + c.f[n]
fs a, b, c ; a.f[n] = b.f[n] - c.f[n]
fm a, b, c ; a.f[n] = b.f[n] * c.f[n]
fma a, b, c, d ; a.f[n] = b.f[n] * c.f[n] + d.f[n]
fms a, b, c, d ; a.f[n] = b.f[n] * c.f[n] - d.f[n]
fnms a, b, c, d ; a.f[n] = -(b.f[n] * c.f[n] - d.f[n])
The syntax here indicates that for each of the 4 32-bit floating point values in
the register, the operation in the comment is executed.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
7. Single Precision Floating Point Class (SP)
No broadcast versions.
No dot-products or cross-products.
No “fnma” instruction.
Example:
If the registers r1 and r2 contains
r1 = ( 1.0, 2.0, 3.0, 4.0 ),
r2 = ( 0.0, -2.0, 1.0, 4.0 ),
then after
fa r0, r1, r2 ; r0 = r1 + r2
then r0 contains
r0 = ( 1.0, 0.0, 4.0, 8.0 ).
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
8. FiXed precision Class (FX) [Even:2]
The FX class of instructions all have latency of just 2 cycles and all have a
throughput of 1 cycle. These are even instructions.
There’s quite a few of them, and we can further divide them down into:
Integer Arithmetic Operations.
Immediate Loads Operations.
Comparison Operations.
Select Bit Operation.
Logical Bit Operations.
Extensions and Misc Operations.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
9. FX: Arithmetic Operations
The integer arithmetic operations “add” and “subtract from” work on either 4
words at a time or 8 half-words at a time.
ah i, j, k ; i.h[n] = j.h[n] + k.h[n]
ahi i, j, s10 ; i.h[n] = j.h[n] + ext(s10)
a i, j, k ; i.w[n] = j.w[n] + k.w[n]
ai i, j, s10 ; i.w[n] = j.w[n] + ext(s10)
sfh i, j, k ; i.h[n] = -j.h[n] + k.h[n]
sfhi i, j, s10 ; i.h[n] = -j.h[n] + ext(s10)
sf i, j, k ; i.w[n] = -j.w[n] + k.w[n]
sfi i, j, s10 ; i.w[n] = -j.w[n] + ext(s10)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
10. FX: Arithmetic Operations & Examples
Notice the subtract from semantics. This is different from the floating point
subtract (fs) semantic. We think this was mainly due to the additional power
of the immediate forms.
ai i, i, 1 ; i = i + 1, for each word in i
ahi i, i, -1 ; i = i - 1, for each half-word in i
sfi i, i, 0 ; i = (-i), for each word in i
sfhi x, x, 1 ; x = 1 - x, for each half-word in i
sf z, y, x ; z = x - y, for each word in i
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
11. FX: Immediate Loads
The SPU has some instructions that enable us to quickly set up registers
values. These immediate loads are also 2-cycle FX instructions:
il i, s16 ; i.w[n] = ext(s16)
ilh i, u16 ; i.h[n] = u16
ila i, u18 ; i.w[n] = u18
ilhu i, u16 ; i.w[n] = u16 << 16
iohl i, u16 ; i.w[n] |= u16
Example:
ilhu ones, 0x3f80 ; ones = (1.0, 1.0, 1.0, 1.0)
ila magic, 0x10203; magic = (0x00010203_00010203_00010203_00010203)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
12. FX: Logical Bit Operations
These instructions work on each of the 128 bits in the registers.
and i, j, k ; i = j & k
nand i, j, k ; i = ~(j & k)
andc i, j, k ; i = j & ~k
or i, j, k ; i = j | k
nor i, j, k ; i = ~(j | k)
orc i, j, k ; i = j | ~k
xor i, j, k ; i = j ^ k
eqv i, j, k ; i = j == k
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
13. FX: Logical Operations w/immediates
andbi i, j, u8 ; i.b[n] = j.b[n] & u8
andhi i, j, s10 ; i.h[n] = j.h[n] & ext(s10)
andi i, j, s10 ; i.w[n] = j.w[n] & ext(s10)
orbi i, j, u8 ; i.b[n] = j.b[n] | u8
orhi i, j, s10 ; i.h[n] = j.h[n] | ext(s10)
ori i, j, s10 ; i.w[n] = j.w[n] | ext(s10)
xorbi i, j, u8 ; i.b[n] = j.b[n] ^ u8
xorhi i, j, s10 ; i.h[n] = j.h[n] ^ ext(s10)
xori i, j, s10 ; i.w[n] = j.w[n] ^ ext(s10)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
14. FX: Comparisons (Bytes)
ceqb i, j, k ; i.b[n] = (j.b[n] == k.b[n]) ? TRUE : FALSE
ceqbi i, j, su8 ; i.b[n] = (j.b[n] == su8) ? TRUE : FALSE
cgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (s)
cgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSE
clgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (u)
clgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSE
TRUE = 0xFF
FALSE = 0x00
(s) means “signed” and (u) means “unsigned” compares.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
15. FX: Comparisons (Halves)
ceqh i, j, k ; i.h[n] = (j.h[n] == k.h[n]) ? TRUE : FALSE
ceqhi i, j, s10 ; i.h[n] = (j.h[n] == ext(s10)) ? TRUE : FALSE
cgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (s)
cgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (s)
clgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (u)
clgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (u)
TRUE = 0xFFFF
FALSE = 0x0000
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
16. FX: Comparisons (Words)
ceq i, j, k ; i.w[n] = (j.w[n] == k.w[n]) ? TRUE : FALSE
ceqi i, j, s10 ; i.w[n] = (j.w[n] == ext(s10)) ? TRUE : FALSE
cgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (s)
cgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (s)
clgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (u)
clgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (u)
TRUE = 0xFFFF_FFFF
FALSE = 0x0000_0000
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
17. FX: Comparisons (Floats)
fceq i, b, c ; i.w[n] = (b[n] == c[n]) ? TRUE : FALSE
fcmeq i, b, c ; i.w[n] = (abs(b[n]) == abs(c[n])) ? TRUE : FALSE
fcgt i, b, c ; i.w[n] = (b[n] > c[n]) ? TRUE : FALSE
fcmgt i, b, c ; i.w[n] = (abs(b[n]) > abs(c[n])) ? TRUE : FALSE
TRUE = 0xFFFF_FFFF
FALSE = 0x0000_0000
Note: All zeros are equal, e.g.: 0.0 == -0.0.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
18. FX: Select Bits
This very important operation selects bits from j and k depending on the bits
in the l registers. These fit well with the comparison functions given previously.
selb i, j, k, l ; i = (l==0) ? j : k
Notice that if the bit is 0, then it selects j and if not then it selects the bit in k.
Example: SIMD min/max
fcgt mask, a, b ; mask is all 1’s if a > b
selb max, b, a, mask ; select a if a > b
selb min, a, b, mask ; select b if !(a > b)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
19. FX: Misc
generate borrow bit
bg i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n])
i.w[n] = tmp.w[n] < 0 ? 0 : 1
generate borrow bit with borrow
bgx i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n] + (i.w[n]&1) - 1)
i.w[n] = tmp.w[n] < 0 ? 0 : 1
generate carry bit
cg i, j, k ; i.w[n] = (j.w[n] + k.w[n]) > 0xffffffff ? 1 : 0
generate carry bit with carry
cgx i, j, k ; tmp.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1)
i.w[n] = tmp.w[n] > 0xffffffff ? 1 : 0
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
20. FX: Misc
add with carry bit
addx i, j, k ; i.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1))
subtract with borrow bit
sfx i, j, k ; i.w[n] = (-j.w[n] + k.w[n] + (i.w[n] & 1) - 1)
sign-extend byte to half-word
xsbh i, j ; i.h[n] = ext(i.h[n] & 0xff)
sign-extend half-word to word
xshw i, j ; i.w[n] = ext(i.w[n] & 0xffff)
sign-extend word to double-word
xswd i, j ; i.d[n] = ext(i.d[n] & 0xffffffff)
count leading zeros
clz i, j ; i.w[n] = leadingZeroCount(j.w[n])
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
21. Word Shift Class (WS) [Even:4]
The WS class of instructions have latency of 4 cycles and a throughput of 1
cycle. These are all even instructions.
shlh i, j, k ; i.h[n] = j.h[n] << ( k.h[n] & 0x1f )
shlhi i, j, imm ; i.h[n] = j.h[n] << ( imm & 0x1f )
shl i, j, k ; i.w[n] = j.w[n] << ( k.w[n] & 0x3f )
shli i, j, imm ; i.w[n] = j.w[n] << ( imm & 0x3f )
Notice that there is an independent shift amount for each of the shlh and shl
versions, i.e., this is truly SIMD!
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
23. WS: Rotate left logical
roth i, j, k ; i.h[n] = j.h[n] <^ ( k.h[n] & 0x0f )
rothi i, j, imm ; i.h[n] = j.h[n] <^ ( imm & 0x0f )
rot i, j, k ; i.w[n] = j.w[n] <^ ( k.w[n] & 0x1f )
roti i, j, imm ; i.w[n] = j.w[n] <^ ( imm & 0x1f )
<^ is my idiosyncratic symbol for rotate.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
24. WS: Shift right logical
rothm i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f )
rothmi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f )
rotm i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f )
rotmi i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f )
Notice here that the shift amounts need to be negative in order to produce a
proper shift. This is because this is actually a rotate left and then mask
operation.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
25. WS: Shift right arithmetic
rotmah i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f )
rotmahi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f )
rotma i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f )
rotmai i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f )
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
26. Load/Store Class (LS) [Odd:6]
The load/store operations are odd instructions that work on the 256 kB local
memory. They have a latency of 6 cycles, but the hardware has short-cuts in
place so that you can read a written value immediately after the store. Do note:
Memory wraps around, so you can never access memory outside the local
store (LS).
You can only load and store a whole quadword, so if you need to modify a
part, you need to load the quadword value, merge in the modified part
into the value and store the whole quadword back.
Addresses are in units of bytes, unlike the VU’s on the PS2.
The load/store operations will use the value in the preferred word of the
address register, i.e.: the first word.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
27. LS: Loads
lqa i, label18 ; addr = label18
; range = 256kb (or +/- 128kb)
lqd i, qoff(j) ; addr = qoff * 16 + j.w[0]
; qoff is 10 bit signed, addr range = +/-8kb.
lqr i, label14 ; addr = ext(label14) + pc
; label14 range = +/- 8kb.
lqx i, j, k ; addr = j.w[0] + k.w[0]
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
28. LS: Stores
stqa i, label18 ; addr = label18
; range = 256kb (or +/- 128kb)
stqd i, qoff(j) ; addr = qoff * 16 + j.w[0]
; qoff is 10 bit signed, addr range = +/-8kb.
stqr i, label14 ; addr = ext(label14) + pc
; label14 range = +/- 8kb.
stqx i, j, k ; addr = j.w[0] + k.w[0]
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
29. Shuffle Class (SH) [Odd:4]
The shuffle operations all have 4 cycle latency and they are odd instructions.
Most of the instructions in this class deal with the whole quadword:
We can divide the SH class into:
The Shuffle Bytes Instruction.
Quadword left-shifts, rotates and right-shifts.
Creation of Shuffle Masks.
Form Select Instructions.
Gather Bit Instructions.
Reciprocal Estimate Instructions.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
30. SH: Shuffle Bytes
The ordering of bytes, half-words and words within the quadword is shown
below. Notice that this is big-endian, not little-endian:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 1 | 2 | 3 |
+---------------+---------------+---------------+---------------+
The shuffle byte instruction shufb take three inputs, two source registers r0,
r1, and a shuffle mask msk. The output register d is found by running the
following logic on each byte within the input registers:
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
31. SH: Shuffle Bytes
Let x = msk.b[n], where n goes from 0 to 15:
if x in 0 .. 0x7f:
If (x & 0x10) == 0x00, then d.b[n] = r0.b[x & 0x0f].
If (x & 0x10) == 0x10, then d.b[n] = r1.b[x & 0x0f].
if x in 0x80 .. 0xbf: d.b[n] = 0x00
if x in 0xc0 .. 0xdf: d.b[n] = 0xff
if x in 0xe0 .. 0xff: d.b[n] = 0x80
This is very powerful stuff!
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
32. SH: Shufb Examples
Previously, we mentioned that the SPU has no broadcast ability, but with a
single shufb instruction we can broadcast one word into all words. We can
create the shuffle masks using instructions directly, or else we could simply load
it using a LS class instruction.
ila s_AAAA, 0x10203 ; s_AAAA = 0x00_01_02_03 x 4
; = 0x00010203_00010203_00010203_00010203
orbi s_CCCC, s_AAAA, 8 ; s_CCCC = 0x08_09_0a_0b x 4
Using these masks, we can quickly create a registers with all x’s, y’s, z’s or w’s:
shufb xs, v, v, s_AAAA ; xs = (v.x, v.x, v.x, v.x)
shufb zs, v, v, s_CCCC ; zs = (v.z, v.z, v.z, v.z)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
33. SH: .dshuf
Because the shuffle instruction is so useful, our “frontend” tool supports quick
creations of shuffle masks. Using the .dshuf directive, we create shuffle masks
that follow the following rules.
If the length of the string is 4, we assume it is word-sized shuffles, if 8
then half-word sized, and if 16 then byte-sized shuffles,
upper-cased letters indicate sources from the first input, lower-cased ones
indicate from the second input,
’0’ indicates zeros, ’X’ ones and ’8’ 0x80’s.
.dshuf "ABC0" ; 0x00010203_04050607_08090a0b_80808080
.dshuf "aX08" ; 0x10111213_c0c0c0c0_80808080_e0e0e0e0
.dshuf "aBC0aBC0" ; 0x1011_0203_0405_8080_1011_0203_0405_8080
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
34. SH: Another Shufb Example
We can create finite state-machines, piping input into one end of the
quad-word, while spitting out the result into another (like e.g. the preferred
word). Here’s an example of such a “delay machine”:
; in the data section:
m_bcdA: .dshufb "bcdA"
; in the init section:
lqa s_bcdA, 0(m_bcdA)
; in the loop:
shufb state, input, state, s_bcdA ; state.x = state.y
; state.y = state.z
; state.z = state.w
; state.w = input.x
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
35. SH: Quadword Shift Left
These instructions take the preferred byte (byte 3) or an immediate value,
shifting the whole quadword to the left. There are versions that shift in number
of bytes as well as in number of bits. For bit shifts, the shift amount is
clamped to be less than 8.
SHift Left Quadword by BYtes
shlqby i, j, k ; i = j << ((k.b[3] & 0x1f) * 8)
SHift Left Quadword by BYtes Immediate
shlqbyi i, j, imm ; i = j << ((imm & 0x1f) * 8)
SHift Left Quadword by BYtes using BIt count
shlqbybi i, j, k ; i = j << (k.b[3] & 0xf8)
SHift Left Quadword by BIts
shlqbi i, j, k ; i = j << (k.b[3] & 0x07)
SHift Left Quadword by BIts Immediate
shlqbii i, j, imm ; i = j << (imm & 0x07)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
36. SH: Quadword Rotate Left
These follow the same pattern as left shifts:
ROTate (left) Quadword by BYtes
rotqby i, j, k ; i = j <^ ((k.b[3] & 0x0f) * 8)
ROTate (left) Quadword by BYtes Immediate
rotqbyi i, j, imm ; i = j <^ ((imm & 0x0f) * 8)
ROTate (left) Quadword by BYtes using BIt count
rotqbybi i, j, k ; i = j <^ (k.b[3] & 0x78)
ROTate (left) Quadword by BIts
rotqbi i, j, k ; i = j <^ (k.b[3] & 0x07)
ROTate (left) Quadword by BIts Immediate
rotqbii i, j, imm ; i = j <^ (imm & 0x07)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
37. SH: Quadword Shift Right
Ditto for shift rights, though as for the WS class, we call it rotates with mask
and use the negative shift amounts:
ROTate and Mask Quadword by BYtes
rotmqby i, j, k ; i = j >> ((-k.b[3] & 0x1f) * 8)
ROTate and Mask Quadword by BYtes Immediate
rotmqbyi i, j, imm ; i = j >> ((-imm & 0x1f) * 8)
ROTate and Mask Quadword by BYtes using BIt count
rotmqbybi i, j, k ; i = j >> (-(k.b[3] & 0xf8) & 0xf8) (*)
ROTate and Mask Quadword by BIts
rotmqbi i, j, k ; i = j >> (-(k.b[3] & 0x07))
ROTate and Mask Quadword by BIts Immediate
rotmqbii i, j, imm ; i = j >> (-imm & 0x07)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
38. SH: Form Select Instructions
These instructions are designed to expand a small number of bits into many bits of ones,
and they are good for use with the sel operation.
Form Select Mask for Bytes Immediate
fsmbi i, u16 ; i.b[n] = ( (u16 << n) & 0x8000 ) ? 0xff : 0x00
Form Select Mask for Bytes
fsmb i, j ; i.b[n] = ( (i.h[1] << n) & 0x8000 ) ? 0xff : 0x00
Form Select Mask for Halfwords
fsmh i, j ; i.h[n] = ( (i.b[3] << n) & 0x80 ) ? 0xffff : 0x00
Form Select Mask for Words
fsm i, j ; i.w[n] = ( (i.b[3] << n) & 0x8 ) ? 0xffffffff : 0x00
Example:
fsmbi selABCd, 0x000f; make select mask to get XYZ from first arg
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
39. SH: Gather Bits Instructions
These are the opposite to the form select instructions, and can be used to
quickly pack results from comparison operators into compact bytes or
half-words. They all gather the rightmost bit from the the source register and
packs it into a single bit in the target.
Gather Bits from Bytes
gbb i, j ; i=0;for(n=0;n<16;n++){i.w[0]|=(j.b[n]&1);i.w[0]<<=1;}
Gather Bits from Halfwords
gbh i, j ; i=0;for(n=0;n< 8;n++){i.w[0]|=(j.h[n]&1);i.w[0]<<=1;}
Gather Bits (from Words)
gb i, j ; i=0;for(n=0;n< 4;n++){i.w[0]|=(j.w[n]&1);i.w[0]<<=1;}
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
40. SH: How to generate masks for non-quadword stores.
As seen in the section for load/store, there are no non-quadword load/store
operations. A way to store a non-quadword value is to load the destination
quadword, shuffle the value with the loaded quadword, and store it back to the
same location. In order to make the process of generating these shuffle-masks,
there are a few instructions that generate these control masks:
Generate Controls for Byte Insertion (d-form)
cbd i, imm(j)
Generate Controls for Byte Insertion (x-form)
cbx i, j, k
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
41. SH: How to generate masks for non-quadword stores.
Generate Controls for Halfword Insertion (d-form)
chd i, imm(j)
Generate Controls for Halfword Insertion (x-form)
chx i, j, k
Generate Controls for Word Insertion (d-form)
cwd i, imm(j)
Generate Controls for Word Insertion (x-form)
cwx i, j, k
Generate Controls for Doubleword Insertion (d-form)
cdd i, imm(j)
Generate Controls for Doubleword Insertion (x-form)
cdx i, j, k
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
42. SH: How to generate masks for non-quadword stores.
Example: Store prefered byte into a table
lqx qword, table, offset
cbx mask, table, offset
shufb qword, value, qword, mask
stqx qword, table, offset
ai offset, offset, 1
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
43. SH: Reciprocal Estimate Instructions
The hardware supports two fast (4 cycles) that calculate the reciprocal
√
recip(x) = 1/x, or the reciprocal square root rsqrt(x) = 1/ x. These
instructions work in conjunction with the “fi” instruction that we’ll later explain
in detail. After the interpolation instruction, result are accurate to a precision
of 12 bits, which is about half the floating-point precision of 23. In order to
improve the accuracy, one must perform another Taylor- or Euler-step.
Do note that:
√ √
√ x x 1
sqrt(x) = x= √ = |x| √ = x · rsqrt(x),
x x
since x ≥ 0, so there is no need for a seperate square-root function.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
44. Improving precision on the reciprocal function
Assuming we have the input in the x-register, we proceed to calculate
frest a, x
fi b, x, a ; b is good to 12 bits precision
fnms c, b, x, one ;
fma b, c, b, b ; b is good to 24 bits precision
;
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
45. Improving precision on the reciprocal square-root function
frsqest a, x
fi b, x, a ; b is good to 12 bits precision
fm c, b, x ; (b and a can share register)
fm d, b, onehalf ; (c and x can share register)
fnms c, c, b, one
fma b, d, c, b ; b is good to 24 bits precision
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
46. SH: Or Across - The Final Instruction
The last instruction in the SH class is a new addition.
Or Across
orx i, j ; i.w[0] = ( j.w[0] | j.w[1] | j.w[2] | j.w[3] );
i.w[1] = i.w[2] = i.w[3] = 0
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
47. Floating point / Integer Class (FI) [Even:7]
The FI class of instructions have latency of 7 cycles and a throughput of 1
cycle. These are all even instructions. There are basically three types of
instructions: integer multiplies, interpolations for reciprocal calculations, and
finally, fp/integer conversions.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
48. FI: Integer Multiplies
multiply lower halves signed
mpy i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1]
multiply lower halves signed immediate
mpyi i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10)
multiply lower halves unsigned
mpyu i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1]
multiply lower halves unsigned immediate (immediate sign-extends)
mpyui i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10)
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
49. FI: Integer Multiplies
multiply lower halves, add word
mpya i, j, k, l ; i.w[n] = j.h[2n+1] * k.h[2n+1] + l.w[n]
multiply lower halves, shift result down 16 with sign extend
mpys i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] >> 16
multiply upper half j by lower half k, shift up 16
mpyh i, j, k ; i.w[n] = j.h[2n] * k.h[2n+1] << 16
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
50. FI: Integer Multiplies
multiply upper halves signed
mpyhh i, j, k ; i.w[n] = j.h[2n] * k.h[2n]
multiply upper halves unsigned
mpyhhu i, j, k ; i.w[n] = j.h[2n] * k.h[2n]
multiply/accumulate upper halves
mpyhha i, j, k ; i.w[n] += j.h[2n] * k.h[2n]
multiply/accumulate upper halves unsigned
mpyhhau i, j, k ; i.w[n] += j.h[2n] * k.h[2n]
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
51. FI: Conversions and FI instruction
fi a, b, c ; use after frest or frsqest
cuflt a, j, precis ; unsigned int to float
csflt a, j, precis ; signed int to float
cfltu i, b, precis ; float to unsigned int
cflts i, b, precis ; float to signed int
Here precis is the precision as an immediate, so that e.g.
cuflt fp, val, 8; converts 0x80 into 0.5
Also, please note that these instructions saturate to the min and max values of
their precision.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
52. Byte Operations (BO) [Even: 4]
There’s a couple of interesting instructions that help with multi-media and
streaming logic.
Count Ones in Bytes
cntb i, j ; i.b[n] = numOneBits( j.b[n] )
Average Bytes
avgb i, j, k ; i.b[n] = ( j.b[n] + k.b[n] + 1 ) / 2
Absolute Difference in Bytes
absdb i, j, k ; i.b[n] = abs( j.b[n] - k.b[n] )
Sum Bytes into Half-words
sumb i, j, k ; i.h[0] = k.b[0] + k.b[1] + k.b[2] + k.b[3];
i.h[1] = j.b[0] + j.b[1] + j.b[2] + j.b[3];
:
i.h[6] = k.b[12] + k.b[13] + k.b[14] + k.b[15];
i.h[7] = j.b[12] + j.b[13] + j.b[14] + j.b[15];
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
53. Branch Class (BR) [Odd:-]
Branches on the SPU are costly. If a branch is taken, and it has not been
predicted, there is a 18 cycle penalty so that the chip can restart the pipe.
There is no penalty for falling through a non-predicted branch. However, if you
have predicted a branch, and this does not occur - then there is also a 18 cycle
penalty. Branches and branch hints are all odd instructions.
Note: Even a static branch needs to be predicted.
Note: This is one of the reasons why diverging control-paths are so difficult to
optimize for.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
54. BR: Unconditional Branches
Branch Relative
br brTo ; goto label address
Branch Relative and Set Link
brsl i, brTo ; gosub label address, i.w[0] = return address, (*)
Branch Indirect
bi i ; goto i.w[0]
Branch Indirect and Set Link
bisl i, j ; gosub j.w[0], i.w[0] = return address, (*)
BRanch Absolute
bra brTo ; goto brTo
BRanch Absolute and Set Link
brasl i, brTo ; gosub label address, i.w[0] = return address (*)
(*): These instructions have a 4 cycle latency for the return register. Note:
The bi instructions have enable/disable interrupt versions, e.g.: bie, bid,
bisle, bisld.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
55. BR: Conditional Branches (Relative)
Branch on Zero
brz i, brTo ; branch if i.w[0] == 0
Branch on Not Zero
brnz i, brTo ; branch if i.w[0] != 0
Branch on Zero
brhz i, brTo ; branch if i.h[1] == 0
Branch on Not Zero
brhnz i, brTo ; branch if i.h[1] != 0
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
56. BR: Conditional Branches (Indirect)
Branch Indirect on Zero
biz i, j ; branch to j.w[0] if i.w[0] == 0
Branch Indirect on Not Zero
binz i, j ; branch to j.w[0] if i.w[0] != 0
Branch Indirect on Zero
bihz i, j ; branch to j.w[0] if i.h[1] == 0
Branch Indirect on Not Zero
bihnz i, j ; branch to j.w[0] if i.h[1] != 0
Note: These instructions can enable/disable interrupts as well.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
57. BR: Interrupt & Misc
Interrupt RETurn
iret i ; Return from interrupt
Interrupt RETurn
iretd i ; Return from interrupt, disable interrupts
Interrupt RETurn
irete i ; Return from interrupt, enable interrupts
Branch Indirect and Set Link if External Data
bisled i, j ; gosub j if channel 0 is non-zero
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
58. Hints Branch Class (HB) [Odd:15]
If you know the most likely (or only) outcome for a branch, you can make sure
the branch is penalty free as long as the hint occurs at least 15 cycles before
the branch is taken. If the hint occurs later, there still may be a benefit, since
the penalty is lowered. However, if the hint arrives less than 4 cycles before the
branch, there is no benefit.
Please note that it also turns out that there is a hardware bug w.r.t. the hbr
instructions. One cannot hint a branch where the branch targets forwards and
is also within the same 64-byte block as the branch.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
59. Hints Branch Instructions
Hint Branch (Immediate)
hbr brFrom, j ; branch hint for any BIxxx type branch
Hint Branch Absolute
hbra brFrom, brTo ; branch hint for any BRAxxx type branch
Hint Branch Relative
hbrr brFrom, brTo ; branch hint for any BRxxx type branch
Hint Branch Prefetch
hbrp ; inline prefetch code (*)
(*) allows 15 LS instructions in a row without any instruction fetch stall.
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
60. CH: DMA Channel Ops
We will explain these in further talks, but for completeness we’ve included
these here. They are all odd instructions with a latency of 6. Note, that the
latency may actually be much higher if channels are not ready.
Read from Channel
rdch i, chn ; read i from channel chn
Write to Channel
wrch chn, i ; write i into channel chn
Read Channel Count
rdchcnt i, chn; read channel count for channel chn into i
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
61. DP: Double Precision
DP instructions have a latency of 13 and are even. However, they will stall
pipelining for 6 cycles (that is all currently executing instructions are halted)
while this instruction is executed. Therefore, we do not recommend using
double precision at all!
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations
62. Questions?
That’s all folks!
P˚
al-Kristian Engstadpal engstad@naughtydog.com Introduction to SPU Optimizations