2. Outline
• Sorting in CG
• Quick radix sort refresher
• Issues with radix sort
– Incoherent memory access during parts of it
– Originally only for integers
• Two-phase sort
– Cache-aware stream splitting
– Cache friendly merge using Loser Tree
– A lot faster than STL sort (several times)
3. Sort in CG
• Depth-sort for transparency Patney [2010]
• Better Z-cull
• Collision detection [Lin 2000]
• Minimizing state-changes
• Ray coherency Garanzha & Loop [2010]
• HPC to handle irregular workloads
• PBGI ?
4. Inspirations
• Out-of-core sorts, e.g. AlphaSort Nyberg[95]
• GPU based stream processing
• Cache-Aware algorthims
• Came out of my work on fast kd-tree builder
5. Importance of Memory
• GPUs and CPU cores are faster and faster;
• Tons of cores and more are coming
• For GFLOPS Moore’s Law still holds
• NOT for bandwidth to memory
– While GFLOPS doubles or triples every 18m
– Bandwidth barely moves (~15%)
• Bandwidth equals power; pushing electrons
6. Real-time Rendering
• Have been focusing on cache and memory
patterns for a while
• CG researchers like Ingo Wald et al. have
tackled that in ray-tracing
7. STL Sort
• Quicksort based
– Memory access pattern less than ideal
– Not sequential and lots of branching
• Will not dwell too much on this
8. Radix Sort
• The only practical O(dN) sort algorithm
– d is the # of radix digits, e.g. for 32b word and 1
bit per pass d is 32.
• No branching (almost) at least for integers
9. Counting Sort – Pass 1
• For radix = 2 we allocate two counters
• Each pass we go through the input and count
the # of inputs that has 0s and 1s
• Extract digit (1 bit) and use that as index to
increment the right counter – no branching
• d is a key design parameter
10. Pass 2 - Scatter
• At the end of the pass the counter for 0s will
give us the offset to insert the 1s
• We go through the input using the counters to
guide us where to scatter into the output buf
11. Number of Passes
• Original radix-sort each radix digit requires 1
pass through the input and 1 scatter pass
• Swap input and output; repeat d times
• Each of the passes is a stable-sort
12. Prefix-Sum
• Radix-2 is simple; in general we have to
compute the prefix-sum for the counters
• Key building block for GPU computing
• A big topic on its own
• Our array is only 256 entries long, so we didn’t
use fancy SIMD method
13.
14. Access Patterns
• Pass 1 – pure sequential read. Good
– Very parallelizable too.
• Pass 2 – random scatter. Not so good
• Each pass requires one complete round trip
from and to memory
15. Random Scatter
• Idea: utilize the cache
• Split the input into sub-streams
• Sub-streams defined by cache size/fast
memory
16. Cache Resident Passes
• When we swap input and outputs
– Output from previous pass still in cache
17. Stream Merging
• Sorted sub-streams with be merged
• Merge is streaming friendly:
– Input are read sequentially
– Output is generated sequentially
• This is where the fun is
• We will get back to this. I promise.
18. Cache-Aware Hybrid Sort
• Cache-aware because we use the actual cache
size of the machine to split the input
• Hybrid: radix sort sub-streams then merge
19. Cache sizing
• cpuid instruction
• code in the book ‘Game Engine Gems II’, AK
Peters 2011.
20. Stream Spliting
• Depends on # of threads
• General strategy is to keep the output of each
scatter pass completely within the cache
21. Substream Sorting
• Each byte is a digit
• Radix-256 sort – allocate 256 counters
– 1kb or 2kb (64b); fits in L1 cache
– Actually we allocate 4 sets of counters
• d logically is 4 but we do it all in 1 pass
• form the 4 sets of prefix-sums
• 4 scatter-passes
22. Floats
• Radix-sort original designed for ints
• What if we treat float as int? Casting?
• Almost works, if all the floats are postive
• IEEE is sign-exponent-mantissa.
• sign bit makes all negative number appears to
be larger than the positive ones
23. Float example
2.0 is 0x40000000
-2.0 is 0xc0000000
-4.0 is 0xc0800000
Which implies -4.0 > -2.0 > 2.0,
just the opposite of what we want
24. Terdiman’s Solution
• Usual solution [Terdiman 2000] treats high
byte special and use a test in the inner loop
• Modern CPUs do not like branching
• GPUs likes it even less
25. Herf’s Hack
1. always invert the sign bit
2. If the sign bit was set, then invert the
exponent and mantissa
2.0 is 0x40000000 -> 0xc0000000
-2.0 is 0xc0000000 -> 0x3fffffff
-4.0 is 0xc0800000 -> 0x3f7fffff
We get the correct ordering
27. My Version
int32 mask = (int32(f) >> 31) | 0x8000000;
Utilize the sign extension while shifting signed
numbers. Generates better code.
28. Parallel Sorting
• Each substream can be sorted in parallel
• We allocate 1 core per substream
• We size the substream so that it fits into each
core’s L2 or L1 cache (or GPU share memory)
• At the end of substream sort phase we have
read the input from memory (disk) twice
29. /*!
-- RadixSorter: a builder class to aid with the use of radix-sorter.
-- It splits the input stream into substreams that fits into cache.
-- Mostly it holds the indices and temporaries for reuse.
-- It currently only supports sorting of <key,index> pairs. Caller can either
-- request for the sorted indices or request the original values to be moved.
*/
30. class RadixSorter {
typedef size_t* Indices;
static const size_t kStreams = 4;
public:
static const size_t kNumThreads = 4; // # of threads
RadixSorter( int count );
~RadixSorter();
/// reallocate internal storage to prepare for a stream of length 'count':
void Resize( int count );
/// deallocate all storage:
void Clear();
/// initialize the sorter for 'values' :
void SortInit( float* values, int count );
/// sort 'values' :
void Sort( float* values, int count );
/// sort sub-stream ‘s':
void SortStream( int s );
void MergeStreams();
31. public:
size_t m_blockSizes[kStreams]; //!< size of each sub-stream
float* m_streams[kStreams]; //!< our sub-streams of work
float* m_temp[kNumThreads]; //!< working buffer carved from output buffer
float* m_outbuf;
int m_count; //!< max size of the input sequence
bool m_inited;
32. Stream Merging
• Usually performed using a priority-queue,
most likely a heap-based PQ
• I tried to find the best PQ implementation
• Disappointing, the gain from radix-sort was
almost negated by the merge phase
33. Loser-Tree
• Comes to the rescue
• Thanks Knuth
– The Art of Computer Programming Vol. 3
• Almost forgotten and I am a Knuth fan
• It is a kind of tournament-tree
34. Tournament Tree
• Single elimination
• Loser-tree is a tournament tree where the
loser is kept in each round
• Winner moves on (in a register)
35. Our Tree
• Node consist of a float key and a payload of
stream_id
• Linearized binary-tree, no pointers
– Navigation up and down is by shifts and adds
• Initialized by inserting the head of each
substream into the tree
– Size_of_tree = 2 x # of substreams
• Let the play begins!
36. Winner
• Winner rises to the top
– We remove the winner and output the key
• We use the winner’s stream_id to pull the
next item from the stream
• Key idea: new winner can only come from
players that had faced the previous winner –
i.e. the path from the root to the original
position of the winner
38. Access Pattern of Merge
• Each substream is accessed sequentially
• Output is written sequentially
• Modern CPUs and GPUs like these sorts of
patterns due to their pre-fetch, write-coalesce
and caching logic
• Tree is small and fits into the L1 cache or even
register file
39. Performance (1 core)
• Serially sort all substreams
• Merge using Loser-Tree on same thread
• Small data set: 2.1..3.5 times faster than STL
– The poor access pattern of quicksort is less
problematic when everything fits into cache
41. Multi-Core Performances
• One million entries: Q6600
– STL took 76ms,
– radix-sort 28 ms
– 4-core: 16.4ms = where 5-6ms in merge
• One million entries: I7
– STL (58ms),
– hybrid (9.6ms)
• 6 times faster than STL
42. Threading Overhead
• The 1 stream vs. serial time is 5.12 vs. 5
– So only .12ms of threading overhead