Cache-Aware Hybrid Sorter Optimizes Memory Access

Cache-Aware Hybrid Sorter

Manny Ko

Outline
• Sorting in CG
• Quick radix sort refresher
• Issues with radix sort
– Incoherent memory access during parts of it
– Originally only for integers
• Two-phase sort
– Cache-aware stream splitting
– Cache friendly merge using Loser Tree
– A lot faster than STL sort (several times)

Sort in CG
• Depth-sort for transparency Patney [2010]
• Better Z-cull
• Collision detection [Lin 2000]
• Minimizing state-changes
• Ray coherency Garanzha & Loop [2010]
• HPC to handle irregular workloads
• PBGI ?

Inspirations
• Out-of-core sorts, e.g. AlphaSort Nyberg[95]
• GPU based stream processing
• Cache-Aware algorthims
• Came out of my work on fast kd-tree builder

Importance of Memory
• GPUs and CPU cores are faster and faster;
• Tons of cores and more are coming
• For GFLOPS Moore’s Law still holds
• NOT for bandwidth to memory
– While GFLOPS doubles or triples every 18m
– Bandwidth barely moves (~15%)
• Bandwidth equals power; pushing electrons

Real-time Rendering
• Have been focusing on cache and memory
patterns for a while
• CG researchers like Ingo Wald et al. have
tackled that in ray-tracing

STL Sort
• Quicksort based
– Memory access pattern less than ideal
– Not sequential and lots of branching
• Will not dwell too much on this

Radix Sort
• The only practical O(dN) sort algorithm
– d is the # of radix digits, e.g. for 32b word and 1
bit per pass d is 32.
• No branching (almost) at least for integers

Counting Sort – Pass 1
• For radix = 2 we allocate two counters
• Each pass we go through the input and count
the # of inputs that has 0s and 1s
• Extract digit (1 bit) and use that as index to
increment the right counter – no branching
• d is a key design parameter

Pass 2 - Scatter
• At the end of the pass the counter for 0s will
give us the offset to insert the 1s
• We go through the input using the counters to
guide us where to scatter into the output buf

Number of Passes
• Original radix-sort each radix digit requires 1
pass through the input and 1 scatter pass
• Swap input and output; repeat d times
• Each of the passes is a stable-sort

Prefix-Sum
• Radix-2 is simple; in general we have to
compute the prefix-sum for the counters
• Key building block for GPU computing
• A big topic on its own
• Our array is only 256 entries long, so we didn’t
use fancy SIMD method

Access Patterns
• Pass 1 – pure sequential read. Good 
– Very parallelizable too.
• Pass 2 – random scatter. Not so good 
• Each pass requires one complete round trip
from and to memory

Random Scatter
• Idea: utilize the cache
• Split the input into sub-streams
• Sub-streams defined by cache size/fast
memory

Cache Resident Passes
• When we swap input and outputs
– Output from previous pass still in cache 

Stream Merging
• Sorted sub-streams with be merged
• Merge is streaming friendly:
– Input are read sequentially
– Output is generated sequentially
• This is where the fun is
• We will get back to this. I promise.

Cache-Aware Hybrid Sort
• Cache-aware because we use the actual cache
size of the machine to split the input
• Hybrid: radix sort sub-streams then merge

Cache sizing
• cpuid instruction
• code in the book ‘Game Engine Gems II’, AK
Peters 2011.

Stream Spliting
• Depends on # of threads
• General strategy is to keep the output of each
scatter pass completely within the cache

Substream Sorting
• Each byte is a digit
• Radix-256 sort – allocate 256 counters
– 1kb or 2kb (64b); fits in L1 cache
– Actually we allocate 4 sets of counters
• d logically is 4 but we do it all in 1 pass
• form the 4 sets of prefix-sums
• 4 scatter-passes

Floats
• Radix-sort original designed for ints
• What if we treat float as int? Casting?
• Almost works, if all the floats are postive
• IEEE is sign-exponent-mantissa.
• sign bit makes all negative number appears to
be larger than the positive ones

Float example
2.0 is 0x40000000
-2.0 is 0xc0000000
-4.0 is 0xc0800000

Which implies -4.0 > -2.0 > 2.0,
just the opposite of what we want 

Terdiman’s Solution
• Usual solution [Terdiman 2000] treats high
byte special and use a test in the inner loop
• Modern CPUs do not like branching
• GPUs likes it even less

Herf’s Hack
1. always invert the sign bit
2. If the sign bit was set, then invert the
exponent and mantissa

2.0 is 0x40000000 -> 0xc0000000
-2.0 is 0xc0000000 -> 0x3fffffff
-4.0 is 0xc0800000 -> 0x3f7fffff

We get the correct ordering 

Herf’s FloatFlip
U32 FloatFlip(U32 f)
{
U32 mask = -int32(f >> 31) | 0x80000000;
return (f ^ mask);
}

My Version
int32 mask = (int32(f) >> 31) | 0x8000000;

Utilize the sign extension while shifting signed
numbers. Generates better code.

Parallel Sorting
• Each substream can be sorted in parallel
• We allocate 1 core per substream
• We size the substream so that it fits into each
core’s L2 or L1 cache (or GPU share memory)
• At the end of substream sort phase we have
read the input from memory (disk) twice

/*!
-- RadixSorter: a builder class to aid with the use of radix-sorter.
-- It splits the input stream into substreams that fits into cache.
-- Mostly it holds the indices and temporaries for reuse.
-- It currently only supports sorting of <key,index> pairs. Caller can either
-- request for the sorted indices or request the original values to be moved.
*/

class RadixSorter {
typedef size_t* Indices;
static const size_t kStreams = 4;
public:
static const size_t kNumThreads = 4; // # of threads

RadixSorter( int count );
~RadixSorter();

/// reallocate internal storage to prepare for a stream of length 'count':
void Resize( int count );
/// deallocate all storage:
void Clear();

/// initialize the sorter for 'values' :
void SortInit( float* values, int count );
/// sort 'values' :
void Sort( float* values, int count );

/// sort sub-stream ‘s':
void SortStream( int s );
void MergeStreams();

public:
size_t m_blockSizes[kStreams]; //!< size of each sub-stream
float* m_streams[kStreams]; //!< our sub-streams of work
float* m_temp[kNumThreads]; //!< working buffer carved from output buffer
float* m_outbuf;
int m_count; //!< max size of the input sequence
bool m_inited;

Stream Merging
• Usually performed using a priority-queue,
most likely a heap-based PQ
• I tried to find the best PQ implementation
• Disappointing, the gain from radix-sort was
almost negated by the merge phase

Loser-Tree
• Comes to the rescue
• Thanks Knuth 
– The Art of Computer Programming Vol. 3
• Almost forgotten and I am a Knuth fan
• It is a kind of tournament-tree

Tournament Tree
• Single elimination
• Loser-tree is a tournament tree where the
loser is kept in each round
• Winner moves on (in a register)

Our Tree
• Node consist of a float key and a payload of
stream_id
• Linearized binary-tree, no pointers
– Navigation up and down is by shifts and adds
• Initialized by inserting the head of each
substream into the tree
– Size_of_tree = 2 x # of substreams
• Let the play begins!

Winner
• Winner rises to the top
– We remove the winner and output the key
• We use the winner’s stream_id to pull the
next item from the stream
• Key idea: new winner can only come from
players that had faced the previous winner –
i.e. the path from the root to the original
position of the winner

Repeat Matches
• Repeat those matches, a new winner emerges

Access Pattern of Merge
• Each substream is accessed sequentially
• Output is written sequentially
• Modern CPUs and GPUs like these sorts of
patterns due to their pre-fetch, write-coalesce
and caching logic
• Tree is small and fits into the L1 cache or even
register file

Performance (1 core)
• Serially sort all substreams
• Merge using Loser-Tree on same thread
• Small data set: 2.1..3.5 times faster than STL
– The poor access pattern of quicksort is less
problematic when everything fits into cache

Scalability (4-cores)
Threaded Serial Threaded Serial (I7)
(Q6600) (Q6600) (I7)

1 stream 5.12 5 3.89 3.62

2 streams 6.90 10.04 4.20 7.1

3 streams 8.08 15.07 4.56 10.69

4 streams 10.97 20.55 4.86 14.2

4 + merge 16.4 26.01 9.61 19.0

Multi-Core Performances
• One million entries: Q6600
– STL took 76ms,
– radix-sort 28 ms
– 4-core: 16.4ms = where 5-6ms in merge
• One million entries: I7
– STL (58ms),
– hybrid (9.6ms)
• 6 times faster than STL

Threading Overhead
• The 1 stream vs. serial time is 5.12 vs. 5
– So only .12ms of threading overhead

Related Work
• Funnel-Sort, Brodal [2008]
• GPU radix-sort, Satish [2009]

Cache-Aware Hybrid Sorter Optimizes Memory Access

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cache-Aware Hybrid Sorter Optimizes Memory Access

Similar to Cache-Aware Hybrid Sorter Optimizes Memory Access (20)

Cache-Aware Hybrid Sorter Optimizes Memory Access