SlideShare a Scribd company logo
1 of 43
Download to read offline
Cache-Aware Hybrid Sorter

         Manny Ko
Outline
• Sorting in CG
• Quick radix sort refresher
• Issues with radix sort
  – Incoherent memory access during parts of it
  – Originally only for integers
• Two-phase sort
  – Cache-aware stream splitting
  – Cache friendly merge using Loser Tree
  – A lot faster than STL sort (several times)
Sort in CG
•   Depth-sort for transparency Patney [2010]
•   Better Z-cull
•   Collision detection [Lin 2000]
•   Minimizing state-changes
•   Ray coherency Garanzha & Loop [2010]
•   HPC to handle irregular workloads
•   PBGI ?
Inspirations
•   Out-of-core sorts, e.g. AlphaSort Nyberg[95]
•   GPU based stream processing
•   Cache-Aware algorthims
•   Came out of my work on fast kd-tree builder
Importance of Memory
•   GPUs and CPU cores are faster and faster;
•   Tons of cores and more are coming
•   For GFLOPS Moore’s Law still holds
•   NOT for bandwidth to memory
    – While GFLOPS doubles or triples every 18m
    – Bandwidth barely moves (~15%)
• Bandwidth equals power; pushing electrons
Real-time Rendering
• Have been focusing on cache and memory
  patterns for a while
• CG researchers like Ingo Wald et al. have
  tackled that in ray-tracing
STL Sort
• Quicksort based
  – Memory access pattern less than ideal
  – Not sequential and lots of branching
• Will not dwell too much on this
Radix Sort
• The only practical O(dN) sort algorithm
  – d is the # of radix digits, e.g. for 32b word and 1
    bit per pass d is 32.
• No branching (almost) at least for integers
Counting Sort – Pass 1
• For radix = 2 we allocate two counters
• Each pass we go through the input and count
  the # of inputs that has 0s and 1s
• Extract digit (1 bit) and use that as index to
  increment the right counter – no branching
• d is a key design parameter
Pass 2 - Scatter
• At the end of the pass the counter for 0s will
  give us the offset to insert the 1s
• We go through the input using the counters to
  guide us where to scatter into the output buf
Number of Passes
• Original radix-sort each radix digit requires 1
  pass through the input and 1 scatter pass
• Swap input and output; repeat d times
• Each of the passes is a stable-sort
Prefix-Sum
• Radix-2 is simple; in general we have to
  compute the prefix-sum for the counters
• Key building block for GPU computing
• A big topic on its own
• Our array is only 256 entries long, so we didn’t
  use fancy SIMD method
Access Patterns
• Pass 1 – pure sequential read. Good 
  – Very parallelizable too.
• Pass 2 – random scatter. Not so good 
• Each pass requires one complete round trip
  from and to memory
Random Scatter
• Idea: utilize the cache
• Split the input into sub-streams
• Sub-streams defined by cache size/fast
  memory
Cache Resident Passes
• When we swap input and outputs
  – Output from previous pass still in cache 
Stream Merging
• Sorted sub-streams with be merged
• Merge is streaming friendly:
  – Input are read sequentially
  – Output is generated sequentially
• This is where the fun is
• We will get back to this. I promise.
Cache-Aware Hybrid Sort
• Cache-aware because we use the actual cache
  size of the machine to split the input
• Hybrid: radix sort sub-streams then merge
Cache sizing
• cpuid instruction
• code in the book ‘Game Engine Gems II’, AK
  Peters 2011.
Stream Spliting
• Depends on # of threads
• General strategy is to keep the output of each
  scatter pass completely within the cache
Substream Sorting
• Each byte is a digit
• Radix-256 sort – allocate 256 counters
   – 1kb or 2kb (64b); fits in L1 cache
   – Actually we allocate 4 sets of counters
• d logically is 4 but we do it all in 1 pass
• form the 4 sets of prefix-sums
• 4 scatter-passes
Floats
•   Radix-sort original designed for ints
•   What if we treat float as int? Casting?
•   Almost works, if all the floats are postive
•   IEEE is sign-exponent-mantissa.
•   sign bit makes all negative number appears to
    be larger than the positive ones
Float example
 2.0 is 0x40000000
-2.0 is 0xc0000000
-4.0 is 0xc0800000

Which implies -4.0 > -2.0 > 2.0,
just the opposite of what we want 
Terdiman’s Solution
• Usual solution [Terdiman 2000] treats high
  byte special and use a test in the inner loop
• Modern CPUs do not like branching
• GPUs likes it even less
Herf’s Hack
1. always invert the sign bit
2. If the sign bit was set, then invert the
  exponent and mantissa

    2.0 is 0x40000000 -> 0xc0000000
   -2.0 is 0xc0000000 -> 0x3fffffff
   -4.0 is 0xc0800000 -> 0x3f7fffff


   We get the correct ordering 
Herf’s FloatFlip
U32 FloatFlip(U32 f)
{
  U32 mask = -int32(f >> 31) | 0x80000000;
  return (f ^ mask);
}
My Version
int32 mask = (int32(f) >> 31) | 0x8000000;


 Utilize the sign extension while shifting signed
 numbers. Generates better code.
Parallel Sorting
• Each substream can be sorted in parallel
• We allocate 1 core per substream
• We size the substream so that it fits into each
  core’s L2 or L1 cache (or GPU share memory)
• At the end of substream sort phase we have
  read the input from memory (disk) twice
/*!
 -- RadixSorter: a builder class to aid with the use of radix-sorter.
 --       It splits the input stream into substreams that fits into cache.
 --       Mostly it holds the indices and temporaries for reuse.
 --       It currently only supports sorting of <key,index> pairs. Caller can either
 --       request for the sorted indices or request the original values to be moved.
*/
class RadixSorter {
          typedef size_t* Indices;
          static const size_t kStreams    = 4;
public:
          static const size_t kNumThreads = 4;               // # of threads

         RadixSorter( int count );
         ~RadixSorter();

         /// reallocate internal storage to prepare for a stream of length 'count':
         void       Resize( int count );
         /// deallocate all storage:
         void       Clear();

         /// initialize the sorter for 'values' :
         void        SortInit( float* values, int count );
         /// sort 'values' :
         void        Sort( float* values, int count );

         /// sort sub-stream ‘s':
         void       SortStream( int s );
         void       MergeStreams();
public:
          size_t   m_blockSizes[kStreams];   //!< size of each sub-stream
          float*   m_streams[kStreams];      //!< our sub-streams of work
          float*   m_temp[kNumThreads];      //!< working buffer carved from output buffer
          float*   m_outbuf;
          int      m_count;                  //!< max size of the input sequence
          bool     m_inited;
Stream Merging
• Usually performed using a priority-queue,
  most likely a heap-based PQ
• I tried to find the best PQ implementation
• Disappointing, the gain from radix-sort was
  almost negated by the merge phase
Loser-Tree
• Comes to the rescue
• Thanks Knuth 
  – The Art of Computer Programming Vol. 3
• Almost forgotten and I am a Knuth fan
• It is a kind of tournament-tree
Tournament Tree
• Single elimination
• Loser-tree is a tournament tree where the
  loser is kept in each round
• Winner moves on (in a register)
Our Tree
• Node consist of a float key and a payload of
  stream_id
• Linearized binary-tree, no pointers
  – Navigation up and down is by shifts and adds
• Initialized by inserting the head of each
  substream into the tree
  – Size_of_tree = 2 x # of substreams
• Let the play begins!
Winner
• Winner rises to the top
  – We remove the winner and output the key
• We use the winner’s stream_id to pull the
  next item from the stream
• Key idea: new winner can only come from
  players that had faced the previous winner –
  i.e. the path from the root to the original
  position of the winner
Repeat Matches
• Repeat those matches, a new winner emerges
Access Pattern of Merge
• Each substream is accessed sequentially
• Output is written sequentially
• Modern CPUs and GPUs like these sorts of
  patterns due to their pre-fetch, write-coalesce
  and caching logic
• Tree is small and fits into the L1 cache or even
  register file
Performance (1 core)
• Serially sort all substreams
• Merge using Loser-Tree on same thread
• Small data set: 2.1..3.5 times faster than STL
  – The poor access pattern of quicksort is less
    problematic when everything fits into cache
Scalability (4-cores)
              Threaded        Serial    Threaded   Serial (I7)
              (Q6600)        (Q6600)      (I7)

1 stream     5.12        5             3.89        3.62

2 streams    6.90        10.04         4.20        7.1

3 streams    8.08        15.07         4.56        10.69

4 streams    10.97       20.55         4.86        14.2

4 + merge    16.4        26.01         9.61        19.0
Multi-Core Performances
• One million entries: Q6600
  – STL took 76ms,
  – radix-sort 28 ms
  – 4-core: 16.4ms = where 5-6ms in merge
• One million entries: I7
  – STL (58ms),
  – hybrid (9.6ms)
• 6 times faster than STL
Threading Overhead
• The 1 stream vs. serial time is 5.12 vs. 5
  – So only .12ms of threading overhead
Related Work
• Funnel-Sort, Brodal [2008]
• GPU radix-sort, Satish [2009]

More Related Content

What's hot

NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)Igalia
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixCodemotion Tel Aviv
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
An Analysis of Convolution for Inference
An Analysis of Convolution for InferenceAn Analysis of Convolution for Inference
An Analysis of Convolution for InferenceIntel Nervana
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Optimizing the graphics pipeline with compute
Optimizing the graphics pipeline with computeOptimizing the graphics pipeline with compute
Optimizing the graphics pipeline with computeWuBinbo
 
Porting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons LearnedPorting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons Learnedbasisspace
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsTakahiro Harada
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitShiladitya Sen
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
Order Independent Transparency
Order Independent TransparencyOrder Independent Transparency
Order Independent Transparencyacbess
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
A synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaA synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaUniversidad Carlos III de Madrid
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Takahiro Harada
 
LLVM Register Allocation
LLVM Register AllocationLLVM Register Allocation
LLVM Register AllocationWang Hsiangkai
 

What's hot (20)

NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
An Analysis of Convolution for Inference
An Analysis of Convolution for InferenceAn Analysis of Convolution for Inference
An Analysis of Convolution for Inference
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Optimizing the graphics pipeline with compute
Optimizing the graphics pipeline with computeOptimizing the graphics pipeline with compute
Optimizing the graphics pipeline with compute
 
Porting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons LearnedPorting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons Learned
 
Basanta jtr2009
Basanta jtr2009Basanta jtr2009
Basanta jtr2009
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUs
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Order Independent Transparency
Order Independent TransparencyOrder Independent Transparency
Order Independent Transparency
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
A synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaA synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time Java
 
No Heap Remote Objects for Distributed real-time Java
No Heap Remote Objects for Distributed real-time JavaNo Heap Remote Objects for Distributed real-time Java
No Heap Remote Objects for Distributed real-time Java
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
 
2011.jtr.pbasanta.
2011.jtr.pbasanta.2011.jtr.pbasanta.
2011.jtr.pbasanta.
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
 
LLVM Register Allocation
LLVM Register AllocationLLVM Register Allocation
LLVM Register Allocation
 

Similar to Cache-Aware Hybrid Sorter Optimizes Memory Access

Understanding low latency jvm gcs V2
Understanding low latency jvm gcs V2Understanding low latency jvm gcs V2
Understanding low latency jvm gcs V2Jean-Philippe BEMPEL
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allenjaxconf
 
10 instruction sets characteristics
10 instruction sets characteristics10 instruction sets characteristics
10 instruction sets characteristicsdilip kumar
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Bruno Castelucci
 
10 instruction sets characteristics
10 instruction sets characteristics10 instruction sets characteristics
10 instruction sets characteristicsSher Shah Merkhel
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDBMongoDB
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to CassandraJon Haddad
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnGuerrilla
 
Block ciphers &amp; public key cryptography
Block ciphers &amp; public key cryptographyBlock ciphers &amp; public key cryptography
Block ciphers &amp; public key cryptographyRAMPRAKASHT1
 

Similar to Cache-Aware Hybrid Sorter Optimizes Memory Access (20)

Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Understanding low latency jvm gcs
Understanding low latency jvm gcsUnderstanding low latency jvm gcs
Understanding low latency jvm gcs
 
Understanding low latency jvm gcs V2
Understanding low latency jvm gcs V2Understanding low latency jvm gcs V2
Understanding low latency jvm gcs V2
 
Tc basics
Tc basicsTc basics
Tc basics
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
Switching units
Switching unitsSwitching units
Switching units
 
10 instruction sets characteristics
10 instruction sets characteristics10 instruction sets characteristics
10 instruction sets characteristics
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
 
10 instruction sets characteristics
10 instruction sets characteristics10 instruction sets characteristics
10 instruction sets characteristics
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Lec05
Lec05Lec05
Lec05
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
 
Block ciphers &amp; public key cryptography
Block ciphers &amp; public key cryptographyBlock ciphers &amp; public key cryptography
Block ciphers &amp; public key cryptography
 

Cache-Aware Hybrid Sorter Optimizes Memory Access

  • 2. Outline • Sorting in CG • Quick radix sort refresher • Issues with radix sort – Incoherent memory access during parts of it – Originally only for integers • Two-phase sort – Cache-aware stream splitting – Cache friendly merge using Loser Tree – A lot faster than STL sort (several times)
  • 3. Sort in CG • Depth-sort for transparency Patney [2010] • Better Z-cull • Collision detection [Lin 2000] • Minimizing state-changes • Ray coherency Garanzha & Loop [2010] • HPC to handle irregular workloads • PBGI ?
  • 4. Inspirations • Out-of-core sorts, e.g. AlphaSort Nyberg[95] • GPU based stream processing • Cache-Aware algorthims • Came out of my work on fast kd-tree builder
  • 5. Importance of Memory • GPUs and CPU cores are faster and faster; • Tons of cores and more are coming • For GFLOPS Moore’s Law still holds • NOT for bandwidth to memory – While GFLOPS doubles or triples every 18m – Bandwidth barely moves (~15%) • Bandwidth equals power; pushing electrons
  • 6. Real-time Rendering • Have been focusing on cache and memory patterns for a while • CG researchers like Ingo Wald et al. have tackled that in ray-tracing
  • 7. STL Sort • Quicksort based – Memory access pattern less than ideal – Not sequential and lots of branching • Will not dwell too much on this
  • 8. Radix Sort • The only practical O(dN) sort algorithm – d is the # of radix digits, e.g. for 32b word and 1 bit per pass d is 32. • No branching (almost) at least for integers
  • 9. Counting Sort – Pass 1 • For radix = 2 we allocate two counters • Each pass we go through the input and count the # of inputs that has 0s and 1s • Extract digit (1 bit) and use that as index to increment the right counter – no branching • d is a key design parameter
  • 10. Pass 2 - Scatter • At the end of the pass the counter for 0s will give us the offset to insert the 1s • We go through the input using the counters to guide us where to scatter into the output buf
  • 11. Number of Passes • Original radix-sort each radix digit requires 1 pass through the input and 1 scatter pass • Swap input and output; repeat d times • Each of the passes is a stable-sort
  • 12. Prefix-Sum • Radix-2 is simple; in general we have to compute the prefix-sum for the counters • Key building block for GPU computing • A big topic on its own • Our array is only 256 entries long, so we didn’t use fancy SIMD method
  • 13.
  • 14. Access Patterns • Pass 1 – pure sequential read. Good  – Very parallelizable too. • Pass 2 – random scatter. Not so good  • Each pass requires one complete round trip from and to memory
  • 15. Random Scatter • Idea: utilize the cache • Split the input into sub-streams • Sub-streams defined by cache size/fast memory
  • 16. Cache Resident Passes • When we swap input and outputs – Output from previous pass still in cache 
  • 17. Stream Merging • Sorted sub-streams with be merged • Merge is streaming friendly: – Input are read sequentially – Output is generated sequentially • This is where the fun is • We will get back to this. I promise.
  • 18. Cache-Aware Hybrid Sort • Cache-aware because we use the actual cache size of the machine to split the input • Hybrid: radix sort sub-streams then merge
  • 19. Cache sizing • cpuid instruction • code in the book ‘Game Engine Gems II’, AK Peters 2011.
  • 20. Stream Spliting • Depends on # of threads • General strategy is to keep the output of each scatter pass completely within the cache
  • 21. Substream Sorting • Each byte is a digit • Radix-256 sort – allocate 256 counters – 1kb or 2kb (64b); fits in L1 cache – Actually we allocate 4 sets of counters • d logically is 4 but we do it all in 1 pass • form the 4 sets of prefix-sums • 4 scatter-passes
  • 22. Floats • Radix-sort original designed for ints • What if we treat float as int? Casting? • Almost works, if all the floats are postive • IEEE is sign-exponent-mantissa. • sign bit makes all negative number appears to be larger than the positive ones
  • 23. Float example 2.0 is 0x40000000 -2.0 is 0xc0000000 -4.0 is 0xc0800000 Which implies -4.0 > -2.0 > 2.0, just the opposite of what we want 
  • 24. Terdiman’s Solution • Usual solution [Terdiman 2000] treats high byte special and use a test in the inner loop • Modern CPUs do not like branching • GPUs likes it even less
  • 25. Herf’s Hack 1. always invert the sign bit 2. If the sign bit was set, then invert the exponent and mantissa 2.0 is 0x40000000 -> 0xc0000000 -2.0 is 0xc0000000 -> 0x3fffffff -4.0 is 0xc0800000 -> 0x3f7fffff We get the correct ordering 
  • 26. Herf’s FloatFlip U32 FloatFlip(U32 f) { U32 mask = -int32(f >> 31) | 0x80000000; return (f ^ mask); }
  • 27. My Version int32 mask = (int32(f) >> 31) | 0x8000000; Utilize the sign extension while shifting signed numbers. Generates better code.
  • 28. Parallel Sorting • Each substream can be sorted in parallel • We allocate 1 core per substream • We size the substream so that it fits into each core’s L2 or L1 cache (or GPU share memory) • At the end of substream sort phase we have read the input from memory (disk) twice
  • 29. /*! -- RadixSorter: a builder class to aid with the use of radix-sorter. -- It splits the input stream into substreams that fits into cache. -- Mostly it holds the indices and temporaries for reuse. -- It currently only supports sorting of <key,index> pairs. Caller can either -- request for the sorted indices or request the original values to be moved. */
  • 30. class RadixSorter { typedef size_t* Indices; static const size_t kStreams = 4; public: static const size_t kNumThreads = 4; // # of threads RadixSorter( int count ); ~RadixSorter(); /// reallocate internal storage to prepare for a stream of length 'count': void Resize( int count ); /// deallocate all storage: void Clear(); /// initialize the sorter for 'values' : void SortInit( float* values, int count ); /// sort 'values' : void Sort( float* values, int count ); /// sort sub-stream ‘s': void SortStream( int s ); void MergeStreams();
  • 31. public: size_t m_blockSizes[kStreams]; //!< size of each sub-stream float* m_streams[kStreams]; //!< our sub-streams of work float* m_temp[kNumThreads]; //!< working buffer carved from output buffer float* m_outbuf; int m_count; //!< max size of the input sequence bool m_inited;
  • 32. Stream Merging • Usually performed using a priority-queue, most likely a heap-based PQ • I tried to find the best PQ implementation • Disappointing, the gain from radix-sort was almost negated by the merge phase
  • 33. Loser-Tree • Comes to the rescue • Thanks Knuth  – The Art of Computer Programming Vol. 3 • Almost forgotten and I am a Knuth fan • It is a kind of tournament-tree
  • 34. Tournament Tree • Single elimination • Loser-tree is a tournament tree where the loser is kept in each round • Winner moves on (in a register)
  • 35. Our Tree • Node consist of a float key and a payload of stream_id • Linearized binary-tree, no pointers – Navigation up and down is by shifts and adds • Initialized by inserting the head of each substream into the tree – Size_of_tree = 2 x # of substreams • Let the play begins!
  • 36. Winner • Winner rises to the top – We remove the winner and output the key • We use the winner’s stream_id to pull the next item from the stream • Key idea: new winner can only come from players that had faced the previous winner – i.e. the path from the root to the original position of the winner
  • 37. Repeat Matches • Repeat those matches, a new winner emerges
  • 38. Access Pattern of Merge • Each substream is accessed sequentially • Output is written sequentially • Modern CPUs and GPUs like these sorts of patterns due to their pre-fetch, write-coalesce and caching logic • Tree is small and fits into the L1 cache or even register file
  • 39. Performance (1 core) • Serially sort all substreams • Merge using Loser-Tree on same thread • Small data set: 2.1..3.5 times faster than STL – The poor access pattern of quicksort is less problematic when everything fits into cache
  • 40. Scalability (4-cores) Threaded Serial Threaded Serial (I7) (Q6600) (Q6600) (I7) 1 stream 5.12 5 3.89 3.62 2 streams 6.90 10.04 4.20 7.1 3 streams 8.08 15.07 4.56 10.69 4 streams 10.97 20.55 4.86 14.2 4 + merge 16.4 26.01 9.61 19.0
  • 41. Multi-Core Performances • One million entries: Q6600 – STL took 76ms, – radix-sort 28 ms – 4-core: 16.4ms = where 5-6ms in merge • One million entries: I7 – STL (58ms), – hybrid (9.6ms) • 6 times faster than STL
  • 42. Threading Overhead • The 1 stream vs. serial time is 5.12 vs. 5 – So only .12ms of threading overhead
  • 43. Related Work • Funnel-Sort, Brodal [2008] • GPU radix-sort, Satish [2009]