2. Optimization - a form of balance
Device/Platform
Features
Runtime
Toolchain
Problem
Algorithm
Optimization
Optimization is not only greedy
searching in single direction. It is
more like to find a good balance
point between device, toolchain
and the problem.
3. Device - Computation
● device type
○ cpu - powerful single thread performance
○ gpu - many threads, great total throughput
● ISA design
○ scalar-based
○ vector-based
● # of compute unit/processing elements
● estimate impact of using divergence & barrier
● capability of asynchronous data transfer
4. Device - Memory
● get basic memory characteristics:
○ size
○ latency
○ throughput
○ coalescing effect
○ addressing mode
● global memory - unified or not
● local memory - real or not
● penalty of oversize
5. Toolchain/Runtime
● document/tutorial/guide for debugging, profiling and optimization.
● there is no perfect runtime/toolchain
● profiling/debugging tools.
● it is not always a good idea to debug/optimization on different
platforms.
● automatic optimization MAY NOT HELP the thinking of optimization
● tricky forms of computation/memory operations.
○ MAD operations
○ memory access mode
6. Problem/Algorithms
● DATA PARALLEL!
● multi-stages is not always bad.
○ doing all things together uses more memory resource in one workitem.
● vectorized is not always a good idea
● use appropriate work group size
○ bad memory access pattern, less coalescing
○ may cause lower cache hit rate
○ less local memory for each workitem
○ may be less private memory for each workitem.
● different form of implementation
● do optimization things manually.
○ DO NOT relies on automatic features.