[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
Similar to [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)
Extend starfish to Support the Growing Hadoop EcosystemFei Dong
Similar to [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs) (20)
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)
1. The R-Stream High-Level Program Transformation Tool
N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin
Reservoir Labs Harvard 04/12/2011 1
4. Computation choreography
•• Expressing it
• Annotations and pragma dialects for C
• Explicitly (e.g., new languages like CUDA and OpenCL)
•• But before expressing it, how can programmers find it?
• Manual constructive procedures, art, sweat, time
– Artisans get complete control over every detail
• Automatically
– Operations research problem
– Like scheduling trucks to save fuel Our focus
– Model, solve , implement
– Faster, sometimes better, than a human
Reservoir Labs Harvard 04/12/2011 4
5. How to do automatic scheduling?
•• Naïve approach
• Model
– Tasks, dependences
– Resource use, latencies
– Machine model with connectivity, resource capacities
• Solve with ILP
– Minimize overall task length
– Subject to dependences, resource use
• Problems
– Complexity: task graph is huge!
– Dynamics: loop lengths unknown.
•• So we do something much more cool.
Reservoir Labs Harvard 04/12/2011 5
6. Program Transformations Specification
iteration space of a statement S(i,j)
t2
j
2 2
:Z Z
i
t1
•• Schedule maps iterations to multi-dimensional time:
• A feasible schedule preserves dependences
•• Placement maps iterations to multi-dimensional space:
• UHPC in progress, partially done
•• Layout maps data elements to multi-dimensional space:
• UHPC in progress
•• Hierarchical by design, tiling serves separation of concerns
6
Reservoir Labs Harvard 04/12/2011
9. Enabling technology is new compiler math
Uniform Recurrence Equations [Karp et al. 1970]
Many: Lamport, Allen/Kennedy,
Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e
Loop Transformations and Parallelization [1970-]
Vectorization, SMP, locality optimizations
Dependence summary: direction/distance vectors
Unimodular transformations
Systolic Array Mapping Mostly linear-algebraic
Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....
Exact dependence analysis
Polyhedral Model [1980-] General affine transformations
Loop synthesis via polyhedral scanning
New computational techniques based
on polyhedral representations
Reservoir Labs Harvard 04/12/2011
9
10. R-Stream model: polyhedra
n = f();
for (i=5; i<= n; i+=2) {
A[i][i] = A[i][i]/B[i];
for (j=0; j<=i; j++) {
if (j<=10) {
…… A[i+2j+n][i+3]……
}
}
{i, j Z2 | k Z ,5 i n;0 j i; j i ; i 2k 1}
i
A0 1 2 1 0
j
A1 1 0 0 3 Affine and non-affine transformations
n Order and place of operations and data
1 0 0 0 1
1
Loop code represented (exactly or conservatively) with polyhedrons
High-level, mathematical view of a mapping
But targets concrete properties: parallelism, locality, memory footprint
Reservoir Labs Harvard 04/12/2011 10
11. Polyhedral slogans
•• Parametric imperfect loop nests
•• Subsumes classical transformations
•• Compacts the transformation search space
•• Parallelization, locality optimization (communication avoiding)
•• Preserves semantics
•• Analytic joint formulations of optimizations
•• Not just for affine static control programs
Reservoir Labs Harvard 04/12/2011 11
12. Polyhedral model – challenges in building a compiler
•• Killer math
•• Scalability of optimizations/code generation
•• Mostly confined to dependence preserving transformations
•• Code can be radically transformed – outputs can look
wildly different
•• Modeling indirections, pointers, non-affine code.
•• Many of these challenges are solved
Reservoir Labs Harvard 04/12/2011 12
13. R-Stream blueprint
Machine
Polyhedral Mapper
Model
Raising Lowering
EDG C Pretty
Scalar Representation
Front End Printer
13
Reservoir Labs Harvard 04/12/2011
15. Inside the polyhedral mapper
Optimization modules engineered to expose ““knobs”” that could be used by auto-tuner
GDG representation
Tactics Module
Parallelization
Comm.
Locality Tiling Placement
Generation
Optimization
……
Memory Sync Layout Polyhedral
Promotion Generation Optimization Scanning
Jolylib, ……
15
Reservoir Labs Harvard 04/12/2011
16. Driving the mapping: the machine model
•• Target machine characteristics that have an
influence on how the mapping should be done
• Local memory / cache sizes
• Communication facilities: DMA, cache(s)
• Synchronization capabilities
• Symmetrical or not
• SIMD width
• Bandwidths
•• Currently: two-level model (Host and Accelerators)
•• XML schema and graphical rendering
Reservoir Labs Harvard 04/12/2011
17. Machine model example: multi-Tesla
Host
1 thread per GPU
OpenMP morph XML file
CUDA morph
Reservoir Labs Harvard 04/12/2011 17
18. Mapping process
Dependencies
2- Task formation:
- Coarse-grain atomic tasks
- Master/slave side operations
1- Scheduling:
Parallelism, locality, tilability
3- Placement:
Assign tasks to blocks/threads
- Local / global data layout optimization
- Multi-buffering (explicitly managed)
- Synchronization (barriers)
- Bulk communications
- Thread generation -> master/slave
- CUDA-specific optimizations
Reservoir Labs Harvard 04/12/2011 18
19. Program Transformations Specification
iteration space of a statement S(i,j)
t2
j
2 2
:Z Z
i
t1
•• Schedule maps iterations to multi-dimensional time:
• A feasible schedule preserves dependences
•• Placement maps iterations to multi-dimensional space:
• UHPC in progress, partially done
•• Layout maps data elements to multi-dimensional space:
• UHPC in progress
•• Hierarchical by design, tiling serves separation of concerns
19
Reservoir Labs Harvard 04/12/2011
20. Model for scheduling trades 3 objectives jointly
Loop Fission
Fewer
Global More More Sufficient
Memory Locality Parallelism Occupancy
Accesses
Loop Fusion
+ successive + successive
thread thread
contiguity contiguity
Memory
Coalescing
Better
Effective
Bandwidth
Patent pending
Reservoir Labs Harvard 04/12/2011 20
21. Optimization with BLAS vs. global optimization
Numerous cache misses
/* Global Optimization*/
/* Optimization with BLAS */ doall loop { Can parallelize
for loop { Outer loop(s) …… outer loop(s)
…… for loop {
Retrieve data Z from disk ……
BLAS call 1
…… Store data Z back to disk [read from Z]
Retrieve data Z from disk !!! Loop fusion
BLAS call 2 VS. ……
…… [write to Z] can
…… …… improve
BLAS call n [read from Z] locality
…… }
} ……
}
Global optimization can expose better parallelism and locality
Reservoir Labs Harvard 04/12/2011
22. Tradeoffs between parallelism and locality
• Significant parallelism is needed to fully utilize all resources
• Locality is also critical to minimize communication
• Parallelism can come at the expense of locality
Limited bandwidth
at chip border
High on-chip
parallelism
•• Our approach: R-Stream compiler exposes parallelism via affine scheduling
that simultaneously augments locality using loop fusion
Reuse data once
loaded on chip =
locality
Reservoir Labs Harvard 04/12/2011
23. Parallelism/locality tradeoff example
Array z gets expanded, to
Maximum distribution destroys locality
introduce another level of
parallelism
/* doall (i=0; i<400; i++)
* Original code: doall (j=0; j<3997; j++)
* Simplified CSLC LMS z_e[j][i]=0
*/ doall (i=0; i<400; i++)
for (k=0; k<400; k++) { doall (j=0; j<3997; j++)
Max. parallelism
for (i=0; i<3997; i++) { for (k=0; k<4000; k++)
(no fusion)
z[i]=0; z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];
for (j=0; j<4000; j++) doall (i=0; i<3997; i++)
z[i]= z[i]+B[i][j]*x[k][j]; for (j=0; j<400; j++)
} w[i]=w[i]+z_e[i][j];
for (i=0; i<3997; i++) doall (i=0; i<3997; i++)
w[i]=w[i]+z[i]; Data
z[i] = z_e[i][399];
} accumulation
2 levels of parallelism, but poor data reuse (on array z_e)
Reservoir Labs Harvard 04/12/2011
24. Parallelism/locality tradeoff example (cont.)
Aggressive loop fusion destroys
parallelism (i.e., only 1 degree
/*
of parallelism)
* Original code:
* Simplified CSLC LMS
*/ doall (i=0; i<3997; i++)
for (k=0; k<400; k++) { for (j=0; j<400; j++) {
for (i=0; i<3997; i++) {
Max. fusion z[i]=0;
z[i]=0; for (k=0; k<4000; k++)
for (j=0; j<4000; j++) z[i]=z[i]+B[i][k]*x[j][k];
z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z[i];
} }
for (i=0; i<3997; i++)
w[i]=w[i]+z[i];
}
Very good data reuse (on array z), but only 1 level of parallelism
Reservoir Labs Harvard 04/12/2011
25. Parallelism/locality tradeoff example (cont.)
Partial fusion doesn’t
Expansion of array z decrease parallelism
/*
* Original code: doall (i=0; i<3997; i++) {
* Simplified CSLC LMS doall (j=0; j<400; j++) {
*/ z_e[i][j]=0;
for (k=0; k<400; k++) { for (k=0; k<4000; k++)
z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];
for (i=0; i<3997; i++) { Parallelism with
}
z[i]=0; partial fusion
for (j=0; j<4000; j++) for (j=0; j<400; j++)
z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z_e[i][j];
} }
doall (i=0; i<3997; i++)
for (i=0; i<3997; i++) Data
z[i]=z_e[i][399];
w[i]=w[i]+z[i]; accumulation
}
2 levels of parallelism with good data reuse (on array z_e)
Reservoir Labs Harvard 04/12/2011
26. Parallelism/locality tradeoffs: performance
numbers
Code with a good balance between parallelism and fusion performs best
In explicitly managed memory/scratchpad architectures this is even more true
Reservoir Labs Harvard 04/12/2011
27. R-Stream: affine scheduling and fusion
•• R-Stream uses a heuristic based on an objective function with several cost
coefficients:
• slowdown in execution if a loop p is executed sequentially rather than in parallel
• cost in performance if two loops p and q remain unfused rather than fused
minimize wl pl ue f e
l loops e loop edges
slowdown in sequential cost of unfusing
execution two loops
•• These two cost coefficients address parallelism and locality in a unified and
unbiased manner (as opposed to traditional compilers)
•• Fine-grained parallelism, such as SIMD, can also be modeled using similar
formulation
Patent Pending
Reservoir Labs Harvard 04/12/2011
28. Parallelism + locality + spatial locality
Hypothesis that auto-tuning should adjust these
parameters
wl pl ue f e
l loops e loop edges
benefits of improved locality
benefits of parallel execution
New algorithm (unpublished) balances contiguity to
enhance coalescing for GPU and SIMDization
modulo data-layout transformations
28
Reservoir Labs Harvard 04/12/2011
30. What R-Stream does for you – in a nutshell
•• Input
• Sequential
– Short and simple
textbook C code
– Just add a “#pragma
map” and R-Stream
figures out the rest
Reservoir Labs Harvard 04/12/2011 30
31. What R-Stream does for you – in a nutshell
•• Input •• Output
• Sequential • OpenMP + CUDA code
– Short and simple R-Stream – Hundreds of lines of tightly
textbook C code optimized GPU-side CUDA
– Just add a “#pragma code
map” and R-Stream – Few lines of host-side
figures out the rest OpenMP C code
• Gauss-Seidel 9 points stencil • Very difficult to
– Used in iterative PDE solvers hand-optimize
– scientific modeling (heat, fluid flow, waves, • Not available in
etc.) any standard
– Building block for faster iterative solvers like library
Multigrid or AMR
Reservoir Labs Harvard 04/12/2011 31
32. What R-Stream does for you – in a nutshell
•• Input •• Output
• Sequential • OpenMP + CUDA code
– Short and simple R-Stream – Hundreds of lines of tightly
textbook C code optimized GPU-side CUDA
– Just add a “#pragma code
map” and R-Stream – Few lines of host-side
figures out the rest OpenMP C code
• Achieving up to
Will be – 20 GFLOPS in
illustrated in GTX 285
the next few – 25 GFLOPS in
slides GTX 480
Reservoir Labs Harvard 04/12/2011 32
33. Finding and utilizing available parallelism
Excerpt of automatically generated code
GPU
SM N
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
R-Stream AUTOMATICALLY finds
SP 1 SP 2 … SP M
Unit
and forms parallelism Constant
Cache
Texture
Cache
Off-chip Device memory (Global,
constant, texture)
Extracting and mapping parallel loops
Reservoir Labs Harvard 04/12/2011 33
34. Memory compaction on GPU scratchpad
Excerpt of automatically generated code GPU
SM N
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
SP 1 SP 2 … SP M
Unit
R-Stream AUTOMATICALLY
manages local scratchpad Constant
Cache
Texture
Cache
Off-chip Device memory (Global,
constant, texture)
Reservoir Labs Harvard 04/12/2011 34
35. GPU DRAM to scratchpad coalesced communication
Excerpt of automatically generated code
GPU
SM N
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
SP 1 SP 2 … SP M
Unit
R-Stream AUTOMATICALLY
chooses parallelism to favor Constant
Cache
coalescing
Texture
Cache
Off-chip Device memory (Global,
constant, texture)
Coalesced GPU DRAM accesses
Reservoir Labs Harvard 04/12/2011 35
36. Host-to-GPU communication
Excerpt of automatically generated code
R-Stream AUTOMATICALLY
GPU
chooses partition and sets up host
SM N
to GPU communication
SM 2
SM 1
Shared Memory
Registers Registers Registers
Instruction
SP 1 SP 2 … SP M
Unit
CPU Constant
Cache
Texture
PCI Cache
Express
Host memory Off-chip Device memory (Global,
constant, texture)
Reservoir Labs Harvard 04/12/2011 36
37. Multi-GPU mapping
Excerpt of automatically generated code
Mapping Host Host
across all GPUs memory memory
CPU CPU
R-Stream AUTOMATICALLY finds GPU GPU
GPU
another level of parallelism,
across GPUs
GPU GPU GPU
memory memory memory
Reservoir Labs Harvard 04/12/2011 37
38. Multi-GPU mapping
Excerpt of automatically generated code
Host Host
Multi-streaming memory memory
of host-GPU
communication
CPU CPU
R-Stream AUTOMATICALLY GPU GPU
GPU
creates n-way software pipelines
for communications
GPU GPU GPU
memory memory memory
Reservoir Labs Harvard 04/12/2011 38
39. Future capabilities – mapping to CPU-GPU clusters
High Speed Interconnect (e.g.
InfiniBand)
Program
CPU + GPU CPU + GPU
MPI Process MPI Process
OpenMP OpenMP
process process
DRAM
launching launching DRAM
CUDA CUDA
CPU CPU
GPU GPU GPU
DRAM DRAM DRAM
Reservoir Labs Harvard 04/12/2011 39
48. 3D Discretized wave equation input code (RTM)
Not so naïve …
Communication
autotuning
knobs
ThreadIdx.x divergence
is expensive
Reservoir Labs Harvard 04/12/2011 48
50. Current status
•• Ongoing development also supported by DOE, Reservoir
• Improvements in scope, stability, performance
•• Installations/evaluations at US government laboratories
•• Forward collaboration with Georgia Tech on Keeneland
• HP SL390 - 3 FERMI GPU, 2 Westmere/node
•• Basis of compiler for DARPA UHPC Intel Corporation Team
Reservoir Labs Harvard 04/12/2011 50
51. Availability
•• Per developer seat licensing model
•• Support (releases, bug fixes, services)
•• Available with commercial grade external solvers
•• Government has limited rights / SBIR data rights
•• Academic source licenses with collaborators
•• Professional team, continuity, software engineering
Reservoir Labs Harvard 04/12/2011 51
52. R-Stream Gives More Insight About Programs
•• Teaches you about parallelism and the polyhedral model:
• Generates correct code for any transformation
– The transformation may be incorrect if specified by the user
• Imperative code generation has been the bottleneck till 2005
– 3 thesis written on the topic and it’s still not completely covered
– It is good to have an intuition of how these things work
•• R-Stream has meaningful metrics to represent your program
• Maximal amount of parallelism given minimal expansion
• Tradeoffs between coarse-grained and fine-grained parallelism
• Loop types (doall, red, perm, seq)
•• Help with algorithm selection
•• Tiling of imperfectly nested loop nests
•• Generates code for explicitly managed memory
Reservoir Labs Harvard 04/12/2011 52
53. How to Use R-Stream Successfully
•• R-Stream is a great transformation tool, it is also a great learning tool:
• Takes “simple C” input code
• Applies multiple transformations and lists them explicitly
• Can dump code at any step in the process
• It’s the tool I wish I had during my PhD:
– To be fair, I already had a great set of tools
•• R-Stream can be used in multiple modes:
• Fully automatic + compile flag options
• Autotuning mode (more than just tile size and unrolling … )
• Scripting / Programmatic mode (Beanshell + interfaces)
• Mix of these modes + manual post-processing
Reservoir Labs Harvard 04/12/2011 53
54. Use Case: Fully Automatic Mode + Compile Flag Options
•• Akin to traditional gcc / icc compilation with flags:
• Predefined transformations can be parameterized
– Except less/no phase ordering issues
• Except you can see what each transformation does at each step
• Except you can generate compilable and executable code at (almost)
any step in the process
• Except you can control code generation for compactness or
performance
Reservoir Labs Harvard 04/12/2011 54
55. Use Case: Autotuning Mode
•• More advanced than traditional approaches:
• Knobs go far beyond loop unrolling + unroll-and-jam + tiling
• Knobs are based on well-understood models
• Knobs target high-level properties of the program
– Amount of parallelism, amount of memory expansion, depth of
pipelining of communication and computations …
• Knobs are dependent on target machine, program and state of the
mapping process:
– Our tool has introspection
•• We are really building a hierarchical autotuning transformation tool
Reservoir Labs Harvard 04/12/2011 55
56. Use Case: “Power user” interactive interface
•• Beanshell access to optimizations
•• Can direct and review the process of compilation
• automatic tactics (affine scheduling)
•• Can direct code generation
•• Access to “tuning parameters”
•• All options / commands available on command line
interface
Reservoir Labs Harvard 04/12/2011 56
57. Conclusion
•• R-Stream simplifies software development and maintenance
•• Porting: reduces expense and delivery delays
•• Does this by automatically parallelizing loop code
• While optimizing for data locality, coalescing, etc.
•• Addresses
• Dense loop-intensive computations
•• Extensions
• Data-parallel programming idioms
• Sparse representations
• Dynamic runtime execution
Reservoir Labs Harvard 04/12/2011 57
58. Contact us
•• Per-developer seat, floating, cloud-based licensing
•• Discounts for academic users
•• Research collaborations with academic partners
•• For more information:
• Call us at 212-780-0527, or
• See Rich, Ann
• E-mail us at
– {sales,lethin,johnson}@reservoir.com
Reservoir Labs Harvard 04/12/2011 58