[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

The R-Stream High-Level Program Transformation Tool
N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin

Reservoir Labs Harvard 04/12/2011 1

Outline

•• R-Stream Overview

•• Compilation Walk-through

•• Performance Results

•• Getting R-Stream


Power efficiency driving architectures

Heterogeneous SIMD SIMD SIMD SIMD NUMA
FPGA FPGA
Processing

DMA Memory DMA Memory

Distributed
GPP GPP
Local Memories SIMD SIMD SIMD SIMD Hierarchical
(including board,
chassis, cabinet)

SIMD SIMD SIMD SIMD
Explicitly FPGA FPGA
Managed
Architecture
DMA Memory DMA
Multiple
Memory
Execution
Models
Bandwidth
GPP GPP
Starved SIMD SIMD SIMD SIMD

Multiple Mixed
Spatial Parallelism
Dimensions Types

3
Reservoir Labs Harvard 04/12/2011

Computation choreography

•• Expressing it
• Annotations and pragma dialects for C
• Explicitly (e.g., new languages like CUDA and OpenCL)

•• But before expressing it, how can programmers find it?
• Manual constructive procedures, art, sweat, time
– Artisans get complete control over every detail
• Automatically
– Operations research problem
– Like scheduling trucks to save fuel Our focus
– Model, solve , implement
– Faster, sometimes better, than a human


How to do automatic scheduling?

•• Naïve approach
• Model
– Tasks, dependences
– Resource use, latencies
– Machine model with connectivity, resource capacities
• Solve with ILP
– Minimize overall task length
– Subject to dependences, resource use
• Problems
– Complexity: task graph is huge!
– Dynamics: loop lengths unknown.

•• So we do something much more cool.


Program Transformations Specification

iteration space of a statement S(i,j)
t2
j
2 2
:Z Z
i
t1
•• Schedule maps iterations to multi-dimensional time:
• A feasible schedule preserves dependences
•• Placement maps iterations to multi-dimensional space:
• UHPC in progress, partially done
•• Layout maps data elements to multi-dimensional space:
• UHPC in progress
•• Hierarchical by design, tiling serves separation of concerns

6

Loop transformations

for(i=0; i<N; i++)
for(j=0; j<N; j++)
S(i,j);
unimodular
for(j=0; j<N; j++) 0 1 i
permutation for(i=0; i<N; i++) (i, j )
S(i,j); 1 0 j
for(i=N-1; i>=0; i--) 1 0 i
reversal for(j=0; j<N; j++) (i, j )
S(j,i); 0 1 j

for(i=0; i<N; i++) 1 0 i
skewing for(j= *i; j<N+ *i; j++) (i, j )
S(i,j- *i); 1 j

for(i=0; i< *N; i+= ) 0 i
scaling for(j=0; j<N; j++) (i, j )
S(i/ ,j); 0 1 j

7

Loop fusion and distribution

for(i=0; i<N; i++)
for(j=0; j<N; j++)
fusion for(i=0; i<N; i++)
for(j=0; j<N; j++)
S1(i,j);
S1(i,j);
for(j=0; j<N; j++)
S2(i,j)
S2(i,j) distribution
0 0 0
0 0 0
1 0 0 i
1 0 0 i fusion (i, j ) 0 0 0 j
1 (i, j ) 0 0 0 j 1

0 1 0 1
0 1 0 1
0 0 0
0 0 0
0 0 0 0 0 0
1 0 0 i 1 0 0 i
2 (i, j ) 0 0 1 j 2 (i, j ) 0 0 0 j
distribution 0 1 0 1
0 1 0 1
0 0 0 0 0 1

8

Enabling technology is new compiler math

Uniform Recurrence Equations [Karp et al. 1970]
Many: Lamport, Allen/Kennedy,
Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e
Loop Transformations and Parallelization [1970-]
Vectorization, SMP, locality optimizations
Dependence summary: direction/distance vectors

Unimodular transformations
Systolic Array Mapping Mostly linear-algebraic

Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....

Exact dependence analysis
Polyhedral Model [1980-] General affine transformations
Loop synthesis via polyhedral scanning

New computational techniques based
on polyhedral representations

9

R-Stream model: polyhedra

n = f();
for (i=5; i<= n; i+=2) {
A[i][i] = A[i][i]/B[i];
for (j=0; j<=i; j++) {
if (j<=10) {
…… A[i+2j+n][i+3]……
}
}

{i, j Z2 | k Z ,5 i n;0 j i; j i ; i 2k 1}
i
A0 1 2 1 0
j
A1 1 0 0 3 Affine and non-affine transformations
n Order and place of operations and data
1 0 0 0 1
1

Loop code represented (exactly or conservatively) with polyhedrons
High-level, mathematical view of a mapping
But targets concrete properties: parallelism, locality, memory footprint


Polyhedral slogans

•• Parametric imperfect loop nests

•• Subsumes classical transformations

•• Compacts the transformation search space

•• Parallelization, locality optimization (communication avoiding)

•• Preserves semantics

•• Analytic joint formulations of optimizations

•• Not just for affine static control programs


Polyhedral model – challenges in building a compiler

•• Killer math

•• Scalability of optimizations/code generation

•• Mostly confined to dependence preserving transformations

•• Code can be radically transformed – outputs can look
wildly different

•• Modeling indirections, pointers, non-affine code.

•• Many of these challenges are solved


R-Stream blueprint

Machine
Polyhedral Mapper
Model

Raising Lowering

EDG C Pretty
Scalar Representation
Front End Printer

13

Inside the polyhedral mapper

GDG representation

Tactics Module

Parallelization
Comm.
Locality Tiling Placement
Generation
Optimization
……
Memory Sync Layout Polyhedral
Promotion Generation Optimization Scanning

Jolylib, ……

14

Inside the polyhedral mapper
Optimization modules engineered to expose ““knobs”” that could be used by auto-tuner

GDG representation

Tactics Module

Parallelization
Comm.
Locality Tiling Placement
Generation
Optimization
……
Memory Sync Layout Polyhedral
Promotion Generation Optimization Scanning

Jolylib, ……

15

Driving the mapping: the machine model

•• Target machine characteristics that have an
influence on how the mapping should be done
• Local memory / cache sizes
• Communication facilities: DMA, cache(s)
• Synchronization capabilities
• Symmetrical or not
• SIMD width
• Bandwidths

•• Currently: two-level model (Host and Accelerators)
•• XML schema and graphical rendering


Machine model example: multi-Tesla

Host

1 thread per GPU

OpenMP morph XML file
CUDA morph

Mapping process

Dependencies
2- Task formation:
- Coarse-grain atomic tasks
- Master/slave side operations

1- Scheduling:
Parallelism, locality, tilability

3- Placement:
Assign tasks to blocks/threads

- Local / global data layout optimization
- Multi-buffering (explicitly managed)
- Synchronization (barriers)
- Bulk communications
- Thread generation -> master/slave
- CUDA-specific optimizations


Program Transformations Specification

iteration space of a statement S(i,j)
t2
j
2 2
:Z Z
i
t1
•• Schedule maps iterations to multi-dimensional time:
• A feasible schedule preserves dependences
•• Placement maps iterations to multi-dimensional space:
• UHPC in progress, partially done
•• Layout maps data elements to multi-dimensional space:
• UHPC in progress
•• Hierarchical by design, tiling serves separation of concerns

19

Model for scheduling trades 3 objectives jointly

Loop Fission
Fewer
Global More More Sufficient
Memory Locality Parallelism Occupancy
Accesses
Loop Fusion

+ successive + successive
thread thread
contiguity contiguity
Memory
Coalescing
Better
Effective
Bandwidth

Patent pending


Optimization with BLAS vs. global optimization

Numerous cache misses
/* Global Optimization*/
/* Optimization with BLAS */ doall loop { Can parallelize
for loop { Outer loop(s) …… outer loop(s)
…… for loop {
Retrieve data Z from disk ……
BLAS call 1
…… Store data Z back to disk [read from Z]
Retrieve data Z from disk !!! Loop fusion
BLAS call 2 VS. ……
…… [write to Z] can
…… …… improve
BLAS call n [read from Z] locality
…… }
} ……
}

Global optimization can expose better parallelism and locality


Tradeoffs between parallelism and locality

• Significant parallelism is needed to fully utilize all resources
• Locality is also critical to minimize communication
• Parallelism can come at the expense of locality
Limited bandwidth
at chip border
High on-chip
parallelism

•• Our approach: R-Stream compiler exposes parallelism via affine scheduling
that simultaneously augments locality using loop fusion
Reuse data once
loaded on chip =
locality


Parallelism/locality tradeoff example
Array z gets expanded, to
Maximum distribution destroys locality
introduce another level of
parallelism
/* doall (i=0; i<400; i++)
* Original code: doall (j=0; j<3997; j++)
* Simplified CSLC LMS z_e[j][i]=0
*/ doall (i=0; i<400; i++)
for (k=0; k<400; k++) { doall (j=0; j<3997; j++)
Max. parallelism
for (i=0; i<3997; i++) { for (k=0; k<4000; k++)
(no fusion)
z[i]=0; z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];
for (j=0; j<4000; j++) doall (i=0; i<3997; i++)
z[i]= z[i]+B[i][j]*x[k][j]; for (j=0; j<400; j++)
} w[i]=w[i]+z_e[i][j];
for (i=0; i<3997; i++) doall (i=0; i<3997; i++)
w[i]=w[i]+z[i]; Data
z[i] = z_e[i][399];
} accumulation

2 levels of parallelism, but poor data reuse (on array z_e)


Parallelism/locality tradeoff example (cont.)

Aggressive loop fusion destroys
parallelism (i.e., only 1 degree
/*
of parallelism)
* Original code:
* Simplified CSLC LMS
*/ doall (i=0; i<3997; i++)
for (k=0; k<400; k++) { for (j=0; j<400; j++) {
for (i=0; i<3997; i++) {
Max. fusion z[i]=0;
z[i]=0; for (k=0; k<4000; k++)
for (j=0; j<4000; j++) z[i]=z[i]+B[i][k]*x[j][k];
z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z[i];
} }
for (i=0; i<3997; i++)
w[i]=w[i]+z[i];
}

Very good data reuse (on array z), but only 1 level of parallelism


Parallelism/locality tradeoff example (cont.)

Partial fusion doesn’t
Expansion of array z decrease parallelism
/*
* Original code: doall (i=0; i<3997; i++) {
* Simplified CSLC LMS doall (j=0; j<400; j++) {
*/ z_e[i][j]=0;
for (k=0; k<400; k++) { for (k=0; k<4000; k++)
z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];
for (i=0; i<3997; i++) { Parallelism with
}
z[i]=0; partial fusion
for (j=0; j<4000; j++) for (j=0; j<400; j++)
z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z_e[i][j];
} }
doall (i=0; i<3997; i++)
for (i=0; i<3997; i++) Data
z[i]=z_e[i][399];
w[i]=w[i]+z[i]; accumulation
}

2 levels of parallelism with good data reuse (on array z_e)


Parallelism/locality tradeoffs: performance
numbers

Code with a good balance between parallelism and fusion performs best
In explicitly managed memory/scratchpad architectures this is even more true


R-Stream: affine scheduling and fusion

•• R-Stream uses a heuristic based on an objective function with several cost
coefficients:
• slowdown in execution if a loop p is executed sequentially rather than in parallel
• cost in performance if two loops p and q remain unfused rather than fused

minimize wl pl ue f e
l loops e loop edges

slowdown in sequential cost of unfusing
execution two loops

•• These two cost coefficients address parallelism and locality in a unified and
unbiased manner (as opposed to traditional compilers)
•• Fine-grained parallelism, such as SIMD, can also be modeled using similar
formulation

Patent Pending

Parallelism + locality + spatial locality

Hypothesis that auto-tuning should adjust these
parameters

wl pl ue f e
l loops e loop edges

benefits of improved locality
benefits of parallel execution

New algorithm (unpublished) balances contiguity to
enhance coalescing for GPU and SIMDization
modulo data-layout transformations

28

Outline






What R-Stream does for you – in a nutshell

•• Input
• Sequential
– Short and simple
textbook C code
– Just add a “#pragma
map” and R-Stream
figures out the rest



•• Input •• Output
• Sequential • OpenMP + CUDA code
– Short and simple R-Stream – Hundreds of lines of tightly
textbook C code optimized GPU-side CUDA
– Just add a “#pragma code
map” and R-Stream – Few lines of host-side
figures out the rest OpenMP C code

• Gauss-Seidel 9 points stencil • Very difficult to
– Used in iterative PDE solvers hand-optimize
– scientific modeling (heat, fluid flow, waves, • Not available in
etc.) any standard
– Building block for faster iterative solvers like library
Multigrid or AMR



•• Input •• Output
• Sequential • OpenMP + CUDA code
– Short and simple R-Stream – Hundreds of lines of tightly
textbook C code optimized GPU-side CUDA
– Just add a “#pragma code
map” and R-Stream – Few lines of host-side
figures out the rest OpenMP C code

• Achieving up to
Will be – 20 GFLOPS in
illustrated in GTX 285
the next few – 25 GFLOPS in
slides GTX 480


Finding and utilizing available parallelism

Excerpt of automatically generated code
GPU
SM N

SM 2

SM 1

Shared Memory

Registers Registers Registers
Instruction

R-Stream AUTOMATICALLY finds
SP 1 SP 2 … SP M
Unit

and forms parallelism Constant
Cache

Texture
Cache

Off-chip Device memory (Global,
constant, texture)

Extracting and mapping parallel loops


Memory compaction on GPU scratchpad

Excerpt of automatically generated code GPU
SM N

SM 2

SM 1

Shared Memory

Instruction

SP 1 SP 2 … SP M
Unit

R-Stream AUTOMATICALLY
manages local scratchpad Constant
Cache

Texture
Cache

constant, texture)


GPU DRAM to scratchpad coalesced communication

GPU
SM N

SM 2

SM 1

Shared Memory

Instruction

SP 1 SP 2 … SP M
Unit

chooses parallelism to favor Constant
Cache
coalescing
Texture
Cache

constant, texture)

Coalesced GPU DRAM accesses


Host-to-GPU communication


GPU
chooses partition and sets up host
SM N
to GPU communication
SM 2

SM 1

Shared Memory

Instruction

SP 1 SP 2 … SP M
Unit

CPU Constant
Cache

Texture
PCI Cache
Express
Host memory Off-chip Device memory (Global,
constant, texture)


Multi-GPU mapping

Mapping Host Host
across all GPUs memory memory

CPU CPU

R-Stream AUTOMATICALLY finds GPU GPU
GPU
another level of parallelism,
across GPUs
GPU GPU GPU
memory memory memory


Multi-GPU mapping

Host Host
Multi-streaming memory memory
of host-GPU
communication
CPU CPU

R-Stream AUTOMATICALLY GPU GPU
GPU
creates n-way software pipelines
for communications
GPU GPU GPU
memory memory memory


Future capabilities – mapping to CPU-GPU clusters

High Speed Interconnect (e.g.
InfiniBand)

Program
CPU + GPU CPU + GPU
MPI Process MPI Process

OpenMP OpenMP
process process
DRAM
launching launching DRAM

CUDA CUDA
CPU CPU

GPU GPU GPU
DRAM DRAM DRAM


Outline






Experimental evaluation

Configuration 1: MKL Configuration 2: Low-level compilers

Radar Radar GCC
MKL calls
code code ICC

Configuration 3: R-Stream

Radar Optimized GCC
R-Stream
code radar code ICC

•• Main comparisons:
• R-Stream High-Level C Transformation Tool 3.1.2
• Intel MKL 10.2.1


Experimental evaluation (cont.)

•• Intel Xeon workstation:
• Dual quad-core E5405 Xeon processors (8 cores total)
• 9GB memory
•• 8 OpenMP threads
•• Single precision floating point data
•• Low-level compilers and the used flags:
• GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp
• ICC: -fast -openmp


Radar benchmarks

•• Beamforming algorithms:
• MVDR-SER: Minimum Variance Distortionless Response using
Sequential Regression
• CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square
• CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least
Square
•• Expressed in sequential ANSI C
•• 400 radar iterations
•• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)


MVDR-SER


CSLC-LMS


CSLC-RLS


3D Discretized wave equation input code (RTM)
•• #pragma rstream map
•• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],
•• int pX, int pY, int pZ) {
•• double temp;
•• int i, j, k;

•• for (k=4; k<pZ-4; k++) {
•• for (j=4; j<pY-4; j++) {
•• for (i=4; i<pX-4; i++) {
•• temp = C0 * U2[k][j][i] +
•• C1 * (U2[k-1][j][i] + U2[k+1][j][i] +
•• U2[k][j-1][i] + U2[k][j+1][i] +
•• U2[k][j][i-1] + U2[k][j][i+1]) +
•• C2 * (U2[k-2][j][i] + U2[k+2][j][i] +
•• U2[k][j-2][i] + U2[k][j+2][i] + 25-point 8th
•• U2[k][j][i-2] + U2[k][j][i+2]) + order (in space)
•• C3 * (U2[k-3][j][i] + U2[k+3][j][i] +
•• U2[k][j-3][i] + U2[k][j+3][i] +
stencil
•• U2[k][j][i-3] + U2[k][j][i+3]) +
•• C4 * (U2[k-4][j][i] + U2[k+4][j][i] +
•• U2[k][j-4][i] + U2[k][j+4][i] +
•• U2[k][j][i-4] + U2[k][j][i+4]);

•• U1[k][j][i] =
•• 2.0f * U2[k][j][i] - U1[k][j][i] +
•• V[k][j][i] * temp;
•• } } } }


3D Discretized wave equation input code (RTM)

Not so naïve …

Communication
autotuning
knobs

ThreadIdx.x divergence
is expensive


Outline






Current status

•• Ongoing development also supported by DOE, Reservoir
• Improvements in scope, stability, performance

•• Installations/evaluations at US government laboratories

•• Forward collaboration with Georgia Tech on Keeneland
• HP SL390 - 3 FERMI GPU, 2 Westmere/node

•• Basis of compiler for DARPA UHPC Intel Corporation Team


Availability

•• Per developer seat licensing model

•• Support (releases, bug fixes, services)

•• Available with commercial grade external solvers

•• Government has limited rights / SBIR data rights

•• Academic source licenses with collaborators

•• Professional team, continuity, software engineering


R-Stream Gives More Insight About Programs

•• Teaches you about parallelism and the polyhedral model:
• Generates correct code for any transformation
– The transformation may be incorrect if specified by the user
• Imperative code generation has been the bottleneck till 2005
– 3 thesis written on the topic and it’s still not completely covered
– It is good to have an intuition of how these things work
•• R-Stream has meaningful metrics to represent your program
• Maximal amount of parallelism given minimal expansion
• Tradeoffs between coarse-grained and fine-grained parallelism
• Loop types (doall, red, perm, seq)
•• Help with algorithm selection
•• Tiling of imperfectly nested loop nests
•• Generates code for explicitly managed memory


How to Use R-Stream Successfully

•• R-Stream is a great transformation tool, it is also a great learning tool:
• Takes “simple C” input code
• Applies multiple transformations and lists them explicitly
• Can dump code at any step in the process
• It’s the tool I wish I had during my PhD:
– To be fair, I already had a great set of tools
•• R-Stream can be used in multiple modes:
• Fully automatic + compile flag options
• Autotuning mode (more than just tile size and unrolling … )
• Scripting / Programmatic mode (Beanshell + interfaces)
• Mix of these modes + manual post-processing


Use Case: Fully Automatic Mode + Compile Flag Options

•• Akin to traditional gcc / icc compilation with flags:
• Predefined transformations can be parameterized
– Except less/no phase ordering issues
• Except you can see what each transformation does at each step
• Except you can generate compilable and executable code at (almost)
any step in the process
• Except you can control code generation for compactness or
performance


Use Case: Autotuning Mode

•• More advanced than traditional approaches:
• Knobs go far beyond loop unrolling + unroll-and-jam + tiling
• Knobs are based on well-understood models
• Knobs target high-level properties of the program
– Amount of parallelism, amount of memory expansion, depth of
pipelining of communication and computations …
• Knobs are dependent on target machine, program and state of the
mapping process:
– Our tool has introspection
•• We are really building a hierarchical autotuning transformation tool


Use Case: “Power user” interactive interface

•• Beanshell access to optimizations

•• Can direct and review the process of compilation
• automatic tactics (affine scheduling)

•• Can direct code generation

•• Access to “tuning parameters”

•• All options / commands available on command line
interface


Conclusion

•• R-Stream simplifies software development and maintenance

•• Porting: reduces expense and delivery delays

•• Does this by automatically parallelizing loop code
• While optimizing for data locality, coalescing, etc.

•• Addresses
• Dense loop-intensive computations

•• Extensions
• Data-parallel programming idioms
• Sparse representations
• Dynamic runtime execution

Contact us

•• Per-developer seat, floating, cloud-based licensing

•• Discounts for academic users

•• Research collaborations with academic partners

•• For more information:
• Call us at 212-780-0527, or
• See Rich, Ann
• E-mail us at
– {sales,lethin,johnson}@reservoir.com


[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (14)

Similar to [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

Similar to [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs) (20)

More from npinto

More from npinto (16)

Recently uploaded

Recently uploaded (20)

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)