GPU Programming in Python with PyOpenCL and PyCUDA

Intro PyOpenCL RTCG Perspectives

Easy, Eﬀective, Eﬃcient:
GPU Programming in Python
with PyOpenCL and PyCUDA

Andreas Kl¨ckner
o

Courant Institute of Mathematical Sciences
New York University

March 31, 2011

Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA


Thanks

Jan Hesthaven (Brown)
Tim Warburton (Rice)
Leslie Greengard (NYU)
PyOpenCL, PyCUDA contributors
Nvidia Corp., AMD Corp.

Andreas Kl¨ckner


Outline

1 Introduction

2 Programming with PyOpenCL

3 Run-Time Code Generation

4 Perspectives

Andreas Kl¨ckner

Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL

Outline

1 Introduction
A Common Theme
Intro to OpenCL



4 Perspectives

Andreas Kl¨ckner


How are High-Performance Codes constructed?

“Traditional” Construction of
High-Performance Codes:
C/C++/Fortran
Libraries
“Alternative” Construction of
High-Performance Codes:
Scripting for ‘brains’
GPUs for ‘inner loops’
Play to the strengths of each
programming environment.

Andreas Kl¨ckner


What is OpenCL?

OpenCL (Open Computing Language) is an
open, royalty-free standard for general purpose
parallel programming across CPUs, GPUs and
other processors. [OpenCL 1.1 spec]

Device-neutral (Nv GPU, AMD GPU,
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Deﬁnes:
Host-side programming interface (library)
Device-side programming language (!)

Andreas Kl¨ckner


What is OpenCL?


Big deal?
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Deﬁnes:

Andreas Kl¨ckner


What is OpenCL?


Big deal!
Big deal?
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Deﬁnes:

Andreas Kl¨ckner


Who?
OpenCL Working Group
• Diverse industry participation
- Processor vendors, system OEMs, middleware vendors, application developers
• Many industry-leading experts involved in OpenCL’s design
- A healthy diversity of industry perspectives
• Apple made initial proposal and is very active in the working group
- Serving as specification editor

© Copyright Khronos Group, 2010 - Page 4

Credit: Khronos Group

Andreas Kl¨ckner


When?
OpenCL Timeline
• Six months from proposal to released OpenCL 1.0 specification
- Due to a strong initial proposal and a shared commercial incentive
• Multiple conformant implementations shipping
- Apple’s Mac OS X Snow Leopard now ships with OpenCL
• 18 month cadence between OpenCL 1.0 and OpenCL 1.1
- Backwards compatibility protect software investment

Khronos publicly Multiple conformant
releases OpenCL 1.0 as implementations ship
royalty-free across diverse OS
specification and platforms
Jun08 May09 Jun10
Dec08 2H09
Apple proposes OpenCL Khronos releases OpenCL OpenCL 1.1
working group and 1.0 conformance tests to Specification released and
contributes draft specification ensure high-quality first implementations ship
to Khronos implementations


Andreas Kl¨ckner


Why?
Processor Parallelism

CPUs GPUs
Multiple cores driving Emerging Increasingly general
performance increases purpose data-parallel
Intersection
computing

Multi- Heterogeneous Graphics
processor Computing APIs and
programming Shading
– e.g. OpenMP Languages

OpenCL is a programming framework for heterogeneous compute resources


Andreas Kl¨ckner


CL vs CUDA side-by-side
CUDA source code: OpenCL source code:
global void transpose( void transpose(
float ∗A t, float ∗A, global float ∗a t, global float ∗a,
int a width, int a height ) unsigned a width, unsigned a height )
{ {
int base idx a = int base idx a =
blockIdx .x ∗ BLK SIZE + get group id (0) ∗ BLK SIZE +
blockIdx .y ∗ A BLOCK STRIDE; get group id (1) ∗ A BLOCK STRIDE;
int base idx a t = int base idx a t =
blockIdx .y ∗ BLK SIZE + get group id (1) ∗ BLK SIZE +
blockIdx .x ∗ A T BLOCK STRIDE; get group id (0) ∗ A T BLOCK STRIDE;

int glob idx a = int glob idx a =
base idx a + threadIdx.x base idx a + get local id (0)
+ a width ∗ threadIdx.y; + a width ∗ get local id (1);
int glob idx a t = int glob idx a t =
base idx a t + threadIdx.x base idx a t + get local id (0)
+ a height ∗ threadIdx .y; + a height ∗ get local id (1);

shared float A shared[BLK SIZE][BLK SIZE+1]; local float a local [BLK SIZE][BLK SIZE+1];

A shared[ threadIdx .y ][ threadIdx .x] = a local [ get local id (1)∗BLK SIZE+get local id(0)] =
A[ glob idx a ]; a[ glob idx a ];

syncthreads (); barrier (CLK LOCAL MEM FENCE);

A t[ glob idx a t ] = a t [ glob idx a t ] =
A shared[ threadIdx .x ][ threadIdx .y ]; a local [ get local id (0)∗BLK SIZE+get local id(1)];
} }

Andreas Kl¨ckner


OpenCL ↔ CUDA: A dictionary

OpenCL CUDA
Grid Grid
Work Group Block
Work Item Thread
kernel global
global device
local shared
private local
imagend t texture<type, n, ...>
barrier(LMF) syncthreads()
get local id(012) threadIdx.xyz
get group id(012) blockIdx.xyz
get global id(012) – (reimplement)

Andreas Kl¨ckner


OpenCL: Execution Model

nD Grid

Group Group Group
(0, 0) (1, 0) (2, 0)
Two-tiered Parallelism
Group Group Group
(0, 1) (1, 1) (2, 1) Grid = Nx × Ny × Nz work groups
Work group = Sx × Sy × Sz work items
Total: i∈{x,y ,z} Si Ni work items
Work Group (1, 0) Comm/Sync only within work group
Item Item Item Item Work group maps to compute unit
(0, 0) (1, 0) (2, 0) (3, 0)
Grid/Group ≈ outer loops in an algorithm
Item Item Item Item
(0, 1) (1, 1) (2, 1) (3, 1)
Device Language:
Item
(0, 2)
Item
(1, 2)
Item
(2, 2)
Item
(3, 2)
get {global,group,local} {id,size}
Item Item Item Item
(axis)
(0, 3) (1, 3) (2, 3) (3, 3)

Andreas Kl¨ckner


OpenCL: Computing as a Service

Host
(CPU)

Andreas Kl¨ckner



Compute Device 0 (Platform 0)

···
···
···

···
···
Host ···
(CPU)
···
···
···

···
···
···

Andreas Kl¨ckner




···
···
···

···
···
Host ···
(CPU)
···
Memory ···
···

···
···
···

Andreas Kl¨ckner




···
···
···
Memory

···
Host ···
···
Memory
(CPU)
···
Memory ···
···
Memory

···
···
···
Memory

Andreas Kl¨ckner



Platform 0 (e.g. CPUs)

···
···
···

···
···
Host ···
(CPU)
···
···
···

···
···
···

Andreas Kl¨ckner




···
···
···

···
···
Host ···
(CPU)
···
···
···

···
···
···

Platform 1 (e.g. GPUs)

Andreas Kl¨ckner



(think “chip”,
has memory
···
interface) ···
···

···
···
Host ···
(CPU)
···
···
···

···
···
···

Andreas Kl¨ckner



(think “chip”,
has memory
···
interface) ···
···

···
···
Host ···
(CPU)
···
···
···
Compute Unit ···
···
(think “processor”, ···
has insn. fetch)

Andreas Kl¨ckner



(think “chip”,
has memory
···
interface) ···
···

···
···
Host ···
(CPU)
···
···
···
Compute Unit ···
···
(think “processor”, ···
has insn. fetch)
Processing Element
(think “SIMD lane”)

Andreas Kl¨ckner




···
···
···

···
···
Host ···
(CPU)
···
···
···

Python ···
···
···

Andreas Kl¨ckner




···
···
···

···
···
Host ···
(CPU)
···
···
···

Python ···
···
···

Device Language: ∼ C99

Andreas Kl¨ckner


OpenCL Object Diagram

Figure 2.1 - OpenCL UML Class Diagram


Andreas Kl¨ckner


Why do Scripting for GPUs?

GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory
throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Python + CUDA = PyCUDA
Python + OpenCL = PyOpenCL

Andreas Kl¨ckner

Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL

Outline

1 Introduction

First Contact
About PyOpenCL


4 Perspectives

Andreas Kl¨ckner


Dive into PyOpenCL

1 import pyopencl as cl , numpy
2
3 a = numpy.random.rand(256∗∗3).astype(numpy.float32)
4
5 ctx = cl. create some context ()
6 queue = cl.CommandQueue(ctx)
7
8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12 kernel void twice( global float ∗a)
13 { a[ get global id (0)] ∗= 2; }
14 ”””). build ()
15
16 prg. twice(queue, a.shape, (1,), a dev)

Andreas Kl¨ckner


Dive into PyOpenCL

1 import pyopencl as cl , numpy
2
3 a = numpy.random.rand(256∗∗3).astype(numpy.ﬂoat32)
4
7
10
13 { a[ get global id (0)] ∗= 2; } Compute kernel
14 ”””). build ()
15

Andreas Kl¨ckner


Dive into PyOpenCL

10
13 { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; }
14 ”””). build ()
15
17
18 result = numpy.empty like(a)
19 cl . enqueue read buﬀer (queue, a dev, result ). wait()
20 import numpy.linalg as la
21 assert la .norm(result − 2∗a) == 0

Andreas Kl¨ckner


PyOpenCL: Completeness

PyOpenCL exposes all of OpenCL.

For example:
Every GetInfo() query
Images and Samplers
Memory Maps
Proﬁling and Synchronization
GL Interop

Andreas Kl¨ckner


PyOpenCL: Completeness

PyOpenCL supports (nearly)
every OS that has an OpenCL
implementation.

Linux
OS X
Windows

Andreas Kl¨ckner


Automatic Cleanup

Reachable objects (memory,
streams, . . . ) are never destroyed.
Once unreachable, released at an
unspeciﬁed future time.
Scarce resources (memory) can be
explicitly freed. (obj.release())
Correctly deals with multiple
contexts and dependencies. (based
on OpenCL’s reference counting)

Andreas Kl¨ckner


PyOpenCL: Documentation

Andreas Kl¨ckner


PyOpenCL Philosophy

Provide complete access
Automatically manage resources
Provide abstractions
Allow interactive use
Check for and report errors
automatically
Integrate tightly with numpy

Andreas Kl¨ckner


PyOpenCL, PyCUDA: Vital Information

http://mathema.tician.de/
software/pyopencl (or /pycuda)
Complete documentation
X Consortium License
(no warranty, free for all use)
Convenient abstractions
Arrays, Elementwise op., Reduction, Scan
Require: numpy, Python 2.4+
(Win/OS X/Linux)
Community: mailing list, wiki, add-on
packages (FFT, scikits.cuda, . . . )

Andreas Kl¨ckner


Capturing Dependencies

A
f
B = f(A) B
C = g(B) g p
E = f(C)
C P q
F = h(C)
G = g(E,F) f h
P = p(B) E F Q
Q = q(B)
g g r
R = r(G,P,Q)
G r
r
R
Andreas Kl¨ckner


Capturing Dependencies

A
Switch queue to out-of-order
mode! f
B = f(A)
Specify as list of events using B
C = g(B) for= optional keyword to
wait g p
E = f(C)
enqueue XXX.
C P q
F = h(C) also enqueue barrier.
Can
G = g(E,F) f h
Common use case:
P = p(B)
Transmit/receive from other MPI E F Q
Q = q(B)
ranks. g g r
R = r(G,P,Q)
Possible on Nv Fermi: Submit G r
parallel work to increase machine
r
use.
R
Andreas Kl¨ckner

Intro PyOpenCL RTCG Perspectives Idea RTCG in Action

Outline

1 Introduction


The Idea
RTCG in Action

4 Perspectives

Andreas Kl¨ckner


Metaprogramming

In GPU scripting,
GPU code does
not need to be
a compile-time
constant.

Andreas Kl¨ckner


Metaprogramming

In GPU scripting,
GPU code does
not need to be
a compile-time
constant.

(Key: Code is data–it wants to be
reasoned about at run time)

Andreas Kl¨ckner


Metaprogramming

Idea

In GPU scripting,
GPU code does
not need to be
a compile-time
constant.

reasoned about at run time)

Andreas Kl¨ckner


Metaprogramming

Idea

In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary
GPU reasoned about at run time)

Result

Andreas Kl¨ckner


Metaprogramming

Idea

In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary Machine

Result

Andreas Kl¨ckner


Metaprogramming

Idea
Human In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary

Result

Andreas Kl¨ckner


Metaprogramming

Idea

Good for code In GPU scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary

Result

Andreas Kl¨ckner


Metaprogramming

Idea

Good for code In GPUyCUDA
P scripting,
Python Code
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary

Result

Andreas Kl¨ckner


Metaprogramming

Idea

Good for code PyOp UDA
In GPUyCenCL
P scripting,
Python Code
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary

Result

Andreas Kl¨ckner


Machine-generated Code

Why machine-generate code?
Automated Tuning
(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables
(→ register pressure)
Loop Unrolling

Andreas Kl¨ckner


PyOpenCL: Support for Metaprogramming

Three (main) ways of generating code:
Simple %-operator substitution
Combine with C preprocessor: simple, often suﬃcient
Use a templating engine (Mako works very well)
codepy:
Build C syntax trees from Python
Generates readable, indented C
Many ways of evaluating code–most important one:
Exact device timing via events

Andreas Kl¨ckner


PyOpenCL Arrays: General Usage

Remember your ﬁrst PyOpenCL program?
Abstraction is good:
1 import numpy
2 import pyopencl as cl
3 import pyopencl.array as cl array
4
7
8 a gpu = cl array . to device (
9 ctx , queue, numpy.random.randn(4,4).astype(numpy.ﬂoat32))
10 a doubled = (2∗a gpu).get()
11 print a doubled
12 print a gpu

Andreas Kl¨ckner


pyopencl.array: Simple Linear Algebra

pyopencl.array.Array:
Meant to look and feel just like numpy.
p.a.to device(ctx, queue, numpy array)
numpy array = ary.get()
+, -, ∗, /, fill, sin, arange, exp, rand, . . .
Mixed types (int32 + float32 = float64)
print cl array for debugging.
Allows access to raw bits
Use as kernel arguments, memory maps

Andreas Kl¨ckner


pyopencl.elementwise: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
n = 10000
a gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))
b gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))

from pyopencl.elementwise import ElementwiseKernel
lin comb = ElementwiseKernel(ctx,
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)

c gpu = cl array . empty like (a gpu)
lin comb(5, a gpu, 6, b gpu, c gpu)

import numpy.linalg as la
assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5

Andreas Kl¨ckner


RTCG via Substitution

source = (”””
kernel void %(name)s(%(arguments)s)
{
unsigned lid = get local id (0);
unsigned gsize = get global size (0);
unsigned work item start = get local size (0)∗ get group id (0);

for (unsigned i = work item start + lid ; i < n; i += gsize)
{
%(operation)s;
}
}
””” % {
”arguments”: ”, ”. join (arg . declarator () for arg in arguments),
”operation”: operation ,
”name”: name,
”loop prep”: loop prep ,
})

prg = cl.Program(ctx, source ). build ()

Andreas Kl¨ckner


RTCG via Templates
from mako.template import Template
tpl = Template(”””
kernel void add(
global ${ type name } ∗tgt,
global const ${ type name } ∗op1,
global const ${ type name } ∗op2)
{
int idx = get local id (0)
+ ${ local size } ∗ ${ thread strides }
∗ get group id (0);
% for i in range( thread strides ):
<% offset = i∗ local size %>
tgt [ idx + ${ offset }] =
op1[idx + ${ offset }]
+ op2[idx + ${ offset } ];
% endfor
}”””)

rendered tpl = tpl . render(type name=”float”,
local size = local size , thread strides = thread strides )
knl = cl.Program(ctx, str ( rendered tpl )). build (). add

Andreas Kl¨ckner


pyopencl.reduction: Reduction made easy

Example: A dot product calculation
from pyopencl.reduction import ReductionKernel
dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”,
reduce expr=”a+b”, map expr=”x[i]∗y[i]”,
arguments=” global const float ∗x, global const float ∗y”)

import pyopencl.clrandom as cl rand
x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)
y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)

x dot y = dot(x, y ). get()
x dot y cpu = numpy.dot(x.get(), y. get ())

Andreas Kl¨ckner


pyopencl.scan: Scan made easy

Example: A cumulative sum computation
from pyopencl.scan import InclusiveScanKernel
knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”)

n = 2∗∗20−2∗∗18+5
host data = np.random.randint(0, 10, n). astype(np.int32)
dev data = cl array . to device (queue, host data)

knl(dev data)
assert (dev data.get() == np.cumsum(host data, axis=0)).all()

Andreas Kl¨ckner

Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions

Outline

1 Introduction



4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions

Andreas Kl¨ckner


Whetting your appetite

1 import pycuda.driver as cuda
2 import pycuda.autoinit , pycuda.compiler
3 import numpy
4
5 a = numpy.random.randn(4,4).astype(numpy.ﬂoat32)
6 a gpu = cuda.mem alloc(a.nbytes)
7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]

Andreas Kl¨ckner



1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( ﬂoat ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 }
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a

Andreas Kl¨ckner



1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( ﬂoat ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 } Compute kernel
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a

Andreas Kl¨ckner


PyOpenCL ↔ PyCUDA: A (rough) dictionary

PyOpenCL PyCUDA
Context Context
CommandQueue Stream
Buffer mem alloc / DeviceAllocation
Program SourceModule
Kernel Function
Event (eg. enqueue marker) Event

Andreas Kl¨ckner


Discontinuous Galerkin Method

Let Ω := i Dk ⊂ Rd .

Andreas Kl¨ckner




Goal
Solve a conservation law on Ω: ut + · F (u) = 0

Andreas Kl¨ckner




Goal
Solve a conservation law on Ω: ut + · F (u) = 0

Example
Maxwell’s Equations: EM ﬁeld: E (x, t), H(x, t) on Ω governed by

1 j 1
∂t E − ×H =− , ∂t H + × E = 0,
ε ε µ
ρ
·E = , · H = 0.
ε

Andreas Kl¨ckner


Metaprogramming DG: Flux Terms

ˆ ˆ
0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx
n n
Dk ∂Dk

Flux term

Andreas Kl¨ckner


Metaprogramming DG: Flux Terms

ˆ ˆ
0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx
n n
Dk ∂Dk

Flux term

Flux terms:
vary by problem
expression speciﬁed by user
evaluated pointwise

Andreas Kl¨ckner


Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2

Andreas Kl¨ckner


1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2

User writes: Vectorial statement in math. notation
ﬂux = 1/2∗cross(normal, h. int −h.ext
−alpha∗cross(normal, e. int −e.ext))

Andreas Kl¨ckner


1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2

We generate: Scalar evaluator in C (6×)
a flux += (
((( val a field5 − val b field5 )∗ fpair −>normal[2]
− ( val a field4 − val b field4 )∗ fpair −>normal[0])
+ val a field0 − val b field0 )∗ fpair −>normal[0]
− ((( val a field4 − val b field4 ) ∗ fpair −>normal[1]
− ( val a field1 − val b field1 )∗ fpair −>normal[2])
+ val a field3 − val b field3 ) ∗ fpair −>normal[1]
)∗ value type (0.5);

Andreas Kl¨ckner


Loop Slicing for element-local parts of GPU DG

Per Block: KL element-local mat.mult. + matrix load
Preparation

Question: How should one assign work to threads?

Andreas Kl¨ckner


Loop Slicing for element-local parts of GPU DG

Per Block: KL element-local mat.mult. + matrix load
Preparation

Question: How should one assign work to threads?

ws : in sequence wi : “inline-parallel” wp : in parallel
Thread Thread Thread

t t t

(amortize preparation) (exploit register space)

Andreas Kl¨ckner


Loop Slicing for Diﬀerentiation
2.2 3.0
Local differentiation, matrix-in-shared,
order 4, with microblocking 2.8
2.0 point size denotes wi ∈ 1, ,4
2.6
1.8 2.4
Execution time [ms]

1.6 2.2
2.0

ws
1.4 1.8
1.2 1.6
1.4
1.0
1.2
0.8 15 20 25 30 1.0
wp

Andreas Kl¨ckner


Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400

300
GPU
250 CPU
200
GFlops/s

150

100

50

00 2 4 6 8 10
Polynomial Order N

Andreas Kl¨ckner


Memory Bandwidth on a GTX 280
200
Gather
180 Lift
Global Memory Bandwidth [GB/s]

Diff
160 Assy.
Peak
140
120
100
80
60
40
201 2 3 4 5 6 7 8 9
Polynomial Order N

Andreas Kl¨ckner


GPU DG Showcase

Eletromagnetism

Andreas Kl¨ckner


GPU DG Showcase

Eletromagnetism
Poisson

Andreas Kl¨ckner


GPU DG Showcase

Eletromagnetism
Poisson
CFD

Andreas Kl¨ckner


Automating GPU Programming

GPU programming can be time-consuming, unintuitive and
error-prone.

Obvious idea: Let the computer do it.
One way: Smart compilers

Andreas Kl¨ckner



error-prone.

GPU programming requires complex tradeoﬀs
Tradeoﬀs require heuristics
Heuristics are fragile

Andreas Kl¨ckner



error-prone.

GPU programming requires complex tradeoﬀs
Tradeoﬀs require heuristics
Heuristics are fragile
Another way: Dumb enumeration
Enumerate loop slicings
Enumerate prefetch options
Choose by running resulting code on actual hardware

Andreas Kl¨ckner


Loo.py Example

Empirical GPU loop optimization:
a, b, c, i , j , k = [var(s) for s in ” abcijk ”]
n = 500
k = make loop kernel([
LoopDimension(”i”, n),
LoopDimension(”j”, n),
LoopDimension(”k”, n),
], [
(c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j])
])

gen kwargs = {
”min threads”: 128,
”min blocks”: 32,
}

→ Ideal case: Finds 160 GF/s kernel
without human intervention.

Andreas Kl¨ckner


Loo.py Status

Limited scope:
Require input/output separation
Kernels must be expressible using
“loopy” model
(i.e. indices decompose into “output”
and “reduction”)
Enough for DG, LA, FD, . . .

Andreas Kl¨ckner


Loo.py Status

Limited scope:
Require input/output separation
Kernels must be expressible using
“loopy” model
(i.e. indices decompose into “output”
and “reduction”)
Enough for DG, LA, FD, . . .
Kernel compilation limits trial rate
Non-Goal: Peak performance
Good results currently for dense linear
algebra and (some) DG subkernels

Andreas Kl¨ckner


Where to from here?

PyCUDA, PyOpenCL, hedge
→ http://www.cims.nyu.edu/~kloeckner/

GPU RTCG
AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for
High-Performance Computing, submitted.

GPU-DG Article
AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal
Discontinuous Galerkin Methods on Graphics Processors”,
J. Comp. Phys., 228 (21), 7863–7882.

Also: Intro in GPU Computing Gems Vol 2

Andreas Kl¨ckner


Conclusions

GPUs to me: architecture choice now widely available
Fun time to be in computational science
GPUs and scripting work surprisingly well together
Exploit a natural task decomposition in computational codes
RTCG: Crucial tool
GPU Scripting great for prototyping
. . . and just as suitable for production code

Andreas Kl¨ckner


Questions?

?
Thank you for your attention!

http://www.cims.nyu.edu/~kloeckner/

image credits

Andreas Kl¨ckner


Image Credits

Dictionary: sxc.hu/topfer
C870 GPU: Nvidia Corp.
OpenCL Logo: Apple Corp./Ars Technica
OS Platforms: flickr.com/aOliN.Tk
Old Books: flickr.com/ppdigital
Floppy disk: flickr.com/ethanhein
Machine: flickr.com/13521837@N00
Adding Machine: flickr.com/thomashawk

Andreas Kl¨ckner

Implementations

Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs

Flop Rates: 16 GPUs vs 64 CPU cores
4000 GPU
CPU
3000
GFlops/s

2000

1000

00 2 4 6 8 10
Polynomial Order N
Andreas Kl¨ckner

Implementations

Outline

5 OpenCL implementations

Andreas Kl¨ckner

Implementations

The Nvidia CL implementation

Targets only GPUs
Notes:
Nearly identical to CUDA
No native C-level JIT in CUDA (→
PyCUDA)
Page-locked memory:
Use CL MEM ALLOC HOST PTR.
Careful: double meaning
Need page-locked memory for genuinely
overlapped transfers.
No linear memory texturing
CUDA device emulation mode deprecated
→ Use AMD CPU CL (faster, too!)

Andreas Kl¨ckner

Implementations

The Apple CL implementation
Targets CPUs and GPUs
General notes:
Diﬀerent header name
OpenCL/cl.h instead of CL/cl.h
Use -framework OpenCL for C
access.
Beware of imperfect compiler cache
implementation
(ignores include ﬁles)
CPU notes:
One work item per processor
GPU similar to hardware vendor
implementation.
(New: Intel w/ Sandy Bridge)
Andreas Kl¨ckner

Implementations

The AMD CL implementation

Targets CPUs and GPUs (from both AMD and Nvidia)
GPU notes:
Wide SIMD groups (64)
Native 4/5-wide vectors
But: very ﬂop-heavy machine, may ignore vectors
for memory-bound workloads
→ Both implicit and explicit SIMD
CPU notes:
Many work items per processor (emulated)
General:
cl amd printf

Andreas Kl¨ckner

GPU Programming in Python with PyOpenCL and PyCUDA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to GPU Programming in Python with PyOpenCL and PyCUDA

Similar to GPU Programming in Python with PyOpenCL and PyCUDA (20)

More from npinto

More from npinto (13)

Recently uploaded

Recently uploaded (20)

GPU Programming in Python with PyOpenCL and PyCUDA