This document outlines Andreas Klockner's presentation on GPU programming in Python using PyOpenCL and PyCUDA. The presentation covers an introduction to OpenCL, programming with PyOpenCL, run-time code generation, and perspectives on GPU programming in Python. OpenCL provides a common programming framework for heterogeneous parallel programming across CPUs, GPUs, and other processors. PyOpenCL and PyCUDA allow GPU programming from Python.
GPU Programming in Python with PyOpenCL and PyCUDA
1. Intro PyOpenCL RTCG Perspectives
Easy, Effective, Efficient:
GPU Programming in Python
with PyOpenCL and PyCUDA
Andreas Kl¨ckner
o
Courant Institute of Mathematical Sciences
New York University
March 31, 2011
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
2. Intro PyOpenCL RTCG Perspectives
Thanks
Jan Hesthaven (Brown)
Tim Warburton (Rice)
Leslie Greengard (NYU)
PyOpenCL, PyCUDA contributors
Nvidia Corp., AMD Corp.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
3. Intro PyOpenCL RTCG Perspectives
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
4. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 Introduction
A Common Theme
Intro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
5. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 Introduction
A Common Theme
Intro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
6. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
How are High-Performance Codes constructed?
“Traditional” Construction of
High-Performance Codes:
C/C++/Fortran
Libraries
“Alternative” Construction of
High-Performance Codes:
Scripting for ‘brains’
GPUs for ‘inner loops’
Play to the strengths of each
programming environment.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
7. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 Introduction
A Common Theme
Intro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
8. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is an
open, royalty-free standard for general purpose
parallel programming across CPUs, GPUs and
other processors. [OpenCL 1.1 spec]
Device-neutral (Nv GPU, AMD GPU,
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
9. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is an
open, royalty-free standard for general purpose
parallel programming across CPUs, GPUs and
other processors. [OpenCL 1.1 spec]
Device-neutral (Nv GPU, AMD GPU,
Big deal?
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
10. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is an
open, royalty-free standard for general purpose
parallel programming across CPUs, GPUs and
other processors. [OpenCL 1.1 spec]
Big deal!
Device-neutral (Nv GPU, AMD GPU,
Big deal?
Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
14. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
CL vs CUDA side-by-side
CUDA source code: OpenCL source code:
global void transpose( void transpose(
float ∗A t, float ∗A, global float ∗a t, global float ∗a,
int a width, int a height ) unsigned a width, unsigned a height )
{ {
int base idx a = int base idx a =
blockIdx .x ∗ BLK SIZE + get group id (0) ∗ BLK SIZE +
blockIdx .y ∗ A BLOCK STRIDE; get group id (1) ∗ A BLOCK STRIDE;
int base idx a t = int base idx a t =
blockIdx .y ∗ BLK SIZE + get group id (1) ∗ BLK SIZE +
blockIdx .x ∗ A T BLOCK STRIDE; get group id (0) ∗ A T BLOCK STRIDE;
int glob idx a = int glob idx a =
base idx a + threadIdx.x base idx a + get local id (0)
+ a width ∗ threadIdx.y; + a width ∗ get local id (1);
int glob idx a t = int glob idx a t =
base idx a t + threadIdx.x base idx a t + get local id (0)
+ a height ∗ threadIdx .y; + a height ∗ get local id (1);
shared float A shared[BLK SIZE][BLK SIZE+1]; local float a local [BLK SIZE][BLK SIZE+1];
A shared[ threadIdx .y ][ threadIdx .x] = a local [ get local id (1)∗BLK SIZE+get local id(0)] =
A[ glob idx a ]; a[ glob idx a ];
syncthreads (); barrier (CLK LOCAL MEM FENCE);
A t[ glob idx a t ] = a t [ glob idx a t ] =
A shared[ threadIdx .x ][ threadIdx .y ]; a local [ get local id (0)∗BLK SIZE+get local id(1)];
} }
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
15. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL ↔ CUDA: A dictionary
OpenCL CUDA
Grid Grid
Work Group Block
Work Item Thread
kernel global
global device
local shared
private local
imagend t texture<type, n, ...>
barrier(LMF) syncthreads()
get local id(012) threadIdx.xyz
get group id(012) blockIdx.xyz
get global id(012) – (reimplement)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
16. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Execution Model
nD Grid
Group Group Group
(0, 0) (1, 0) (2, 0)
Two-tiered Parallelism
Group Group Group
(0, 1) (1, 1) (2, 1) Grid = Nx × Ny × Nz work groups
Work group = Sx × Sy × Sz work items
Total: i∈{x,y ,z} Si Ni work items
Work Group (1, 0) Comm/Sync only within work group
Item Item Item Item Work group maps to compute unit
(0, 0) (1, 0) (2, 0) (3, 0)
Grid/Group ≈ outer loops in an algorithm
Item Item Item Item
(0, 1) (1, 1) (2, 1) (3, 1)
Device Language:
Item
(0, 2)
Item
(1, 2)
Item
(2, 2)
Item
(3, 2)
get {global,group,local} {id,size}
Item Item Item Item
(axis)
(0, 3) (1, 3) (2, 3) (3, 3)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
17. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host
(CPU)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
18. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
19. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
Memory ···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
20. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Memory
Compute Device 1 (Platform 0)
···
Host ···
···
Memory
Compute Device 0 (Platform 1)
(CPU)
···
Memory ···
···
Memory
Compute Device 1 (Platform 1)
···
···
···
Memory
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
21. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
22. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Platform 0 (e.g. CPUs)
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
23. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Platform 1 (e.g. GPUs)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
24. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
25. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
(think “chip”,
Compute Device 0 (Platform 0)
has memory
···
interface) ···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
26. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
(think “chip”,
Compute Device 0 (Platform 0)
has memory
···
interface) ···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Compute Unit ···
···
(think “processor”, ···
has insn. fetch)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
27. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
(think “chip”,
Compute Device 0 (Platform 0)
has memory
···
interface) ···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Compute Unit ···
···
(think “processor”, ···
has insn. fetch)
Processing Element
(think “SIMD lane”)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
28. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
29. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Python ···
···
···
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
30. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Compute Device 0 (Platform 0)
···
···
···
Compute Device 1 (Platform 0)
···
···
Host ···
Compute Device 0 (Platform 1)
(CPU)
···
···
···
Compute Device 1 (Platform 1)
Python ···
···
···
Device Language: ∼ C99
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
31. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL Object Diagram
Figure 2.1 - OpenCL UML Class Diagram
Credit: Khronos Group
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
32. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory
throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Python + CUDA = PyCUDA
Python + OpenCL = PyOpenCL
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
33. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCL
First Contact
About PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
34. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCL
First Contact
About PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
35. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
1 import pyopencl as cl , numpy
2
3 a = numpy.random.rand(256∗∗3).astype(numpy.float32)
4
5 ctx = cl. create some context ()
6 queue = cl.CommandQueue(ctx)
7
8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12 kernel void twice( global float ∗a)
13 { a[ get global id (0)] ∗= 2; }
14 ”””). build ()
15
16 prg. twice(queue, a.shape, (1,), a dev)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
36. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
1 import pyopencl as cl , numpy
2
3 a = numpy.random.rand(256∗∗3).astype(numpy.float32)
4
5 ctx = cl. create some context ()
6 queue = cl.CommandQueue(ctx)
7
8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12 kernel void twice( global float ∗a)
13 { a[ get global id (0)] ∗= 2; } Compute kernel
14 ”””). build ()
15
16 prg. twice(queue, a.shape, (1,), a dev)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
37. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12 kernel void twice( global float ∗a)
13 { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; }
14 ”””). build ()
15
16 prg. twice(queue, a.shape, (256,), a dev)
17
18 result = numpy.empty like(a)
19 cl . enqueue read buffer (queue, a dev, result ). wait()
20 import numpy.linalg as la
21 assert la .norm(result − 2∗a) == 0
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
38. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCL
First Contact
About PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
39. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Completeness
PyOpenCL exposes all of OpenCL.
For example:
Every GetInfo() query
Images and Samplers
Memory Maps
Profiling and Synchronization
GL Interop
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
40. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Completeness
PyOpenCL supports (nearly)
every OS that has an OpenCL
implementation.
Linux
OS X
Windows
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
41. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Automatic Cleanup
Reachable objects (memory,
streams, . . . ) are never destroyed.
Once unreachable, released at an
unspecified future time.
Scarce resources (memory) can be
explicitly freed. (obj.release())
Correctly deals with multiple
contexts and dependencies. (based
on OpenCL’s reference counting)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
42. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Documentation
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
43. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL Philosophy
Provide complete access
Automatically manage resources
Provide abstractions
Allow interactive use
Check for and report errors
automatically
Integrate tightly with numpy
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
44. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL, PyCUDA: Vital Information
http://mathema.tician.de/
software/pyopencl (or /pycuda)
Complete documentation
X Consortium License
(no warranty, free for all use)
Convenient abstractions
Arrays, Elementwise op., Reduction, Scan
Require: numpy, Python 2.4+
(Win/OS X/Linux)
Community: mailing list, wiki, add-on
packages (FFT, scikits.cuda, . . . )
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
45. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Capturing Dependencies
A
f
B = f(A) B
C = g(B) g p
E = f(C)
C P q
F = h(C)
G = g(E,F) f h
P = p(B) E F Q
Q = q(B)
g g r
R = r(G,P,Q)
G r
r
R
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
46. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Capturing Dependencies
A
Switch queue to out-of-order
mode! f
B = f(A)
Specify as list of events using B
C = g(B) for= optional keyword to
wait g p
E = f(C)
enqueue XXX.
C P q
F = h(C) also enqueue barrier.
Can
G = g(E,F) f h
Common use case:
P = p(B)
Transmit/receive from other MPI E F Q
Q = q(B)
ranks. g g r
R = r(G,P,Q)
Possible on Nv Fermi: Submit G r
parallel work to increase machine
r
use.
R
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
47. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
The Idea
RTCG in Action
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
48. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
The Idea
RTCG in Action
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
49. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
50. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is data–it wants to be
reasoned about at run time)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
51. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is data–it wants to be
reasoned about at run time)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
52. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
53. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary Machine
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
54. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Human In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
55. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Good for code In GPU scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
56. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Good for code In GPUyCUDA
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
57. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Good for code PyOp UDA
In GPUyCenCL
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
58. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Machine-generated Code
Why machine-generate code?
Automated Tuning
(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables
(→ register pressure)
Loop Unrolling
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
59. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
PyOpenCL: Support for Metaprogramming
Three (main) ways of generating code:
Simple %-operator substitution
Combine with C preprocessor: simple, often sufficient
Use a templating engine (Mako works very well)
codepy:
Build C syntax trees from Python
Generates readable, indented C
Many ways of evaluating code–most important one:
Exact device timing via events
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
60. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
The Idea
RTCG in Action
4 Perspectives
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
61. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
PyOpenCL Arrays: General Usage
Remember your first PyOpenCL program?
Abstraction is good:
1 import numpy
2 import pyopencl as cl
3 import pyopencl.array as cl array
4
5 ctx = cl. create some context ()
6 queue = cl.CommandQueue(ctx)
7
8 a gpu = cl array . to device (
9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))
10 a doubled = (2∗a gpu).get()
11 print a doubled
12 print a gpu
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
62. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.array: Simple Linear Algebra
pyopencl.array.Array:
Meant to look and feel just like numpy.
p.a.to device(ctx, queue, numpy array)
numpy array = ary.get()
+, -, ∗, /, fill, sin, arange, exp, rand, . . .
Mixed types (int32 + float32 = float64)
print cl array for debugging.
Allows access to raw bits
Use as kernel arguments, memory maps
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
63. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.elementwise: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
n = 10000
a gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))
b gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))
from pyopencl.elementwise import ElementwiseKernel
lin comb = ElementwiseKernel(ctx,
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)
c gpu = cl array . empty like (a gpu)
lin comb(5, a gpu, 6, b gpu, c gpu)
import numpy.linalg as la
assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
64. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
RTCG via Substitution
source = (”””
kernel void %(name)s(%(arguments)s)
{
unsigned lid = get local id (0);
unsigned gsize = get global size (0);
unsigned work item start = get local size (0)∗ get group id (0);
for (unsigned i = work item start + lid ; i < n; i += gsize)
{
%(operation)s;
}
}
””” % {
”arguments”: ”, ”. join (arg . declarator () for arg in arguments),
”operation”: operation ,
”name”: name,
”loop prep”: loop prep ,
})
prg = cl.Program(ctx, source ). build ()
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
65. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
RTCG via Templates
from mako.template import Template
tpl = Template(”””
kernel void add(
global ${ type name } ∗tgt,
global const ${ type name } ∗op1,
global const ${ type name } ∗op2)
{
int idx = get local id (0)
+ ${ local size } ∗ ${ thread strides }
∗ get group id (0);
% for i in range( thread strides ):
<% offset = i∗ local size %>
tgt [ idx + ${ offset }] =
op1[idx + ${ offset }]
+ op2[idx + ${ offset } ];
% endfor
}”””)
rendered tpl = tpl . render(type name=”float”,
local size = local size , thread strides = thread strides )
knl = cl.Program(ctx, str ( rendered tpl )). build (). add
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
66. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.reduction: Reduction made easy
Example: A dot product calculation
from pyopencl.reduction import ReductionKernel
dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”,
reduce expr=”a+b”, map expr=”x[i]∗y[i]”,
arguments=” global const float ∗x, global const float ∗y”)
import pyopencl.clrandom as cl rand
x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)
y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)
x dot y = dot(x, y ). get()
x dot y cpu = numpy.dot(x.get(), y. get ())
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
67. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.scan: Scan made easy
Example: A cumulative sum computation
from pyopencl.scan import InclusiveScanKernel
knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”)
n = 2∗∗20−2∗∗18+5
host data = np.random.randint(0, 10, n). astype(np.int32)
dev data = cl array . to device (queue, host data)
knl(dev data)
assert (dev data.get() == np.cumsum(host data, axis=0)).all()
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
68. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
69. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
70. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 import pycuda.driver as cuda
2 import pycuda.autoinit , pycuda.compiler
3 import numpy
4
5 a = numpy.random.randn(4,4).astype(numpy.float32)
6 a gpu = cuda.mem alloc(a.nbytes)
7 cuda.memcpy htod(a gpu, a)
[This is examples/demo.py in the PyCUDA distribution.]
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
71. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( float ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 }
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
72. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( float ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 } Compute kernel
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
73. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
PyOpenCL ↔ PyCUDA: A (rough) dictionary
PyOpenCL PyCUDA
Context Context
CommandQueue Stream
Buffer mem alloc / DeviceAllocation
Program SourceModule
Kernel Function
Event (eg. enqueue marker) Event
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
74. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
75. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω := i Dk ⊂ Rd .
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
76. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω := i Dk ⊂ Rd .
Goal
Solve a conservation law on Ω: ut + · F (u) = 0
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
77. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω := i Dk ⊂ Rd .
Goal
Solve a conservation law on Ω: ut + · F (u) = 0
Example
Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by
1 j 1
∂t E − ×H =− , ∂t H + × E = 0,
ε ε µ
ρ
·E = , · H = 0.
ε
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
78. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms
ˆ ˆ
0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx
n n
Dk ∂Dk
Flux term
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
79. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms
ˆ ˆ
0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx
n n
Dk ∂Dk
Flux term
Flux terms:
vary by problem
expression specified by user
evaluated pointwise
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
80. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
81. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2
User writes: Vectorial statement in math. notation
flux = 1/2∗cross(normal, h. int −h.ext
−alpha∗cross(normal, e. int −e.ext))
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
82. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
1
n · (F − F ∗ )E :=
ˆ [ˆ × ( H − αˆ × E )]
n n
2
We generate: Scalar evaluator in C (6×)
a flux += (
((( val a field5 − val b field5 )∗ fpair −>normal[2]
− ( val a field4 − val b field4 )∗ fpair −>normal[0])
+ val a field0 − val b field0 )∗ fpair −>normal[0]
− ((( val a field4 − val b field4 ) ∗ fpair −>normal[1]
− ( val a field1 − val b field1 )∗ fpair −>normal[2])
+ val a field3 − val b field3 ) ∗ fpair −>normal[1]
)∗ value type (0.5);
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
83. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for element-local parts of GPU DG
Per Block: KL element-local mat.mult. + matrix load
Preparation
Question: How should one assign work to threads?
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
84. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for element-local parts of GPU DG
Per Block: KL element-local mat.mult. + matrix load
Preparation
Question: How should one assign work to threads?
ws : in sequence wi : “inline-parallel” wp : in parallel
Thread Thread Thread
t t t
(amortize preparation) (exploit register space)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
85. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for Differentiation
2.2 3.0
Local differentiation, matrix-in-shared,
order 4, with microblocking 2.8
2.0 point size denotes wi ∈ 1, ,4
2.6
1.8 2.4
Execution time [ms]
1.6 2.2
2.0
ws
1.4 1.8
1.2 1.6
1.4
1.0
1.2
0.8 15 20 25 30 1.0
wp
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
86. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400
300
GPU
250 CPU
200
GFlops/s
150
100
50
00 2 4 6 8 10
Polynomial Order N
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
87. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Memory Bandwidth on a GTX 280
200
Gather
180 Lift
Global Memory Bandwidth [GB/s]
Diff
160 Assy.
Peak
140
120
100
80
60
40
201 2 3 4 5 6 7 8 9
Polynomial Order N
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
88. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
89. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Poisson
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
90. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Poisson
CFD
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
91. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
92. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive and
error-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
93. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive and
error-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
GPU programming requires complex tradeoffs
Tradeoffs require heuristics
Heuristics are fragile
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
94. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive and
error-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
GPU programming requires complex tradeoffs
Tradeoffs require heuristics
Heuristics are fragile
Another way: Dumb enumeration
Enumerate loop slicings
Enumerate prefetch options
Choose by running resulting code on actual hardware
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
95. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Example
Empirical GPU loop optimization:
a, b, c, i , j , k = [var(s) for s in ” abcijk ”]
n = 500
k = make loop kernel([
LoopDimension(”i”, n),
LoopDimension(”j”, n),
LoopDimension(”k”, n),
], [
(c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j])
])
gen kwargs = {
”min threads”: 128,
”min blocks”: 32,
}
→ Ideal case: Finds 160 GF/s kernel
without human intervention.
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
96. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Status
Limited scope:
Require input/output separation
Kernels must be expressible using
“loopy” model
(i.e. indices decompose into “output”
and “reduction”)
Enough for DG, LA, FD, . . .
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
97. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Status
Limited scope:
Require input/output separation
Kernels must be expressible using
“loopy” model
(i.e. indices decompose into “output”
and “reduction”)
Enough for DG, LA, FD, . . .
Kernel compilation limits trial rate
Non-Goal: Peak performance
Good results currently for dense linear
algebra and (some) DG subkernels
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
98. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
PyCUDA
DG-FEM on the GPU
“Automatic” GPU Programming
Conclusions
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
99. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Where to from here?
PyCUDA, PyOpenCL, hedge
→ http://www.cims.nyu.edu/~kloeckner/
GPU RTCG
AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for
High-Performance Computing, submitted.
GPU-DG Article
AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal
Discontinuous Galerkin Methods on Graphics Processors”,
J. Comp. Phys., 228 (21), 7863–7882.
Also: Intro in GPU Computing Gems Vol 2
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
100. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Conclusions
GPUs to me: architecture choice now widely available
Fun time to be in computational science
GPUs and scripting work surprisingly well together
Exploit a natural task decomposition in computational codes
RTCG: Crucial tool
GPU Scripting great for prototyping
. . . and just as suitable for production code
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
101. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Questions?
?
Thank you for your attention!
http://www.cims.nyu.edu/~kloeckner/
image credits
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
102. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Image Credits
Dictionary: sxc.hu/topfer
C870 GPU: Nvidia Corp.
OpenCL Logo: Apple Corp./Ars Technica
OS Platforms: flickr.com/aOliN.Tk
Old Books: flickr.com/ppdigital
Floppy disk: flickr.com/ethanhein
Machine: flickr.com/13521837@N00
Adding Machine: flickr.com/thomashawk
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
103. Implementations
Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs
Flop Rates: 16 GPUs vs 64 CPU cores
4000 GPU
CPU
3000
GFlops/s
2000
1000
00 2 4 6 8 10
Polynomial Order N
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
104. Implementations
Outline
5 OpenCL implementations
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
105. Implementations
The Nvidia CL implementation
Targets only GPUs
Notes:
Nearly identical to CUDA
No native C-level JIT in CUDA (→
PyCUDA)
Page-locked memory:
Use CL MEM ALLOC HOST PTR.
Careful: double meaning
Need page-locked memory for genuinely
overlapped transfers.
No linear memory texturing
CUDA device emulation mode deprecated
→ Use AMD CPU CL (faster, too!)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
106. Implementations
The Apple CL implementation
Targets CPUs and GPUs
General notes:
Different header name
OpenCL/cl.h instead of CL/cl.h
Use -framework OpenCL for C
access.
Beware of imperfect compiler cache
implementation
(ignores include files)
CPU notes:
One work item per processor
GPU similar to hardware vendor
implementation.
(New: Intel w/ Sandy Bridge)
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
107. Implementations
The AMD CL implementation
Targets CPUs and GPUs (from both AMD and Nvidia)
GPU notes:
Wide SIMD groups (64)
Native 4/5-wide vectors
But: very flop-heavy machine, may ignore vectors
for memory-bound workloads
→ Both implicit and explicit SIMD
CPU notes:
Many work items per processor (emulated)
General:
cl amd printf
Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA