SlideShare a Scribd company logo
1 of 107
Download to read offline
Intro PyOpenCL RTCG Perspectives




          Easy, Effective, Efficient:
       GPU Programming in Python
       with PyOpenCL and PyCUDA

                      Andreas Kl¨ckner
                                o

           Courant Institute of Mathematical Sciences
                      New York University


                       March 31, 2011




                 Andreas Kl¨ckner
                           o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives


Thanks




     Jan Hesthaven (Brown)
     Tim Warburton (Rice)
     Leslie Greengard (NYU)
     PyOpenCL, PyCUDA contributors
     Nvidia Corp., AMD Corp.




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives


Outline



   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


Outline


   1 Introduction
          A Common Theme
          Intro to OpenCL

   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


Outline


   1 Introduction
          A Common Theme
          Intro to OpenCL

   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


How are High-Performance Codes constructed?



     “Traditional” Construction of
     High-Performance Codes:
         C/C++/Fortran
         Libraries
     “Alternative” Construction of
     High-Performance Codes:
         Scripting for ‘brains’
         GPUs for ‘inner loops’
     Play to the strengths of each
     programming environment.




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


Outline


   1 Introduction
          A Common Theme
          Intro to OpenCL

   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


What is OpenCL?

 OpenCL (Open Computing Language) is an
 open, royalty-free standard for general purpose
 parallel programming across CPUs, GPUs and
 other processors.              [OpenCL 1.1 spec]


     Device-neutral (Nv GPU, AMD GPU,
     Intel/AMD CPU)
     Vendor-neutral
     Comes with RTCG
 Defines:
     Host-side programming interface (library)
     Device-side programming language (!)

                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


What is OpenCL?

 OpenCL (Open Computing Language) is an
 open, royalty-free standard for general purpose
 parallel programming across CPUs, GPUs and
 other processors.              [OpenCL 1.1 spec]


     Device-neutral (Nv GPU, AMD GPU,
          Big deal?
     Intel/AMD CPU)
     Vendor-neutral
     Comes with RTCG
 Defines:
     Host-side programming interface (library)
     Device-side programming language (!)

                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


What is OpenCL?

 OpenCL (Open Computing Language) is an
 open, royalty-free standard for general purpose
 parallel programming across CPUs, GPUs and
 other processors.              [OpenCL 1.1 spec]

                                Big deal!
     Device-neutral (Nv GPU, AMD GPU,
          Big deal?
     Intel/AMD CPU)
     Vendor-neutral
     Comes with RTCG
 Defines:
     Host-side programming interface (library)
     Device-side programming language (!)

                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


Who?
OpenCL Working Group
• Diverse industry participation
   - Processor vendors, system OEMs, middleware vendors, application developers
• Many industry-leading experts involved in OpenCL’s design
   - A healthy diversity of industry perspectives
• Apple made initial proposal and is very active in the working group
   - Serving as specification editor




                                                                             © Copyright Khronos Group, 2010 - Page 4




   Credit: Khronos Group

                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives           A Common Theme OpenCL


When?
 OpenCL Timeline
 • Six months from proposal to released OpenCL 1.0 specification
    - Due to a strong initial proposal and a shared commercial incentive
 • Multiple conformant implementations shipping
    - Apple’s Mac OS X Snow Leopard now ships with OpenCL
 • 18 month cadence between OpenCL 1.0 and OpenCL 1.1
    - Backwards compatibility protect software investment

                           Khronos publicly                      Multiple conformant
                           releases OpenCL 1.0 as                implementations ship
                           royalty-free                          across diverse OS
                           specification                         and platforms
         Jun08                                      May09                                    Jun10
                                Dec08                                    2H09
   Apple proposes OpenCL                      Khronos releases OpenCL                   OpenCL 1.1
   working group and                          1.0 conformance tests to                  Specification released and
   contributes draft specification            ensure high-quality                       first implementations ship
   to Khronos                                 implementations
                                                                                               © Copyright Khronos Group, 2010 - Page 5




    Credit: Khronos Group

                                        Andreas Kl¨ckner
                                                  o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


Why?
 Processor Parallelism

                  CPUs                                      GPUs
            Multiple cores driving     Emerging       Increasingly general
            performance increases                     purpose data-parallel
                                      Intersection
                                                           computing




                  Multi-             Heterogeneous           Graphics
                processor              Computing             APIs and
              programming                                    Shading
              – e.g. OpenMP                                 Languages




       OpenCL is a programming framework for heterogeneous compute resources
                                                                              © Copyright Khronos Group, 2010 - Page 3




  Credit: Khronos Group

                              Andreas Kl¨ckner
                                        o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives           A Common Theme OpenCL


CL vs CUDA side-by-side
 CUDA source code:                                               OpenCL source code:
     global   void transpose(                                    void transpose(
       float ∗A t, float ∗A,                                           global float ∗a t,    global float ∗a,
       int a width, int a height )                                 unsigned a width, unsigned a height )
 {                                                               {
     int base idx a    =                                           int base idx a    =
       blockIdx .x ∗ BLK SIZE +                                      get group id (0) ∗ BLK SIZE +
       blockIdx .y ∗ A BLOCK STRIDE;                                 get group id (1) ∗ A BLOCK STRIDE;
     int base idx a t =                                            int base idx a t =
       blockIdx .y ∗ BLK SIZE +                                      get group id (1) ∗ BLK SIZE +
       blockIdx .x ∗ A T BLOCK STRIDE;                               get group id (0) ∗ A T BLOCK STRIDE;

     int glob idx a =                                                int glob idx a =
       base idx a + threadIdx.x                                        base idx a + get local id (0)
       + a width ∗ threadIdx.y;                                        + a width ∗ get local id (1);
     int glob idx a t =                                              int glob idx a t =
        base idx a t + threadIdx.x                                      base idx a t + get local id (0)
       + a height ∗ threadIdx .y;                                      + a height ∗ get local id (1);

       shared     float A shared[BLK SIZE][BLK SIZE+1];                 local float a local [BLK SIZE][BLK SIZE+1];

     A shared[ threadIdx .y ][ threadIdx .x] =                       a local [ get local id (1)∗BLK SIZE+get local id(0)] =
       A[ glob idx a ];                                                a[ glob idx a ];

       syncthreads ();                                               barrier (CLK LOCAL MEM FENCE);

     A t[ glob idx a t ] =                                           a t [ glob idx a t ] =
       A shared[ threadIdx .x ][ threadIdx .y ];                       a local [ get local id (0)∗BLK SIZE+get local id(1)];
 }                                                               }


                                              Andreas Kl¨ckner
                                                        o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


OpenCL ↔ CUDA: A dictionary


              OpenCL             CUDA
                  Grid           Grid
           Work Group            Block
             Work Item           Thread
               kernel              global
               global              device
                local              shared
              private              local
            imagend t            texture<type, n, ...>
         barrier(LMF)              syncthreads()
    get local id(012)            threadIdx.xyz
    get group id(012)            blockIdx.xyz
   get global id(012)            – (reimplement)

                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives           A Common Theme OpenCL


OpenCL: Execution Model

  nD Grid

   Group     Group      Group
   (0, 0)    (1, 0)     (2, 0)
                                                  Two-tiered Parallelism
   Group     Group      Group
   (0, 1)    (1, 1)     (2, 1)                               Grid = Nx × Ny × Nz work groups
                                                             Work group = Sx × Sy × Sz work items
                                                             Total: i∈{x,y ,z} Si Ni work items
  Work Group (1, 0)                               Comm/Sync only within work group
  Item      Item      Item       Item                        Work group maps to compute unit
  (0, 0)    (1, 0)    (2, 0)     (3, 0)
                                                  Grid/Group ≈ outer loops in an algorithm
  Item      Item      Item       Item
  (0, 1)    (1, 1)    (2, 1)     (3, 1)
                                                  Device Language:
  Item
  (0, 2)
            Item
            (1, 2)
                      Item
                      (2, 2)
                                 Item
                                 (3, 2)
                                                  get {global,group,local} {id,size}
  Item      Item      Item       Item
                                                  (axis)
  (0, 3)    (1, 3)    (2, 3)     (3, 3)




                                          Andreas Kl¨ckner
                                                    o           GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


OpenCL: Computing as a Service




       Host
      (CPU)




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
         Memory                                      ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                          Memory
                                  Compute Device 1          (Platform 0)

                                       ···
       Host                             ···
                                         ···
                                                 Memory
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
         Memory                                      ···
                                                      ···
                                                           Memory
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···
                                                                       Memory




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service

                            Platform 0 (e.g. CPUs)
                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···



                                               Platform 1 (e.g. GPUs)



                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


    (think “chip”,
                              Compute Device 0         (Platform 0)
    has memory
                                    ···
    interface)                       ···
                                      ···
                                   Compute Device 1          (Platform 0)

                                        ···
                                         ···
       Host                               ···
                                                Compute Device 0            (Platform 1)
      (CPU)
                                                     ···
                                                      ···
                                                       ···
                                                    Compute Device 1            (Platform 1)

                                                          ···
                                                           ···
                                                            ···




                           Andreas Kl¨ckner
                                     o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


    (think “chip”,
                              Compute Device 0         (Platform 0)
    has memory
                                    ···
    interface)                       ···
                                      ···
                                   Compute Device 1          (Platform 0)

                                        ···
                                         ···
       Host                               ···
                                                Compute Device 0            (Platform 1)
      (CPU)
                                                     ···
                                                      ···
                                                       ···
                                                    Compute Device 1            (Platform 1)
      Compute Unit                                        ···
                                                           ···
      (think “processor”,                                   ···
      has insn. fetch)




                           Andreas Kl¨ckner
                                     o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


    (think “chip”,
                              Compute Device 0         (Platform 0)
    has memory
                                    ···
    interface)                       ···
                                      ···
                                   Compute Device 1          (Platform 0)

                                        ···
                                         ···
       Host                               ···
                                                Compute Device 0            (Platform 1)
      (CPU)
                                                     ···
                                                      ···
                                                       ···
                                                    Compute Device 1            (Platform 1)
      Compute Unit                                        ···
                                                           ···
      (think “processor”,                                   ···
      has insn. fetch)
                     Processing Element
                     (think “SIMD lane”)


                           Andreas Kl¨ckner
                                     o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

                                                         ···
                                                          ···
                                                           ···




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

      Python                                             ···
                                                          ···
                                                           ···




                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCL


OpenCL: Computing as a Service


                             Compute Device 0         (Platform 0)

                                   ···
                                    ···
                                     ···
                                  Compute Device 1          (Platform 0)

                                       ···
                                        ···
       Host                              ···
                                               Compute Device 0            (Platform 1)
      (CPU)
                                                    ···
                                                     ···
                                                      ···
                                                   Compute Device 1            (Platform 1)

      Python                                             ···
                                                          ···
                                                           ···




                                             Device Language: ∼ C99


                          Andreas Kl¨ckner
                                    o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives              A Common Theme OpenCL


OpenCL Object Diagram




                            Figure 2.1 - OpenCL UML Class Diagram


   Credit: Khronos Group

                              Andreas Kl¨ckner
                                        o                 GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCL


Why do Scripting for GPUs?


     GPUs are everything that scripting
     languages are not.
         Highly parallel
         Very architecture-sensitive
         Built for maximum FP/memory
         throughput
     → complement each other
     CPU: largely restricted to control
     tasks (∼1000/sec)
         Scripting fast enough
     Python + CUDA = PyCUDA
     Python + OpenCL = PyOpenCL


                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Outline


   1 Introduction


   2 Programming with PyOpenCL
          First Contact
          About PyOpenCL

   3 Run-Time Code Generation


   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Outline


   1 Introduction


   2 Programming with PyOpenCL
          First Contact
          About PyOpenCL

   3 Run-Time Code Generation


   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Dive into PyOpenCL

 1   import pyopencl as cl , numpy
 2
 3   a = numpy.random.rand(256∗∗3).astype(numpy.float32)
 4
 5   ctx = cl. create some context ()
 6   queue = cl.CommandQueue(ctx)
 7
 8   a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
 9   cl . enqueue write buffer (queue, a dev, a)
10
11   prg = cl.Program(ctx, ”””
12         kernel void twice( global float ∗a)
13       { a[ get global id (0)] ∗= 2; }
14       ”””). build ()
15
16   prg. twice(queue, a.shape, (1,), a dev)

                               Andreas Kl¨ckner
                                         o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Dive into PyOpenCL

 1   import pyopencl as cl , numpy
 2
 3   a = numpy.random.rand(256∗∗3).astype(numpy.float32)
 4
 5   ctx = cl. create some context ()
 6   queue = cl.CommandQueue(ctx)
 7
 8   a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
 9   cl . enqueue write buffer (queue, a dev, a)
10
11   prg = cl.Program(ctx, ”””
12         kernel void twice( global float ∗a)
13       { a[ get global id (0)] ∗= 2; }                            Compute kernel
14       ”””). build ()
15
16   prg. twice(queue, a.shape, (1,), a dev)

                               Andreas Kl¨ckner
                                         o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Dive into PyOpenCL

 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes)
 9 cl . enqueue write buffer (queue, a dev, a)
10
11 prg = cl.Program(ctx, ”””
12         kernel void twice( global float ∗a)
13       { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; }
14       ”””). build ()
15
16 prg. twice(queue, a.shape, (256,), a dev)
17
18 result = numpy.empty like(a)
19 cl . enqueue read buffer (queue, a dev, result ). wait()
20 import numpy.linalg as la
21 assert la .norm(result − 2∗a) == 0



                               Andreas Kl¨ckner
                                         o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Outline


   1 Introduction


   2 Programming with PyOpenCL
          First Contact
          About PyOpenCL

   3 Run-Time Code Generation


   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


PyOpenCL: Completeness



                          PyOpenCL exposes all of OpenCL.

                          For example:
                                 Every GetInfo() query
                                 Images and Samplers
                                 Memory Maps
                                 Profiling and Synchronization
                                 GL Interop




                         Andreas Kl¨ckner
                                   o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


PyOpenCL: Completeness



 PyOpenCL supports (nearly)
 every OS that has an OpenCL
 implementation.

     Linux
     OS X
     Windows




                              Andreas Kl¨ckner
                                        o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Automatic Cleanup



       Reachable objects (memory,
       streams, . . . ) are never destroyed.
       Once unreachable, released at an
       unspecified future time.
       Scarce resources (memory) can be
       explicitly freed. (obj.release())
       Correctly deals with multiple
       contexts and dependencies. (based
       on OpenCL’s reference counting)




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


PyOpenCL: Documentation




                         Andreas Kl¨ckner
                                   o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     First Contact About PyOpenCL


PyOpenCL Philosophy



                                            Provide complete access
                                            Automatically manage resources
                                            Provide abstractions
                                            Allow interactive use
                                            Check for and report errors
                                            automatically
                                            Integrate tightly with numpy




                         Andreas Kl¨ckner
                                   o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


PyOpenCL, PyCUDA: Vital Information


     http://mathema.tician.de/
     software/pyopencl (or /pycuda)
     Complete documentation
     X Consortium License
     (no warranty, free for all use)
     Convenient abstractions
     Arrays, Elementwise op., Reduction, Scan
     Require: numpy, Python 2.4+
     (Win/OS X/Linux)
     Community: mailing list, wiki, add-on
     packages (FFT, scikits.cuda, . . . )


                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Capturing Dependencies

                                                                 A
                                                                    f
   B = f(A)                                                      B
   C = g(B)                                                     g p
   E = f(C)
                                                        C        P       q
   F = h(C)
   G = g(E,F)                                          f h
   P = p(B)                                      E       F              Q
   Q = q(B)
                                                        g g         r
   R = r(G,P,Q)
                                                            G           r
                                                                r
                                                                 R
                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCL


Capturing Dependencies

                                                                  A
        Switch queue to out-of-order
        mode!                                                        f
   B = f(A)
        Specify as list of events using                           B
   C = g(B) for= optional keyword to
        wait                                                     g p
   E = f(C)
        enqueue XXX.
                                                         C        P       q
   F = h(C) also enqueue barrier.
        Can
   G = g(E,F)                                           f h
        Common use case:
   P = p(B)
        Transmit/receive from other MPI E                 F              Q
   Q = q(B)
        ranks.                                           g g         r
   R = r(G,P,Q)
        Possible on Nv Fermi: Submit                         G           r
        parallel work to increase machine
                                                                 r
        use.
                                                                  R
                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation
          The Idea
          RTCG in Action

   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation
          The Idea
          RTCG in Action

   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


Metaprogramming



                                                    In GPU scripting,
                                                      GPU code does
                                                      not need to be
                                                      a compile-time
                                                         constant.




                         Andreas Kl¨ckner
                                   o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming



                                                     In GPU scripting,
                                                       GPU code does
                                                       not need to be
                                                       a compile-time
                                                          constant.



                                            (Key: Code is data–it wants to be
                                              reasoned about at run time)




                         Andreas Kl¨ckner
                                   o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming


     Idea

                                                     In GPU scripting,
                                                       GPU code does
                                                       not need to be
                                                       a compile-time
                                                          constant.



                                            (Key: Code is data–it wants to be
                                              reasoned about at run time)




                         Andreas Kl¨ckner
                                   o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming


      Idea

                                                       In GPU scripting,
   Python Code
                                                         GPU code does
                                                         not need to be
    GPU Code
                                                         a compile-time
                                                            constant.
  GPU Compiler

   GPU Binary
                                              (Key: Code is data–it wants to be
      GPU                                       reasoned about at run time)

     Result


                           Andreas Kl¨ckner
                                     o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming


      Idea

                                                       In GPU scripting,
   Python Code
                                                         GPU code does
                                                         not need to be
    GPU Code
                                                         a compile-time
                                                            constant.
  GPU Compiler

   GPU Binary       Machine
                                              (Key: Code is data–it wants to be
      GPU                                       reasoned about at run time)

     Result


                           Andreas Kl¨ckner
                                     o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming


      Idea
                    Human                              In GPU scripting,
   Python Code
                                                         GPU code does
                                                         not need to be
    GPU Code
                                                         a compile-time
                                                            constant.
  GPU Compiler

   GPU Binary
                                              (Key: Code is data–it wants to be
      GPU                                       reasoned about at run time)

     Result


                           Andreas Kl¨ckner
                                     o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming


      Idea

                          Good for code                In GPU scripting,
   Python Code
                          generation                     GPU code does
                                                         not need to be
    GPU Code
                                                         a compile-time
                                                            constant.
  GPU Compiler

   GPU Binary
                                              (Key: Code is data–it wants to be
      GPU                                       reasoned about at run time)

     Result


                           Andreas Kl¨ckner
                                     o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming


      Idea

                          Good for code                In GPUyCUDA
                                                             P scripting,
   Python Code
                          generation                     GPU code does
                                                         not need to be
    GPU Code
                                                         a compile-time
                                                            constant.
  GPU Compiler

   GPU Binary
                                              (Key: Code is data–it wants to be
      GPU                                       reasoned about at run time)

     Result


                           Andreas Kl¨ckner
                                     o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     Idea RTCG in Action


Metaprogramming


      Idea

                          Good for code                    PyOp UDA
                                                       In GPUyCenCL
                                                             P scripting,
   Python Code
                          generation                     GPU code does
                                                         not need to be
    GPU Code
                                                         a compile-time
                                                            constant.
  GPU Compiler

   GPU Binary
                                              (Key: Code is data–it wants to be
      GPU                                       reasoned about at run time)

     Result


                           Andreas Kl¨ckner
                                     o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


Machine-generated Code



  Why machine-generate code?
      Automated Tuning
      (cf. ATLAS, FFTW)
      Data types
      Specialize code for given problem
      Constants faster than variables
      (→ register pressure)
      Loop Unrolling




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


PyOpenCL: Support for Metaprogramming


  Three (main) ways of generating code:
      Simple %-operator substitution
           Combine with C preprocessor: simple, often sufficient
      Use a templating engine (Mako works very well)
      codepy:
           Build C syntax trees from Python
           Generates readable, indented C
  Many ways of evaluating code–most important one:
      Exact device timing via events




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation
          The Idea
          RTCG in Action

   4 Perspectives




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


PyOpenCL Arrays: General Usage

     Remember your first PyOpenCL program?
     Abstraction is good:
 1   import numpy
 2   import pyopencl as cl
 3   import pyopencl.array as cl array
 4
 5   ctx = cl. create some context ()
 6   queue = cl.CommandQueue(ctx)
 7
 8   a gpu = cl array . to device (
 9           ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))
10   a doubled = (2∗a gpu).get()
11   print a doubled
12   print a gpu


                               Andreas Kl¨ckner
                                         o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


pyopencl.array: Simple Linear Algebra


  pyopencl.array.Array:
     Meant to look and feel just like numpy.
           p.a.to device(ctx, queue, numpy array)
           numpy array = ary.get()
     +, -, ∗, /, fill, sin, arange, exp, rand, . . .
     Mixed types (int32 + float32 = float64)
      print cl array for debugging.
     Allows access to raw bits
          Use as kernel arguments, memory maps




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


pyopencl.elementwise: Elementwise expressions
  Avoiding extra store-fetch cycles for elementwise math:
  n = 10000
  a gpu = cl array . to   device (
         ctx , queue,     numpy.random.randn(n).astype(numpy.float32))
  b gpu = cl array . to   device (
         ctx , queue,     numpy.random.randn(n).astype(numpy.float32))

  from pyopencl.elementwise import ElementwiseKernel
  lin comb = ElementwiseKernel(ctx,
          ” float a, float ∗x, float b, float ∗y, float ∗z”,
          ”z[ i ] = a∗x[i ] + b∗y[i]”)

  c gpu = cl array . empty like (a gpu)
  lin comb(5, a gpu, 6, b gpu, c gpu)

  import numpy.linalg as la
   assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5

                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


RTCG via Substitution

   source = (”””
         kernel void %(name)s(%(arguments)s)
       {
         unsigned lid = get local id (0);
         unsigned gsize = get global size (0);
         unsigned work item start = get local size (0)∗ get group id (0);

         for (unsigned i = work item start + lid ; i < n; i += gsize)
         {
           %(operation)s;
         }
       }
       ””” % {
           ”arguments”: ”, ”. join (arg . declarator () for arg in arguments),
           ”operation”: operation ,
           ”name”: name,
           ”loop prep”: loop prep ,
           })

   prg = cl.Program(ctx, source ). build ()


                               Andreas Kl¨ckner
                                         o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


RTCG via Templates
  from mako.template import Template
  tpl = Template(”””
        kernel void add(
                  global ${ type name } ∗tgt,
                  global const ${ type name } ∗op1,
                  global const ${ type name } ∗op2)
      {
        int idx = get local id (0)
          + ${ local size } ∗ ${ thread strides }
          ∗ get group id (0);
        % for i in range( thread strides ):
            <% offset = i∗ local size %>
            tgt [ idx + ${ offset }] =
              op1[idx + ${ offset }]
              + op2[idx + ${ offset } ];
        % endfor
      }”””)

  rendered tpl = tpl . render(type name=”float”,
      local size = local size , thread strides = thread strides )
  knl = cl.Program(ctx, str ( rendered tpl )). build (). add

                              Andreas Kl¨ckner
                                        o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


pyopencl.reduction: Reduction made easy


  Example: A dot product calculation
  from pyopencl.reduction import ReductionKernel
  dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”,
      reduce expr=”a+b”, map expr=”x[i]∗y[i]”,
      arguments=” global const float ∗x, global const float ∗y”)

  import pyopencl.clrandom as cl rand
  x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)
  y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32)

  x dot y = dot(x, y ). get()
  x dot y cpu = numpy.dot(x.get(), y. get ())




                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Action


pyopencl.scan: Scan made easy


  Example: A cumulative sum computation
  from pyopencl.scan import InclusiveScanKernel
  knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”)

  n = 2∗∗20−2∗∗18+5
  host data = np.random.randint(0, 10, n). astype(np.int32)
  dev data = cl array . to device (queue, host data)

  knl(dev data)
  assert (dev data.get() == np.cumsum(host data, axis=0)).all()




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives
          PyCUDA
          DG-FEM on the GPU
          “Automatic” GPU Programming
          Conclusions



                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives
          PyCUDA
          DG-FEM on the GPU
          “Automatic” GPU Programming
          Conclusions



                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Whetting your appetite



1   import pycuda.driver as cuda
2   import pycuda.autoinit , pycuda.compiler
3   import numpy
4
5   a = numpy.random.randn(4,4).astype(numpy.float32)
6   a gpu = cuda.mem alloc(a.nbytes)
7   cuda.memcpy htod(a gpu, a)


    [This is examples/demo.py in the PyCUDA distribution.]




                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Whetting your appetite

 1 mod = pycuda.compiler.SourceModule(”””
 2       global     void twice( float ∗a)
 3     {
 4       int idx = threadIdx.x + threadIdx.y∗4;
 5       a[ idx ] ∗= 2;
 6     }
 7     ”””)
 8
 9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a


                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Whetting your appetite

 1 mod = pycuda.compiler.SourceModule(”””
 2       global     void twice( float ∗a)
 3     {
 4       int idx = threadIdx.x + threadIdx.y∗4;
 5       a[ idx ] ∗= 2;
 6     }                                                         Compute kernel
 7     ”””)
 8
 9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a


                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     PyCUDA GPU-DG Loo.py Conclusions


PyOpenCL ↔ PyCUDA: A (rough) dictionary




                   PyOpenCL                  PyCUDA
                      Context                Context
                CommandQueue                 Stream
                       Buffer                mem alloc / DeviceAllocation
                      Program                SourceModule
                       Kernel                Function
   Event (eg. enqueue marker)                Event




                          Andreas Kl¨ckner
                                    o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives
          PyCUDA
          DG-FEM on the GPU
          “Automatic” GPU Programming
          Conclusions



                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Discontinuous Galerkin Method


   Let Ω :=   i   Dk ⊂ Rd .




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Discontinuous Galerkin Method


   Let Ω :=    i   Dk ⊂ Rd .

   Goal
   Solve a conservation law on Ω:                                ut +       · F (u) = 0




                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Discontinuous Galerkin Method


   Let Ω :=     i   Dk ⊂ Rd .

   Goal
   Solve a conservation law on Ω:                                   ut +     · F (u) = 0

   Example
   Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by

                    1          j                                1
          ∂t E −        ×H =− ,                       ∂t H +          × E = 0,
                    ε          ε                                µ
                             ρ
                         ·E = ,                                        · H = 0.
                             ε


                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Metaprogramming DG: Flux Terms



          ˆ                                    ˆ
     0=         ut ϕ + [    · F (u)]ϕ dx −             [ˆ · F − (ˆ · F )∗ ]ϕ dSx
                                                        n        n
           Dk                                    ∂Dk

                                                            Flux term




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Metaprogramming DG: Flux Terms



          ˆ                                    ˆ
     0=         ut ϕ + [    · F (u)]ϕ dx −             [ˆ · F − (ˆ · F )∗ ]ϕ dSx
                                                        n        n
           Dk                                    ∂Dk

                                                            Flux term

  Flux terms:
      vary by problem
      expression specified by user
      evaluated pointwise




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Metaprogramming DG: Flux Terms Example
  Example: Fluxes for Maxwell’s Equations
                                      1
             n · (F − F ∗ )E :=
             ˆ                          [ˆ × ( H − αˆ × E )]
                                         n          n
                                      2




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Metaprogramming DG: Flux Terms Example
  Example: Fluxes for Maxwell’s Equations
                                        1
               n · (F − F ∗ )E :=
               ˆ                          [ˆ × ( H − αˆ × E )]
                                           n          n
                                        2

  User writes: Vectorial statement in math. notation
   flux = 1/2∗cross(normal, h. int −h.ext
        −alpha∗cross(normal, e. int −e.ext))




                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Metaprogramming DG: Flux Terms Example
  Example: Fluxes for Maxwell’s Equations
                                        1
               n · (F − F ∗ )E :=
               ˆ                          [ˆ × ( H − αˆ × E )]
                                           n          n
                                        2

  We generate: Scalar evaluator in C (6×)
   a flux += (
       ((( val a field5 − val b field5 )∗ fpair −>normal[2]
         − ( val a field4 − val b field4 )∗ fpair −>normal[0])
        + val a field0 − val b field0 )∗ fpair −>normal[0]
      − ((( val a field4 − val b field4 ) ∗ fpair −>normal[1]
           − ( val a field1 − val b field1 )∗ fpair −>normal[2])
         + val a field3 − val b field3 ) ∗ fpair −>normal[1]
      )∗ value type (0.5);



                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     PyCUDA GPU-DG Loo.py Conclusions


Loop Slicing for element-local parts of GPU DG

   Per Block: KL element-local mat.mult. + matrix load
                     Preparation




   Question: How should one assign work to threads?




                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives     PyCUDA GPU-DG Loo.py Conclusions


Loop Slicing for element-local parts of GPU DG

   Per Block: KL element-local mat.mult. + matrix load
                      Preparation




   Question: How should one assign work to threads?

    ws : in sequence                wi : “inline-parallel”           wp : in parallel
             Thread                             Thread                             Thread



    t                                t                                t




    (amortize preparation)          (exploit register space)

                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Loop Slicing for Differentiation
                        2.2                                                           3.0
                            Local differentiation, matrix-in-shared,
                            order 4, with microblocking                               2.8
                        2.0 point size denotes wi ∈ 1, ,4
                                                                                      2.6
                        1.8                                                           2.4
  Execution time [ms]




                        1.6                                                           2.2
                                                                                      2.0




                                                                                        ws
                        1.4                                                           1.8
                        1.2                                                           1.6
                                                                                      1.4
                        1.0
                                                                                      1.2
                        0.8 15            20             25          30               1.0
                                                    wp

                                                Andreas Kl¨ckner
                                                          o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400

             300
                         GPU
             250         CPU
             200
  GFlops/s




             150

             100

              50

              00           2           4         6               8           10
                                   Polynomial Order N

                                    Andreas Kl¨ckner
                                              o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Memory Bandwidth on a GTX 280
                              200
                                        Gather
                              180       Lift
  Global Memory Bandwidth [GB/s]



                                        Diff
                              160       Assy.
                                        Peak
                              140
                              120
                              100
                               80
                               60
                               40
                               201      2        3     4     5      6           7       8       9
                                                     Polynomial Order N

                                                      Andreas Kl¨ckner
                                                                o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


GPU DG Showcase




   Eletromagnetism




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


GPU DG Showcase




   Eletromagnetism
                                                         Poisson




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


GPU DG Showcase




   Eletromagnetism
                                                         Poisson
               CFD



                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives
          PyCUDA
          DG-FEM on the GPU
          “Automatic” GPU Programming
          Conclusions



                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Automating GPU Programming


   GPU programming can be time-consuming, unintuitive and
   error-prone.

     Obvious idea: Let the computer do it.
     One way: Smart compilers




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Automating GPU Programming


   GPU programming can be time-consuming, unintuitive and
   error-prone.

     Obvious idea: Let the computer do it.
     One way: Smart compilers
          GPU programming requires complex tradeoffs
          Tradeoffs require heuristics
          Heuristics are fragile




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Automating GPU Programming


   GPU programming can be time-consuming, unintuitive and
   error-prone.

     Obvious idea: Let the computer do it.
     One way: Smart compilers
          GPU programming requires complex tradeoffs
          Tradeoffs require heuristics
          Heuristics are fragile
     Another way: Dumb enumeration
          Enumerate loop slicings
          Enumerate prefetch options
          Choose by running resulting code on actual hardware



                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives       PyCUDA GPU-DG Loo.py Conclusions


Loo.py Example

 Empirical GPU loop optimization:
 a, b, c, i , j , k = [var(s) for s in ” abcijk ”]
 n = 500
 k = make loop kernel([
     LoopDimension(”i”, n),
     LoopDimension(”j”, n),
     LoopDimension(”k”, n),
     ], [
     (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j])
     ])

 gen kwargs = {
         ”min threads”: 128,
         ”min blocks”: 32,
         }

 → Ideal case: Finds 160 GF/s kernel
 without human intervention.

                               Andreas Kl¨ckner
                                         o           GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Loo.py Status


     Limited scope:
         Require input/output separation
         Kernels must be expressible using
         “loopy” model
         (i.e. indices decompose into “output”
         and “reduction”)
         Enough for DG, LA, FD, . . .




                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Loo.py Status


     Limited scope:
          Require input/output separation
          Kernels must be expressible using
          “loopy” model
          (i.e. indices decompose into “output”
          and “reduction”)
          Enough for DG, LA, FD, . . .
     Kernel compilation limits trial rate
     Non-Goal: Peak performance
     Good results currently for dense linear
     algebra and (some) DG subkernels



                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Outline


   1 Introduction


   2 Programming with PyOpenCL


   3 Run-Time Code Generation


   4 Perspectives
          PyCUDA
          DG-FEM on the GPU
          “Automatic” GPU Programming
          Conclusions



                             Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Where to from here?

   PyCUDA, PyOpenCL, hedge
   → http://www.cims.nyu.edu/~kloeckner/

   GPU RTCG
   AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for
   High-Performance Computing, submitted.

   GPU-DG Article
   AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal
   Discontinuous Galerkin Methods on Graphics Processors”,
   J. Comp. Phys., 228 (21), 7863–7882.

   Also: Intro in GPU Computing Gems Vol 2

                            Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Conclusions



      GPUs to me: architecture choice now widely available
      Fun time to be in computational science
      GPUs and scripting work surprisingly well together
          Exploit a natural task decomposition in computational codes
          RTCG: Crucial tool
      GPU Scripting great for prototyping
          . . . and just as suitable for production code




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives        PyCUDA GPU-DG Loo.py Conclusions


Questions?



                                             ?
                      Thank you for your attention!


             http://www.cims.nyu.edu/~kloeckner/


                                       image credits




                          Andreas Kl¨ckner
                                    o            GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py Conclusions


Image Credits

      Dictionary: sxc.hu/topfer
      C870 GPU: Nvidia Corp.
      OpenCL Logo: Apple Corp./Ars Technica
      OS Platforms: flickr.com/aOliN.Tk
      Old Books: flickr.com/ppdigital
      Floppy disk: flickr.com/ethanhein
      Machine: flickr.com/13521837@N00
      Adding Machine: flickr.com/thomashawk




                          Andreas Kl¨ckner
                                    o        GPU-Python with PyOpenCL and PyCUDA
Implementations


Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs

             Flop Rates: 16 GPUs vs 64 CPU cores
            4000       GPU
                       CPU
            3000
 GFlops/s




            2000

            1000

               00    2        4      6      8                  10
                         Polynomial Order N
                           Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Implementations


Outline




   5 OpenCL implementations




                     Andreas Kl¨ckner
                               o        GPU-Python with PyOpenCL and PyCUDA
Implementations


The Nvidia CL implementation

                      Targets only GPUs
                      Notes:
                          Nearly identical to CUDA
                                     No native C-level JIT in CUDA (→
                                     PyCUDA)
                              Page-locked memory:
                              Use CL MEM ALLOC HOST PTR.
                                     Careful: double meaning
                                     Need page-locked memory for genuinely
                                     overlapped transfers.
                              No linear memory texturing
                              CUDA device emulation mode deprecated
                              → Use AMD CPU CL (faster, too!)

                  Andreas Kl¨ckner
                            o         GPU-Python with PyOpenCL and PyCUDA
Implementations


The Apple CL implementation
    Targets CPUs and GPUs
    General notes:
        Different header name
        OpenCL/cl.h instead of CL/cl.h
        Use -framework OpenCL for C
        access.
        Beware of imperfect compiler cache
        implementation
        (ignores include files)
    CPU notes:
        One work item per processor
    GPU similar to hardware vendor
    implementation.
    (New: Intel w/ Sandy Bridge)
                     Andreas Kl¨ckner
                               o        GPU-Python with PyOpenCL and PyCUDA
Implementations


The AMD CL implementation

         Targets CPUs and GPUs (from both AMD and Nvidia)
         GPU notes:
             Wide SIMD groups (64)
             Native 4/5-wide vectors
                    But: very flop-heavy machine, may ignore vectors
                    for memory-bound workloads
             → Both implicit and explicit SIMD
         CPU notes:
             Many work items per processor (emulated)
         General:
             cl amd printf


                    Andreas Kl¨ckner
                              o        GPU-Python with PyOpenCL and PyCUDA

More Related Content

What's hot

Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Dalvik Vm &amp; Jit
Dalvik Vm &amp; JitDalvik Vm &amp; Jit
Dalvik Vm &amp; JitAnkit Somani
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101Yoss Cohen
 
Java applications containerized and deployed
Java applications containerized and deployedJava applications containerized and deployed
Java applications containerized and deployedAnthony Dahanne
 
DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるKohei Tokunaga
 
P2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctlP2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctlKohei Tokunaga
 
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...Shinya Takamaeda-Y
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Pythonkonryd
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionKohei Tokunaga
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering SoftwareYung-Yu Chen
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 

What's hot (20)

Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Dalvik Vm &amp; Jit
Dalvik Vm &amp; JitDalvik Vm &amp; Jit
Dalvik Vm &amp; Jit
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
Java applications containerized and deployed
Java applications containerized and deployedJava applications containerized and deployed
Java applications containerized and deployed
 
DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐる
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
P2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctlP2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctl
 
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 

Viewers also liked

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Halvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedHalvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedArea41
 
Database Programming Techniques
Database Programming TechniquesDatabase Programming Techniques
Database Programming TechniquesRaji Ghawi
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 

Viewers also liked (19)

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Halvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromisedHalvar Flake: Why Johnny can’t tell if he is compromised
Halvar Flake: Why Johnny can’t tell if he is compromised
 
Database Programming Techniques
Database Programming TechniquesDatabase Programming Techniques
Database Programming Techniques
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 

Similar to GPU Programming in Python with PyOpenCL and PyCUDA

"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono..."The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...Edge AI and Vision Alliance
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...Edge AI and Vision Alliance
 
OpenCL Overview Japan Virtual Open House Feb 2021
OpenCL Overview Japan Virtual Open House Feb 2021OpenCL Overview Japan Virtual Open House Feb 2021
OpenCL Overview Japan Virtual Open House Feb 2021The Khronos Group Inc.
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...Edge AI and Vision Alliance
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberNVIDIA
 
"An Update on Open Standard APIs for Vision Processing," a Presentation from ...
"An Update on Open Standard APIs for Vision Processing," a Presentation from ..."An Update on Open Standard APIs for Vision Processing," a Presentation from ...
"An Update on Open Standard APIs for Vision Processing," a Presentation from ...Edge AI and Vision Alliance
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC
 
OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...
OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...
OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...OpenNebula Project
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC
 
HKG15-110: ODP Project Update
HKG15-110: ODP Project UpdateHKG15-110: ODP Project Update
HKG15-110: ODP Project UpdateLinaro
 
"New Standards for Embedded Vision and Neural Networks," a Presentation from ...
"New Standards for Embedded Vision and Neural Networks," a Presentation from ..."New Standards for Embedded Vision and Neural Networks," a Presentation from ...
"New Standards for Embedded Vision and Neural Networks," a Presentation from ...Edge AI and Vision Alliance
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...Rogue Wave Software
 
OpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdf
OpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdfOpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdf
OpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdfAjit Dash
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntaxcsandit
 
OpenCL programming using Python syntax
OpenCL programming using Python syntax OpenCL programming using Python syntax
OpenCL programming using Python syntax cscpconf
 

Similar to GPU Programming in Python with PyOpenCL and PyCUDA (20)

"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono..."The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
 
OpenCL Overview Japan Virtual Open House Feb 2021
OpenCL Overview Japan Virtual Open House Feb 2021OpenCL Overview Japan Virtual Open House Feb 2021
OpenCL Overview Japan Virtual Open House Feb 2021
 
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- December
 
"An Update on Open Standard APIs for Vision Processing," a Presentation from ...
"An Update on Open Standard APIs for Vision Processing," a Presentation from ..."An Update on Open Standard APIs for Vision Processing," a Presentation from ...
"An Update on Open Standard APIs for Vision Processing," a Presentation from ...
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
 
OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...
OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...
OpenNebulaConf2019 - Welcome and Project Update - Ignacio M. Llorente, Rubén ...
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
HKG15-110: ODP Project Update
HKG15-110: ODP Project UpdateHKG15-110: ODP Project Update
HKG15-110: ODP Project Update
 
"New Standards for Embedded Vision and Neural Networks," a Presentation from ...
"New Standards for Embedded Vision and Neural Networks," a Presentation from ..."New Standards for Embedded Vision and Neural Networks," a Presentation from ...
"New Standards for Embedded Vision and Neural Networks," a Presentation from ...
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
OpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdf
OpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdfOpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdf
OpenStack-and-OpenDaylight-Integrated-IaaS-for-SDN-and-NFV.pdf
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
 
Open cl programming using python syntax
Open cl programming using python syntaxOpen cl programming using python syntax
Open cl programming using python syntax
 
OpenCL programming using Python syntax
OpenCL programming using Python syntax OpenCL programming using Python syntax
OpenCL programming using Python syntax
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 

More from npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 

More from npinto (13)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 

Recently uploaded

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 

Recently uploaded (20)

LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 

GPU Programming in Python with PyOpenCL and PyCUDA

  • 1. Intro PyOpenCL RTCG Perspectives Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Andreas Kl¨ckner o Courant Institute of Mathematical Sciences New York University March 31, 2011 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 2. Intro PyOpenCL RTCG Perspectives Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Leslie Greengard (NYU) PyOpenCL, PyCUDA contributors Nvidia Corp., AMD Corp. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 3. Intro PyOpenCL RTCG Perspectives Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 4. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Outline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 5. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Outline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 6. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL How are High-Performance Codes constructed? “Traditional” Construction of High-Performance Codes: C/C++/Fortran Libraries “Alternative” Construction of High-Performance Codes: Scripting for ‘brains’ GPUs for ‘inner loops’ Play to the strengths of each programming environment. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 7. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Outline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 8. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL What is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 9. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL What is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 10. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL What is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Big deal! Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 11. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Who? OpenCL Working Group • Diverse industry participation - Processor vendors, system OEMs, middleware vendors, application developers • Many industry-leading experts involved in OpenCL’s design - A healthy diversity of industry perspectives • Apple made initial proposal and is very active in the working group - Serving as specification editor © Copyright Khronos Group, 2010 - Page 4 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 12. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL When? OpenCL Timeline • Six months from proposal to released OpenCL 1.0 specification - Due to a strong initial proposal and a shared commercial incentive • Multiple conformant implementations shipping - Apple’s Mac OS X Snow Leopard now ships with OpenCL • 18 month cadence between OpenCL 1.0 and OpenCL 1.1 - Backwards compatibility protect software investment Khronos publicly Multiple conformant releases OpenCL 1.0 as implementations ship royalty-free across diverse OS specification and platforms Jun08 May09 Jun10 Dec08 2H09 Apple proposes OpenCL Khronos releases OpenCL OpenCL 1.1 working group and 1.0 conformance tests to Specification released and contributes draft specification ensure high-quality first implementations ship to Khronos implementations © Copyright Khronos Group, 2010 - Page 5 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 13. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Why? Processor Parallelism CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing Multi- Heterogeneous Graphics processor Computing APIs and programming Shading – e.g. OpenMP Languages OpenCL is a programming framework for heterogeneous compute resources © Copyright Khronos Group, 2010 - Page 3 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 14. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL CL vs CUDA side-by-side CUDA source code: OpenCL source code: global void transpose( void transpose( float ∗A t, float ∗A, global float ∗a t, global float ∗a, int a width, int a height ) unsigned a width, unsigned a height ) { { int base idx a = int base idx a = blockIdx .x ∗ BLK SIZE + get group id (0) ∗ BLK SIZE + blockIdx .y ∗ A BLOCK STRIDE; get group id (1) ∗ A BLOCK STRIDE; int base idx a t = int base idx a t = blockIdx .y ∗ BLK SIZE + get group id (1) ∗ BLK SIZE + blockIdx .x ∗ A T BLOCK STRIDE; get group id (0) ∗ A T BLOCK STRIDE; int glob idx a = int glob idx a = base idx a + threadIdx.x base idx a + get local id (0) + a width ∗ threadIdx.y; + a width ∗ get local id (1); int glob idx a t = int glob idx a t = base idx a t + threadIdx.x base idx a t + get local id (0) + a height ∗ threadIdx .y; + a height ∗ get local id (1); shared float A shared[BLK SIZE][BLK SIZE+1]; local float a local [BLK SIZE][BLK SIZE+1]; A shared[ threadIdx .y ][ threadIdx .x] = a local [ get local id (1)∗BLK SIZE+get local id(0)] = A[ glob idx a ]; a[ glob idx a ]; syncthreads (); barrier (CLK LOCAL MEM FENCE); A t[ glob idx a t ] = a t [ glob idx a t ] = A shared[ threadIdx .x ][ threadIdx .y ]; a local [ get local id (0)∗BLK SIZE+get local id(1)]; } } Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 15. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL ↔ CUDA: A dictionary OpenCL CUDA Grid Grid Work Group Block Work Item Thread kernel global global device local shared private local imagend t texture<type, n, ...> barrier(LMF) syncthreads() get local id(012) threadIdx.xyz get group id(012) blockIdx.xyz get global id(012) – (reimplement) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 16. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Execution Model nD Grid Group Group Group (0, 0) (1, 0) (2, 0) Two-tiered Parallelism Group Group Group (0, 1) (1, 1) (2, 1) Grid = Nx × Ny × Nz work groups Work group = Sx × Sy × Sz work items Total: i∈{x,y ,z} Si Ni work items Work Group (1, 0) Comm/Sync only within work group Item Item Item Item Work group maps to compute unit (0, 0) (1, 0) (2, 0) (3, 0) Grid/Group ≈ outer loops in an algorithm Item Item Item Item (0, 1) (1, 1) (2, 1) (3, 1) Device Language: Item (0, 2) Item (1, 2) Item (2, 2) Item (3, 2) get {global,group,local} {id,size} Item Item Item Item (axis) (0, 3) (1, 3) (2, 3) (3, 3) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 17. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Host (CPU) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 18. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 19. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 20. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Memory Compute Device 1 (Platform 0) ··· Host ··· ··· Memory Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Memory Compute Device 1 (Platform 1) ··· ··· ··· Memory Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 21. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 22. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Platform 0 (e.g. CPUs) Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 23. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Platform 1 (e.g. GPUs) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 24. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 25. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 26. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 27. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Processing Element (think “SIMD lane”) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 28. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 29. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 30. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Device Language: ∼ C99 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 31. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL OpenCL Object Diagram Figure 2.1 - OpenCL UML Class Diagram Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 32. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 33. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Outline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 34. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Outline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 35. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Dive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a) 10 11 prg = cl.Program(ctx, ””” 12 kernel void twice( global float ∗a) 13 { a[ get global id (0)] ∗= 2; } 14 ”””). build () 15 16 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 36. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Dive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a) 10 11 prg = cl.Program(ctx, ””” 12 kernel void twice( global float ∗a) 13 { a[ get global id (0)] ∗= 2; } Compute kernel 14 ”””). build () 15 16 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 37. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Dive into PyOpenCL 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a) 10 11 prg = cl.Program(ctx, ””” 12 kernel void twice( global float ∗a) 13 { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; } 14 ”””). build () 15 16 prg. twice(queue, a.shape, (256,), a dev) 17 18 result = numpy.empty like(a) 19 cl . enqueue read buffer (queue, a dev, result ). wait() 20 import numpy.linalg as la 21 assert la .norm(result − 2∗a) == 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 38. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Outline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 39. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL: Completeness PyOpenCL exposes all of OpenCL. For example: Every GetInfo() query Images and Samplers Memory Maps Profiling and Synchronization GL Interop Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 40. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL: Completeness PyOpenCL supports (nearly) every OS that has an OpenCL implementation. Linux OS X Windows Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 41. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Automatic Cleanup Reachable objects (memory, streams, . . . ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.release()) Correctly deals with multiple contexts and dependencies. (based on OpenCL’s reference counting) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 42. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL: Documentation Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 43. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 44. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL PyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Complete documentation X Consortium License (no warranty, free for all use) Convenient abstractions Arrays, Elementwise op., Reduction, Scan Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (FFT, scikits.cuda, . . . ) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 45. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Capturing Dependencies A f B = f(A) B C = g(B) g p E = f(C) C P q F = h(C) G = g(E,F) f h P = p(B) E F Q Q = q(B) g g r R = r(G,P,Q) G r r R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 46. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL Capturing Dependencies A Switch queue to out-of-order mode! f B = f(A) Specify as list of events using B C = g(B) for= optional keyword to wait g p E = f(C) enqueue XXX. C P q F = h(C) also enqueue barrier. Can G = g(E,F) f h Common use case: P = p(B) Transmit/receive from other MPI E F Q Q = q(B) ranks. g g r R = r(G,P,Q) Possible on Nv Fermi: Submit G r parallel work to increase machine r use. R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 47. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 48. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 49. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 50. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 51. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 52. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 53. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary Machine (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 54. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Human In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 55. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Good for code In GPU scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 56. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Good for code In GPUyCUDA P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 57. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Metaprogramming Idea Good for code PyOp UDA In GPUyCenCL P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 58. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Machine-generated Code Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 59. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action PyOpenCL: Support for Metaprogramming Three (main) ways of generating code: Simple %-operator substitution Combine with C preprocessor: simple, often sufficient Use a templating engine (Mako works very well) codepy: Build C syntax trees from Python Generates readable, indented C Many ways of evaluating code–most important one: Exact device timing via events Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 60. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 61. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action PyOpenCL Arrays: General Usage Remember your first PyOpenCL program? Abstraction is good: 1 import numpy 2 import pyopencl as cl 3 import pyopencl.array as cl array 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a gpu = cl array . to device ( 9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32)) 10 a doubled = (2∗a gpu).get() 11 print a doubled 12 print a gpu Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 62. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.array: Simple Linear Algebra pyopencl.array.Array: Meant to look and feel just like numpy. p.a.to device(ctx, queue, numpy array) numpy array = ary.get() +, -, ∗, /, fill, sin, arange, exp, rand, . . . Mixed types (int32 + float32 = float64) print cl array for debugging. Allows access to raw bits Use as kernel arguments, memory maps Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 63. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.elementwise: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: n = 10000 a gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) b gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) from pyopencl.elementwise import ElementwiseKernel lin comb = ElementwiseKernel(ctx, ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c gpu = cl array . empty like (a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) import numpy.linalg as la assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 64. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action RTCG via Substitution source = (””” kernel void %(name)s(%(arguments)s) { unsigned lid = get local id (0); unsigned gsize = get global size (0); unsigned work item start = get local size (0)∗ get group id (0); for (unsigned i = work item start + lid ; i < n; i += gsize) { %(operation)s; } } ””” % { ”arguments”: ”, ”. join (arg . declarator () for arg in arguments), ”operation”: operation , ”name”: name, ”loop prep”: loop prep , }) prg = cl.Program(ctx, source ). build () Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 65. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action RTCG via Templates from mako.template import Template tpl = Template(””” kernel void add( global ${ type name } ∗tgt, global const ${ type name } ∗op1, global const ${ type name } ∗op2) { int idx = get local id (0) + ${ local size } ∗ ${ thread strides } ∗ get group id (0); % for i in range( thread strides ): <% offset = i∗ local size %> tgt [ idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }”””) rendered tpl = tpl . render(type name=”float”, local size = local size , thread strides = thread strides ) knl = cl.Program(ctx, str ( rendered tpl )). build (). add Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 66. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.reduction: Reduction made easy Example: A dot product calculation from pyopencl.reduction import ReductionKernel dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”, reduce expr=”a+b”, map expr=”x[i]∗y[i]”, arguments=” global const float ∗x, global const float ∗y”) import pyopencl.clrandom as cl rand x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) x dot y = dot(x, y ). get() x dot y cpu = numpy.dot(x.get(), y. get ()) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 67. Intro PyOpenCL RTCG Perspectives Idea RTCG in Action pyopencl.scan: Scan made easy Example: A cumulative sum computation from pyopencl.scan import InclusiveScanKernel knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”) n = 2∗∗20−2∗∗18+5 host data = np.random.randint(0, 10, n). astype(np.int32) dev data = cl array . to device (queue, host data) knl(dev data) assert (dev data.get() == np.cumsum(host data, axis=0)).all() Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 68. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 69. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 70. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Whetting your appetite 1 import pycuda.driver as cuda 2 import pycuda.autoinit , pycuda.compiler 3 import numpy 4 5 a = numpy.random.randn(4,4).astype(numpy.float32) 6 a gpu = cuda.mem alloc(a.nbytes) 7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.] Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 71. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Whetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } 7 ”””) 8 9 func = mod.get function(”twice”) 10 func(a gpu, block=(4,4,1)) 11 12 a doubled = numpy.empty like(a) 13 cuda.memcpy dtoh(a doubled, a gpu) 14 print a doubled 15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 72. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Whetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } Compute kernel 7 ”””) 8 9 func = mod.get function(”twice”) 10 func(a gpu, block=(4,4,1)) 11 12 a doubled = numpy.empty like(a) 13 cuda.memcpy dtoh(a doubled, a gpu) 14 print a doubled 15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 73. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions PyOpenCL ↔ PyCUDA: A (rough) dictionary PyOpenCL PyCUDA Context Context CommandQueue Stream Buffer mem alloc / DeviceAllocation Program SourceModule Kernel Function Event (eg. enqueue marker) Event Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 74. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 75. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Discontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 76. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Discontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 77. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Discontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Example Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by 1 j 1 ∂t E − ×H =− , ∂t H + × E = 0, ε ε µ ρ ·E = , · H = 0. ε Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 78. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 79. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Flux terms: vary by problem expression specified by user evaluated pointwise Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 80. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 81. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 User writes: Vectorial statement in math. notation flux = 1/2∗cross(normal, h. int −h.ext −alpha∗cross(normal, e. int −e.ext)) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 82. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Metaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 We generate: Scalar evaluator in C (6×) a flux += ( ((( val a field5 − val b field5 )∗ fpair −>normal[2] − ( val a field4 − val b field4 )∗ fpair −>normal[0]) + val a field0 − val b field0 )∗ fpair −>normal[0] − ((( val a field4 − val b field4 ) ∗ fpair −>normal[1] − ( val a field1 − val b field1 )∗ fpair −>normal[2]) + val a field3 − val b field3 ) ∗ fpair −>normal[1] )∗ value type (0.5); Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 83. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 84. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? ws : in sequence wi : “inline-parallel” wp : in parallel Thread Thread Thread t t t (amortize preparation) (exploit register space) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 85. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loop Slicing for Differentiation 2.2 3.0 Local differentiation, matrix-in-shared, order 4, with microblocking 2.8 2.0 point size denotes wi ∈ 1, ,4 2.6 1.8 2.4 Execution time [ms] 1.6 2.2 2.0 ws 1.4 1.8 1.2 1.6 1.4 1.0 1.2 0.8 15 20 25 30 1.0 wp Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 86. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400 300 GPU 250 CPU 200 GFlops/s 150 100 50 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 87. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Memory Bandwidth on a GTX 280 200 Gather 180 Lift Global Memory Bandwidth [GB/s] Diff 160 Assy. Peak 140 120 100 80 60 40 201 2 3 4 5 6 7 8 9 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 88. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions GPU DG Showcase Eletromagnetism Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 89. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions GPU DG Showcase Eletromagnetism Poisson Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 90. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions GPU DG Showcase Eletromagnetism Poisson CFD Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 91. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 92. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 93. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 94. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Another way: Dumb enumeration Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 95. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loo.py Example Empirical GPU loop optimization: a, b, c, i , j , k = [var(s) for s in ” abcijk ”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, } → Ideal case: Finds 160 GF/s kernel without human intervention. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 96. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 97. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Loo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 98. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Outline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 99. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Where to from here? PyCUDA, PyOpenCL, hedge → http://www.cims.nyu.edu/~kloeckner/ GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. GPU-DG Article AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”, J. Comp. Phys., 228 (21), 7863–7882. Also: Intro in GPU Computing Gems Vol 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 100. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Conclusions GPUs to me: architecture choice now widely available Fun time to be in computational science GPUs and scripting work surprisingly well together Exploit a natural task decomposition in computational codes RTCG: Crucial tool GPU Scripting great for prototyping . . . and just as suitable for production code Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 101. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Questions? ? Thank you for your attention! http://www.cims.nyu.edu/~kloeckner/ image credits Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 102. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions Image Credits Dictionary: sxc.hu/topfer C870 GPU: Nvidia Corp. OpenCL Logo: Apple Corp./Ars Technica OS Platforms: flickr.com/aOliN.Tk Old Books: flickr.com/ppdigital Floppy disk: flickr.com/ethanhein Machine: flickr.com/13521837@N00 Adding Machine: flickr.com/thomashawk Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 103. Implementations Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs Flop Rates: 16 GPUs vs 64 CPU cores 4000 GPU CPU 3000 GFlops/s 2000 1000 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 104. Implementations Outline 5 OpenCL implementations Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 105. Implementations The Nvidia CL implementation Targets only GPUs Notes: Nearly identical to CUDA No native C-level JIT in CUDA (→ PyCUDA) Page-locked memory: Use CL MEM ALLOC HOST PTR. Careful: double meaning Need page-locked memory for genuinely overlapped transfers. No linear memory texturing CUDA device emulation mode deprecated → Use AMD CPU CL (faster, too!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 106. Implementations The Apple CL implementation Targets CPUs and GPUs General notes: Different header name OpenCL/cl.h instead of CL/cl.h Use -framework OpenCL for C access. Beware of imperfect compiler cache implementation (ignores include files) CPU notes: One work item per processor GPU similar to hardware vendor implementation. (New: Intel w/ Sandy Bridge) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 107. Implementations The AMD CL implementation Targets CPUs and GPUs (from both AMD and Nvidia) GPU notes: Wide SIMD groups (64) Native 4/5-wide vectors But: very flop-heavy machine, may ignore vectors for memory-bound workloads → Both implicit and explicit SIMD CPU notes: Many work items per processor (emulated) General: cl amd printf Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA