SlideShare a Scribd company logo
1 of 44
Download to read offline
PG-Strom
GPU Accelerated Asynchronous
  Query Execution Module

               NEC Europe, Ltd
        SAP Global Competence Center
 KaiGai Kohei <kohei.kaigai@emea.nec.com>
Homogeneous vs Heterogeneous Computing

                                                                                ▌KPIs
                                                                                        Computing Performance
                 Homogeneous                                                            Power Consumption
                 Scale-Up                                                               System Cost
                                                                                        Variety of Applications
                                                                                        Vendor Support
                                      Heterogeneous                                     Software Development
                                      Scale-Up
                                                                                             :
                                                                                             :



                                                                        +
     Scale-out
     (not a topic of
      today’s talk)


 2            PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Characteristics of GPU (1/2)
                                                    Nvidia                       AMD               Intel
                                                    Kepler                       GCN               SandyBridge
                           Model                    GTX 680 (*)                  FirePro S9000     Xeon E5-2690
                                                    (Q1/2012)                    (Q3/2012)         (Q1/2012)
                           Number of                3.54billion                  4.3billion        2.26billion
                           Transistors
                           Number of                1536                         1792              16
                           Cores                    Simple                       Simple            Functional
                           Core clock               1006MHz                      925MHz            2.9GHz
                           Peak FLOPS               3.01Tflops                   3.23TFlops        185.6GFlops

                           Memory                   2GB, GDDR5                   6GB, GDDR5        up to 768GB,
                           Size / TYPE                                                             DDR3
                           Memory                   ~192GB/s                     ~264GB/s          ~51.2GB/s
                           Bandwidth
                           Power                    ~195W                        ~225W             ~135W
                           Consumption
                                                        (*) Nvidia shall release high-end model (Kepler K20) at Q4/2012



 3    PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Characteristics of GPU (2/2)

     Example)

     Zi = Xi + Yi             (0 <= i <= n)


     X0     X1           X2                Xn

     +      +            +                 +
     Y0     Y1           Y2                Yn
     

           

                        



                                          




     Z0     Z1           Z2                Zn



                 Assign a particular “core”


                                                              Nvidia’s GeForce GTX 680 Block Diagram (1536 Cuda cores)
 4              PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (1/2)
Example) Parallel Execution of “sqrt(Xi^2 + Yi^2) < Zi”
GPU Code
__kernel void
sample_func(bool result[], float x[], float y[], float z[]) {
    int i = get_global_id(0);

        result[i] = (bool)(sqrt(x[i]^2 + y[i]^2) < z[i]);
}

Host Code
#define N   (1<<20)
size_t g_itemsz = N / 1024;
size_t l_itemsz = 1024;
/* Acquire device memory and data transfer (host -> device) */
X = clCreateBuffer(cxt, CL_MEM_READ_WRITE, sizeof(float)*N, NULL, &r);
clEnqueueWriteBuffer(cmdq, X, CL_TRUE, sizeof(float)*N, ...);
/* Set argument of the kernel code */
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&X);
/* Invoke device kernel */
clEnqueueNDRangeKernel(cmdq, kernel, 1, &g_itemsz, &l_itemsz, ...);

    5       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1. Build & Load GPU Kernel




                Source                       OpenCL
                 Code                        Compiler
                                                                                     L2 Cache

                     X, Y, Z
     result          buffer                                                                 GPU Kernel
     buffer                                 Command
        Host Memory                          Queue                                  Device DRAM

 6       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1. Build & Load GPU Kernel
2. Allocate Device Memory




                                                                                         L2 Cache

                     X, Y, Z
     result          buffer                                                         Device      GPU Kernel
     buffer                                 Command                                 Memory
        Host Memory                          Queue                                     Device DRAM

 7       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1. Build & Load GPU Kernel
2. Allocate Device Memory
3. Enqueue DMA Transfer (host  device)




                                                                                         L2 Cache

                     X, Y, Z
     result          buffer                                                         Device      GPU Kernel
     buffer                                 Command                                 Memory
        Host Memory                          Queue                                     Device DRAM

 8       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1.   Build & Load GPU Kernel
2.   Allocate Device Memory
3.   Enqueue DMA Transfer (host  device)
4.   Setup Kernel Arguments




                                                                                           L2 Cache

                       X, Y, Z
       result          buffer                                                         Device      GPU Kernel
       buffer                                 Command                                 Memory
          Host Memory                          Queue                                     Device DRAM

 9         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1.    Build & Load GPU Kernel
2.    Allocate Device Memory
3.    Enqueue DMA Transfer (host  device)
4.    Setup Kernel Arguments
5.    Enqueue Execution of GPU Kernel




                                                                                            L2 Cache

                        X, Y, Z
        result          buffer                                                         Device      GPU Kernel
        buffer                                 Command                                 Memory
           Host Memory                          Queue                                     Device DRAM

 10         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1.    Build & Load GPU Kernel
2.    Allocate Device Memory
3.    Enqueue DMA Transfer (host  device)
4.    Setup Kernel Arguments
5.    Enqueue Execution of GPU Kernel
6.    Enqueue DMA Transfer (device  host)




                                                                                            L2 Cache

                        X, Y, Z
        result          buffer                                                         Device      GPU Kernel
        buffer                                 Command                                 Memory
           Host Memory                          Queue                                     Device DRAM

 11         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1.    Build & Load GPU Kernel
2.    Allocate Device Memory
3.    Enqueue DMA Transfer (host  device)
4.    Setup Kernel Arguments
5.    Enqueue Execution of GPU Kernel
6.    Enqueue DMA Transfer (device  host)
7.    Synchronize the command queue
       DMA Transfer (host  device)



                                                                                            L2 Cache

                        X, Y, Z
        result          buffer                                                         Device      GPU Kernel
        buffer                                 Command                                 Memory
           Host Memory                          Queue                                     Device DRAM

 12         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1.    Build & Load GPU Kernel
2.    Allocate Device Memory
3.    Enqueue DMA Transfer (host  device)
4.    Setup Kernel Arguments
5.    Enqueue Execution of GPU Kernel
6.    Enqueue DMA Transfer (device  host)                                               Super
                                                                                        Parallel
7.    Synchronize the command queue                                                    Execution
       DMA Transfer (host  device)
       Execution of GPU Kernel


                                                                                              L2 Cache

                        X, Y, Z
        result          buffer                                                          Device       GPU Kernel
        buffer                                 Command                                  Memory
           Host Memory                          Queue                                       Device DRAM

 13         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1.    Build & Load GPU Kernel
2.    Allocate Device Memory
3.    Enqueue DMA Transfer (host  device)
4.    Setup Kernel Arguments
5.    Enqueue Execution of GPU Kernel
6.    Enqueue DMA Transfer (device  host)
7.    Synchronize the command queue
       DMA Transfer (host  device)
       Execution of GPU Kernel
       DMA Transfer (device  host)
8. Release Device Memory                                                                    L2 Cache

                        X, Y, Z
        result          buffer                                                         Device      GPU Kernel
        buffer                                 Command                                 Memory
           Host Memory                          Queue                                     Device DRAM

 14         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Programming with GPU (2/2)

1.    Build & Load GPU Kernel
2.    Allocate Device Memory
3.    Enqueue DMA Transfer (host  device)
4.    Setup Kernel Arguments
5.    Enqueue Execution of GPU Kernel
6.    Enqueue DMA Transfer (device  host)
7.    Synchronize the command queue
       DMA Transfer (host  device)
       Execution of GPU Kernel
       DMA Transfer (device  host)
8. Release Device Memory                                                                L2 Cache

                        X, Y, Z
        result          buffer                                                                 GPU Kernel
        buffer                                 Command
           Host Memory                          Queue                                  Device DRAM

 15         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Basic idea to utilize GPU

 Simultaneous (Asynchronous) execution of CPU and GPU
 Minimization of data transfer between host and device
                                                                                          GPGPU
      Memory                                       CPU                                    (non-integrated)



                       DDR3-1600                                                          Super
      on-host          (51.2GB/s)                                                        Parallel
       buffer                                                                           Execution
                     DMA Transfer
                                                                                                    DDR5
                                                                                                    192.2GB/s
                                                IO HUB
                                                                                     on-device
                                                                                      buffer
                                             HBA                                                    device DRAM


                        SAS 2.0 (600MB/s)                      PCI-E 3.0 x16 (32.0GB/s)

 16       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Back to the PostgreSQL world




                                                                       Don’t I forget I’m talking
                                                                       at PGconf.EU 2012?


 17   PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Re-definition of SQL/MED

▌SQL/MED (Management of External Data)
  External data source performing as if regular tables
  Not only “management”, but external computing resources also

                                                                                                   Exec
                                                                      Regular                                    Exec
                                                                                      storage
                                           Query Executor              Table
                           Query Planner
            Query Parser




                                                                      Foreign             MySQL
                                                                       Table               FDW

  SQL
 Query                                                                Foreign             Oracle
                                                                                          FDW                    Exec
                                                                       Table

                                                                      Foreign        PG-Strom
                                                                       Table           FDW

                                                                                                          Exec
                                                                                Regular
                                                            storage
                                                                                 Table


 18      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Introduction of PG-Strom

▌PG-Strom is ...
  A FDW extension of PostgreSQL, released under the GPL v3.
   https://github.com/kaigai/pg_strom
  Not a stable module, please don’t use in production system yet.
  Designed to utilize GPU devices for CPU off-load according to
   their characteristics.


▌Key features of PG-Strom
  Just-in-time pseudo code generation for GPU execution
  Column-oriented internal data structure
  Asynchronous query execution
 Reduction of response-time dramatically!




 19      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Asynchronous Execution using CPU/GPU (1/2)

▌CPU characteristics
  Complex Instruction, less parallelism
  Expensive & much power consumption per core
  I/O capability

▌GPU characteristics
  Simple Instruction, much parallelism
  Cheap & less power consumption per core
  Device memory access only (except for integrated GPU)

▌“Best Mix” strategy of PG-Stom
  CPU focus on I/O and control stuff.
  GPU focus on calculation stuff.


 20      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Asynchronous Execution using CPU/GPU (2/2)
    vanilla PostgreSQL                                                    PostgreSQL with PG-Strom

           CPU                                                           CPU                      GPU

                                                                                         Asynchronous memory
                                                                                         transfer and execution


                                Iteration of scan tuples and
                                   evaluation of qualifiers




                                                                  Synchronization


                                                               Larger “chunk” to scan
                                                                                             Earlier than
                                                                the database at once
                                                                                           “Only CPU” scan


           : Scan tuples on shared-buffers
           : Execution of the qualifiers

 Page 21   PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
So what, How fast is it?
  postgres=# SELECT COUNT(*) FROM rtbl
             WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
   count
  --------
   100467
  (1 row)
  Time: 7668.684 ms
  postgres=# SELECT COUNT(*) FROM ftbl
             WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
   count
  --------
   100467
  (1 row)                 Accelerated!
  Time: 857.298 ms

 CPU: Xeon E5-2670 (2.60GHz), GPU: NVidia GeForce GT640, RAM: 384GB
 Both of regular rtbl and PG-Strom ftbl contain 20milion rows with same value

 22       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Architecture of PG-Strom
                                                               World of CPU
         Regular                  Shadow
         Tables                    Tables
                                                                       A chunk contains
                                                                       both of data & code
                 shared buffer                         chunk
                                      Data
                                                                                                 World of GPU
                                                    Pseudo                            Async
                                                     codes                             DMA
         SeqScan,           PG-Strom                                                 Transfer       GPU Device
           etc...                                                 Event
                                                                 Monitor                             Memory
                          ForeignScan
                                                                                                      Super
Result                                                                                               Parallel
               Query Executor                                 PG-Strom                              Execution
                                                             GPU Control
                                                                                                    GPU Kernel
           PostgreSQL Backend
            PostgreSQL Backend                                 Server
             PostgreSQL Backend                                                      Preload         Function


                                     Postmaster                            An extra            Works according to
                                                                           daemon              given pseudo code


 23       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Pseudo code generation (1/2)

         SELECT * FROM ftbl WHERE
         c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10;

      contains unsupported
      operators / functions

                              Translation to                xreg10 = $(ftbl.x)
                              pseudo code
                                                            xreg12 = 256.000000::double
                 Super                                      xreg8 = (xreg10 - xreg12)
                Parallel                                    xreg10 = 2.000000::double
               Execution
                                                            xreg6 = pow(xreg8, xreg10)
          GPU Kernel
           Function                                         xreg12 = $(ftbl.y)
                                                            xreg14 = 128.000000::double
                                                                          :


 24         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Pseudo code generation (2/2)

                                               __global__
Regularly, we should avoid                     void kernel_qual(const int commands[],...)
                                               {
branch operations on GPU                         const int *cmd = commands;
code                                               :
                                                 while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0;                                      {
                                                   switch (*cmd)
if (condition)                                     {
                                                     case GPUCMD_CONREF_INT4:
{
                                                       regs[*(cmd+1)] = *(cmd + 2);
  result = a + b;                                      cmd += 3;
}                                                      break;
                                                     case GPUCMD_VARREF_INT4:
else                                                   VARREF_TEMPLATE(cmd, uint);
{                                                      break;
  result = a - b;                                    case GPUCMD_OPER_INT4_PL:
                                                       OPER_ADD_TEMPLATE(cmd, int);
}                                                      break;
                                                          :
return 2 * result;
                                                          :

 25      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Pseudo code generation (2/2)

                                               __global__
Regularly, we should avoid                     void kernel_qual(const int commands[],...)
                                               {
branch operations on GPU                         const int *cmd = commands;
code                                               :
                                                 while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0;                                      {
                                                   switch (*cmd)
if (condition)                                     {
                                                     case GPUCMD_CONREF_INT4:
{
                                                       regs[*(cmd+1)] = *(cmd + 2);
  result = a + b;                                      cmd += 3;
}                                                      break;
                                                     case GPUCMD_VARREF_INT4:
else                                                   VARREF_TEMPLATE(cmd, uint);
{                                                      break;
  result = a - b;                                    case GPUCMD_OPER_INT4_PL:
                                                       OPER_ADD_TEMPLATE(cmd, int);
}                                                      break;
                                                          :
return 2 * result;
                                                          :

 26      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Pseudo code generation (2/2)

                                               __global__
Regularly, we should avoid                     void kernel_qual(const int commands[],...)
                                               {
branch operations on GPU                         const int *cmd = commands;
code                                               :
                                                 while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0;                                      {
                                                   switch (*cmd)
if (condition)                                     {
                                                     case GPUCMD_CONREF_INT4:
{
                                                       regs[*(cmd+1)] = *(cmd + 2);
  result = a + b;                                      cmd += 3;
}                                                      break;
                                                     case GPUCMD_VARREF_INT4:
else                                                   VARREF_TEMPLATE(cmd, uint);
{                                                      break;
  result = a - b;                                    case GPUCMD_OPER_INT4_PL:
                                                       OPER_ADD_TEMPLATE(cmd, int);
}                                                      break;
                                                          :
return 2 * result;
                                                          :

 27      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Pseudo code generation (2/2)

                                               __global__
Regularly, we should avoid                     void kernel_qual(const int commands[],...)
                                               {
branch operations on GPU                         const int *cmd = commands;
code                                               :
                                                 while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0;                                      {
                                                   switch (*cmd)
if (condition)                                     {
                                                     case GPUCMD_CONREF_INT4:
{
                                                       regs[*(cmd+1)] = *(cmd + 2);
  result = a + b;                                      cmd += 3;
}                                                      break;
                                                     case GPUCMD_VARREF_INT4:
else                                                   VARREF_TEMPLATE(cmd, uint);
{                                                      break;
  result = a - b;                                    case GPUCMD_OPER_INT4_PL:
                                                       OPER_ADD_TEMPLATE(cmd, int);
}                                                      break;
                                                          :
return 2 * result;
                                                          :

 28      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Pseudo code generation (2/2)

                                               __global__
Regularly, we should avoid                     void kernel_qual(const int commands[],...)
                                               {
branch operations on GPU                         const int *cmd = commands;
code                                               :
                                                 while (*cmd != GPUCMD_TERMINAL_COMMAND)
result = 0;                                      {
                                                   switch (*cmd)
if (condition)                                     {
                                                     case GPUCMD_CONREF_INT4:
{
                                                       regs[*(cmd+1)] = *(cmd + 2);
  result = a + b;                                      cmd += 3;
}                                                      break;
                                                     case GPUCMD_VARREF_INT4:
else                                                   VARREF_TEMPLATE(cmd, uint);
{                                                      break;
  result = a - b;                                    case GPUCMD_OPER_INT4_PL:
                                                       OPER_ADD_TEMPLATE(cmd, int);
}                                                      break;
                                                          :
return 2 * result;
                                                          :

 29      PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
OT: Why “pseudo”, not native code
                                                                                           Initial design at Jan-2012

   SQL Query




                                         PG-Strom module
                                                                                        GPU

                       PostgreSQL Core
                                                                   Qualifier
                                                                                       Source

                                                                           Run-time                Pre-Compiled
 Query Parser                                                              GPU Code                Binary Cache
                                                                                                                  Compile
                                                                           Generator                               Time
                                                                                                            GPU             GPU
                                                                                                           Source          Binary
                                                              PG-Strom
 Query Planner
                                                               Planner                      Init                  nvcc

                                                                                                   Async Memcpy          Async
                                                              PG-Strom
Query Executor                                                                             Load                          Execute
                                                              Executor
                                                                                                                  GPU
                                                           Columns used                                             Num of
                                                           to Qualifiers                   Scan                     kernels
                                                                                                   Async Memcpy     to load
    Regular                                                    pg_strom          Columns used
   Databases                                                    schema           to Target-List


 PGcon2012, PG-Strom -A GPU Optimized Asynchronous Executor Module of FDW-
              30
Save the bandwidth of PCI-Express bus
   E.g) SELECT name, tel, email, address FROM address_book
          WHERE sqrt((pos_x – 24.5)^2 + (pos_y – 52.3)^2) < 10;
    No sense to fetch columns being not in use

           CPU                              GPU                              CPU                 GPU




   Synchronization                                                      Synchronization

     : Scan tuples on the shared-buffers
     : Execution of the qualifiers                                     Reduction of data-size to be
     : Columns being not used the qualifiers                           transferred via PCI-E

 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
              31
Data density & Column-oriented structure (1/3)

                                                   (shadow) TABLE “public.ft.rowid”
                                                    rowid nitems              isnull
      FOREIGN TABLE ft                              4000   2000         {0,0,0,1,0,0,…}
    int    float   text
                                                    6000   2000         {0,0,0,0,0,0,…}
     X       Y      Z
                                                       :     :                   :
                                       (shadow)14000 “public.ft.z.cs”
                                                     TABLE 400          {0,0,1,0,0,0,…}
                                rowid nitems       isnull             values
                                4000     15       {0,0,…}    { ‘hello’, ‘world’, … }
                                4015     20       {0,0,…} { ‘aaa’, ‘bbb’, ‘ccc’, … }
                                 (shadow) TABLE “public.ft.y.cs”
                                  :       :           :                  :
                          rowid nitems     isnull            values
                                14275    25       {0,0,…}    {‘xxx’, ‘yyy’, ‘zzz’, …}
                          4000     250   {0,0,…} { 1.38, 6.45, 2.15, … }
                 (shadow) TABLE “public.ft.a.cs” { 4.32, 5.46, 3.14, … }
                          4250     250   {0,1,…}
           rowid nitems     :
                         isnull     :        :
                                          values                 :
            4000   500  {0,0,…} {10, 20, {0,0,…} 50, …} 29, 39, 49, 59, …}
                         14200     100    30, 40, {19,
               4500         500      {0,1,…}        {11, 0, 31, 41, 51, …}
                 :           :          :                      :
               14200        200      {0,0,…}       {19, 29, 39, 49, 59, …}

 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
              32
Data density & Column-oriented structure (2/3)

postgres=# CREATE FOREIGN TABLE example
             (a int, b text) SERVER pg_strom;
CREATE FOREIGN TABLE
postgres=# SELECT * FROM pgstrom_shadow_relations;
  oid |        relname        | relkind | relsize
-------+----------------------+---------+-----------
 16446 | public.example.rowid | r       |          0
 16449 | public.example.idx   | i       |      8192
 16450 | public.example.a.cs | r        |          0
 16453 | public.example.a.idx | i       |      8192
 16454 | public.example.b.cs | r        |          0
 16457 | public.example.b.idx | i       |      8192
 16462 | public.example.seq   | S       |      8192
(9 rows)
postgres=# SELECT * FROM pg_strom."public.example.a.cs" ;
 rowid | nitems | isnull | values
-------+--------+--------+--------
(0 rows)

 33     PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Data density & Column-oriented structure (2/3)

② Calculation                                                       opcode                Pseudo Code




                                               PgStromChunkBuffer
                   ① Transfer                                       rowmap

                                                                    value   a[]     <not used>

                                                                    value   b[]
                  ③ Write-Back
                                                                    value   c[]

                                                                    value   d[]     <not used>

      Less bandwidth
                                                                       Table: my_schema.ft1.b.cs
      consumption
                                                                    10100       {2.4, 5.6, 4.95, … }
                                                                    10300     {10.23, 7.54, 5.43, … }

                                                                                  Table: my_schema.ft1.c.cs
            Also, suitable for                                                10100        {‘2010-10-21’, …}
            data compression                                                  10200        {‘2011-01-23’, …}
                                                                              10300        {‘2011-08-17’, …}


 34       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Demonstration
Key features towards upcoming v9.3 (1/2)

▌Extra Daemon
  It enables extension to manage background worker processes.
  Pre-requisites to implement PG-Strom’s GPU control server
  Alvaro submitted this patch on CommitFest:Nov.

                 Shared Resources (DB cluster, shared mem, IPC, ...)


                                                    Built-in                         Extra
                                                  background                        daemon
               PostgreSQL
                PostgreSQL                         daemon
                 PostgreSQL
                  PostgreSQL
               Backend
                Backend
                 Backend
                  Backend                        (autovacuum,
                                                  bgwriter...)               (GPU controller)

                                                                                           manage
                                                                                   Extension
                                              postmaster


 36     PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Key features towards upcoming v9.3 (2/2)

▌Writable Foreign Table
  It enables to use usual INSERT, UPDATE or DELETE to modify foreign
   tables managed by PG-Strom.
  KaiGai submitted a proof-of-concept patch to CommitFest:Sep.
  In-core postgresql_fdw is needed for working example.

                                                                                    SELECT rowid, *
                           create_                                                  FROM ... WHERE ...   Remote
                           foreignscan_plan                                         FOR UPDATE;          Data


                                                             Foreign Data Wrapper
      Planner
                                                                                                         Source
                                  ExecQual

                                                                                    FETCH
                           ExecForeignScan
      Executor                                                                      UPDATE ... WHERE
                                                                                    rowid = xxx;
                           ExecModifyTable

 37          PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
More Rapidness (1/2) – Parallel Data Load
                                                                     World of CPU
      Regular                Shadow
      Tables                  Tables


            shared buffer                                      chunk
                                                                                            Async World of GPU
                                                                                             DMA
                                                PG-Strom                                   Transfer
      SeqScan        PG-Strom                    PG-Strom
                                                  PG-Strom
                                                  Data                                              GPU Device
       etc...                                      Data
                                                    Data
                                                 Loader                                              Memory
                   ForeignScan                    Loader
                                                   Loader                     Event
                                                                             Monitor                  Super
                                                 chunk                                               Parallel
         Query Executor                                                   PG-Strom                  Execution
                                                 to be
                                                                         GPU Control
       PostgreSQL Backend                       loaded                     Server
                                                                                                     GPU Kernel
        PostgreSQL Backend
         PostgreSQL Backend                                                          Preload          Function


                                                                                               Works according to
                                      Postmaster
                                                                                               given pseudo code


 38             PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
More Rapidness (2/2) – TargetList Push-down

                          SELECT ((a + b) * (c – d))^2 FROM ftbl;


                                   SELECT pseudo_col FROM ftbl;


        a                 b                c               d            pseudo_col
                                                                                        Computed
        1                 2                3               4                      9       during
        3                 1                4               1                    144    ForeignScan
        2                 4                1               4                    324
        2                 2                3               6                    144
        :                  :               :               :                      :

 Pseudo column hold “computed” result, to be just referenced
 Performs as if extra columns exist in addition to table definition

 39         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
We need you getting involved

      ▌Project was launched from my personal curiousness,
      ▌So, it is uncertain how does PG-Strom fit “real-life” workload.
      ▌We definitely have to find out attractive usage of PG-Strom


  Which region?
                                                                                     Which problem?




         How to solve?

 40       PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Summary

▌Characteristics of GPU device
  Inflexible instructions, but much higher parallelism
  Cheap & small power consumption per computing capability
▌PG-Strom
  Utilization of GPU device for CPU off-load and rapid response
  Just-in-time pseudo code generation according to the given query
  Column-oriented data structure for data density on PCI-Express bus
 In the result, dramatic shorter response time
▌Upcoming development
  Upstream
      • Extra daemons, Writable Foreign Tables
  Extension
      • Move to OpenCL rather than CUDA

▌Your involvement can lead future evolution of PG-Strom

 41         PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Any Questions?




 42   PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Thank you


                      ありがとうございました
                        THANK YOU
                         DĚKUJEME
                           DANKE
                           MERCI
                          GRAZIE
                          GRACIAS

 43   PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
Page 44   PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module

More Related Content

What's hot

20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...Kohei KaiGai
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdwKohei KaiGai
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGISKohei KaiGai
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_IndexKohei KaiGai
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)Kohei KaiGai
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_ProcessingKohei KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...Equnix Business Solutions
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
 

What's hot (20)

20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 

Viewers also liked

Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsIBM
 
GTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path RenderingGTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path Rendering Mark Kilgard
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in Rherbps10
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014StampedeCon
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on sparkSatyendra Rana
 
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingSIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard
 
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...odsc
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overviewinside-BigData.com
 
PyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at ScalePyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at ScaleGoDataDriven
 
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...Spark Summit
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillotsparktc
 
Containerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the CloudContainerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the CloudSubbu Rama
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsIBM Power Systems
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
 
The Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkThe Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 

Viewers also liked (20)

Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUs
 
GTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path RenderingGTC 2012: GPU-Accelerated Path Rendering
GTC 2012: GPU-Accelerated Path Rendering
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in R
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on spark
 
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingSIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
 
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...Enabling Graph Analytics at Scale:  The Opportunity for GPU-Acceleration of D...
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
 
PyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at ScalePyData Amsterdam - Name Matching at Scale
PyData Amsterdam - Name Matching at Scale
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
Deep Learning on Hadoop
Deep Learning on HadoopDeep Learning on Hadoop
Deep Learning on Hadoop
 
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
Containerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the CloudContainerizing GPU Applications with Docker for Scaling to the Cloud
Containerizing GPU Applications with Docker for Scaling to the Cloud
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
The Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkThe Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in Spark
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 

Similar to PG-Strom - GPU Accelerated Asyncr

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeireimec
 
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulCongatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulIşınsu Akçetin
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraCongatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraIşınsu Akçetin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
How To Train Your Calxeda EnergyCore
How To Train Your  Calxeda EnergyCoreHow To Train Your  Calxeda EnergyCore
How To Train Your Calxeda EnergyCoreNaoto MATSUMOTO
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101John Holden
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 

Similar to PG-Strom - GPU Accelerated Asyncr (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulCongatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraCongatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
How To Train Your Calxeda EnergyCore
How To Train Your  Calxeda EnergyCoreHow To Train Your  Calxeda EnergyCore
How To Train Your Calxeda EnergyCore
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 

More from Kohei KaiGai

20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_HistoryKohei KaiGai
 
20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_APIKohei KaiGai
 
20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrowKohei KaiGai
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_countKohei KaiGai
 
20210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.020210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.0Kohei KaiGai
 
20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCacheKohei KaiGai
 
20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGISKohei KaiGai
 
20200828_OSCKyoto_Online
20200828_OSCKyoto_Online20200828_OSCKyoto_Online
20200828_OSCKyoto_OnlineKohei KaiGai
 
20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdwKohei KaiGai
 
20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_FdwKohei KaiGai
 
20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_TokyoKohei KaiGai
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.JapanKohei KaiGai
 
20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_BetaKohei KaiGai
 
20190925_DBTS_PGStrom
20190925_DBTS_PGStrom20190925_DBTS_PGStrom
20190925_DBTS_PGStromKohei KaiGai
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
20190516_DLC10_PGStrom
20190516_DLC10_PGStrom20190516_DLC10_PGStrom
20190516_DLC10_PGStromKohei KaiGai
 
20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdwKohei KaiGai
 
20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_FdwKohei KaiGai
 
20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LTKohei KaiGai
 
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigDataKohei KaiGai
 

More from Kohei KaiGai (20)

20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History
 
20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API
 
20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
 
20210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.020210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.0
 
20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache
 
20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS
 
20200828_OSCKyoto_Online
20200828_OSCKyoto_Online20200828_OSCKyoto_Online
20200828_OSCKyoto_Online
 
20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw
 
20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw
 
20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.Japan
 
20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta
 
20190925_DBTS_PGStrom
20190925_DBTS_PGStrom20190925_DBTS_PGStrom
20190925_DBTS_PGStrom
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
20190516_DLC10_PGStrom
20190516_DLC10_PGStrom20190516_DLC10_PGStrom
20190516_DLC10_PGStrom
 
20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw
 
20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw
 
20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT
 
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
 

PG-Strom - GPU Accelerated Asyncr

  • 1. PG-Strom GPU Accelerated Asynchronous Query Execution Module NEC Europe, Ltd SAP Global Competence Center KaiGai Kohei <kohei.kaigai@emea.nec.com>
  • 2. Homogeneous vs Heterogeneous Computing ▌KPIs  Computing Performance Homogeneous  Power Consumption Scale-Up  System Cost  Variety of Applications  Vendor Support Heterogeneous  Software Development Scale-Up : : + Scale-out (not a topic of today’s talk) 2 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 3. Characteristics of GPU (1/2) Nvidia AMD Intel Kepler GCN SandyBridge Model GTX 680 (*) FirePro S9000 Xeon E5-2690 (Q1/2012) (Q3/2012) (Q1/2012) Number of 3.54billion 4.3billion 2.26billion Transistors Number of 1536 1792 16 Cores Simple Simple Functional Core clock 1006MHz 925MHz 2.9GHz Peak FLOPS 3.01Tflops 3.23TFlops 185.6GFlops Memory 2GB, GDDR5 6GB, GDDR5 up to 768GB, Size / TYPE DDR3 Memory ~192GB/s ~264GB/s ~51.2GB/s Bandwidth Power ~195W ~225W ~135W Consumption (*) Nvidia shall release high-end model (Kepler K20) at Q4/2012 3 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 4. Characteristics of GPU (2/2) Example) Zi = Xi + Yi (0 <= i <= n) X0 X1 X2 Xn + + + + Y0 Y1 Y2 Yn     Z0 Z1 Z2 Zn Assign a particular “core” Nvidia’s GeForce GTX 680 Block Diagram (1536 Cuda cores) 4 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 5. Programming with GPU (1/2) Example) Parallel Execution of “sqrt(Xi^2 + Yi^2) < Zi” GPU Code __kernel void sample_func(bool result[], float x[], float y[], float z[]) { int i = get_global_id(0); result[i] = (bool)(sqrt(x[i]^2 + y[i]^2) < z[i]); } Host Code #define N (1<<20) size_t g_itemsz = N / 1024; size_t l_itemsz = 1024; /* Acquire device memory and data transfer (host -> device) */ X = clCreateBuffer(cxt, CL_MEM_READ_WRITE, sizeof(float)*N, NULL, &r); clEnqueueWriteBuffer(cmdq, X, CL_TRUE, sizeof(float)*N, ...); /* Set argument of the kernel code */ clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&X); /* Invoke device kernel */ clEnqueueNDRangeKernel(cmdq, kernel, 1, &g_itemsz, &l_itemsz, ...); 5 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 6. Programming with GPU (2/2) 1. Build & Load GPU Kernel Source OpenCL Code Compiler L2 Cache X, Y, Z result buffer GPU Kernel buffer Command Host Memory Queue Device DRAM 6 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 7. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 7 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 8. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 8 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 9. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) 4. Setup Kernel Arguments L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 9 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 10. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) 4. Setup Kernel Arguments 5. Enqueue Execution of GPU Kernel L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 10 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 11. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) 4. Setup Kernel Arguments 5. Enqueue Execution of GPU Kernel 6. Enqueue DMA Transfer (device  host) L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 11 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 12. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) 4. Setup Kernel Arguments 5. Enqueue Execution of GPU Kernel 6. Enqueue DMA Transfer (device  host) 7. Synchronize the command queue  DMA Transfer (host  device) L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 12 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 13. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) 4. Setup Kernel Arguments 5. Enqueue Execution of GPU Kernel 6. Enqueue DMA Transfer (device  host) Super Parallel 7. Synchronize the command queue Execution  DMA Transfer (host  device)  Execution of GPU Kernel L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 13 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 14. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) 4. Setup Kernel Arguments 5. Enqueue Execution of GPU Kernel 6. Enqueue DMA Transfer (device  host) 7. Synchronize the command queue  DMA Transfer (host  device)  Execution of GPU Kernel  DMA Transfer (device  host) 8. Release Device Memory L2 Cache X, Y, Z result buffer Device GPU Kernel buffer Command Memory Host Memory Queue Device DRAM 14 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 15. Programming with GPU (2/2) 1. Build & Load GPU Kernel 2. Allocate Device Memory 3. Enqueue DMA Transfer (host  device) 4. Setup Kernel Arguments 5. Enqueue Execution of GPU Kernel 6. Enqueue DMA Transfer (device  host) 7. Synchronize the command queue  DMA Transfer (host  device)  Execution of GPU Kernel  DMA Transfer (device  host) 8. Release Device Memory L2 Cache X, Y, Z result buffer GPU Kernel buffer Command Host Memory Queue Device DRAM 15 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 16. Basic idea to utilize GPU  Simultaneous (Asynchronous) execution of CPU and GPU  Minimization of data transfer between host and device GPGPU Memory CPU (non-integrated) DDR3-1600 Super on-host (51.2GB/s) Parallel buffer Execution DMA Transfer DDR5 192.2GB/s IO HUB on-device buffer HBA device DRAM SAS 2.0 (600MB/s) PCI-E 3.0 x16 (32.0GB/s) 16 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 17. Back to the PostgreSQL world Don’t I forget I’m talking at PGconf.EU 2012? 17 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 18. Re-definition of SQL/MED ▌SQL/MED (Management of External Data)  External data source performing as if regular tables  Not only “management”, but external computing resources also Exec Regular Exec storage Query Executor Table Query Planner Query Parser Foreign MySQL Table FDW SQL Query Foreign Oracle FDW Exec Table Foreign PG-Strom Table FDW Exec Regular storage Table 18 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 19. Introduction of PG-Strom ▌PG-Strom is ...  A FDW extension of PostgreSQL, released under the GPL v3. https://github.com/kaigai/pg_strom  Not a stable module, please don’t use in production system yet.  Designed to utilize GPU devices for CPU off-load according to their characteristics. ▌Key features of PG-Strom  Just-in-time pseudo code generation for GPU execution  Column-oriented internal data structure  Asynchronous query execution Reduction of response-time dramatically! 19 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 20. Asynchronous Execution using CPU/GPU (1/2) ▌CPU characteristics  Complex Instruction, less parallelism  Expensive & much power consumption per core  I/O capability ▌GPU characteristics  Simple Instruction, much parallelism  Cheap & less power consumption per core  Device memory access only (except for integrated GPU) ▌“Best Mix” strategy of PG-Stom  CPU focus on I/O and control stuff.  GPU focus on calculation stuff. 20 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 21. Asynchronous Execution using CPU/GPU (2/2) vanilla PostgreSQL PostgreSQL with PG-Strom CPU CPU GPU Asynchronous memory transfer and execution Iteration of scan tuples and evaluation of qualifiers Synchronization Larger “chunk” to scan Earlier than the database at once “Only CPU” scan : Scan tuples on shared-buffers : Execution of the qualifiers Page 21 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 22. So what, How fast is it? postgres=# SELECT COUNT(*) FROM rtbl WHERE sqrt((x-256)^2 + (y-128)^2) < 40; count -------- 100467 (1 row) Time: 7668.684 ms postgres=# SELECT COUNT(*) FROM ftbl WHERE sqrt((x-256)^2 + (y-128)^2) < 40; count -------- 100467 (1 row) Accelerated! Time: 857.298 ms  CPU: Xeon E5-2670 (2.60GHz), GPU: NVidia GeForce GT640, RAM: 384GB  Both of regular rtbl and PG-Strom ftbl contain 20milion rows with same value 22 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 23. Architecture of PG-Strom World of CPU Regular Shadow Tables Tables A chunk contains both of data & code shared buffer chunk Data World of GPU Pseudo Async codes DMA SeqScan, PG-Strom Transfer GPU Device etc... Event Monitor Memory ForeignScan Super Result Parallel Query Executor PG-Strom Execution GPU Control GPU Kernel PostgreSQL Backend PostgreSQL Backend Server PostgreSQL Backend Preload Function Postmaster An extra Works according to daemon given pseudo code 23 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 24. Pseudo code generation (1/2) SELECT * FROM ftbl WHERE c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10; contains unsupported operators / functions Translation to xreg10 = $(ftbl.x) pseudo code xreg12 = 256.000000::double Super xreg8 = (xreg10 - xreg12) Parallel xreg10 = 2.000000::double Execution xreg6 = pow(xreg8, xreg10) GPU Kernel Function xreg12 = $(ftbl.y) xreg14 = 128.000000::double : 24 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 25. Pseudo code generation (2/2) __global__ Regularly, we should avoid void kernel_qual(const int commands[],...) { branch operations on GPU const int *cmd = commands; code : while (*cmd != GPUCMD_TERMINAL_COMMAND) result = 0; { switch (*cmd) if (condition) { case GPUCMD_CONREF_INT4: { regs[*(cmd+1)] = *(cmd + 2); result = a + b; cmd += 3; } break; case GPUCMD_VARREF_INT4: else VARREF_TEMPLATE(cmd, uint); { break; result = a - b; case GPUCMD_OPER_INT4_PL: OPER_ADD_TEMPLATE(cmd, int); } break; : return 2 * result; : 25 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 26. Pseudo code generation (2/2) __global__ Regularly, we should avoid void kernel_qual(const int commands[],...) { branch operations on GPU const int *cmd = commands; code : while (*cmd != GPUCMD_TERMINAL_COMMAND) result = 0; { switch (*cmd) if (condition) { case GPUCMD_CONREF_INT4: { regs[*(cmd+1)] = *(cmd + 2); result = a + b; cmd += 3; } break; case GPUCMD_VARREF_INT4: else VARREF_TEMPLATE(cmd, uint); { break; result = a - b; case GPUCMD_OPER_INT4_PL: OPER_ADD_TEMPLATE(cmd, int); } break; : return 2 * result; : 26 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 27. Pseudo code generation (2/2) __global__ Regularly, we should avoid void kernel_qual(const int commands[],...) { branch operations on GPU const int *cmd = commands; code : while (*cmd != GPUCMD_TERMINAL_COMMAND) result = 0; { switch (*cmd) if (condition) { case GPUCMD_CONREF_INT4: { regs[*(cmd+1)] = *(cmd + 2); result = a + b; cmd += 3; } break; case GPUCMD_VARREF_INT4: else VARREF_TEMPLATE(cmd, uint); { break; result = a - b; case GPUCMD_OPER_INT4_PL: OPER_ADD_TEMPLATE(cmd, int); } break; : return 2 * result; : 27 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 28. Pseudo code generation (2/2) __global__ Regularly, we should avoid void kernel_qual(const int commands[],...) { branch operations on GPU const int *cmd = commands; code : while (*cmd != GPUCMD_TERMINAL_COMMAND) result = 0; { switch (*cmd) if (condition) { case GPUCMD_CONREF_INT4: { regs[*(cmd+1)] = *(cmd + 2); result = a + b; cmd += 3; } break; case GPUCMD_VARREF_INT4: else VARREF_TEMPLATE(cmd, uint); { break; result = a - b; case GPUCMD_OPER_INT4_PL: OPER_ADD_TEMPLATE(cmd, int); } break; : return 2 * result; : 28 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 29. Pseudo code generation (2/2) __global__ Regularly, we should avoid void kernel_qual(const int commands[],...) { branch operations on GPU const int *cmd = commands; code : while (*cmd != GPUCMD_TERMINAL_COMMAND) result = 0; { switch (*cmd) if (condition) { case GPUCMD_CONREF_INT4: { regs[*(cmd+1)] = *(cmd + 2); result = a + b; cmd += 3; } break; case GPUCMD_VARREF_INT4: else VARREF_TEMPLATE(cmd, uint); { break; result = a - b; case GPUCMD_OPER_INT4_PL: OPER_ADD_TEMPLATE(cmd, int); } break; : return 2 * result; : 29 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 30. OT: Why “pseudo”, not native code Initial design at Jan-2012 SQL Query PG-Strom module GPU PostgreSQL Core Qualifier Source Run-time Pre-Compiled Query Parser GPU Code Binary Cache Compile Generator Time GPU GPU Source Binary PG-Strom Query Planner Planner Init nvcc Async Memcpy Async PG-Strom Query Executor Load Execute Executor GPU Columns used Num of to Qualifiers Scan kernels Async Memcpy to load Regular pg_strom Columns used Databases schema to Target-List PGcon2012, PG-Strom -A GPU Optimized Asynchronous Executor Module of FDW- 30
  • 31. Save the bandwidth of PCI-Express bus E.g) SELECT name, tel, email, address FROM address_book WHERE sqrt((pos_x – 24.5)^2 + (pos_y – 52.3)^2) < 10;  No sense to fetch columns being not in use CPU GPU CPU GPU Synchronization Synchronization : Scan tuples on the shared-buffers : Execution of the qualifiers Reduction of data-size to be : Columns being not used the qualifiers transferred via PCI-E PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module 31
  • 32. Data density & Column-oriented structure (1/3) (shadow) TABLE “public.ft.rowid” rowid nitems isnull FOREIGN TABLE ft 4000 2000 {0,0,0,1,0,0,…} int float text 6000 2000 {0,0,0,0,0,0,…} X Y Z : : : (shadow)14000 “public.ft.z.cs” TABLE 400 {0,0,1,0,0,0,…} rowid nitems isnull values 4000 15 {0,0,…} { ‘hello’, ‘world’, … } 4015 20 {0,0,…} { ‘aaa’, ‘bbb’, ‘ccc’, … } (shadow) TABLE “public.ft.y.cs” : : : : rowid nitems isnull values 14275 25 {0,0,…} {‘xxx’, ‘yyy’, ‘zzz’, …} 4000 250 {0,0,…} { 1.38, 6.45, 2.15, … } (shadow) TABLE “public.ft.a.cs” { 4.32, 5.46, 3.14, … } 4250 250 {0,1,…} rowid nitems : isnull : : values : 4000 500 {0,0,…} {10, 20, {0,0,…} 50, …} 29, 39, 49, 59, …} 14200 100 30, 40, {19, 4500 500 {0,1,…} {11, 0, 31, 41, 51, …} : : : : 14200 200 {0,0,…} {19, 29, 39, 49, 59, …} PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module 32
  • 33. Data density & Column-oriented structure (2/3) postgres=# CREATE FOREIGN TABLE example (a int, b text) SERVER pg_strom; CREATE FOREIGN TABLE postgres=# SELECT * FROM pgstrom_shadow_relations; oid | relname | relkind | relsize -------+----------------------+---------+----------- 16446 | public.example.rowid | r | 0 16449 | public.example.idx | i | 8192 16450 | public.example.a.cs | r | 0 16453 | public.example.a.idx | i | 8192 16454 | public.example.b.cs | r | 0 16457 | public.example.b.idx | i | 8192 16462 | public.example.seq | S | 8192 (9 rows) postgres=# SELECT * FROM pg_strom."public.example.a.cs" ; rowid | nitems | isnull | values -------+--------+--------+-------- (0 rows) 33 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 34. Data density & Column-oriented structure (2/3) ② Calculation opcode Pseudo Code PgStromChunkBuffer ① Transfer rowmap value a[] <not used> value b[] ③ Write-Back value c[] value d[] <not used> Less bandwidth Table: my_schema.ft1.b.cs consumption 10100 {2.4, 5.6, 4.95, … } 10300 {10.23, 7.54, 5.43, … } Table: my_schema.ft1.c.cs Also, suitable for 10100 {‘2010-10-21’, …} data compression 10200 {‘2011-01-23’, …} 10300 {‘2011-08-17’, …} 34 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 36. Key features towards upcoming v9.3 (1/2) ▌Extra Daemon  It enables extension to manage background worker processes.  Pre-requisites to implement PG-Strom’s GPU control server  Alvaro submitted this patch on CommitFest:Nov. Shared Resources (DB cluster, shared mem, IPC, ...) Built-in Extra background daemon PostgreSQL PostgreSQL daemon PostgreSQL PostgreSQL Backend Backend Backend Backend (autovacuum, bgwriter...) (GPU controller) manage Extension postmaster 36 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 37. Key features towards upcoming v9.3 (2/2) ▌Writable Foreign Table  It enables to use usual INSERT, UPDATE or DELETE to modify foreign tables managed by PG-Strom.  KaiGai submitted a proof-of-concept patch to CommitFest:Sep.  In-core postgresql_fdw is needed for working example. SELECT rowid, * create_ FROM ... WHERE ... Remote foreignscan_plan FOR UPDATE; Data Foreign Data Wrapper Planner Source ExecQual FETCH ExecForeignScan Executor UPDATE ... WHERE rowid = xxx; ExecModifyTable 37 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 38. More Rapidness (1/2) – Parallel Data Load World of CPU Regular Shadow Tables Tables shared buffer chunk Async World of GPU DMA PG-Strom Transfer SeqScan PG-Strom PG-Strom PG-Strom Data GPU Device etc... Data Data Loader Memory ForeignScan Loader Loader Event Monitor Super chunk Parallel Query Executor PG-Strom Execution to be GPU Control PostgreSQL Backend loaded Server GPU Kernel PostgreSQL Backend PostgreSQL Backend Preload Function Works according to Postmaster given pseudo code 38 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 39. More Rapidness (2/2) – TargetList Push-down SELECT ((a + b) * (c – d))^2 FROM ftbl; SELECT pseudo_col FROM ftbl; a b c d pseudo_col Computed 1 2 3 4 9 during 3 1 4 1 144 ForeignScan 2 4 1 4 324 2 2 3 6 144 : : : : :  Pseudo column hold “computed” result, to be just referenced  Performs as if extra columns exist in addition to table definition 39 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 40. We need you getting involved ▌Project was launched from my personal curiousness, ▌So, it is uncertain how does PG-Strom fit “real-life” workload. ▌We definitely have to find out attractive usage of PG-Strom Which region? Which problem? How to solve? 40 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 41. Summary ▌Characteristics of GPU device  Inflexible instructions, but much higher parallelism  Cheap & small power consumption per computing capability ▌PG-Strom  Utilization of GPU device for CPU off-load and rapid response  Just-in-time pseudo code generation according to the given query  Column-oriented data structure for data density on PCI-Express bus In the result, dramatic shorter response time ▌Upcoming development  Upstream • Extra daemons, Writable Foreign Tables  Extension • Move to OpenCL rather than CUDA ▌Your involvement can lead future evolution of PG-Strom 41 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 42. Any Questions? 42 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 43. Thank you ありがとうございました THANK YOU DĚKUJEME DANKE MERCI GRAZIE GRACIAS 43 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module
  • 44. Page 44 PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module