SlideShare a Scribd company logo
1 of 76
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010




         GPGPU in Scientific Applications




                              www.SDART.co.uk


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                        Plan of presentation
  Parallel computing

  GPGPU

  GPGPU technologies

  Scientific applications




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                       Computational limits
  Resources

  Speed:
        Faster hardware

        Optimized software

        Parallelism!



www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                            Flynn’s taxonomy
                                                Single                     Multiple
                                             instruction                 instructions

                  Single data                       SISD                       MISD

                   Multiple                       SIMD
                                                                              MIMD
                    data                          (GPU)

                    We will focus on the data parallelism
                     (opposite to the task parallelism).
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                       Computational limits
  Task parallelism:                                                                                         C
        Normal on the current OSs run on multi-core processors                                              P
        Independent processes (limited communication)
                                                                                                             U
  Data parallelism:
        The same data (vector) to process                                                                   G
        The elements of vector could be divided and they do not
                                                                                                             P
         depend one on the another
        Uses multiple processing units                                                                      U



www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                          Parallel computing
  PU1
  PU2
  PU3
  PU4
  PU5

                                         Parallel section

                                      Sequential section

www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                          Parallel computing
  Ahmdal’s law – speedup of a programme for the
  multiple processing units is limited by the time of
  programme’s sequential fraction.

                                                                 1
                  Speedup
                                                                        parallel
                                        (1 parallel )
                                                                       processors


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                  Parallel computing - speedup
             70

             60

             50
                                                                                                                   Ideal
   Sppedup




             40
                                                                                                                   99%
             30                                                                                                    95%
                                                                                                                   75%
             20
                                                                                                                   50%
             10                                                                                                    20%

              0
                       1            2            4           8           16          32           64
                                               Number of processors

www.SDART.co.uk                           GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                 Parallel computing - efficiency
                 1
                0.9
                0.8
                0.7
                0.6                                                                                                   Ideal
   Efficiency




                                                                                                                      99%
                0.5
                                                                                                                      95%
                0.4
                                                                                                                      75%
                0.3
                                                                                                                      50%
                0.2
                                                                                                                      20%
                0.1
                 0
                           1            2           4           8           16          32           64
                                                   Number of processors

www.SDART.co.uk                              GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                          Parallel computing
  There exists a boundary B that for N processors
  taking TN time TN ≥ B, regardless what the value of
  N is.

  As B depends on the sequential (non-parallel) part
  of programme we look for a way to limit this part.

  In the result: speedup ↑ efficiency ↑
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                          Parallel computing
  Summarising, we could compute faster by:
  •designing better algorithms (not always possible)
  •faster machines
  •doing taks on several processing units (cores/graphical
  processors)

  The speed of PUs is currently bounded more by
  quantum physics than by engineering.


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010




                                             GPGPU




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                  GPU History (a very brief)
                               Simple memory operations e.g. Amiga’s
                                     Blitter (part of Agnus chip)



                                 Programmable vertex and fragment
                                 shaders (programmable 3D pipeline)



                                 Fully programmable processing units




                                                 ??       ?
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                Transformation

                                                                                    eneral
     raphical                                                                       urpose
     rocessing                                                                      raphical
     nit                                                                            rocessing
                                                                                    nit


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     GPU Hardware - Cores
                                Card                                   No. processors
                         GeForce 9800 GTX                                       128

                         GeForce 9300M GS                                        8

                        GeForce 9800M GTX                                       112

                      GeForce GTX 295 (2x285)                                  2x240

               Tesla S1075 (GPU Computing Server)                               960

                          Radeon HD 4890                                         80

                     Mobility Radeon HD 5870                                    160

                          Radeon HD 5970                                       2x320




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                    Reasons to use GPGPU
  •Power (more info at the next slide)

  •Costs (equipment)

  •Costs (energy – supercomputers, e.g. Radeon HD
  5870 has 14.47 GFLOPS/W and Core 2 Extreme
  QX9775 has 0.34 GFLOPS/W)

  •Memory access (approx. a rank faster for GPU)
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                            GFLOPS




     Comparison from NVIDIA CUDA Programming Guide
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                          Power comparison
                  Processing unit (or device)                           GFLOPs FP64
                  Intel Core 2 Duo E8600                                      26.64
                  Intel Core 2 Duo P7350 (mobile)                               16
                  Intel Core 2 Quad Q9650                                       48
                  Intel Core 2 Extreme QX9775                                 51.20
                  NVIDIA GeForce 9300M GS (mobile)                              34
                  NVIDIA GeForce GTX 380                                870 (~1700 FP32)
                  AMD Athlon X2 7750 BE (dual-core)                             17
                  AMD Phenon II X4 940 (quad-core)                              44
                  AMD Radeon HD 5970                                    928 (~2700 FP32)
                  Microsoft Xbox 360                                   115 CPU/240 GPU
                  Sony PS3                                             204 CPU / 900 GPU
                  SPARC64 VIIIfx Venus                                         128
              Various sources i.e. official materials, articles, benchmarks submitted by users.
                                     It is only as a approx. comparison.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     GPU Hardware - NVIDIA
GeForce GTX 295           GeForce GTS 150            GeForce 8800             Quadro FX 1800M    GeForce G105M
Tesla S1070               Quadro FX 570              GTXQuadro Plex 2100 S4   GeForce 305M       GeForce G102M
Quadro FX 5800            GeForce GT 130             GeForce 8800             Quadro FX 1600M    GeForce 9800M GTX
GeForce GTX 285           Quadro FX 470              GTSQuadro Plex 1000      GeForce GTX 285M   GeForce 9800M GT
Tesla C1060               GeForce GT 120             Model IV                 Quadro FX 880M     GeForce 9800M GTS
Quadro FX 5600            Quadro FX 380              GeForce 8800 GT          GeForce GTX 280M   GeForce 9800M GS
GeForce GTX 285 for Mac   GeForce G100               GeForce 8800 GS          Quadro FX 770M     GeForce 9700M GTS
Tesla C870                Quadro FX 380 LP           GeForce 8600 GTS         GeForce GTX 260M   GeForce 9700M GT
Quadro FX 4800            GeForce 9800               GeForce 8600 GT          Quadro FX 570M     GeForce 9650M GS
GeForce GTX 280           GX2Quadro FX 370           GeForce 8500 GT          GeForce GTS 260M   GeForce 9600M GT
Tesla D870                GeForce 9800               GeForce 8400 GS          Quadro FX 380M     GeForce 9600M GS
Quadro FX 4800 for Mac    GTX+Quadro FX 370 Low      GeForce 9400 mGPU        GeForce GTS 250M   GeForce 9500M GS
GeForce GTX 275           Profile                    GeForce 9300 mGPU        Quadro FX 370M     GeForce 9500M G
Tesla S870                GeForce 9800               GeForce 8300 mGPU        GeForce GTS 160M   GeForce 9400M G
Quadro FX 4700 X2         GTXQuadro CX               GeForce 8200 mGPU        Quadro FX 360M     GeForce 9300M GS
GeForce GTX 260           GeForce 9800               GeForce 8100 mGPU        GeForce GTS 150M   GeForce 9300M G
Quadro FX 4600            GTQuadro NVS 450            GeForce GTS 360M        Quadro NVS 320M    GeForce 9200M GS
GeForce GTS 250           GeForce 9600               Quadro FX 3800M          GeForce GT 240M    GeForce 9100M G
Quadro FX 3800            GSOQuadro NVS 420          GeForce GTS 350M         Quadro NVS 160M    GeForce 8800M GTS
GeForce GTS 240           GeForce 9600               Quadro FX 3700M          GeForce GT 230M    GeForce 8700M GT
Quadro FX 3700            GTQuadro NVS 295           GeForce GT 335M          Quadro NVS 150M    GeForce 8600M GT
GeForce GT 240            GeForce 9500               Quadro FX 3600M          GeForce GT 130M    GeForce 8600M GS
Quadro FX 1800            GTQuadro NVS 290           GeForce GT 330M          Quadro NVS 140M    GeForce 8400M GT
GeForce GT 220            GeForce 9400GT             ION Quadro FX 2800M      GeForce G210M      GeForce 8400M GS
Quadro FX 1700            Quadro Plex 2100 D4        GeForce GT 325M          Quadro NVS 135M
GeForce 210               GeForce 8800               Quadro FX 2700M          GeForce G110M
Quadro FX 580             UltraQuadro Plex 2200 D2   GeForce 310M             Quadro NVS 130M


www.SDART.co.uk                           GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                         GPU Hardware - ATI
 ATI Radeon™ HD                    ATI FirePro™                        ATI Mobility Radeon™ HD
 5970                              V87502                              48702
 5870                              V87002                              48602
 5850                              V77502                              4850X22
 5770                              V57002                              48502
 5750                              V37502                              48302
 48902                                                                 46702
 4870 X22                          AMD FireStream™                     46502
 48702                             92702                               4500 Series2
 4850 X22                          92502                               4300 Series2
 48502
 48302
 47702                             ATI Mobility FirePro™
 46702                             M77402
 46502
 45502                             ATI Radeon™ Embedded
 43502                             E4690 Discrete GPU2


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                         CPU/GPU tendency
              GPU                                   GPU                                 GPU


                                                    CPU                                 CPU
              CPU


        „Add-on”                               „Support”                          „Blended”
                                                     Time

                                                                        ↑
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010




                        GPGPU Technologies




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                      Techologies
  Let’s take a look at:
        •   AMD/ATI Stream
        •   DirectX11 DirectCompute
        •   NVIDIA CUDA
        •   OpenCL


  Currently CUDA and Stream support various
  technologies thus we will limit their analysis
  only to the original ones.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                    AMD Stream
  Released by ATI in December 2007 as Close-to-the-
  Metal (beta version).

  Uses Brook+ language – Brook optimized for AMD
  hardware.

  In later versions was upgraded to use OpenCL and
  DirectCompute.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                    AMD Stream
                                                                Tools, Libraries,
                   Brook+
                                                                  Middleware
             High-level language
                                                             (ACML, RapidMind ..)

                   AMD/ATI Stream
             Compute Abstraction Layer (CAL)


                               AMD/ATI GPU Hardware



www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     AMD Stream – Brook+
  Brooke was an extension to the C language,
  designed to be used in stream programming (at
  Stanford University).

  Brooke+ was implemented by AMD and build over
  AMD’s compute abstraction layer and enhanced to
  use AMD’s hardware.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     AMD Stream – Brook+
  kernel void sum(float a<>, float b<>, out float c<>
  {
                                                                       Kernels – operate on stream
        c = a + b;                                                     elements
  }

  int main(int argc, char** argv)
  {
         int i, j;
         float a<10, 10>;
                                                                       Streams – collection of data
         float b<10, 10>;                                              (vectors) to be operated on in
         float c<10, 10>;
         float input_a[10][10];
                                                                       parallel
         float input_b[10][10];
         float input_c[10][10];
         for(i=0; i<10; i++) {
                for(j=0; j<10; j++) {
                      input_a[i][j] = (float) i;
                      input_b[i][j] = (float) j;
                }
         }
         streamRead(a, input_a);                                       Brook+ access functions
         streamRead(b, input_b);
         sum(a, b, c);
         streamWrite(c, input_c);
         ...
  }
  Example comes from AMD-Brookplus Manual


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                          AMD Stream – Tools
      Integrated Stream
        Kernel and CPU
         Programme



      CPU, Stream Code
                                                           Kernel Compiler
           Splitter



                                      CPU Emulation Code                  AMD Stream Processor
        CPU Code (C)
                                            (C++)                           Device Code (IL)
                                                                                                     Compiler

                                                                                                     Run-time
                                                           Stream Runtime




                                          CPU Backend                        GPU Backend (CAL)


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



               AMD Stream – Pros&Cons
  +OpenCL was supported by AMD before NVIDIA

  -Non-portable technology

  -Before AMD introduced OpenCL was CUDA
  follower

  -Less popular than CUDA


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



               DirectX 11 DirectCompute
  Microsoft’s answer to GPGPU.

  Still in the development (drivers, learning
  resources, tools ….).

  A part of DirectX technology that is very popular in
  computer games industry.


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



               DirectX 11 DirectCompute
  DirectX is a winner in PC gaming branch but not in
  graphical software branch (OpenGL).

  Is a step back – causes the software not to be
  portable (similar to OpenGL/DirectX struggle).

  Limited to Windows OS – not well choice for the
  engineers or scientists.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                    DirectX 11 DirectCompute
  //--------------------------------------------------------------------------------------
  // File: BasicCompute11.hlsl
  //
  // This file contains the Compute Shader to perform array A + array B
  //
  // Copyright (c) Microsoft Corporation. All rights reserved.
  //--------------------------------------------------------------------------------------

  struct BufType
  {
     int i;
     float f;
  };

  StructuredBuffer<BufType> Buffer0 : register(t0);
  StructuredBuffer<BufType> Buffer1 : register(t1);
  RWStructuredBuffer<BufType> BufferOut : register(u0);

  [numthreads(1, 1, 1)]
  void CSMain( uint3 DTid : SV_DispatchThreadID )
  {
    BufferOut[DTid.x].i = Buffer0[DTid.x].i + Buffer1[DTid.x].i;
    BufferOut[DTid.x].f = Buffer0[DTid.x].f + Buffer1[DTid.x].f;
  }
www.SDART.co.uk                                GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                CUDA - History
  Public announcement was made in November
  2006, Beta was released in February 2007 and full
  version in June 2007.

  Currently CUDA Toolkit is in 3 version (beta).




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                        CUDA - Architecture




                                  Source: NVIDIA CUDA Architecture
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                      C for CUDA
  C for CUDA is an extension of C language that
  allows programmer to target portions of the
  source code for execution of device.

  E.g. functions __global__ and __device__ cannot
  be recurrent (a it is possible in standard C).


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                              CUDA - Example
  __global__ void vecAdd(float* A, float* B, float* C)
  {
      int i = threadIdx.x;
      C[i] = A[i] + B[i];
  }

  int main()
  {
      // Kernel invocation
      vecAdd<<<1, N>>>(A, B, C);
  }

  Example comes from NVIDIA CUDA Programming Guide




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                              CUDA - Example
  __global__ void matAdd(float A[N][N], float B[N][N], float C[N][N])
  {
      int i = threadIdx.x;
      int j = threadIdx.y;
      C[i][j] = A[i][j] + B[i][j];
  }

  int main()
  {
      // Kernel invocation
      dim3 dimBlock(N, N);
      matAdd<<<1, dimBlock>>>(A, B, C);
  }

  Example comes from NVIDIA CUDA Programming Guide




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



         CUDA – Programming schema
                                                       Blocks
                                              0           1             2
                                 0  ⁞⁞⁞   ⁞⁞⁞       ⁞⁞⁞
                                 1   ⁞ ⁞ ⁞
                                 2
                                    ⁞⁞⁞   ⁞⁞⁞       ⁞⁞⁞
                                     ⁞ ⁞ ⁞
                  Threads
         0           1            2 ⁞⁞⁞   ⁞⁞⁞       ⁞⁞⁞
                                      Threads from one block can cooperate by:
0       ↓            ↓           ↓ ⁞•Synchronization⁞
                                            ⁞
                                      •Sharing data (low latency shared memory)
1       ↓            ↓           ↓
2       ↓            ↓           ↓
www.SDART.co.uk                      GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                              CUDA - Example
  __global__ void matAdd(float A[N][N], float B[N][N], float C[N][N])
  {
      int i = blockIdx.x * blockDim.x + threadIdx.x;
      int j = blockIdx.y * blockDim.y + threadIdx.y;
      if (i < N && j < N)
          C[i][j] = A[i][j] + B[i][j];
  }

  int main()
  {
      // Kernel invocation
      dim3 dimBlock(16, 16);
      dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,
                   (N + dimBlock.y – 1) / dimBlock.y);
      matAdd<<<dimGrid, dimBlock>>>(A, B, C);
  }

  Example comes from NVIDIA CUDA Programming Guide

www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                          CUDA – Pros&Cons
  + Uses a variant of C-language

  + Very popular (a privilege of the first solution)

  - Single precision FP32 – only cards > GTX 260
  have FP64

  - Available only on NVIDIA cards


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                            OpenCL
  OpenCL is framework managed by consortium
  Khronos Group. Its development was started by
  Apple. The consortium includes AMD, IBM, NVIDIA
  and Intel.

  The technical specification was publicly released
  on 8th December 2008.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                            OpenCL
  OpenCL specification was implemented on various
  OSs – Windows, MacOS, Linux as well as on
  various hardware i.e. AMD and NVIDIA.

  It is a portable solution (similarly to OpenGL or
  OpenAL). Vendors must provide Toolkits
  supporting OpenCL on particular hardware&OS.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                            OpenCL
  This technology was based on C language (C99
  variant). Both leaders on GPU market i.e. NVIDIA
  and AMD currently offers OpenCL support in their
  GPU toolkits.

  It obeys IEEE 754-2008 floating point
  requirements (CUDA has some differences).
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                            OpenCL
  Its idea is to use all computational resources –
  CPSs, GPUs and other processors.

  Has desktop and handheld profiles.

  Processing elements are executing code as SIMD
  and SPMD (Single Process/Program Multiple Data)
  elements.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     OpenCL - Architecture
                                              Application
                                                           OpenCL kernels


                                   OpenCL API              OpenCL C Lang.

                                       OpenCL framework

                                         OpenCL runtime

                                                  Driver


                                           GPU hardware

www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



               OpenCL – Memory Model
   Private Memory – per work-item



                  Work-item


   Local Memory – shared by group
              (16Kb)


       Global/Constant Memory



     Host Memory – CPU memory


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



          OpenCL – PyOpenCL Example
  from numpy import *
  import opencl

  # CPU vectors allocation and initialization - Classic NumPy
  host_vec_1 = array([37,50,54,50,56,12,37,45,77,81,92,56,-22,-4], dtype='int8')
  host_vec_2 = array([35,51,54,58,55,32,-5,42,34,33,16,44, 55,14], dtype='int8')
  host_vec_out = ndarray(host_vec_1.shape, dtype='int8')

  # OpenCL C source
  opencl_source = '''
  __kernel void
  vector_add (__global char *c, __global char *a, __global char *b)
  {
    // Index of the elements to add
    unsigned int n = get_global_id(0);
    // Sum the nth element of vectors a and b and store in c
    c[n] = a[n] + b[n];
  }
  '''

  # Compile the code and exec the kernel
  prog = opencl.Program(opencl_source)
  prog.vector_add(host_vec_out, host_vec_1, host_vec_2)

  # Display the result
  print ''.join([chr(c) for c in host_vec_out])

www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



          OpenCL – PyOpenCL Example
  # Allocate GPU memory
  gpu_vec_1 = opencl.Buffer(host_vec_1)
  gpu_vec_2 = opencl.Buffer(host_vec_2)
  gpu_vec_out = opencl.Buffer(host_vec_out)

  # Exec the kernel
  prog.vector_add(gpu_vec_out, gpu_vec_1, gpu_vec_2)

  # Fetch back results
  gpu_vec_out.read(host_vector_out)


  gpu_vec_1 = opencl.Buffer(host_vector_1)


  gpu_vec_out.read(host_vector_out)




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                       OpenCL – Pros&Cons
  + Portable

  + Access to all compute resources

  + Rich set of built-in functions

  + Open standard

  -(?) Newer than CUDA/Stream


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                  Programming languages
  Programming GPU will not force you to write the
  all parts of applications in C/C++.

  C for writing GPU part is a must but CPU part
  could be developed in many other languages e.g.
  Python, Fortran, Java, .NET ….


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



               Steam/CUDA/DC/OpenCL
  Steam & CUDA & DirectCompute are unportable

  OpenCL:
  •vendor independent (it Toolkit was provided),
  •programmer could focus on an universal problem
  solving not a specific implementation,
  •reduces implementation costs,
  •reduces implementation time,

www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                        GPGPU-enabled OSs
  The first GPGPU-enabled OS was MacOS X Snow
  Leopard (10.6). It uses GPGPU to manipulate
  media e.g. in video de/compression.

  We shall expect future OSs to more efficiently
  manage the processing units (both GPU and CPU
  cores). 2, 4 or X cores are more marketing tags.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                       GPGPU performance
            Input                           Processing                                 Output
             CPU                               GPU                                      CPU



                                  Total time of execution




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                       GPGPU performance
  Multiply threads

  Do as much as possible in each thread – do not
  move work to the next thread

  Watch out for costly memory operations

  Use large data sets


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                             GPGPU Software
  According to the information given by AMD,
  Adobe uses GPGPU in:
  •Adobe uses GPGPU in: Acrobat Reader (when working with graphically rich high
  resolution PDFs)
  •Photoshop CS4 Extended (accelerated image and 3D model previewing and
  manipulations)

  while Microsoft uses GPGPU in:
  •PowerPoint 2007 (acceleration of slideshow playback )
  •Silverlight



www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                 What about …
  … multi-core processors?

  Currently CPUs have 2 (or 4) cores - Teraflops
  Research Chip set a record i.e. it has 80 cores.

  Multi-core CPU are less powerful but more
  flexible. Do not expect GPUs to replace CPUs.


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



       AMD Fusion – CPU/GPU Hybrid
  AMD plans to introduce the next generation of
  microprocessors (called Fusion) in 2011.

                                             CPU                       GPU


                                          L2 Cache                 Buffers
      Memory




                                                 Crossbar Switch
                                                                                                      North
                                                                                                      Bridge
                                          Mem. Contr           HyperTransport


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                              GPGPU Speedup




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                CUDA Portfolio




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                CUDA Portfolio




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                CUDA Portfolio




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Bloomberg Case study
  Bloomberg calculates pricing for 1.3 million securities.
  They’ve parallelized models over x86 Linux machines. In
  2005 they released more accurate model which was more
  computationally expensive. In 2008 the task size forced
  them to rescale the hardware to calculate prices on time
  –from 800 cores to 8000 core (approx. 1000 servers).


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Bloomberg Case study
  But they’ve come with an idea – use GPGPU computing.
  Using NVIDIA Tesla GPUs they managed to build sufficient
  system only with 48 machines (each with a Tesla card).

  The x86 servers gather the data and prepare problems to
  be parallelised while 90% of work is being done by GPUs.

  According to Bloomberg they have achieved 800%
  performance increase.
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010




                      Scientific applications




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
  The potential scienfific applications of GPGPU are
  (but not limited to):
  •Optimisation
  •Attributes processing
  •Filtering
  •Classification
  •Monte Carlo


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
  Optimisation:

  •Genetic algorithms

  •Ant colonies

  •Particle Swarm Optimisation




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
                                       Particla Swarm Optimisation




                                             Genetic Algorithms




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
  Optimal filtering:
  •Lots of matrix calculations

  •Complexity is O(N3) for new filters (but more accurace)
  or O(N2) for EKF or SR-versions

  •In Particle Filters many particles are passed via system’s
  model


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
  Neural computing:

  •Each neuron is a simple processing mechanism

  •The net contains tens/hundreds/tousands of
  them

  •They calculations are indepentend (in each layer)


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
 Neural networks → a lot of simple processing
 elements = „That is what Tiggers GPUs do best”.
 The learning for the most of networks contains
 synchronised parts.

                    Simulation           Simulation          Simulation           Simulation



                      Learning             Learning            Learning             Learning




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                                                  CNL
  We have started experiments over GPU-based
  neural networks – CUDA Neural Library
  http://code.google.com/p/cnl/
  The project was started when CUDA was only
  technology offered by NVIDIA. It is more a test
  suite than production ready library.

  OpenCL is waiting ….

www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
  Attributes processing:

  •Selection

  •Error removal

  •Gap removal

  •Transformation (e.g. normalisation)


www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                       Scientific applications
           → → → → →
                   A                                                                       A
                   T                                                                       T
                   T                                                                       T
                                                                       S
                   R                                                                       R
                                                                       I
                   I                                                                        I
                                                                       G
                   B                                                                       B
                                                                       N
                   U                                                                       U
                                                                       I
                   T                                                                       T
                                                                       F
                   E                                                                       E
                                                                       I
                   S                                                                       S
                                                                       C
           → → → → →
                                                                       A
                                                                       N
                                                                       C
                                                                       E

www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                     Scientific applications
  Monte Carlo method:
  1. Define a domain of inputs
  2. Randomly generate inputs using probability
        distribution
  3. Perform computation using inputs
  4. Using individual computation aggregate the results
        into the final results
www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010



                       If you look further …
  •GPGPU.org - http://www.gpgpu.org

  •Microsoft DirectX 11 Samples - http://microsoftpdc.com/Sessions/P09-16

  •CUDA Zone - http://www.NVIDIA.com/object/cuda_home.html

  •OpenCL – http://www.khronos.org/opencl/

  •AMD Stream – http://www.amd.com/stream

  •PyOpenCL - http://pypi.python.org/pypi/pyopencl




www.SDART.co.uk                     GPGPU in Scientific Applications
Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010




                                  Thank you!
                                 Any questions?




www.SDART.co.uk                     GPGPU in Scientific Applications

More Related Content

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

GPGPU in scientifc applications

  • 1. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU in Scientific Applications www.SDART.co.uk www.SDART.co.uk GPGPU in Scientific Applications
  • 2. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Plan of presentation Parallel computing GPGPU GPGPU technologies Scientific applications www.SDART.co.uk GPGPU in Scientific Applications
  • 3. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Computational limits Resources Speed:  Faster hardware  Optimized software  Parallelism! www.SDART.co.uk GPGPU in Scientific Applications
  • 4. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Flynn’s taxonomy Single Multiple instruction instructions Single data SISD MISD Multiple SIMD MIMD data (GPU) We will focus on the data parallelism (opposite to the task parallelism). www.SDART.co.uk GPGPU in Scientific Applications
  • 5. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Computational limits Task parallelism: C  Normal on the current OSs run on multi-core processors P  Independent processes (limited communication) U Data parallelism:  The same data (vector) to process G  The elements of vector could be divided and they do not P depend one on the another  Uses multiple processing units U www.SDART.co.uk GPGPU in Scientific Applications
  • 6. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Parallel computing PU1 PU2 PU3 PU4 PU5 Parallel section Sequential section www.SDART.co.uk GPGPU in Scientific Applications
  • 7. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Parallel computing Ahmdal’s law – speedup of a programme for the multiple processing units is limited by the time of programme’s sequential fraction. 1 Speedup parallel (1 parallel ) processors www.SDART.co.uk GPGPU in Scientific Applications
  • 8. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Parallel computing - speedup 70 60 50 Ideal Sppedup 40 99% 30 95% 75% 20 50% 10 20% 0 1 2 4 8 16 32 64 Number of processors www.SDART.co.uk GPGPU in Scientific Applications
  • 9. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Parallel computing - efficiency 1 0.9 0.8 0.7 0.6 Ideal Efficiency 99% 0.5 95% 0.4 75% 0.3 50% 0.2 20% 0.1 0 1 2 4 8 16 32 64 Number of processors www.SDART.co.uk GPGPU in Scientific Applications
  • 10. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Parallel computing There exists a boundary B that for N processors taking TN time TN ≥ B, regardless what the value of N is. As B depends on the sequential (non-parallel) part of programme we look for a way to limit this part. In the result: speedup ↑ efficiency ↑ www.SDART.co.uk GPGPU in Scientific Applications
  • 11. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Parallel computing Summarising, we could compute faster by: •designing better algorithms (not always possible) •faster machines •doing taks on several processing units (cores/graphical processors) The speed of PUs is currently bounded more by quantum physics than by engineering. www.SDART.co.uk GPGPU in Scientific Applications
  • 12. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU www.SDART.co.uk GPGPU in Scientific Applications
  • 13. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPU History (a very brief) Simple memory operations e.g. Amiga’s Blitter (part of Agnus chip) Programmable vertex and fragment shaders (programmable 3D pipeline) Fully programmable processing units ?? ? www.SDART.co.uk GPGPU in Scientific Applications
  • 14. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Transformation eneral raphical urpose rocessing raphical nit rocessing nit www.SDART.co.uk GPGPU in Scientific Applications
  • 15. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPU Hardware - Cores Card No. processors GeForce 9800 GTX 128 GeForce 9300M GS 8 GeForce 9800M GTX 112 GeForce GTX 295 (2x285) 2x240 Tesla S1075 (GPU Computing Server) 960 Radeon HD 4890 80 Mobility Radeon HD 5870 160 Radeon HD 5970 2x320 www.SDART.co.uk GPGPU in Scientific Applications
  • 16. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Reasons to use GPGPU •Power (more info at the next slide) •Costs (equipment) •Costs (energy – supercomputers, e.g. Radeon HD 5870 has 14.47 GFLOPS/W and Core 2 Extreme QX9775 has 0.34 GFLOPS/W) •Memory access (approx. a rank faster for GPU) www.SDART.co.uk GPGPU in Scientific Applications
  • 17. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GFLOPS Comparison from NVIDIA CUDA Programming Guide www.SDART.co.uk GPGPU in Scientific Applications
  • 18. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Power comparison Processing unit (or device) GFLOPs FP64 Intel Core 2 Duo E8600 26.64 Intel Core 2 Duo P7350 (mobile) 16 Intel Core 2 Quad Q9650 48 Intel Core 2 Extreme QX9775 51.20 NVIDIA GeForce 9300M GS (mobile) 34 NVIDIA GeForce GTX 380 870 (~1700 FP32) AMD Athlon X2 7750 BE (dual-core) 17 AMD Phenon II X4 940 (quad-core) 44 AMD Radeon HD 5970 928 (~2700 FP32) Microsoft Xbox 360 115 CPU/240 GPU Sony PS3 204 CPU / 900 GPU SPARC64 VIIIfx Venus 128 Various sources i.e. official materials, articles, benchmarks submitted by users. It is only as a approx. comparison. www.SDART.co.uk GPGPU in Scientific Applications
  • 19. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPU Hardware - NVIDIA GeForce GTX 295 GeForce GTS 150 GeForce 8800 Quadro FX 1800M GeForce G105M Tesla S1070 Quadro FX 570 GTXQuadro Plex 2100 S4 GeForce 305M GeForce G102M Quadro FX 5800 GeForce GT 130 GeForce 8800 Quadro FX 1600M GeForce 9800M GTX GeForce GTX 285 Quadro FX 470 GTSQuadro Plex 1000 GeForce GTX 285M GeForce 9800M GT Tesla C1060 GeForce GT 120 Model IV Quadro FX 880M GeForce 9800M GTS Quadro FX 5600 Quadro FX 380 GeForce 8800 GT GeForce GTX 280M GeForce 9800M GS GeForce GTX 285 for Mac GeForce G100 GeForce 8800 GS Quadro FX 770M GeForce 9700M GTS Tesla C870 Quadro FX 380 LP GeForce 8600 GTS GeForce GTX 260M GeForce 9700M GT Quadro FX 4800 GeForce 9800 GeForce 8600 GT Quadro FX 570M GeForce 9650M GS GeForce GTX 280 GX2Quadro FX 370 GeForce 8500 GT GeForce GTS 260M GeForce 9600M GT Tesla D870 GeForce 9800 GeForce 8400 GS Quadro FX 380M GeForce 9600M GS Quadro FX 4800 for Mac GTX+Quadro FX 370 Low GeForce 9400 mGPU GeForce GTS 250M GeForce 9500M GS GeForce GTX 275 Profile GeForce 9300 mGPU Quadro FX 370M GeForce 9500M G Tesla S870 GeForce 9800 GeForce 8300 mGPU GeForce GTS 160M GeForce 9400M G Quadro FX 4700 X2 GTXQuadro CX GeForce 8200 mGPU Quadro FX 360M GeForce 9300M GS GeForce GTX 260 GeForce 9800 GeForce 8100 mGPU GeForce GTS 150M GeForce 9300M G Quadro FX 4600 GTQuadro NVS 450 GeForce GTS 360M Quadro NVS 320M GeForce 9200M GS GeForce GTS 250 GeForce 9600 Quadro FX 3800M GeForce GT 240M GeForce 9100M G Quadro FX 3800 GSOQuadro NVS 420 GeForce GTS 350M Quadro NVS 160M GeForce 8800M GTS GeForce GTS 240 GeForce 9600 Quadro FX 3700M GeForce GT 230M GeForce 8700M GT Quadro FX 3700 GTQuadro NVS 295 GeForce GT 335M Quadro NVS 150M GeForce 8600M GT GeForce GT 240 GeForce 9500 Quadro FX 3600M GeForce GT 130M GeForce 8600M GS Quadro FX 1800 GTQuadro NVS 290 GeForce GT 330M Quadro NVS 140M GeForce 8400M GT GeForce GT 220 GeForce 9400GT ION Quadro FX 2800M GeForce G210M GeForce 8400M GS Quadro FX 1700 Quadro Plex 2100 D4 GeForce GT 325M Quadro NVS 135M GeForce 210 GeForce 8800 Quadro FX 2700M GeForce G110M Quadro FX 580 UltraQuadro Plex 2200 D2 GeForce 310M Quadro NVS 130M www.SDART.co.uk GPGPU in Scientific Applications
  • 20. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPU Hardware - ATI ATI Radeon™ HD ATI FirePro™ ATI Mobility Radeon™ HD 5970 V87502 48702 5870 V87002 48602 5850 V77502 4850X22 5770 V57002 48502 5750 V37502 48302 48902 46702 4870 X22 AMD FireStream™ 46502 48702 92702 4500 Series2 4850 X22 92502 4300 Series2 48502 48302 47702 ATI Mobility FirePro™ 46702 M77402 46502 45502 ATI Radeon™ Embedded 43502 E4690 Discrete GPU2 www.SDART.co.uk GPGPU in Scientific Applications
  • 21. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CPU/GPU tendency GPU GPU GPU CPU CPU CPU „Add-on” „Support” „Blended” Time ↑ www.SDART.co.uk GPGPU in Scientific Applications
  • 22. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU Technologies www.SDART.co.uk GPGPU in Scientific Applications
  • 23. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Techologies Let’s take a look at: • AMD/ATI Stream • DirectX11 DirectCompute • NVIDIA CUDA • OpenCL Currently CUDA and Stream support various technologies thus we will limit their analysis only to the original ones. www.SDART.co.uk GPGPU in Scientific Applications
  • 24. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 AMD Stream Released by ATI in December 2007 as Close-to-the- Metal (beta version). Uses Brook+ language – Brook optimized for AMD hardware. In later versions was upgraded to use OpenCL and DirectCompute. www.SDART.co.uk GPGPU in Scientific Applications
  • 25. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 AMD Stream Tools, Libraries, Brook+ Middleware High-level language (ACML, RapidMind ..) AMD/ATI Stream Compute Abstraction Layer (CAL) AMD/ATI GPU Hardware www.SDART.co.uk GPGPU in Scientific Applications
  • 26. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 AMD Stream – Brook+ Brooke was an extension to the C language, designed to be used in stream programming (at Stanford University). Brooke+ was implemented by AMD and build over AMD’s compute abstraction layer and enhanced to use AMD’s hardware. www.SDART.co.uk GPGPU in Scientific Applications
  • 27. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 AMD Stream – Brook+ kernel void sum(float a<>, float b<>, out float c<> { Kernels – operate on stream c = a + b; elements } int main(int argc, char** argv) { int i, j; float a<10, 10>; Streams – collection of data float b<10, 10>; (vectors) to be operated on in float c<10, 10>; float input_a[10][10]; parallel float input_b[10][10]; float input_c[10][10]; for(i=0; i<10; i++) { for(j=0; j<10; j++) { input_a[i][j] = (float) i; input_b[i][j] = (float) j; } } streamRead(a, input_a); Brook+ access functions streamRead(b, input_b); sum(a, b, c); streamWrite(c, input_c); ... } Example comes from AMD-Brookplus Manual www.SDART.co.uk GPGPU in Scientific Applications
  • 28. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 AMD Stream – Tools Integrated Stream Kernel and CPU Programme CPU, Stream Code Kernel Compiler Splitter CPU Emulation Code AMD Stream Processor CPU Code (C) (C++) Device Code (IL) Compiler Run-time Stream Runtime CPU Backend GPU Backend (CAL) www.SDART.co.uk GPGPU in Scientific Applications
  • 29. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 AMD Stream – Pros&Cons +OpenCL was supported by AMD before NVIDIA -Non-portable technology -Before AMD introduced OpenCL was CUDA follower -Less popular than CUDA www.SDART.co.uk GPGPU in Scientific Applications
  • 30. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 DirectX 11 DirectCompute Microsoft’s answer to GPGPU. Still in the development (drivers, learning resources, tools ….). A part of DirectX technology that is very popular in computer games industry. www.SDART.co.uk GPGPU in Scientific Applications
  • 31. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 DirectX 11 DirectCompute DirectX is a winner in PC gaming branch but not in graphical software branch (OpenGL). Is a step back – causes the software not to be portable (similar to OpenGL/DirectX struggle). Limited to Windows OS – not well choice for the engineers or scientists. www.SDART.co.uk GPGPU in Scientific Applications
  • 32. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 DirectX 11 DirectCompute //-------------------------------------------------------------------------------------- // File: BasicCompute11.hlsl // // This file contains the Compute Shader to perform array A + array B // // Copyright (c) Microsoft Corporation. All rights reserved. //-------------------------------------------------------------------------------------- struct BufType { int i; float f; }; StructuredBuffer<BufType> Buffer0 : register(t0); StructuredBuffer<BufType> Buffer1 : register(t1); RWStructuredBuffer<BufType> BufferOut : register(u0); [numthreads(1, 1, 1)] void CSMain( uint3 DTid : SV_DispatchThreadID ) { BufferOut[DTid.x].i = Buffer0[DTid.x].i + Buffer1[DTid.x].i; BufferOut[DTid.x].f = Buffer0[DTid.x].f + Buffer1[DTid.x].f; } www.SDART.co.uk GPGPU in Scientific Applications
  • 33. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA - History Public announcement was made in November 2006, Beta was released in February 2007 and full version in June 2007. Currently CUDA Toolkit is in 3 version (beta). www.SDART.co.uk GPGPU in Scientific Applications
  • 34. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA - Architecture Source: NVIDIA CUDA Architecture www.SDART.co.uk GPGPU in Scientific Applications
  • 35. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 C for CUDA C for CUDA is an extension of C language that allows programmer to target portions of the source code for execution of device. E.g. functions __global__ and __device__ cannot be recurrent (a it is possible in standard C). www.SDART.co.uk GPGPU in Scientific Applications
  • 36. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA - Example __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // Kernel invocation vecAdd<<<1, N>>>(A, B, C); } Example comes from NVIDIA CUDA Programming Guide www.SDART.co.uk GPGPU in Scientific Applications
  • 37. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA - Example __global__ void matAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() { // Kernel invocation dim3 dimBlock(N, N); matAdd<<<1, dimBlock>>>(A, B, C); } Example comes from NVIDIA CUDA Programming Guide www.SDART.co.uk GPGPU in Scientific Applications
  • 38. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA – Programming schema Blocks 0 1 2 0 ⁞⁞⁞ ⁞⁞⁞ ⁞⁞⁞ 1 ⁞ ⁞ ⁞ 2 ⁞⁞⁞ ⁞⁞⁞ ⁞⁞⁞ ⁞ ⁞ ⁞ Threads 0 1 2 ⁞⁞⁞ ⁞⁞⁞ ⁞⁞⁞ Threads from one block can cooperate by: 0 ↓ ↓ ↓ ⁞•Synchronization⁞ ⁞ •Sharing data (low latency shared memory) 1 ↓ ↓ ↓ 2 ↓ ↓ ↓ www.SDART.co.uk GPGPU in Scientific Applications
  • 39. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA - Example __global__ void matAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { // Kernel invocation dim3 dimBlock(16, 16); dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x, (N + dimBlock.y – 1) / dimBlock.y); matAdd<<<dimGrid, dimBlock>>>(A, B, C); } Example comes from NVIDIA CUDA Programming Guide www.SDART.co.uk GPGPU in Scientific Applications
  • 40. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA – Pros&Cons + Uses a variant of C-language + Very popular (a privilege of the first solution) - Single precision FP32 – only cards > GTX 260 have FP64 - Available only on NVIDIA cards www.SDART.co.uk GPGPU in Scientific Applications
  • 41. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL OpenCL is framework managed by consortium Khronos Group. Its development was started by Apple. The consortium includes AMD, IBM, NVIDIA and Intel. The technical specification was publicly released on 8th December 2008. www.SDART.co.uk GPGPU in Scientific Applications
  • 42. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL OpenCL specification was implemented on various OSs – Windows, MacOS, Linux as well as on various hardware i.e. AMD and NVIDIA. It is a portable solution (similarly to OpenGL or OpenAL). Vendors must provide Toolkits supporting OpenCL on particular hardware&OS. www.SDART.co.uk GPGPU in Scientific Applications
  • 43. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL This technology was based on C language (C99 variant). Both leaders on GPU market i.e. NVIDIA and AMD currently offers OpenCL support in their GPU toolkits. It obeys IEEE 754-2008 floating point requirements (CUDA has some differences). www.SDART.co.uk GPGPU in Scientific Applications
  • 44. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL Its idea is to use all computational resources – CPSs, GPUs and other processors. Has desktop and handheld profiles. Processing elements are executing code as SIMD and SPMD (Single Process/Program Multiple Data) elements. www.SDART.co.uk GPGPU in Scientific Applications
  • 45. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL - Architecture Application OpenCL kernels OpenCL API OpenCL C Lang. OpenCL framework OpenCL runtime Driver GPU hardware www.SDART.co.uk GPGPU in Scientific Applications
  • 46. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL – Memory Model Private Memory – per work-item Work-item Local Memory – shared by group (16Kb) Global/Constant Memory Host Memory – CPU memory www.SDART.co.uk GPGPU in Scientific Applications
  • 47. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL – PyOpenCL Example from numpy import * import opencl # CPU vectors allocation and initialization - Classic NumPy host_vec_1 = array([37,50,54,50,56,12,37,45,77,81,92,56,-22,-4], dtype='int8') host_vec_2 = array([35,51,54,58,55,32,-5,42,34,33,16,44, 55,14], dtype='int8') host_vec_out = ndarray(host_vec_1.shape, dtype='int8') # OpenCL C source opencl_source = ''' __kernel void vector_add (__global char *c, __global char *a, __global char *b) { // Index of the elements to add unsigned int n = get_global_id(0); // Sum the nth element of vectors a and b and store in c c[n] = a[n] + b[n]; } ''' # Compile the code and exec the kernel prog = opencl.Program(opencl_source) prog.vector_add(host_vec_out, host_vec_1, host_vec_2) # Display the result print ''.join([chr(c) for c in host_vec_out]) www.SDART.co.uk GPGPU in Scientific Applications
  • 48. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL – PyOpenCL Example # Allocate GPU memory gpu_vec_1 = opencl.Buffer(host_vec_1) gpu_vec_2 = opencl.Buffer(host_vec_2) gpu_vec_out = opencl.Buffer(host_vec_out) # Exec the kernel prog.vector_add(gpu_vec_out, gpu_vec_1, gpu_vec_2) # Fetch back results gpu_vec_out.read(host_vector_out) gpu_vec_1 = opencl.Buffer(host_vector_1) gpu_vec_out.read(host_vector_out) www.SDART.co.uk GPGPU in Scientific Applications
  • 49. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 OpenCL – Pros&Cons + Portable + Access to all compute resources + Rich set of built-in functions + Open standard -(?) Newer than CUDA/Stream www.SDART.co.uk GPGPU in Scientific Applications
  • 50. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Programming languages Programming GPU will not force you to write the all parts of applications in C/C++. C for writing GPU part is a must but CPU part could be developed in many other languages e.g. Python, Fortran, Java, .NET …. www.SDART.co.uk GPGPU in Scientific Applications
  • 51. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Steam/CUDA/DC/OpenCL Steam & CUDA & DirectCompute are unportable OpenCL: •vendor independent (it Toolkit was provided), •programmer could focus on an universal problem solving not a specific implementation, •reduces implementation costs, •reduces implementation time, www.SDART.co.uk GPGPU in Scientific Applications
  • 52. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU-enabled OSs The first GPGPU-enabled OS was MacOS X Snow Leopard (10.6). It uses GPGPU to manipulate media e.g. in video de/compression. We shall expect future OSs to more efficiently manage the processing units (both GPU and CPU cores). 2, 4 or X cores are more marketing tags. www.SDART.co.uk GPGPU in Scientific Applications
  • 53. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU performance Input Processing Output CPU GPU CPU Total time of execution www.SDART.co.uk GPGPU in Scientific Applications
  • 54. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU performance Multiply threads Do as much as possible in each thread – do not move work to the next thread Watch out for costly memory operations Use large data sets www.SDART.co.uk GPGPU in Scientific Applications
  • 55. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU Software According to the information given by AMD, Adobe uses GPGPU in: •Adobe uses GPGPU in: Acrobat Reader (when working with graphically rich high resolution PDFs) •Photoshop CS4 Extended (accelerated image and 3D model previewing and manipulations) while Microsoft uses GPGPU in: •PowerPoint 2007 (acceleration of slideshow playback ) •Silverlight www.SDART.co.uk GPGPU in Scientific Applications
  • 56. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 What about … … multi-core processors? Currently CPUs have 2 (or 4) cores - Teraflops Research Chip set a record i.e. it has 80 cores. Multi-core CPU are less powerful but more flexible. Do not expect GPUs to replace CPUs. www.SDART.co.uk GPGPU in Scientific Applications
  • 57. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 AMD Fusion – CPU/GPU Hybrid AMD plans to introduce the next generation of microprocessors (called Fusion) in 2011. CPU GPU L2 Cache Buffers Memory Crossbar Switch North Bridge Mem. Contr HyperTransport www.SDART.co.uk GPGPU in Scientific Applications
  • 58. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 GPGPU Speedup www.SDART.co.uk GPGPU in Scientific Applications
  • 59. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA Portfolio www.SDART.co.uk GPGPU in Scientific Applications
  • 60. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA Portfolio www.SDART.co.uk GPGPU in Scientific Applications
  • 61. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CUDA Portfolio www.SDART.co.uk GPGPU in Scientific Applications
  • 62. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Bloomberg Case study Bloomberg calculates pricing for 1.3 million securities. They’ve parallelized models over x86 Linux machines. In 2005 they released more accurate model which was more computationally expensive. In 2008 the task size forced them to rescale the hardware to calculate prices on time –from 800 cores to 8000 core (approx. 1000 servers). www.SDART.co.uk GPGPU in Scientific Applications
  • 63. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Bloomberg Case study But they’ve come with an idea – use GPGPU computing. Using NVIDIA Tesla GPUs they managed to build sufficient system only with 48 machines (each with a Tesla card). The x86 servers gather the data and prepare problems to be parallelised while 90% of work is being done by GPUs. According to Bloomberg they have achieved 800% performance increase. www.SDART.co.uk GPGPU in Scientific Applications
  • 64. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications www.SDART.co.uk GPGPU in Scientific Applications
  • 65. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications The potential scienfific applications of GPGPU are (but not limited to): •Optimisation •Attributes processing •Filtering •Classification •Monte Carlo www.SDART.co.uk GPGPU in Scientific Applications
  • 66. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications Optimisation: •Genetic algorithms •Ant colonies •Particle Swarm Optimisation www.SDART.co.uk GPGPU in Scientific Applications
  • 67. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications Particla Swarm Optimisation Genetic Algorithms www.SDART.co.uk GPGPU in Scientific Applications
  • 68. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications Optimal filtering: •Lots of matrix calculations •Complexity is O(N3) for new filters (but more accurace) or O(N2) for EKF or SR-versions •In Particle Filters many particles are passed via system’s model www.SDART.co.uk GPGPU in Scientific Applications
  • 69. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications Neural computing: •Each neuron is a simple processing mechanism •The net contains tens/hundreds/tousands of them •They calculations are indepentend (in each layer) www.SDART.co.uk GPGPU in Scientific Applications
  • 70. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications Neural networks → a lot of simple processing elements = „That is what Tiggers GPUs do best”. The learning for the most of networks contains synchronised parts. Simulation Simulation Simulation Simulation Learning Learning Learning Learning www.SDART.co.uk GPGPU in Scientific Applications
  • 71. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 CNL We have started experiments over GPU-based neural networks – CUDA Neural Library http://code.google.com/p/cnl/ The project was started when CUDA was only technology offered by NVIDIA. It is more a test suite than production ready library. OpenCL is waiting …. www.SDART.co.uk GPGPU in Scientific Applications
  • 72. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications Attributes processing: •Selection •Error removal •Gap removal •Transformation (e.g. normalisation) www.SDART.co.uk GPGPU in Scientific Applications
  • 73. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications → → → → → A A T T T T S R R I I I G B B N U U I T T F E E I S S C → → → → → A N C E www.SDART.co.uk GPGPU in Scientific Applications
  • 74. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Scientific applications Monte Carlo method: 1. Define a domain of inputs 2. Randomly generate inputs using probability distribution 3. Perform computation using inputs 4. Using individual computation aggregate the results into the final results www.SDART.co.uk GPGPU in Scientific Applications
  • 75. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 If you look further … •GPGPU.org - http://www.gpgpu.org •Microsoft DirectX 11 Samples - http://microsoftpdc.com/Sessions/P09-16 •CUDA Zone - http://www.NVIDIA.com/object/cuda_home.html •OpenCL – http://www.khronos.org/opencl/ •AMD Stream – http://www.amd.com/stream •PyOpenCL - http://pypi.python.org/pypi/pyopencl www.SDART.co.uk GPGPU in Scientific Applications
  • 76. Escuela Técnica Superior de Ingenieros Industriales. Universidad Politécnica de Madrid – 12.02.2010 Thank you! Any questions? www.SDART.co.uk GPGPU in Scientific Applications