SlideShare a Scribd company logo
1 of 176
Download to read offline
High-Performance Computing Needs
Machine Learning... And Vice Versa
(was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”)




                                                                                             dit ion
                                                                                         e
Nicolas Pinto
NIPS “Big Learning” | December 16th, 2011




                                                                      The Rowland Institute a
                                                                      HARVARD UNIVERSITY
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Motivation...
The Problem:
Visual Object Recognition
Why?
Why?
it seems easy, right?
44 years ago...
The Problem:
Visual Object Recognition
The Problem:
Visual Object Recognition
The Problem:
Visual Object Recognition

                fast
The Problem:
Visual Object Recognition

                fast
                accurate
The Problem:
Visual Object Recognition

                fast
                accurate
                effortless
The Problem:
Visual Object Recognition

                fast
                accurate
                effortless
                critical to survival
The Problem:
Visual Object Recognition

                fast
                accurate
                effortless
                critical to survival

                tolerant to
                variations!
hard?
hard?


// the world is 3D but the retina is 2D
hard?


// the world is 3D but the retina is 2D
// the curse of dimensionality
hard?


// the world is 3D but the retina is 2D
// the curse of dimensionality

// considerable   image variation
~50% of   is for vision!
you learned it...
      ve
     y ha
   ma
Background
The Approach
Reverse and Forward Engineering the Brain
The Approach
Reverse and Forward Engineering the Brain




     REVERSE                 FORWARD
       Study                       Build
    Natural System            Artificial System
The Approach
Reverse and Forward Engineering the Brain




     REVERSE                 FORWARD
       Study                       Build
    Natural System            Artificial System
Reverse Engineering         Images by DiCarlo JJ & Cox DD
                                        Animation by Li N



The Ventral Visual Stream
Reverse Engineering         Images by DiCarlo JJ & Cox DD
                                        Animation by Li N



The Ventral Visual Stream
Reverse Engineering
The Ventral Visual Stream



                                         taflo ps ?!
                            in =2 0 pe
                      bra
The Approach
Reverse and Forward Engineering the Brain




     REVERSE                 FORWARD
       Study                       Build
    Natural System            Artificial System
Forward Engineering
The Ventral Visual Stream



                                        a rnin g ???
                           a bo ut le
                       all
“Temp. Adv.”
                                                                    “Auto-reset”
                                                                       ...
                                 number of lters




L2
                           thresh/sat            norm strength

                                                            Learning
                                   normalization
                                   neighborhood                      Rate
         kernel                                                      Trace
         size                                                        “Temp. Adv.”
                                                                     “Auto-reset”
                                                                        ...
                                         n. of lters




L1
                  thresh/sat            norm strength            Learning
                                                                      Rate
                                           normalization
                                                                      Trace
                                           neighborhood
                                                                      “Temp. Adv.”
                                                                      “Auto-reset”
kernel                                                                   ...
How are things done normally?
How are things done normally?

  Usual Formula:
How are things done normally?

  Usual Formula:

  1) One grad student
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
  5) One Ph.D.
How do you call this ?




  “This is graduate student descent”
  - David McAllester
How do you call this ?




  “This is graduate student descent”
  - David McAllester
What’s better than this?




“Conjugate graduate student descent?”
- Nicolas Poilvert
Doing things a little bit differently
Doing things a little bit differently


  1) One grad student
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
Doing things a little bit differently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
  5) Hundreds of Thousands One PhD ?
“   If you want to have good ideas
         you must have many ideas.               ”
    “  Most of them will be wrong,
      and what you have to learn is
        which ones to throw away.                ”
                    Linus Pauling
                   (double Nobel Prize Winner)
High-throughput
       Screening
Read-out


L3
                  thresh/sat            norm strength

                                            normalization               Learning


                                                                                                  large family of
                                            neighborhood                         Rate
                                                                                 Trace
                                                                                 “Temp. Adv.”
                                                                                 “Auto-reset”

                                               number of lters
                                                                                    ...
                                                                                                  brain-inspired models

L2
                                        thresh/sat            norm strength




                                  clusive!
                                                                         Learning
                                                 normalization
                                                 neighborhood                     Rate




                               in
                                                                                  Trace

                                                                                                    52 parameters
     ery
         kernel
         size                                                                     “Temp. Adv.”



    v
                                                                                  “Auto-reset”
                                                                                     ...
                                                      n. of lters




                                                                                                    more than        10 25
L1
                               thresh/sat            norm strength            Learning



                                                                                                    possible unique
                                                                                   Rate
                                                        normalization
                                                                                   Trace
                                                        neighborhood
                                                                                   “Temp. Adv.”
                                                                                   “Auto-reset”
kernel                                                                                ...



                                                                                                    combinations!
size

                                                                 number of lters




 input
    kernel
    size
                                                                                                         Pinto, Doukhan, DiCarlo, Cox PLoS 2009
The curse of speed
The curse of speed


  thousands of big models
The curse of speed


  thousands of big models

  large amounts of unsupervised
  learning experience
The curse of speed
...and the blessing of massively parallel computing

  No off-the-shelf solution? DIY!
  Engineering (Hardware/SysAdmin/Software)   Science
The curse of speed
...and the blessing of massively parallel computing

  No off-the-shelf solution? DIY!
  Engineering (Hardware/SysAdmin/Software)   Science


  Leverage non-scientific high-tech
  markets and their $billions of R&D...
  Gaming: Graphics Cards (GPUs), PlayStation 3
  Web 2.0: Cloud Computing (Amazon, Google)
r ow n!
 u ild you
B
The blessing of GPUs
  Computational power         DIY GPU pr0n (since 2006)   Sony Playstation 3s (since 2007)




                                                                              GPUs
                        Peak GFLOP/s




                                                                              CPUs
speed
                 (in billion floating point operations per second)


    Q9450 (Matlab/C) [2008]    0.3


        Q9450 (C/SSE) [2008]   9.0


7900GTX (OpenGL/Cg) [2006]           68.2


     PS3/Cell (C/ASM) [2007]            111.4


  8800GTX (CUDA1.x) [2007]                      192.7


   GTX280 (CUDA2.x) [2008]                              339.3


   GTX480 (CUDA3.x) [2010]                                                                      974.3
    (Fermi)
                                                                Pinto, Doukhan, DiCarlo, Cox PLoS 2009
                                                                     Pinto, Cox GPU Comp. Gems 2011
speed
                 (in billion floating point operations per second)


    Q9450 (Matlab/C) [2008]    0.3


        Q9450 (C/SSE) [2008]   9.0


7900GTX (OpenGL/Cg) [2006]           68.2


     PS3/Cell (C/ASM) [2007]            111.4


  8800GTX (CUDA1.x) [2007]                      192.7


   GTX280 (CUDA2.x) [2008]                                  339.3


                                                                             cha n ging...
                                                                         e
   GTX480 (CUDA3.x) [2010]
                                              pe        edu p is g a m                                           974.3
    (Fermi)
                                     >1 000X s
                                                                                 Pinto, Doukhan, DiCarlo, Cox PLoS 2009
                                                                                      Pinto, Cox GPU Comp. Gems 2011
High-throughput Screening
        Skimming off the best models

                          stupid
              chance     baseline
        250


        200
                                         N=2500
        150
Count




        100


        50


         0
               50        60         70   80   90   100


                       Performance (%)                   Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
        Skimming off the best models

                          stupid
              chance     baseline
        250


        200
                                         N=2500
        150
Count




        100


        50


         0
               50        60         70   80   90   100


                       Performance (%)                   Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
        Skimming off the best models

                          stupid
              chance     baseline
        250


        200
                                         N=2500
        150
Count




        100


        50


         0
               50        60         70   80   90   100


                       Performance (%)                   Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on other tasks

                                                           ~90%
                                   vs.


                   “HMAX 2.1”
                     (~80%)




      V1-like                            5   4    3       2      1
      (baseline)
                   state-of-the-art      high-throughput models
                   (from literature)
                                             Pinto, Doukhan, DiCarlo, Cox PLoS 2009
High-throughput Screening
Validate on faces


                                               vs.




                                               HMAX 2.1
                                 PHOG
                            GB



                                        PHOW
                   SIFT




                                                                                            blend
                                                          5   4        3      2       1
      V1-like                                             high-throughput models
      (baseline)          state-of-the-art
                          (from literature)                       Pinto, Doukhan, DiCarlo, Cox PLoS 2009
Human vs. Machine
  8-way object categorization

                                           99.1


                               64


                  31.3
chance (12.5%)

                 baseline   best model   best human
What does it all mean?
what have we learned ?




                    briefly...
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk




➡   dimensionality: more filters is better
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk




➡   learning is difficult
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk




➡   non-linearities are important
What does it all mean?
what have we learned ?
Grayscale Input
                        Normalize                                        Linear SVM
                                                                       simple classifier


                                          L1         L2           L3

                  Filter
                                    Threshold &
                        Φ1                        Pool    Normalize
                                     Saturate
                        Φ2
                  ...
                        Φk



➡   normalization is very important
    missed in previous modeling efforts
    now confirmed by LeCun et al., Poggio et al., Ng et al.
What are these models
      not good for?
ob jects
 low  level
              s
   ckgr ound
 ba
   fa ces
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
one more thing
Real-world apps?
testing the generality and scalability of the approach
Facebook
Really Real World Problem

                                  enormous scale
                                     billion of photos
                                     3TB+ uploaded
                                     every day
                                     dense, collaborative
                                     face labels




collab. with Zak Stone & Todd Zickler @ Harvard
Relevance to Social Networking




                         slide courtesy of David Cox
Relevance to Social Networking
High-throughput
       Screening
High-Throughput Screening
 Labeled Faces in the Wild (LFW) View 1
  > 30,000 large-scale models (1to3 layers) screened in only 3 days


                   HT L3s (3 layers)                  top 5 models
                                                      LFW view 1 performance




                                   Lea rning!
                         vised
              o Un super
             N




Pinto, Cox (FG 2011)                             Pinto, Stone, Zickler, Cox (CVPR 2011)
Generalization
 Performance on LFW View 2 (hold out)

                       Face Verification Performance (% correct)

                                                               88.1
                                                  86.8
                                   85.3



                   79.4           Wolf et al.
                                 ACCV 2009      Kumar et al.   Ours
                  V1-like        face.com        ICCV 2009     (HT)


Pinto, Cox (FG 2011)
“Facebook100”
typical social network size?




collab. with Zak Stone & Todd Zickler @ Harvard
                                    Pinto, Stone, Zickler, Cox (CVPR 2011)
Auto-tagging
a network of 100 Facebook friends



                             > 86%
                             accurate
                             (w/ 90 training examples)



collab. with Zak Stone & Todd Zickler @ Harvard
                                     Pinto, Stone, Zickler, Cox (CVPR 2011)
vs face.com
comparison with a heavily-specialized commercial system
                                                                     L3
                                                           (hardware-accelerated
                                                           brute-force random model)
Performance (% correct)




                                                            face.com
                                                            V1-likearound)
                                                         (best technology
                                                            (one layer)




                          training example(s) / friend   Pinto, Stone, Zickler, Cox (CVPR 2011)
Conclusion?
Hardware Matters !


       Yann LeCun’s Mac




              picture courtesy of Koray Kavukcuoglu
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Two conflicting requirements

   The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run


   Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Two conflicting requirements

   The brain is a massively parallel computer
                   FA  ST slow to run
➡ Big models are paralyzingly


   Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Two conflicting requirements

   The brain is a massively parallel computer
                   FA  ST slow to run
➡ Big models are paralyzingly


   Neural data only provides weak constraints
                    LEXI BLE
                F
➡ Lots of parameters – hard to explore
Two conflicting requirements

   The brain is a massively parallel computer
                   FA  ST slow to run
➡ Big models are paralyzingly


   Neural data only provides weak constraints
                    LEXI BLE
                F
➡ Lots of parameters – hard to explore




  How to optimize?
What’s the bottleneck?
lutio ns!
                     k Co nvo
       i lter ba n
3D F
Our answer?
Meta-programming

       !
Meta-programming

What?
Meta-programming !


 Leave the grunt-programming to the
 computer (i.e. auto-tuning like ATLAS or FFTW)
 •   Dynamically compile specialized versions
     of the same kernel for different conditions
 •   Empirical run-time tuning
 •   For free: smooth syntactic ugliness: unroll
     loops, index un-indexable registers, etc.
Meta-programming !

“Instrument” your solutions:
•   Block size
•   Work size
•   Loop unrolling
•   Pre-fetching
•   Spilling
•   etc.
                     ... and let the computer generate
                     find the optimal code
How?
Always use the right tool !
texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
                                                                plating
                                                             Tem
extern "C" {

#for j in xrange($FILTER_H)

  __global__ void convolve_beta_j${j}(float4 *input, float4 *output)
  {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
    __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

    // -- input/output offsets
    const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    float4 input_v4;

    // -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
    if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
#end if
      {
	   input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
	   shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
	   shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
	   shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
Compilation?
  (with Python-based solutions)
PyCUDA/PyOpenCL (by Andreas Klockner)




  Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)
Basic GPU Meta-programming System




                                                      A Case Study
                           GPU  Meta-Programming:
                                                 red Machine Vision
                           in Biologically-Inspi
                                                s]
                           [GPU Computing Gem

                           Pinto N, Cox DD
conv_kernel_4x4x4.cu
conv_kernel_template.cu                                          #include <stdio.h>

                                                                 texture<float4, 1, cudaReadModeElementType> tex_float4;
                                                                 __constant__ float constant[4][4][4];

                                                                 #define IMUL(a, b) __mul24(a, b)
 texture<float4, 1, cudaReadModeElementType> tex_float4;         extern "C" {
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];                                                         __global__ void convolve_beta_j0(float4 *input, float4 *output)
                                                                       {
 #define IMUL(a, b) __mul24(a, b)
 extern "C" {                                                           __shared__ float shared_in[131][4+1];

                                                                        // -- input/output offsets
 #for j in xrange($FILTER_H)                                            const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
                                                                        const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
   __global__ void convolve_beta_j${j}(float4 *input, float4            float4 input_v4;
 *output)
                                                                        // -- load input to shared memory
   {
                                                                          {
                                                                 	

                input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);
 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1                       	

                shared_in[threadIdx.x+128*0][0] = input_v4.x;
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];            	

                shared_in[threadIdx.x+128*0][1] = input_v4.y;
                                                                 	

                shared_in[threadIdx.x+128*0][2] = input_v4.z;
                                                                 	

                shared_in[threadIdx.x+128*0][3] = input_v4.w;
     // -- input/output offsets
                                                                          }
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +                      if((threadIdx.x+128*1)<131)
 blockIdx.x*blockDim.x + threadIdx.x;                                     {
     const uint out_idx = blockIdx.y*OUTPUT_W +                  	

                input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);
 blockIdx.x*blockDim.x + threadIdx.x;                            	

                shared_in[threadIdx.x+128*1][0] = input_v4.x;
                                                                 	

                shared_in[threadIdx.x+128*1][1] = input_v4.y;
     float4 input_v4;
                                                                 	

                shared_in[threadIdx.x+128*1][2] = input_v4.z;
                                                                 	

                shared_in[threadIdx.x+128*1][3] = input_v4.w;
      // -- load input to shared memory                                   }
 #for i in xrange($LOAD_ITERATIONS)                                     __syncthreads();
 #if $i==($LOAD_ITERATIONS-1)
                                                                        // -- compute dot products
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
                                                                        float v, w;
 #end if
        {                                                               float sum0 = 0;
 	         input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*           float sum1 = 0;
 $i);                                                                   float sum2 = 0;
                                                                        float sum3 = 0;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;          v = shared_in[threadIdx.x+0][0];
 	         shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;          w = constant[0][0][0];
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;          sum0 += v*w;
        }                                                               w = constant[0][0][1];
                                                                        sum1 += v*w;
 #end for
                                                                        w = constant[0][0][2];
                                                                        sum2 += v*w;
                                                                        w = constant[0][0][3];
                                                                        sum3 += v*w;
                                                                        v = shared_in[threadIdx.x+1][0];
                                                                        w = constant[0][1][0];
                                                                        sum0 += v*w;
                                                                        w = constant[0][1][1];
                                                                        sum1 += v*w;
                                                                        w = constant[0][1][2];
                                                                        sum2 += v*w;
                                                                        w = constant[0][1][3];
                                                                        sum3 += v*w;
                                                                        v = shared_in[threadIdx.x+2][0];
                                                                        w = constant[0][2][0];
                                                                        sum0 += v*w;
                                                                        w = constant[0][2][1];
                                                                        sum1 += v*w;
conv_kernel_template.cu
 texture<float4, 1, cudaReadModeElementType> tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];

 #define IMUL(a, b) __mul24(a, b)
                                                                 conv_kernel_4x4x4.cu
 extern "C" {

 #for j in xrange($FILTER_H)

   __global__ void convolve_beta_j${j}(float4 *input, float4



                                                                             20 kB
 *output)
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

      // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
 #end if

 	
 $i);
        {
           input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*    conv_kernel_8x8x4.cu
 	         shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
        }



                                                                             64 kB
 #end for
Benefits?
Smooth syntactic ugliness
Smooth syntactic ugliness
  Manipulations that are not easily
  accessible in CUDA C code:
  • loop unrolling (possibly fine-controlled)
Smooth syntactic ugliness
                            Manipulations that are not easily
                            accessible in CUDA C code:
                            • fine-controlled loop unrolling / jamming
..)

  v = shared_in[threadIdx.x+0][0];
  w = constant[0][0][0];
  sum0 += v*w;
  w = constant[0][0][1];
  sum1 += v*w;
  w = constant[0][0][2];
  sum2 += v*w;
  w = constant[0][0][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+1][0];
  w = constant[0][1][0];
  sum0 += v*w;
  w = constant[0][1][1];
  sum1 += v*w;
  w = constant[0][1][2];
  sum2 += v*w;
  w = constant[0][1][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+2][0];
  w = constant[0][2][0];
  sum0 += v*w;
  w = constant[0][2][1];
  sum1 += v*w;
  w = constant[0][2][2];
  sum2 += v*w;
  w = constant[0][2][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+3][0];
  w = constant[0][3][0];
  sum0 += v*w;
  w = constant[0][3][1];
  sum1 += v*w;
  w = constant[0][3][2];
  sum2 += v*w;
  w = constant[0][3][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+0][1];
  w = constant[1][0][0];
  sum0 += v*w;
  w = constant[1][0][1];
  sum1 += v*w;
  w = constant[1][0][2];
  sum2 += v*w;
  w = constant[1][0][3];
  sum3 += v*w;
How about #pragma unroll ?
   (why don’t you trust the compiler?)
o t alo ne....
    we are n
              s for S ignal
    Using GPU
             elatio n                        pil   ers
         Corr                     ust com
                            ’t tr
                                                                                                                                                                 itchell
                                                                                                                                                  Daniel A. M




                        Don                     gmen
                                          The Murch


                                        ode fr
                                               a
                                                     ts
                                                    ison Widefi
                                                                 eld Array



                                                                                                                                             c
                                                                                                                                      tical”
                                                                                                                              e “iden
                                                                                                                       re thes                 + g *h;
                                                                                                                   ompa                                                                                  LOPS
                                                                                                        •        C
                                                                                                                  *c +
                                                                                                                                           e*f
                                                                                                                                                                                                  770 GF
                                                                                                              + d
                                                                                                          b*c                       grating 8-s
                                                                                                                                                econd snap
                                                                                                                                                            shots over


                                                                                                     a +=
                                                                                                                               inte                           peeling,
                                                                                                                   roduced by                     lanking and

                                                                                                                     b*c;
                                                                                                   -2 526 field p                    d after RFI b
                                                                                      f the J2107                      e of the fiel
                                                                           an image o                    ht is an imag
                                                                                                                                                                                                        S
                                                                                                                                                                                                    FLOP
                                                             n the left is                  . On the rig

                                                                                                               a += d*c;
                                               Figure 3:
                                                           O                            ing
                                                                             hout blank
                                                                interval wit

                                                                                                                                                                                               20 G
                                                   entire time                    eeled imag
                                                                                              e.                                                                  noise
                                               the                          e unp                                                                    e above the
                                                               ntours of th                                                            f magnitud
                                                                                                                                                                                             10
                                                along with
                                                            co                                                                    rs o                             This
                                                                                                                      at are orde                       ious data.

                                                                                                                a += e*f;
                                                                                                           els th                          dub
                                                                                               ivers at lev                    ply discard              n here
                                                                                   to the rece                      m will sim              tector show
                                                   k
                                                                                                                ste

                                      ichael hClar
                                                                             ct in
                                                              fl ect or refra                      real-time sy                   n-based de
                                                occasion, re                         s the MWA                       mple media
                                                                       integration                       hich the si
                                    M           floor. D
                                                 wit
                                                 wil
                                                          uring deep
                                                     l require a
                                                                  series of d
                                                                              ata-quality
                                                                           art.
                                                                                            tests, of w
                                                                                                                a += g*h;
                                                              n integral p
                                                 will form a   eenhill
                                                     Lincoln Gr
                               Paul La   Plante and
                                                   Reference
                                                             s
                                                                                                  t Boolard
                                                                                                                 a +=
                                                                                                             y, EDGES
                                                                                                                           Memo, 058
                                                                                                                                         , 2010.
                                                                                                                                                  R.J. Cappal
                                                                                                                                                              lo, M.F. M
                                                                                                                                                                           orales, and
                                                                                            ics a                                           ale,                             d Topics
                                                                               RFI Statist                                    , C.J. Lonsd                      l of Selecte
                                                      [1] A.E   .E. Rogers,                                     , R.J. Sault                     IE EE Journa
                                                                                                  R.B. Wayth                        eld Array,
                                                                                   . Greenhill,                      hison Widefi                      ].
                                                                      itchell, L.J                    of the Murc                        07.1912                                 E, 97
                                                       [2] D.A. M                Time Calib
                                                                                              ration
                                                                                                                  , [astro-
                                                                                                                                ph/08                               s of the IEE
                                                            S.M. O    rd, Real-                       7 17, 2008                                      , Proceeding
                                                                                        2 (5), 707–                                     n Overview
                          1
              nuary 201
sday, 27 Ja                                                                rocessing,                                     rray: Desig
                                                            in Signal P                                 on Widefield A
                                                                                          he Murchis                        8].                                            , Graphics
                                                                           ale, et al., T                      903.182                                        R.G. Edgar
                                                        [3]  C.J. Lonsd                    [ast   ro-ph/0                                    H. Pfister, and                   Series,
                                                                             506, 2009,                                      ell, K. Dale,                     Conference
                                                              (8), 1497–1                                    , D.A. Mitch                       d Array, ASP
                                                                                               R.B. Wayth                        on Wide-fiel
                                                                                  Greenhill,                      the Murchis


     IICS‘2011                                           [4] S.M    . Ord, L.J.             ata Pro  cessing in                                                                 cal
                                                                              Units for D                                                                           Mathemati
                                                               Processing                                                                1 radio pola
                                                                                                                                                        rimetry. I.
                                                                              009.                                              aa d
                                                                                                                          nderstryn20 ing
                                                                                                                                       1
                                                                411, 127, 2                                 .J. Sault, U Janu                 6.
                                                                                      . Breg  man, and R ursday,.,2117, 137–147, 199
                                                                                                                        7
                                                                                                                                                                           alar
                                                                        amaker, J.D                       Th pl. Ser
                                                                                                          up                                                alogue of sc
                                                           [5 ] J.P. H                       st rophys. S                                 ll-co herency an                rophys. Su
                                                                                                                                                                                      ppl.
                                                                               s, Astron. A                                  . IV. The fu                   Astron. Ast
                                                                 foundation                                    polarimetry                     ric fidelity,
                                                                                                     g radio               ge and pola
                                                                                                                                         rimet
                                                                                       derstandin
Smooth syntactic ugliness
  Manipulations that are not easily
  accessible in CUDA C code:
  • variable-length argument lists
Smooth syntactic ugliness

  Manipulations that were not easily
  accessible in CUDA C code:
  • index un-indexable resources (e.g. regs)
Explore design decision
  space more freely
Basic GPU Meta-programming System




                                                      A Case Study
                           GPU  Meta-Programming:
                                                 red Machine Vision
                           in Biologically-Inspi
                                                s]
                           [GPU Computing Gem

                           Pinto N, Cox DD
... too many
                      optimizations?

                      ba nk c
                                onflict
                                             s




            on
                                       ing



        isi
                              ale   sc


      ec
                           co




                                        ca
    pr




                                             ch
    d                part             ling
                            itionnrol




                                                 in
ixe
      cla                     p u ca mpin




                                                 g
            m             loo              g
m


                pi
                     ng
                                adca sting
                          bro
                                                  ms
        zero-cop                             trea
e ?
              ec id
       ’t d
c an



                        keep them all !
Exploring design decision space more freely

  Meta-programming:


  • enables efficient learning of the GPU
    hardware/software


  • allows full exploitation of the GPU
    architecture
version A
conv_kernel_beta_template.cu
                                                                                             ...
                                                                        mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
                                                                        mov.b32 $r1, c0[$ofs2+0x0008]
 texture<float4, 1, cudaReadModeElementType> tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]                      mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
 [$N_FILTERS];
                                                                        mov.b32 $r1, c0[$ofs2+0x000c]
 #define IMUL(a, b) __mul24(a, b)
 extern "C" {                                                           mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
 #for j in xrange($FILTER_H)                                            mov.b32 $r1, c0[$ofs2+0x0010]
   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)
                                                                        mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
                                                                                             ...
     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

      // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)


                                                                                                   version B
 #if $i==($LOAD_ITERATIONS-1)
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
 #end if
        {
 	         input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
 $i);
 	
 	
 	
           shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
           shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
           shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
                                                                                           ...
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;   mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
        }
 #end for                                                        mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

                                                                                           ...
                                                                             aster... Why ?
                                                                        2x f
Results
speed
                 (in billion floating point operations per second)


    Q9450 (Matlab/C) [2008]    0.3


        Q9450 (C/SSE) [2008]   9.0


7900GTX (OpenGL/Cg) [2006]           68.2


     PS3/Cell (C/ASM) [2007]            111.4


  8800GTX (CUDA1.x) [2007]                      192.7


   GTX280 (CUDA2.x) [2008]                                  339.3


                                                                             cha n ging...
                                                                         e
   GTX480 (CUDA3.x) [2010]
                                              pe        edu p is g a m                                           974.3
    (Fermi)
                                     >1 000X s
                                                                                 Pinto, Doukhan, DiCarlo, Cox PLoS 2009
                                                                                      Pinto, Cox GPU Comp. Gems 2011
-10.4 1024x1024x8      16x5x5x8     726.412 ± 0.398    744.973 ± 0.571
    Analysis
      2048x2048x4       4x8x8x4     474.681 ± 0.160    887.974 ± 1.017



    ➡ Different hardware ?
  Table 33.2 Performance of Auto-Tuned Implementations on Two
  Hardware Platforms, Including Performance Tuned on One Platform and
  Run on the Other
                             Optimized for:
  Run on:               9400M             GTX480        Tuning Speedup

  9400M                 0.32s              2.52s               675%
  GTX480                0.016s             0.011s              52%



formance gains are observed for the auto-tuned meta-kernels as compared to
t, which was hand-picked to allow correct execution of all input ranges
 ng up against hardware limitations.
APTER 33 GPU Metaprogramming: A Case Study
   Analysis


    ➡ Different input configurations
  Table 33.3 Performance of Auto-Tuned Implementations on Two Input
  Configurations, Including Performance Tuned for One Configuration
  and Run with the Other
                              Optimized for:
  Run on:               Config1             Config2        Tuning Speedup

  config1                 11.1ms             15.7ms              41%
  config2                  fails             10.8ms         not comparable




, in Table 33.3 we show the effect of tuning on one input configuration an
in, significant speedups are obtained using kernels tailored to a specific inp
Summary
Summary

 Meta-programming:
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
 • is easy and flexible with the right tools
   (e.g. Python, PyCUDA/CL, Cheetah, decuda)
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
 • is easy and flexible with the right tools
   (e.g. Python, PyCUDA/CL, Cheetah, decuda)


 ➡ helps get drastic speed-ups !
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter highly-optimized code
 • is easy and flexible with the right tools
   (e.g. Python, PyCUDA/CL, Cheetah, decuda)


 ➡ helps get drastic speed-ups !
 ➡ facilitates “auto-tuning” !
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
Intelligent
         and fast




Auto-Tuning
   with Machine Learning




                    with James Bergstra and David Cox
Intelligent
         and fast




Auto-Tuning
   with Machine Learning
Auto-tuning: two approaches
Auto-tuning: two approaches


• Analytical model-based optimization:
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak



• Empirical optimization:
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak



• Empirical optimization:
 - pros: auto-tuned code close to peak (dominant in
    specialized libraries e.g. ATLAS, FFTW), easier to build
Auto-tuning: two approaches


• Analytical model-based optimization:
 - pros: very generic (dominant in compilers), fast
    “inference”
 - cons: hard to build, domain expertise required, auto-
    tuned code far from peak



• Empirical optimization:
 - pros: auto-tuned code close to peak (dominant in
    specialized libraries e.g. ATLAS, FFTW), easier to build
 - cons: very slow “inference” (for new inputs, etc.)
Empirical Auto-Tuning

The goal is to empirically optimize execution
time given both


• the environment
 - hardware (GPU, CPU, Memory, Mobo, etc.)
 - software (SDK, Compiler suite, etc.)


• the data (input dimensions, repetitions, etc.)
Empirical Auto-Tuning with Meta-programming




                                                       A Case Study
                            GPU  Meta-Programming:
                                                  red Machine Vision
                            in Biologically-Inspi
                                                 s]
                            [GPU Computing Gem

                            Pinto N, Cox DD
Intelligent
         and fast




Auto-Tuning
   with Machine Learning
Auto-tuning: best of both approaches ?
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:


 - pros: auto-tuned code close to peak*, easier to build (?),
    fast “inference” (for new inputs, hardware, etc.)
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:


 - pros: auto-tuned code close to peak*, easier to build (?),
    fast “inference” (for new inputs, hardware, etc.)


 - cons: unexplored !
Auto-tuning: best of both approaches ?


• Empirically-learned model-based
  optimization:


 - pros: auto-tuned code close to peak*, easier to build (?),
    fast “inference” (for new inputs, hardware, etc.)


 - cons: unexplored !


* could be dominant in specialized libraries
(e.g. machine learning!)
Fast Machine Learning-based
 Runtime Auto-Tuning


ML-based
First Last                           First Last                           First Last
                  Affiliation line 1                    Affiliation line 1                    Affiliation line 1


      Fast Machine Learning-based
                  Affiliation line 2                    Affiliation line 2                    Affiliation line 2
               anon@mail.com                        anon@mail.com                        anon@mail.com


ABSTRACT

      Runtime Auto-Tuning
                                                                  designs, the field lacks consensus on exactly how the differ-
The rapidly evolving landscape of multicore architectures         ent subsystems (memory, communication and computation)
makes the construction of efficient libraries a daunting task.      should be efficiently integrated, modeled and programmed.
A family of methods known collectively as “auto-tuning” has       These systems have exhibited varying degrees of memory
emerged to address this challenge. Two major approaches to        hierarchy and multi-threading complexity and, as a conse-
auto-tuning are empirical and model-based: empirical auto-        quence, they have been increasingly relying on flexible but
tuning is a generic but slow approach that works by mea-          low-level software-controlled cache management and paral-
suring runtimes of candidate implementations, model-based         lelism [Asanovic et al., 2006] in order to better control and
auto-tuning predicts those runtimes using simplified abstrac-      understand the various trade-offs among performance, reli-
tions designed by hand. We show that machine learning             ability, energy efficiency, production costs, etc. This evo-
methods for non-linear regression can be used to estimate         lution has profoundly altered the landscape of application
timing models from data, capturing the best of both ap-           development: programmers are now facing a wide diversity


     Machine Learning for Predictive Auto-Tuning with Boosted
proaches. A statistically-derived model offers the speed of        of low-level architectural issues that must be carefully bal-
a model-based approach, with the generality and simplicity        anced if one is to write code that is both high-performance
of empirical auto-tuning. We validate our approach using          and portable.
the filterbank correlation kernel described in Pinto and Cox

                        Regression Trees
[2012], where we find that 0.1 seconds of hill climbing on         1.1      Motivation
the regression model (“predictive auto-tuning”) can achieve          In this rapidly evolving landscape, the construction of gen-
an average of 95% of the speed-up brought by minutes of           eral development tools and libraries that fully utilize system
empirical auto-tuning. Our approach is not specific to filter-      resources remains a daunting task. Even within special-
bank correlation, nor even to GPU kernel auto-tuning, and         ized architectures from the same vendor, such as NVIDIA’s
can be applied to almost any templated-code optimization          Graphics Processing Units (GPUs) and the Compute Unified
problem, spanning a wide variety of problem types, kernel         Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA,
types, and platforms.                                             2011], many developers default to massive amounts of man-

1.   INTRODUCTION                        First Last                                                             First Last
                                                                  ual labor to optimize CUDA code to specific input domains.
                                                                  In addition, hand-tuning rarely generalizes well to new hard-                                First Last
                                      Affiliation line 1                                                     Affiliation line 1                                Affiliation line 1
                                                                  ware generations or different input domains, and it can also
  Due to power consumption and heat dissipation concerns,         be error-prone or far from optimal. One of the reason is that

                                      Affiliation line 2                                                     Affiliation line 2                                Affiliation line 2
scientific applications have shifted from computing platforms      kernels can produce staggeringly large optimization spaces
where performance had been primarily driven by rises in the       [Datta et al., 2008]. The problem is further compounded

                                anon@mail.com
clock frequency of a single “heavy-weight” processor (with
complex out-of-order control and cache structures) to a plat-
form with ever increasing numbers of “light-weight” cores.
                                                                                                       anon@mail.com
                                                                  by the fact that these spaces can be highly discontinuous
                                                                  [Ryoo et al., 2008], difficult to explore, and quasi-optimal                               anon@mail.com
                                                                  solutions lie at the edge of “performance cliffs” induced by
Interestingly, this shift is now not only relevant to compu-      hard device-specific constraints (e.g. register file size or low-
tational sciences but to the development of all computer sys-     latency cache size).


                                                                                                                                                   James Bergstra
tems: from ubiquitous consumer-facing devices (e.g. phones)
to high-end computer farms for web-scale applications (e.g.
                                                                  1.2      Auto-Tuning
     ABSTRACT
social networks).
  Although the future lies in low-power multi-core hardware         One strategy for addressing these challenges is to use one
                                                                  of a variety of automatic methods known collectively as
                                                                                                                                    designs, the field lacks consensus on exactly how the differ-
    The rapidly evolving landscape of multicore architectures
Permission to makethe or hard copies of all or part ofof work for
                                                                  “auto-tuning.” Two major auto-tuning approaches have emer-
                                                                  ged in the extensive literature covering the subject (see sur-
                                                                  veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc
    makes digital construction this efficient libraries a daunting task.
                                                                  et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos,
                                                                                                                                                   Nicolas Pinto
                                                                                                                                    ent subsystems (memory, communication and computation)
                                                                                                                                    should be efficiently integrated, modeled and programmed.

                                                                                                                                                   David Cox
personal or classroom use is granted without fee provided that copies are
                                                                  2008, Li et al., 2009, Park et al., 2011]): analytical model-     These systems have exhibited varying degrees of memory
not A familyfor profit or commercial advantage and that copies
    made or distributed of methods known collectively as “auto-tuning” has
                                                                  driven optimization and empirical optimization [Yotov et al.,
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or address this challenge. Two major approaches to
    emerged to to redistribute to lists, requires prior specific   2003].                                                            hierarchy and multi-threading complexity and, as a conse-
                                                                    The model-driven optimization approach uses analytical
permission and/or a fee.                                                                                                            quence, they have been increasingly relying on flexible but
    auto-tuning are empirical and model-based: empirical auto-
                                                                                                                                                   [submitted]
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.                  abstractions to model the hardware architectures, in order

    tuning is a generic but slow approach that works by mea-                                                                        low-level software-controlled cache management and paral-
    suring runtimes of candidate implementations, model-based                                                                       lelism [Asanovic et al., 2006] in order to better control and
    auto-tuning predicts those runtimes using simplified abstrac-                                                                    understand the various trade-offs among performance, reli-
    tions designed by hand. We show that machine learning                                                                           ability, energy efficiency, production costs, etc. This evo-
    methods for non-linear regression can be used to estimate                                                                       lution has profoundly altered the landscape of application
    timing models from data, capturing the best of both ap-                                                                         development: programmers are now facing a wide diversity
    proaches. A statistically-derived model offers the speed of                                                                      of low-level architectural issues that must be carefully bal-
                                                                                                                                    anced if one is to write code that is both high-performance
lutio ns!
                     k Co nvo
       i lter ba n
3D F
NVIDIA GTX 580 (Fermi)
0                   P    ie w
                      rev(b)                                                     2x faster          equality
                                                              1200




                          GFLOP/s of predictive auto-tuning
                                                              1000
Auto-­tuned mean

                                                               800
                                                                                                          2x slower

        ML-based:
Reference mean
                                                               600

         < 0.1sec
                                                               400



                                                               200



                                                                 0
                   200
                                                                     0    200   400   600   800   1000   1200   1400
d problem
                                                                         GFLOP/s of empirical auto-tuning
 r training
                                                                           old way: minutes!
NVIDIA GTX 580 (Fermi)
0                   P    ie w
                      rev(b)                                                        2x faster        equality
                                                              1200




                          GFLOP/s of predictive auto-tuning
                                                                                                            LOP /s !
                                                                                          RAF
                                                              1000


                                                                                      1 TE
Auto-­tuned mean

                                                               800              >   1.
                                                                                                           2x slower

        ML-based:
Reference mean
                                                               600

         < 0.1sec
                                                               400



                                                               200



                                                                 0
                   200
                                                                     0    200   400    600   800   1000   1200   1400
d problem
                                                                         GFLOP/s of empirical auto-tuning
 r training
                                                                           old way: minutes!
What else ?
What else could we do for HPC ?
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
• Help design better architectures ?
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
• Help design better architectures ?
• $$$
What else could we do for HPC ?



• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
  their complex interactions
• Help design better architectures ?
• $$$
• etc.
It would be a
                                                       win-win-win situation!




(The Office Season 2, Episode 27: Conflict Resolution)
Outline
1. HPC-aware ML
2. GPU Meta-programming
3. ML-aware HPC
en ts
                   e dg em
    nowl
Ack
                                     DiCarlo Lab @ MIT

                            arlo
                     im DiC
                    J




          id Cox
    Dav
en ts
        e dg em
    nowl
Ack
CO ME

More Related Content

What's hot

用十分鐘 學會《資料結構、演算法和計算理論》
用十分鐘  學會《資料結構、演算法和計算理論》用十分鐘  學會《資料結構、演算法和計算理論》
用十分鐘 學會《資料結構、演算法和計算理論》鍾誠 陳鍾誠
 
第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介
第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介
第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介Tsuyoshi Hirayama
 
Blackhat USA 2016 - What's the DFIRence for ICS?
Blackhat USA 2016 - What's the DFIRence for ICS?Blackhat USA 2016 - What's the DFIRence for ICS?
Blackhat USA 2016 - What's the DFIRence for ICS?Chris Sistrunk
 
ソフトハウスの品質保証のウソホント
ソフトハウスの品質保証のウソホントソフトハウスの品質保証のウソホント
ソフトハウスの品質保証のウソホントYasuharu Nishi
 
Socket Programming- Data Link Access
Socket Programming- Data Link AccessSocket Programming- Data Link Access
Socket Programming- Data Link AccessLJ PROJECTS
 
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」Kenji Hiranabe
 
DataRobotを用いた要因分析 (Causal Analysis by DataRobot)
DataRobotを用いた要因分析 (Causal Analysis by DataRobot)DataRobotを用いた要因分析 (Causal Analysis by DataRobot)
DataRobotを用いた要因分析 (Causal Analysis by DataRobot)Yuya Yamamoto
 
症例報告論文を載せるために
症例報告論文を載せるために症例報告論文を載せるために
症例報告論文を載せるためにk-kajiwara
 
项目管理敏捷方法
项目管理敏捷方法项目管理敏捷方法
项目管理敏捷方法Weijun Zhong
 
組み合わせテストの落とし穴〜有則と無則〜
組み合わせテストの落とし穴〜有則と無則〜組み合わせテストの落とし穴〜有則と無則〜
組み合わせテストの落とし穴〜有則と無則〜yufu yufu
 
No011-01-Suc3rum-20100225
No011-01-Suc3rum-20100225No011-01-Suc3rum-20100225
No011-01-Suc3rum-20100225Sukusuku Scrum
 
AndroidにおけるActivity管理の話
AndroidにおけるActivity管理の話AndroidにおけるActivity管理の話
AndroidにおけるActivity管理の話chigichan24
 
世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介
世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介
世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介Hyperleger Tokyo Meetup
 
研究フレームワーク
研究フレームワーク研究フレームワーク
研究フレームワークHiro Hamada
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列鍾誠 陳鍾誠
 
加古川市版Decidimについて
加古川市版Decidimについて加古川市版Decidimについて
加古川市版Decidimについてkakogawasc
 
システマティックレビューの実務から考えるライブラリアンの役割と行動
システマティックレビューの実務から考えるライブラリアンの役割と行動システマティックレビューの実務から考えるライブラリアンの役割と行動
システマティックレビューの実務から考えるライブラリアンの役割と行動Satomi Kojima
 
C++ Builder 程式撰寫基礎 / C++ Builder Basic
C++ Builder 程式撰寫基礎 / C++ Builder Basic C++ Builder 程式撰寫基礎 / C++ Builder Basic
C++ Builder 程式撰寫基礎 / C++ Builder Basic YKLee3434
 

What's hot (20)

用十分鐘 學會《資料結構、演算法和計算理論》
用十分鐘  學會《資料結構、演算法和計算理論》用十分鐘  學會《資料結構、演算法和計算理論》
用十分鐘 學會《資料結構、演算法和計算理論》
 
第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介
第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介
第79回 Machine Learning 15minutes ! 生成AIをエンタープライズで活用するWatsonx.aiの紹介
 
Blackhat USA 2016 - What's the DFIRence for ICS?
Blackhat USA 2016 - What's the DFIRence for ICS?Blackhat USA 2016 - What's the DFIRence for ICS?
Blackhat USA 2016 - What's the DFIRence for ICS?
 
Borland C++Builder 入門課程
Borland C++Builder 入門課程Borland C++Builder 入門課程
Borland C++Builder 入門課程
 
ソフトハウスの品質保証のウソホント
ソフトハウスの品質保証のウソホントソフトハウスの品質保証のウソホント
ソフトハウスの品質保証のウソホント
 
研究倫理と研究デザインについて
研究倫理と研究デザインについて研究倫理と研究デザインについて
研究倫理と研究デザインについて
 
Socket Programming- Data Link Access
Socket Programming- Data Link AccessSocket Programming- Data Link Access
Socket Programming- Data Link Access
 
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」
AgileJapan2010 基調講演:野中郁次郎先生による「実践知のリーダシップ~スクラムと知の場作り」
 
DataRobotを用いた要因分析 (Causal Analysis by DataRobot)
DataRobotを用いた要因分析 (Causal Analysis by DataRobot)DataRobotを用いた要因分析 (Causal Analysis by DataRobot)
DataRobotを用いた要因分析 (Causal Analysis by DataRobot)
 
症例報告論文を載せるために
症例報告論文を載せるために症例報告論文を載せるために
症例報告論文を載せるために
 
项目管理敏捷方法
项目管理敏捷方法项目管理敏捷方法
项目管理敏捷方法
 
組み合わせテストの落とし穴〜有則と無則〜
組み合わせテストの落とし穴〜有則と無則〜組み合わせテストの落とし穴〜有則と無則〜
組み合わせテストの落とし穴〜有則と無則〜
 
No011-01-Suc3rum-20100225
No011-01-Suc3rum-20100225No011-01-Suc3rum-20100225
No011-01-Suc3rum-20100225
 
AndroidにおけるActivity管理の話
AndroidにおけるActivity管理の話AndroidにおけるActivity管理の話
AndroidにおけるActivity管理の話
 
世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介
世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介
世界経済フォーラム第四次産業革命センターによるブロックチェーンの相互運用フレームワークのご紹介
 
研究フレームワーク
研究フレームワーク研究フレームワーク
研究フレームワーク
 
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列區塊鏈  (比特幣背後的關鍵技術)   -- 十分鐘系列
區塊鏈 (比特幣背後的關鍵技術) -- 十分鐘系列
 
加古川市版Decidimについて
加古川市版Decidimについて加古川市版Decidimについて
加古川市版Decidimについて
 
システマティックレビューの実務から考えるライブラリアンの役割と行動
システマティックレビューの実務から考えるライブラリアンの役割と行動システマティックレビューの実務から考えるライブラリアンの役割と行動
システマティックレビューの実務から考えるライブラリアンの役割と行動
 
C++ Builder 程式撰寫基礎 / C++ Builder Basic
C++ Builder 程式撰寫基礎 / C++ Builder Basic C++ Builder 程式撰寫基礎 / C++ Builder Basic
C++ Builder 程式撰寫基礎 / C++ Builder Basic
 

Viewers also liked

[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hypeJorge Ferrer
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 

Viewers also liked (17)

[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hype
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 

Similar to High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 
Implementing Lightweight Networking
Implementing Lightweight NetworkingImplementing Lightweight Networking
Implementing Lightweight Networkingguest6972eaf
 
Global Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate ChangeGlobal Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate ChangeSalford Systems
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionJohn Liu
 
Advanced Oracle Troubleshooting
Advanced Oracle TroubleshootingAdvanced Oracle Troubleshooting
Advanced Oracle TroubleshootingHector Martinez
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Analog for all_preview
Analog for all_previewAnalog for all_preview
Analog for all_previewAnand Udupa
 
Analog for all_preview
Analog for all_previewAnalog for all_preview
Analog for all_previewSahyogeeTech
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010parallellabs
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaFernando Rodriguez
 
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTING
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR  SCIENTIFIC COMPUTINGCLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR  SCIENTIFIC COMPUTING
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTINGDavid Ramirez
 
High-Speed Single-Photon SPAD Camera
High-Speed Single-Photon SPAD CameraHigh-Speed Single-Photon SPAD Camera
High-Speed Single-Photon SPAD CameraFabrizio Guerrieri
 

Similar to High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning) (20)

Machine Learning
Machine LearningMachine Learning
Machine Learning
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 
test
testtest
test
 
Implementing Lightweight Networking
Implementing Lightweight NetworkingImplementing Lightweight Networking
Implementing Lightweight Networking
 
Implementing Lightweight Networking
Implementing Lightweight NetworkingImplementing Lightweight Networking
Implementing Lightweight Networking
 
Global Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate ChangeGlobal Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate Change
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting Recognition
 
Advanced Oracle Troubleshooting
Advanced Oracle TroubleshootingAdvanced Oracle Troubleshooting
Advanced Oracle Troubleshooting
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Analog for all_preview
Analog for all_previewAnalog for all_preview
Analog for all_preview
 
Analog for all_preview
Analog for all_previewAnalog for all_preview
Analog for all_preview
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Apache con 2011 gd
Apache con 2011 gdApache con 2011 gd
Apache con 2011 gd
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Usenix lisa 2011
Usenix lisa 2011Usenix lisa 2011
Usenix lisa 2011
 
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTING
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR  SCIENTIFIC COMPUTINGCLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR  SCIENTIFIC COMPUTING
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTING
 
High-Speed Single-Photon SPAD Camera
High-Speed Single-Photon SPAD CameraHigh-Speed Single-Photon SPAD Camera
High-Speed Single-Photon SPAD Camera
 

More from npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 

More from npinto (15)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 

Recently uploaded

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Recently uploaded (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)

  • 1. High-Performance Computing Needs Machine Learning... And Vice Versa (was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”) dit ion e Nicolas Pinto NIPS “Big Learning” | December 16th, 2011 The Rowland Institute a HARVARD UNIVERSITY
  • 2. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 3. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 11. The Problem: Visual Object Recognition fast
  • 12. The Problem: Visual Object Recognition fast accurate
  • 13. The Problem: Visual Object Recognition fast accurate effortless
  • 14. The Problem: Visual Object Recognition fast accurate effortless critical to survival
  • 15. The Problem: Visual Object Recognition fast accurate effortless critical to survival tolerant to variations!
  • 16. hard?
  • 17. hard? // the world is 3D but the retina is 2D
  • 18. hard? // the world is 3D but the retina is 2D // the curse of dimensionality
  • 19. hard? // the world is 3D but the retina is 2D // the curse of dimensionality // considerable image variation
  • 20. ~50% of is for vision!
  • 21. you learned it... ve y ha ma
  • 23. The Approach Reverse and Forward Engineering the Brain
  • 24. The Approach Reverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 25. The Approach Reverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 26. Reverse Engineering Images by DiCarlo JJ & Cox DD Animation by Li N The Ventral Visual Stream
  • 27. Reverse Engineering Images by DiCarlo JJ & Cox DD Animation by Li N The Ventral Visual Stream
  • 28. Reverse Engineering The Ventral Visual Stream taflo ps ?! in =2 0 pe bra
  • 29. The Approach Reverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 30. Forward Engineering The Ventral Visual Stream a rnin g ??? a bo ut le all
  • 31. “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of lters L1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” kernel ...
  • 32. How are things done normally?
  • 33. How are things done normally? Usual Formula:
  • 34. How are things done normally? Usual Formula: 1) One grad student
  • 35. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime)
  • 36. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets
  • 37. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  • 38. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) One Ph.D.
  • 39. How do you call this ? “This is graduate student descent” - David McAllester
  • 40. How do you call this ? “This is graduate student descent” - David McAllester
  • 41. What’s better than this? “Conjugate graduate student descent?” - Nicolas Poilvert
  • 42. Doing things a little bit differently
  • 43. Doing things a little bit differently 1) One grad student
  • 44. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models
  • 45. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets
  • 46. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets
  • 47. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  • 48. Doing things a little bit differently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) Hundreds of Thousands One PhD ?
  • 49. If you want to have good ideas you must have many ideas. ” “ Most of them will be wrong, and what you have to learn is which ones to throw away. ” Linus Pauling (double Nobel Prize Winner)
  • 50.
  • 51.
  • 52. High-throughput Screening
  • 53. Read-out L3 thresh/sat norm strength normalization Learning large family of neighborhood Rate Trace “Temp. Adv.” “Auto-reset” number of lters ... brain-inspired models L2 thresh/sat norm strength clusive! Learning normalization neighborhood Rate in Trace 52 parameters ery kernel size “Temp. Adv.” v “Auto-reset” ... n. of lters more than 10 25 L1 thresh/sat norm strength Learning possible unique Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” kernel ... combinations! size number of lters input kernel size Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 54. The curse of speed
  • 55. The curse of speed thousands of big models
  • 56. The curse of speed thousands of big models large amounts of unsupervised learning experience
  • 57. The curse of speed ...and the blessing of massively parallel computing No off-the-shelf solution? DIY! Engineering (Hardware/SysAdmin/Software) Science
  • 58. The curse of speed ...and the blessing of massively parallel computing No off-the-shelf solution? DIY! Engineering (Hardware/SysAdmin/Software) Science Leverage non-scientific high-tech markets and their $billions of R&D... Gaming: Graphics Cards (GPUs), PlayStation 3 Web 2.0: Cloud Computing (Amazon, Google)
  • 59. r ow n! u ild you B
  • 60. The blessing of GPUs Computational power DIY GPU pr0n (since 2006) Sony Playstation 3s (since 2007) GPUs Peak GFLOP/s CPUs
  • 61. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 GTX480 (CUDA3.x) [2010] 974.3 (Fermi) Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  • 62. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 cha n ging... e GTX480 (CUDA3.x) [2010] pe edu p is g a m 974.3 (Fermi) >1 000X s Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  • 63. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150 Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 64. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150 Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 65. High-throughput Screening Skimming off the best models stupid chance baseline 250 200 N=2500 150 Count 100 50 0 50 60 70 80 90 100 Performance (%) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 66. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 67. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 68. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 69. High-throughput Screening Validate on other tasks ~90% vs. “HMAX 2.1” (~80%) V1-like 5 4 3 2 1 (baseline) state-of-the-art high-throughput models (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 70. High-throughput Screening Validate on faces vs. HMAX 2.1 PHOG GB PHOW SIFT blend 5 4 3 2 1 V1-like high-throughput models (baseline) state-of-the-art (from literature) Pinto, Doukhan, DiCarlo, Cox PLoS 2009
  • 71. Human vs. Machine 8-way object categorization 99.1 64 31.3 chance (12.5%) baseline best model best human
  • 72. What does it all mean? what have we learned ? briefly...
  • 73. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ dimensionality: more filters is better
  • 74. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ learning is difficult
  • 75. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ non-linearities are important
  • 76. What does it all mean? what have we learned ? Grayscale Input Normalize Linear SVM simple classifier L1 L2 L3 Filter Threshold & Φ1 Pool Normalize Saturate Φ2 ... Φk ➡ normalization is very important missed in previous modeling efforts now confirmed by LeCun et al., Poggio et al., Ng et al.
  • 77. What are these models not good for? ob jects low level s ckgr ound ba fa ces
  • 78. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 80. Real-world apps? testing the generality and scalability of the approach
  • 81. Facebook Really Real World Problem enormous scale billion of photos 3TB+ uploaded every day dense, collaborative face labels collab. with Zak Stone & Todd Zickler @ Harvard
  • 82. Relevance to Social Networking slide courtesy of David Cox
  • 83. Relevance to Social Networking
  • 84.
  • 85. High-throughput Screening
  • 86. High-Throughput Screening Labeled Faces in the Wild (LFW) View 1 > 30,000 large-scale models (1to3 layers) screened in only 3 days HT L3s (3 layers) top 5 models LFW view 1 performance Lea rning! vised o Un super N Pinto, Cox (FG 2011) Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 87. Generalization Performance on LFW View 2 (hold out) Face Verification Performance (% correct) 88.1 86.8 85.3 79.4 Wolf et al. ACCV 2009 Kumar et al. Ours V1-like face.com ICCV 2009 (HT) Pinto, Cox (FG 2011)
  • 88. “Facebook100” typical social network size? collab. with Zak Stone & Todd Zickler @ Harvard Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 89. Auto-tagging a network of 100 Facebook friends > 86% accurate (w/ 90 training examples) collab. with Zak Stone & Todd Zickler @ Harvard Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 90.
  • 91. vs face.com comparison with a heavily-specialized commercial system L3 (hardware-accelerated brute-force random model) Performance (% correct) face.com V1-likearound) (best technology (one layer) training example(s) / friend Pinto, Stone, Zickler, Cox (CVPR 2011)
  • 93. Hardware Matters ! Yann LeCun’s Mac picture courtesy of Koray Kavukcuoglu
  • 94. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 95. Two conflicting requirements The brain is a massively parallel computer ➡ Big models are paralyzingly slow to run Neural data only provides weak constraints ➡ Lots of parameters – hard to explore
  • 96. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run ➡ Big models are paralyzingly Neural data only provides weak constraints ➡ Lots of parameters – hard to explore
  • 97. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run ➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F ➡ Lots of parameters – hard to explore
  • 98. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run ➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F ➡ Lots of parameters – hard to explore How to optimize?
  • 99.
  • 101. lutio ns! k Co nvo i lter ba n 3D F
  • 105. Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) • Dynamically compile specialized versions of the same kernel for different conditions • Empirical run-time tuning • For free: smooth syntactic ugliness: unroll loops, index un-indexable registers, etc.
  • 106. Meta-programming ! “Instrument” your solutions: • Block size • Work size • Loop unrolling • Pre-fetching • Spilling • etc. ... and let the computer generate find the optimal code
  • 107. How?
  • 108. Always use the right tool !
  • 109.
  • 110. texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) plating Tem extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • 111. Compilation? (with Python-based solutions)
  • 112. PyCUDA/PyOpenCL (by Andreas Klockner) Klöckner, Pinto, Lee, Catanzaro, Ivanov, Fasih (ParCo 2011)
  • 113. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 114. conv_kernel_4x4x4.cu conv_kernel_template.cu #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" { __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) extern "C" { __shared__ float shared_in[131][4+1]; // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; __global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4; *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131) blockIdx.x*blockDim.x + threadIdx.x; { const uint out_idx = blockIdx.y*OUTPUT_W + input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } #for i in xrange($LOAD_ITERATIONS) __syncthreads(); #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0; $i); float sum2 = 0; float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w; } w = constant[0][0][1]; sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  • 115. conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) conv_kernel_4x4x4.cu extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if $i); { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  • 118. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • loop unrolling (possibly fine-controlled)
  • 119. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • fine-controlled loop unrolling / jamming ..) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w;
  • 120. How about #pragma unroll ? (why don’t you trust the compiler?)
  • 121. o t alo ne.... we are n s for S ignal Using GPU elatio n pil ers Corr ust com ’t tr itchell Daniel A. M Don gmen The Murch ode fr a ts ison Widefi eld Array c tical” e “iden re thes + g *h; ompa LOPS • C *c + e*f 770 GF + d b*c grating 8-s econd snap shots over a += inte peeling, roduced by lanking and b*c; -2 526 field p d after RFI b f the J2107 e of the fiel an image o ht is an imag S FLOP n the left is . On the rig a += d*c; Figure 3: O ing hout blank interval wit 20 G entire time eeled imag e. noise the e unp e above the ntours of th f magnitud 10 along with co rs o This at are orde ious data. a += e*f; els th dub ivers at lev ply discard n here to the rece m will sim tector show k ste ichael hClar ct in fl ect or refra real-time sy n-based de occasion, re s the MWA mple media integration hich the si M floor. D wit wil uring deep l require a series of d ata-quality art. tests, of w a += g*h; n integral p will form a eenhill Lincoln Gr Paul La Plante and Reference s t Boolard a += y, EDGES Memo, 058 , 2010. R.J. Cappal lo, M.F. M orales, and ics a ale, d Topics RFI Statist , C.J. Lonsd l of Selecte [1] A.E .E. Rogers, , R.J. Sault IE EE Journa R.B. Wayth eld Array, . Greenhill, hison Widefi ]. itchell, L.J of the Murc 07.1912 E, 97 [2] D.A. M Time Calib ration , [astro- ph/08 s of the IEE S.M. O rd, Real- 7 17, 2008 , Proceeding 2 (5), 707– n Overview 1 nuary 201 sday, 27 Ja rocessing, rray: Desig in Signal P on Widefield A he Murchis 8]. , Graphics ale, et al., T 903.182 R.G. Edgar [3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series, 506, 2009, ell, K. Dale, Conference (8), 1497–1 , D.A. Mitch d Array, ASP R.B. Wayth on Wide-fiel Greenhill, the Murchis IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal Units for D Mathemati Processing 1 radio pola rimetry. I. 009. aa d nderstryn20 ing 1 411, 127, 2 .J. Sault, U Janu 6. . Breg man, and R ursday,.,2117, 137–147, 199 7 alar amaker, J.D Th pl. Ser up alogue of sc [5 ] J.P. H st rophys. S ll-co herency an rophys. Su ppl. s, Astron. A . IV. The fu Astron. Ast foundation polarimetry ric fidelity, g radio ge and pola rimet derstandin
  • 122. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • variable-length argument lists
  • 123. Smooth syntactic ugliness Manipulations that were not easily accessible in CUDA C code: • index un-indexable resources (e.g. regs)
  • 124. Explore design decision space more freely
  • 125. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 126. ... too many optimizations? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ling itionnrol in ixe cla p u ca mpin g m loo g m pi ng adca sting bro ms zero-cop trea
  • 127. e ? ec id ’t d c an keep them all !
  • 128. Exploring design decision space more freely Meta-programming: • enables efficient learning of the GPU hardware/software • allows full exploitation of the GPU architecture
  • 129. version A conv_kernel_beta_template.cu ... mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 [$N_FILTERS]; mov.b32 $r1, c0[$ofs2+0x000c] #define IMUL(a, b) __mul24(a, b) extern "C" { mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010] __global__ void convolve_beta_j${j}(float4 *input, float4 *output) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; ... // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) version B #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ... shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 } #end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ... aster... Why ? 2x f
  • 131. speed (in billion floating point operations per second) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (OpenGL/Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 cha n ging... e GTX480 (CUDA3.x) [2010] pe edu p is g a m 974.3 (Fermi) >1 000X s Pinto, Doukhan, DiCarlo, Cox PLoS 2009 Pinto, Cox GPU Comp. Gems 2011
  • 132. -10.4 1024x1024x8 16x5x5x8 726.412 ± 0.398 744.973 ± 0.571 Analysis 2048x2048x4 4x8x8x4 474.681 ± 0.160 887.974 ± 1.017 ➡ Different hardware ? Table 33.2 Performance of Auto-Tuned Implementations on Two Hardware Platforms, Including Performance Tuned on One Platform and Run on the Other Optimized for: Run on: 9400M GTX480 Tuning Speedup 9400M 0.32s 2.52s 675% GTX480 0.016s 0.011s 52% formance gains are observed for the auto-tuned meta-kernels as compared to t, which was hand-picked to allow correct execution of all input ranges ng up against hardware limitations.
  • 133. APTER 33 GPU Metaprogramming: A Case Study Analysis ➡ Different input configurations Table 33.3 Performance of Auto-Tuned Implementations on Two Input Configurations, Including Performance Tuned for One Configuration and Run with the Other Optimized for: Run on: Config1 Config2 Tuning Speedup config1 11.1ms 15.7ms 41% config2 fails 10.8ms not comparable , in Table 33.3 we show the effect of tuning on one input configuration an in, significant speedups are obtained using kernels tailored to a specific inp
  • 136. Summary Meta-programming: • can assist exploration and manual optimization
  • 137. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code
  • 138. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda)
  • 139. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda) ➡ helps get drastic speed-ups !
  • 140. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter highly-optimized code • is easy and flexible with the right tools (e.g. Python, PyCUDA/CL, Cheetah, decuda) ➡ helps get drastic speed-ups ! ➡ facilitates “auto-tuning” !
  • 141. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 142. Intelligent and fast Auto-Tuning with Machine Learning with James Bergstra and David Cox
  • 143. Intelligent and fast Auto-Tuning with Machine Learning
  • 145. Auto-tuning: two approaches • Analytical model-based optimization:
  • 146. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference”
  • 147. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak
  • 148. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak • Empirical optimization:
  • 149. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak • Empirical optimization: - pros: auto-tuned code close to peak (dominant in specialized libraries e.g. ATLAS, FFTW), easier to build
  • 150. Auto-tuning: two approaches • Analytical model-based optimization: - pros: very generic (dominant in compilers), fast “inference” - cons: hard to build, domain expertise required, auto- tuned code far from peak • Empirical optimization: - pros: auto-tuned code close to peak (dominant in specialized libraries e.g. ATLAS, FFTW), easier to build - cons: very slow “inference” (for new inputs, etc.)
  • 151. Empirical Auto-Tuning The goal is to empirically optimize execution time given both • the environment - hardware (GPU, CPU, Memory, Mobo, etc.) - software (SDK, Compiler suite, etc.) • the data (input dimensions, repetitions, etc.)
  • 152. Empirical Auto-Tuning with Meta-programming A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 153. Intelligent and fast Auto-Tuning with Machine Learning
  • 154. Auto-tuning: best of both approaches ?
  • 155. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization:
  • 156. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.)
  • 157. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.) - cons: unexplored !
  • 158. Auto-tuning: best of both approaches ? • Empirically-learned model-based optimization: - pros: auto-tuned code close to peak*, easier to build (?), fast “inference” (for new inputs, hardware, etc.) - cons: unexplored ! * could be dominant in specialized libraries (e.g. machine learning!)
  • 159. Fast Machine Learning-based Runtime Auto-Tuning ML-based
  • 160. First Last First Last First Last Affiliation line 1 Affiliation line 1 Affiliation line 1 Fast Machine Learning-based Affiliation line 2 Affiliation line 2 Affiliation line 2 anon@mail.com anon@mail.com anon@mail.com ABSTRACT Runtime Auto-Tuning designs, the field lacks consensus on exactly how the differ- The rapidly evolving landscape of multicore architectures ent subsystems (memory, communication and computation) makes the construction of efficient libraries a daunting task. should be efficiently integrated, modeled and programmed. A family of methods known collectively as “auto-tuning” has These systems have exhibited varying degrees of memory emerged to address this challenge. Two major approaches to hierarchy and multi-threading complexity and, as a conse- auto-tuning are empirical and model-based: empirical auto- quence, they have been increasingly relying on flexible but tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral- suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli- tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo- methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity Machine Learning for Predictive Auto-Tuning with Boosted proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal- a model-based approach, with the generality and simplicity anced if one is to write code that is both high-performance of empirical auto-tuning. We validate our approach using and portable. the filterbank correlation kernel described in Pinto and Cox Regression Trees [2012], where we find that 0.1 seconds of hill climbing on 1.1 Motivation the regression model (“predictive auto-tuning”) can achieve In this rapidly evolving landscape, the construction of gen- an average of 95% of the speed-up brought by minutes of eral development tools and libraries that fully utilize system empirical auto-tuning. Our approach is not specific to filter- resources remains a daunting task. Even within special- bank correlation, nor even to GPU kernel auto-tuning, and ized architectures from the same vendor, such as NVIDIA’s can be applied to almost any templated-code optimization Graphics Processing Units (GPUs) and the Compute Unified problem, spanning a wide variety of problem types, kernel Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA, types, and platforms. 2011], many developers default to massive amounts of man- 1. INTRODUCTION First Last First Last ual labor to optimize CUDA code to specific input domains. In addition, hand-tuning rarely generalizes well to new hard- First Last Affiliation line 1 Affiliation line 1 Affiliation line 1 ware generations or different input domains, and it can also Due to power consumption and heat dissipation concerns, be error-prone or far from optimal. One of the reason is that Affiliation line 2 Affiliation line 2 Affiliation line 2 scientific applications have shifted from computing platforms kernels can produce staggeringly large optimization spaces where performance had been primarily driven by rises in the [Datta et al., 2008]. The problem is further compounded anon@mail.com clock frequency of a single “heavy-weight” processor (with complex out-of-order control and cache structures) to a plat- form with ever increasing numbers of “light-weight” cores. anon@mail.com by the fact that these spaces can be highly discontinuous [Ryoo et al., 2008], difficult to explore, and quasi-optimal anon@mail.com solutions lie at the edge of “performance cliffs” induced by Interestingly, this shift is now not only relevant to compu- hard device-specific constraints (e.g. register file size or low- tational sciences but to the development of all computer sys- latency cache size). James Bergstra tems: from ubiquitous consumer-facing devices (e.g. phones) to high-end computer farms for web-scale applications (e.g. 1.2 Auto-Tuning ABSTRACT social networks). Although the future lies in low-power multi-core hardware One strategy for addressing these challenges is to use one of a variety of automatic methods known collectively as designs, the field lacks consensus on exactly how the differ- The rapidly evolving landscape of multicore architectures Permission to makethe or hard copies of all or part ofof work for “auto-tuning.” Two major auto-tuning approaches have emer- ged in the extensive literature covering the subject (see sur- veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc makes digital construction this efficient libraries a daunting task. et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos, Nicolas Pinto ent subsystems (memory, communication and computation) should be efficiently integrated, modeled and programmed. David Cox personal or classroom use is granted without fee provided that copies are 2008, Li et al., 2009, Park et al., 2011]): analytical model- These systems have exhibited varying degrees of memory not A familyfor profit or commercial advantage and that copies made or distributed of methods known collectively as “auto-tuning” has driven optimization and empirical optimization [Yotov et al., bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or address this challenge. Two major approaches to emerged to to redistribute to lists, requires prior specific 2003]. hierarchy and multi-threading complexity and, as a conse- The model-driven optimization approach uses analytical permission and/or a fee. quence, they have been increasingly relying on flexible but auto-tuning are empirical and model-based: empirical auto- [submitted] Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. abstractions to model the hardware architectures, in order tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral- suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli- tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo- methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal- anced if one is to write code that is both high-performance
  • 161. lutio ns! k Co nvo i lter ba n 3D F
  • 162. NVIDIA GTX 580 (Fermi) 0 P ie w rev(b) 2x faster equality 1200 GFLOP/s of predictive auto-tuning 1000 Auto-­tuned mean 800 2x slower ML-based: Reference mean 600 < 0.1sec 400 200 0 200 0 200 400 600 800 1000 1200 1400 d problem GFLOP/s of empirical auto-tuning r training old way: minutes!
  • 163. NVIDIA GTX 580 (Fermi) 0 P ie w rev(b) 2x faster equality 1200 GFLOP/s of predictive auto-tuning LOP /s ! RAF 1000 1 TE Auto-­tuned mean 800 > 1. 2x slower ML-based: Reference mean 600 < 0.1sec 400 200 0 200 0 200 400 600 800 1000 1200 1400 d problem GFLOP/s of empirical auto-tuning r training old way: minutes!
  • 165. What else could we do for HPC ?
  • 166. What else could we do for HPC ? • Minimize failures (exascale supercomputers)
  • 167. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors
  • 168. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions
  • 169. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions • Help design better architectures ?
  • 170. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions • Help design better architectures ? • $$$
  • 171. What else could we do for HPC ? • Minimize failures (exascale supercomputers) • Minimize mixed-precision errors • Help better understand hardware features and their complex interactions • Help design better architectures ? • $$$ • etc.
  • 172. It would be a win-win-win situation! (The Office Season 2, Episode 27: Conflict Resolution)
  • 173. Outline 1. HPC-aware ML 2. GPU Meta-programming 3. ML-aware HPC
  • 174. en ts e dg em nowl Ack DiCarlo Lab @ MIT arlo im DiC J id Cox Dav
  • 175. en ts e dg em nowl Ack
  • 176. CO ME