SlideShare a Scribd company logo
1 of 22
Download to read offline
Al-Khawarizmi Institute of Computer Science
 Univeristy of Engineering and Technology, Lahore Pakistan




        LAB WORKBOOK




  Parallel Programming With CUDA
               Summar Short Course



                           August 2009
            © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                 University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                                    University of Engineering & Technology, Lahore.




                                                           TABLE OF CONTENTS

1 INTRODUCTION ........................................................................................................................................... 4
1.1         GENERAL PURPOSE GRAPHIC PROCESSING UNIT (GPGPU) ..................................................................... 4
1.2         COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) .............................................................................. 4
1.3         MAIN OBJECTIVES ..................................................................................................................................... 4
2 SETTING UP CUDA DEVELOPMENT ENVIRONMENT ..................................................................... 5
2.1         VERIFYING THAT YOU HAVE A CUDA-CAPABLE SYSTEM .......................................................................... 5
2.2         DOWNLOADING CUDA DEVELOPMENT COMPONENTS............................................................................. 6
2.3         INSTALLING CUDA SOFTWARE COMPONENTS .......................................................................................... 6
2.4         VERIFYING CUDA INSTALLATIONS ........................................................................................................... 8
2.5         GENERAL PROCEDURE OF PROGRAMMING IN CUDA .............................................................................. 11
3 PROGRAMMING IN CUDA ........................................................................................................................ 11
3.1         PROGRAMMING EXERCISE 1 (HELLO WORLD) ......................................................................................... 11
3.2         PROGRAMMING EXERCISE 2 (MATRIX MULTIPLICATION) ........................................................................ 13
3.3         PROGRAMMING EXERCISE 3 (NUMERICAL CALCULATION OF VALUE OF PI (Π)) ........................................ 17
3.4         PROGRAMMING EXERCISE 4 (PARALLEL SORT) ........................................................................................ 20




                                      © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                           University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                            University of Engineering & Technology, Lahore.




                     LAB WORKBOOK
This workbook is written for assisting the students of Summer Short Course on
“Parallel Programming With CUDA” at Al-Khawarzmi Institute of Computer Science
(KICS).

This edition was prepared over a short period of two months and was finalized in July
2009. The contents of this document have been compiled from various academic
resources to expose the students to Genral Purpose Graphic Processing Units
(GPGPU) and Nvidia’s Compute Unified Device Architecture (CUDA) in a hands-on
fashion.

For Further information, please contact the KICS at UET, Lahore:

                   Telephone: (042) 992 50450
                   Fax:        (042) 992 50246
                   Email:      ghulam.mustafa@kics.edu.pk




                   © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                        University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                     University of Engineering & Technology, Lahore.




1     Introduction
Multicore and Many-core systems provide within-the-box parallel processing capabilities. Computing task that
were run on supercomputers in past are now able to run on desktops provided that we know the capabilities
of available hardware, and software techniques to exploit these available resources.

1.1       General Purpose Graphic Processing Unit (GPGPU)
Graphic Processing Unit (GPU) available on commodity video adapters has evolved into highly parallel,
multithreaded, Many-core processor, thanks to gaming industry. These GPUs have huge computational
power as well as very high memory bandwidth that can be exploited by general purpose high performance
applications. These programmable GPU are also known as general purpose graphic processing units
(GPGPU, from now onward we will use term GPU). GPU is specialized for compute-intensive, highly
parallel computation just like graphics rendering is done. GPU is based on SIMD architectural model and
utilized by data-parallel programming model.

1.2 Compute Unified Device Architecture (CUDA)
 Nvidia Corporation, market leader in GPU market, introduced a general purpose parallel computing
architecture in November 2006, to harness the computing capabilities of their high-end GPUs. Compute
Unified Device Architecture (CUDA) is based on a new parallel programming model and instruction set
architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex
computational problems in a more efficient way than on a CPU. CUDA comes with a software environment
that allows developers to use C as a high-level programming language. Other languages such as FORTRAN,
C++, OpenGL, and DirectX will be supported in the future.

1.3 Main Objectives
The objective of this lab is to become familiar with parallel programming using CUDA. It will give you an
idea that how we can run CUDA programs on systems with and without CUDA-capable GPU. Programming
exercises will enable you to decompose a certain complex problem into portions that could run in parallel
using data-parallel programming model.

      Following activities are intended to be carried out in this lab:
      •    Verification of CUDA-capable system
      •    Installation and verification of CUDA software components
      •    Programming exercises
               o Hello world
               o Matrix Multiplication
               o Numerical calculation of the value of π
               o Parallel Sort

At the end of this lab, you should be able to:

      •    Setup CUDA development environment
      •    Write, compile and run CUDA programs on Nvidia device as well as on x86 multicore systems in
           device emulation mode.
      •    Use data parallel programming model
                           © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                     University of Engineering & Technology, Lahore.




2       Setting up CUDA development environment
To use CUDA on your system, you will need a supported version of Linux with a gcc compiler and toolchain,
CUDA software (available freely at http://www.nvidia.com/cuda) and a CUDA-capable GPU. If you do not
have a CUDA-capable GPU, you can still use CUDA in device emulation mode. Device emulation mode is
basically for debugging purposes and obviously, does not offer as much performance as with a CUDA-
capable GPU. So device emulation mode should not be used for release versions and performance tuning.

After installing CUDA software, we need to test our CUDA build environment by compiling and running
one or more sample programs (available in CUDA SDK). This will validate that hardware and software are
running and communicating correctly.

2.1 Verifying that you have a CUDA-Capable System
Before starting installation of different CUDA software components, we should verify that we have
supported version of Linux with a gcc compiler, toolchain and optionally CUDA-capable Nvidia GPU.

2.1.1     Verify Nvidia video adapter

Enter the following command to verify Nvidia video adapter,

            Note: Skip this section if your system is not equiped wih a CUDA-capable Nvidia GPU.

 [root@gm gm]# lspci |grep -i nVidia
01:00.0 VGA compatible controller: nVidia Corporation GeForce 9600M GT (rev a1)
[root@gm gm]#

If you do not see anything, either you do not have an Nvidia graphic adapter or you have to update PCI
hardware database, maintained by Linux, using following command. If your network connection is fine,
output should look like below.

[root@gm gm]# update-pciids
    % Total      % Received % Xferd          Average Speed   Time    Time     Time Current
                                             Dload Upload    Total   Spent    Left Speed
100     148k   100   148k        0      0    6241k      0 --:--:-- --:--:-- --:--:-- 6767k
Done.
[root@gm gm]#

2.1.2     Verify supported version of Linux

Current version (2.2) of CUDA software components requires an x86-based Linux distribution. Following
command checks distribution and release number of running system,

[root@gm gm]# uname -i && cat /etc/*release
i386
Fedora release 10 (Cambridge)
Fedora release 10 (Cambridge)
Fedora release 10 (Cambridge)
[root@gm gm]#




                            © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                 University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                  University of Engineering & Technology, Lahore.




Output shows that running system is 32-bit (i386) Fedora version 10. On a 64-bit system running in 64-bit
mode the typical output will be x86_64. Version 2.2 of CUDA development tools support only following
distributions:
      Red Hat Enterprise Linux 4.3-4.7, 5.0-5.3
      SUSE Enterprise Desktop 10-SP2
      Open SUSE 11.0 or 11.1
      Fedora 9 or 10
      Ubuntu 8.04 or 8.10

You should frequently visit CUDA download page for updates because other distributions are promised to be
supported latter.

2.1.3   Verifying gcc

Current CUDA development tools supports version 3.4, 4.x of gcc. You can check the version of currently
installed gcc by issuing the following command:

[root@gm gm]# gcc --version
gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[root@gm gm]#

2.2 Downloading CUDA development components
You can get CUDA software components from http://www.nvidia.com/object/cuda_get.html.
Read the instructions given on this page carefully and download necessary files. Nvidia CUDA Driver is not
necessary if you do not have Nvidia GPU and want to run CUDA programs in device emulation mode.

2.3 Installing CUDA software components
Uninstall any previously installed versions of CUDA SDK and toolkit, by just deleting the directory
containing these packages. Default directory for toolkit and SDK are /usr/local/cuda/ and
~/NVIDIA_CUDA_SDK/ respectively. If you want to keep older versions, just rename these directories.

2.3.1   Installing CUDA driver


          Note: You do not have to install CUDA driver if you don't have an Nvidia GPU (cuda-
          capable). If tried, You will see an error like "You do not appear to have an NVIDIA GPU
          supported by the 185.18.14 NVIDIA Linux graphics driver installed in this system."


You need to shutdown x server before installing the driver (best way is to change id:5:initdefault:
to id:3:initdefault: in /etc/inittab file and reboot). You will get console only (No graphics).
Secondly, you must have source code of running kernel (if needed) that can be installed by issuing following
command:

[root@gm gm]# yum install kernel-devel


                         © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                              University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                    University of Engineering & Technology, Lahore.




More information about driver installation is available on
http://us.download.nvidia.com/XFree86/Linux-x86/1.0-9755/README/index.html.

To install driver, first of all exit the GUI (ctl-alt-backspace). On available command line issue the following
commands to turn off x windows as a superuser, install driver and restart GUI environment, respectively.

[root@gm gm]#       su
password:
[root@gm gm]#       /sbin/init 3
[root@gm gm]#       cd <directory containing downloaded .run files>
[root@gm gm]#       ./NVIDIA-Linux-x86-185.18.14-pkg1.run
[root@gm gm]#       /sbin/init 5

You can also issue the following command to start the GUI environment,

[root@gm gm]# startx

Make sure your internet connection is working fine. Follow the instruction displayed on your screen.


           Note: You can verify driver release by running                      the    following   command,
           [root@gm gm]# /usr/bin/nvidia-settings



2.3.2   Installing CUDA toolkit

Just issue following commands,

[root@gm gm]# cd <directory containing downloaded .run files>
[root@gm gm]# ./cudatoolkit_2.2_linux_32_fedora10.run
(Output omitted for the sake of brevity)

2.3.3   Setting environment variables

Issue following commands,

[root@gm gm]# export PATH=/usr/local/cuda/bin/:$PATH
[root@gm gm]# export LD_LIBRARY_PATH=/usr/local/cuda/lib/:$LD_LIBRARY_PATH

You can make these settings permanent by putting the above mentioned commands to ~/.bashrc

2.3.4   Configuring CUDA libraries

Add LD_LIBRARY_PATH=/usr/local/cuda/lib/:$LD_LIBRARY_PATH to /etc/ld.so.conf
and issue the following command,

[root@gm gm]# ldconfig

2.3.5   Installing CUDA SDK
[root@gm gm]# cd <directory containing downloaded .run files>
[root@gm gm]# ./cudasdk_2.21_linux.run
(Output omitted for the sake of brevity)
                          © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                               University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                    University of Engineering & Technology, Lahore.




2.3.6   Installing CUDA Debugger
[root@gm gm]# cd <directory containing downloaded .run files>
[root@gm gm]# ./cudagdb_2.2_linux_32_rhel5.3.run
(Output omitted for the sake of brevity)

2.4 Verifying CUDA installations
After installation, best practice is to validate the installed packages and environment setting.

2.4.1   Verifing CUDA environment
[root@gm gm]# env

ORBIT_SOCKETDIR=/tmp/orbit-gm
HOSTNAME=gm.kics-uet
TERM=xterm
SHELL=/bin/bash
XDG_SESSION_COOKIE=871a3cd51587ff750aec3a5049a408c9-1247661191.484531-1772071398
HISTSIZE=1000
GTK_RC_FILES=/etc/gtk/gtkrc:/home/gm/.gtkrc-1.2-gnome2
WINDOWID=31457334
QTDIR=/usr/lib/qt-3.3
QTINC=/usr/lib/qt-3.3/include
http_proxy=http://10.11.20.20:8888/
USER=gm
LD_LIBRARY_PATH=/usr/local/cuda/lib/:
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:do=00;35:bd=40;33;01:cd=40;3
3;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:
ex=00;32:*.tar=00;31:*.tgz=00;31:*.svgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.lz
ma=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.dz=00;31:*.gz=00;31:*.bz2=00;31:*.tbz2=00;3
1:*.bz=00;31:*.tz=00;31:*.deb=00;31:*.rpm=00;31:*.jar=00;31:*.rar=00;31:*.ace=00;31:*.
zoo=00;31:*.cpio=00;31:*.7z=00;31:*.rz=00;31:*.jpg=00;35:*.jpeg=00;35:*.gif=00;35:*.bm
p=00;35:*.pbm=00;35:*.pgm=00;35:*.ppm=00;35:*.tga=00;35:*.xbm=00;35:*.xpm=00;35:*.tif=
00;35:*.tiff=00;35:*.png=00;35:*.mng=00;35:*.pcx=00;35:*.mov=00;35:*.mpg=00;35:*.mpeg=
00;35:*.m2v=00;35:*.mkv=00;35:*.ogm=00;35:*.mp4=00;35:*.m4v=00;35:*.mp4v=00;35:*.vob=0
0;35:*.qt=00;35:*.nuv=00;35:*.wmv=00;35:*.asf=00;35:*.rm=00;35:*.rmvb=00;35:*.flc=00;3
5:*.avi=00;35:*.fli=00;35:*.gl=00;35:*.dl=00;35:*.xcf=00;35:*.xwd=00;35:*.yuv=00;35:*.
svg=00;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.m
p3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:
SSH_AUTH_SOCK=/tmp/keyring-qpkd1F/ssh
GNOME_KEYRING_SOCKET=/tmp/keyring-qpkd1F/socket
USERNAME=gm
SESSION_MANAGER=local/unix:@/tmp/.ICE-unix/2747,unix/unix:/tmp/.ICE-unix/2747
DESKTOP_SESSION=gnome
PATH=/usr/local/cuda/bin/:/usr/kerberos/sbin:/usr/lib/qt-
3.3/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin
:/home/gm/bin
MAIL=/var/spool/mail/gm
PWD=/home/gm/Desktop
XMODIFIERS=@im=imsettings
GNOME_KEYRING_PID=2745
LANG=en_US.UTF-8
GDM_LANG=en_US.UTF-8
GDMSESSION=gnome
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HOME=/root
SHLVL=3
                          © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                               University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                 University of Engineering & Technology, Lahore.



no_proxy=localhost,127.0.0.0/8
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
LOGNAME=gm
QTLIB=/usr/lib/qt-3.3/lib
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-
E9ZoYtPeZC,guid=4328bc8674e6eb0b12d4ef874a5dcc87
LESSOPEN=|/usr/bin/lesspipe.sh %s
DISPLAY=:0.0
G_BROKEN_FILENAMES=1
XAUTHORITY=/root/.xauth5fdjoq
COLORTERM=gnome-terminal
_=/usr/bin/env
OLDPWD=/home/gm

2.4.2   Verify CUDA compiler

nvcc is compiler driver for CUDA programs. It calls gcc compiler for C code and NVIDIA PTX compiler
foe CUDA code. To verify, enter one of the following commands:

[root@gm gm]# which nvcc

/usr/local/cuda/bin/nvcc

[root@gm ~]# nvcc –V

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2009 NVIDIA
Corporation Built on Thu_Apr__9_07:37:20_PDT_2009 Cuda compilation tools,
release 2.2, V0.2.1221
[root@gm ~]#

2.4.3   Compiling Sample Projects
[root@gm gm]# cd <SDK directory>
[root@gm gm]# make

The resulting binaries will be in NVIDIA_CUDA_SDK/bin/linux/release

2.4.4   Compiling Sample Projects in emulation mode
[root@gm gm]# cd <SDK derectory>
[root@gm gm]# make emu=1

The resulting binaries will be in NVIDIA_CUDA_SDK/bin/linux/emurelease.

2.4.5   Running deviceQuery and bandwidthTest


          Note: You do not need to run deviceQuery and bandwidthTest if you don't have an Nvidia
          GPU (cuda-capable). In this case, you can try some other executable from
          nvidia_CUDA_SDK/bin/linux/emurelease directory


Run ./deviceQuery in <NVIDIA_CUDA_SDK>/bin/linux/release. To run deviceQuery, on
SELinux-enabled systems, you may need to disable this security feature using setenforce command.

[root@gm gm]# setenforce 0
                        © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                             University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                 University of Engineering & Technology, Lahore.




[root@gm gm]# cd <NVIDIA_CUDA_SDK>/bin/linux/emurelease

[root@gm release]# ./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA

Device 0: "GeForce 9600M GT"
  CUDA Capability Major revision number:                                1
  CUDA Capability Minor revision number:                                1
  Total amount of global memory:                                        536150016 bytes
  Number of multiprocessors:                                            4
  Number of cores:                                                      32
  Total amount of constant memory:                                      65536 bytes
  Total amount of shared memory per block:                              16384 bytes
  Total number of registers available per block:                        8192
  Warp size:                                                            32
  Maximum number of threads per block:                                  512
  Maximum sizes of each dimension of a block:                           512 x 512 x 64
  Maximum sizes of each dimension of a grid:                            65535 x 65535 x 1
  Maximum memory pitch:                                                 262144 bytes
  Texture alignment:                                                    256 bytes
  Clock rate:                                                           1.25 GHz
  Concurrent copy and execution:                                        Yes
  Run time limit on kernels:                                            Yes
  Integrated:                                                           No
  Support host page-locked memory mapping:                              No
  Compute mode:                                                         Default (multiple host
threads can use this device simultaneously)
Test PASSED
Press ENTER to exit...


To test that system and CUDA-capable device communicate correctly, run following

[root@gm release]# ./bandwidthTest
Running on......
       device 0:GeForce 9600M GT
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes)     Bandwidth(MB/s)
  33554432         1756.6

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes)    Bandwidth(MB/s)
  33554432        1168.8

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes)   Bandwidth(MB/s)
  33554432        10762.2
&&&& Test PASSED

                        © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                             University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                       University of Engineering & Technology, Lahore.




Press ENTER to exit...

Start using CUDA to build your own high performance applications. NVIDIA CUDA Programming Guide,
located in /usr/local/cuda/doc/ is your next step in this course.

2.5        General procedure of programming in CUDA
You can use any text editor to write your CUDA source code for your program. Save it with .cu extension. Then
issue the following commnd (assuming environment variables are properly set, as described above):

[root@gm <dir>]# nvcc –o <executeable_name> -deviceemu <program_name>.cu

[root@gm <dir>]# ./<executeable_name>

Replace contents contained in “< >” with actual names. “-deviceemu” compiles code that is expected to
run on CPU only.

3       Programming in CUDA

CUDA comes with a software environment that allows developers to use C as a high-level programming
language. This section is composed of programming exercises for hands on practice. Problem partitionaing in
terms of threads and thread Blocks, and organization of thread blocks in one or more block grids is the main
challenge faced by CUDA programmers. Following programming exercises are designed to understand this
concept of problem orchestration. Complicated details of CUDA like compilation steps, generated files,
different file formats, and very precise and efficient use of different memory hierarchy etc. are out of scope of
this activity. You will gradually learn these concepts. Most important is to tackle problem orchestration and to
get output of your simple programs.

3.1 Programming Exercise 1 (Hello World)
This is a well-known warm-up program that asks all threads to prints Hello World!

3.1.1       Lab Objectives

Objectives of this lab experiment include:

      1.    Learning about the general structure of a CUDA program
      2.    Learning the concept of kernel, kernel invocation, hierarchical thread grouping.
      3.    Learning the concept of threadIdx, blockIdx and blockDim.
      4.    Compiling and running CUDA code in device emulation mode

3.1.2       Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in
section 2.3.3.


/*
 * File:   Hello_World.cu
 * Author: Ghulam Mustafa
                             © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                  University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                    University of Engineering & Technology, Lahore.




    */

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

__global__ void printhello()
{
      int thid = blockIdx.x * blockDim.x + threadIdx.x;

         printf("Thread%d: Hello World!n", thid);
}

int main()
{
printhello<<<5,10>>>();
return 0;
}


3.1.3    Procedure
Write this simple program in any text editor and save it with .cu extension (if softcopy is not available). Compile
and run as mentioned below. Experiment with kernel invocation statement by changing the values of dimGrid and
dimBlock where general kernel invocation statement is “kernel<<<dimGrid, dimBlock>>> ( ).” Try to figure out
how the ID of a thread will change by changing dimBlock and dimGrid.

To Compile & Run:

[root@gm gm]# nvcc –o hello -deviceemu Hello_World.cu

[root@gm gm]# ./hello

3.1.4    Conclusions

List your conclusions with respect to the objectives of this experiment




3.1.5    Lab Instructor’s Evaluation

Lab instructor’s remark whether the student finished the work to meet the lab objectives.

                          © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                               University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                      University of Engineering & Technology, Lahore.




3.2 Programming Exercise 2 (Matrix Multiplication)
Parallel matrix multiplication is representative of those problems which are good examples for CUDA
implementation. Each element of resulting matrix is calculated in parallel.

3.2.1      Lab Objectives

Objectives of this lab experiment include:

     5.    Learning the application of CUDA to linear algebra problems
     6.    Learning how to partion a large problem in to subproblems
     7.    Learning how to exploit the thread and block IDs for useful calculations
     8.    Learning how to download parallel portion of code to device
     9.    Learning how to use device memory
     10.   Understanding hetrogeneous programming

3.2.2      Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in
section 2.3.3.


/*
 *   File:   matrix_mul.cu
 *   Author: Ghulam Mustafa
 *   Created on July 31,2009, 7:30 PM
 *   Code is adapted from Nvidia CUDA Programming Guide ver 2.2.1
 *   Matrices are stored in row-major order:M(row, col) = M.ents[row*M.w + col]
*/

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

#define BLOCK_SZ 2
#define DBG 1

//Order      of    Matrix X = (Xr x Xc)
#define      Xc    (2 * BLOCK_SZ)
#define      Xr    (3 * BLOCK_SZ)
//Order      of    Matrix Y = (Yr x Yc)

                            © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                 University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                University of Engineering & Technology, Lahore.




#define    Yc   (2 * BLOCK_SZ)
#define    Yr   Xc
//Order    of   Matrix Z = (Zr x Zc)
#define    Zc   Yc
#define    Zr   Xr

#define N (Zr*Zc)

typedef struct Matrix{
    int r,c;
    float* elements;
} matrix;

void populate_matrix(matrix*);
void print_matrix(matrix);

__global__ void matrix_mul_krnl(matrix A, matrix B, matrix C)
{
    float C_entry = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int i;
    for (i = 0; i < A.c; i++)
        C_entry += A.elements[row * A.c + i] * B.elements[i * B.c + col];
    C.elements[row * C.c + col] = C_entry;
    }

      int main()
      {
          matrix X, Y, Z;
          X.r = Xr;    Y.r = Yr;            Z.r = Zr;
          X.c = Xc;    Y.c = Yc;            Z.c = Zc;

         if(DBG) printf("C(%d,%d) = A(%d,%d) x B(%d,%d)n-----------------------
n"
                   ,Z.r,Z.c, X.r,X.c, Y.r,Y.c);

          size_t size_Z = Z.c * Z.r * sizeof(float);
          Z.elements = (float*) malloc(size_Z);

          populate_matrix(&X);
          populate_matrix(&Y);
          printf("Matrix A (%d,%d)n",X.r,X.c);
          print_matrix(X);
          printf("Matrix B(%d,%d)n",Y.r,Y.c);
          print_matrix(Y);

          matrix d_A;
          d_A.c = X.c;
          d_A.r = X.r;
          size_t size_A = X.c * X.r * sizeof(float);

          cudaMalloc((void**)&d_A.elements, size_A);
          cudaMemcpy(d_A.elements, X.elements, size_A, cudaMemcpyHostToDevice);
          matrix d_B;
          d_B.c = Y.c;
                       © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                            University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                               University of Engineering & Technology, Lahore.




           d_B.r = Y.r;
           size_t size_B = Y.c * Y.r * sizeof(float);

       cudaMalloc((void**)&d_B.elements, size_B);
       cudaMemcpy(d_B.elements, Y.elements, size_B,cudaMemcpyHostToDevice);
       // Allocate C in device memory
       matrix d_C;
       d_C.c = Z.c;
       d_C.r = Z.r;
       size_t size_C = Z.c * Z.r * sizeof(float);
       cudaMalloc((void**)&d_C.elements, size_C);

       dim3 dimBlock(BLOCK_SZ, BLOCK_SZ);
       dim3 dimGrid(Y.c / dimBlock.x, X.r / dimBlock.y);
       matrix_mul_krnl<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

       // Read C from device memory
       cudaMemcpy(Z.elements, d_C.elements, size_C, cudaMemcpyDeviceToHost);

       // Free device memory
       cudaFree(d_A.elements);
       cudaFree(d_B.elements);
       cudaFree(d_C.elements);
       printf("Matrix C(%d,%d)n",Z.r,Z.c);
       print_matrix(Z);
       free (X.elements);
       free(Y.elements);
       free(Z.elements);

   }

void populate_matrix(matrix* mat)
{
    int dim = mat -> c * mat -> r;
    size_t sz = dim * sizeof(float);
    mat -> elements = (float*) malloc(sz);
    int i;
    for (i = 0; i < dim; i++)
        mat->elements[i] = (float)(rand()%1000);
}

void print_matrix(matrix mat)
{
    int i, n = 0, dim;
    dim = mat.c * mat.r;

       for (i = 0; i < dim; i++)
       {
           if (i == mat.c * n)
           {
               printf("n");
               n++;
           }
           printf("%0.2ft", mat.elements[i]);

       }
                      © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                           University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                     University of Engineering & Technology, Lahore.




printf("n============================================================n");

}



3.2.3    Procedure
Write this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as
mentioned below. Experiment with matrices of different sizes as well as with different block sizes. Try to
understand the concept of threadIdx, blockDim and blockIdx and how they are used in this context.

To Compile & Run:

[root@gm gm]# nvcc –o matrix -deviceemu Matrix_mul.cu

[root@gm gm]# ./matrix

3.2.4    Conclusions

List your conclusions with respect to the objectives of this experiment.




3.2.5    Lab Instructor’s Evaluation

Lab instructor’s remark whether the student finished the work to meet the lab objectives.




                           © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                     University of Engineering & Technology, Lahore.




3.3 Programming Exercise 3 (Numerical calculation of value of pi ( ))
Parallel programming is extensively used in scientific computing. Numerical calculation of the value of Pi involves
the usage of loop. This programming exercise uses specified numbers of threads in such a way that each thread is
assigned an equal portion of specified interval.

3.3.1    Lab Objectives

Objectives of this lab experiment include:

    11. Learning the application of CUDA to scientific (numerical) computing
    12. Learning how to use thread IDs in the situations where sequence of executaion is important
    13. Learning how to attack loops for parallelism

3.3.2    Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in
section 2.3.3.

/*
 * File:   pi.cu
 * Author: Ghulam Mustafa
 * Created on July 31,2009, 7:30 PM
 */

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

typedef     struct PI_data{
    int     n;
    int     PerThrItr;
    int     nThr;
} data;

__global__ void calculate_PI(data d, float* s)
{
    float sum, x, w;
    int itr,i,j;
    itr = d.PerThrItr;
      i = blockIdx.x * blockDim.x + threadIdx.x;
   int N = d.n-i;
   w = 1.0/(float)N;
   sum = 0.0;
    if (i < d.nThr)
    {
         for (j = i * itr; j < (i * itr+itr); j++)
         {
             x = w * (j-0.5);
             sum+= (4.0)/(1.0 + x*x);
         }
         s[i] = sum * w;
    }
}
                           © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                 University of Engineering & Technology, Lahore.




// Host code
int main(int argc, char** argv)
{

        printf("Usage: ./<progname> #intervals #Threadsn");
        if(argc < 2)
        {
            printf("Usage: ./<progname> #itrations #Threadsn");
            exit(1);

        }
        data pi_data;
        float PI=0;

        pi_data.n    = atoi(argv[1]);
        pi_data.nThr = atoi(argv[2]);

        pi_data.PerThrItr = pi_data.n/pi_data.nThr;

        float *d_sum;
        float *h_sum;
        // Allocate vectors in device memory
        size_t size = pi_data.nThr * sizeof(float);

        cudaMalloc((void**)&d_sum, size);

    //Memory allocation on host
    h_sum = (float*) malloc(size);
   // cudaMemcpy(d_sum, h_sum, size, cudaMemcpyHostToDevice);

    int threads_per_block = 4;
    int blocks_per_grid;
    blocks_per_grid = (pi_data.nThr + threads_per_block -
1)/threads_per_block;
    calculate_PI<<<blocks_per_grid, threads_per_block>>>(pi_data, d_sum);
    cudaMemcpy(h_sum, d_sum, size, cudaMemcpyDeviceToHost);
    int i;
    for (i = 0; i < pi_data.nThr; i++)
        PI+= h_sum[i];
    //PI = PI * pi_data.n;
        printf("Using %d itrations, Value of PI is %f n", pi_data.n, PI);
        // Free device memory
    cudaFree(d_sum);
}


3.3.3     Procedure

For computing Pi we use numerical methods.




                        © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                             University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                      University of Engineering & Technology, Lahore.




                               N −1
         4                                    4                 1
                               ∑
          1
Π= ∫         dx =                                           ×
    0 1 + x2                                            2
                               i =0       i − 0 .5            N
                                      1+           
                                          N 
Using this technique each partial sum can be calculated in parallel. Write this program in any text editor and save
it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with of
different values of intervals and threads. Try to understand how threadIdx, blockDim and blockIdx are exploited
here to keep the sequence of workflow.

To Compile & Run:

[root@gm gm]# nvcc –o PI -deviceemu pi.cu

[root@gm gm]# ./PI <2300> <25>


3.3.4   Conclusions

List your conclusions with respect to the objectives of this experiment




3.3.5   Lab Instructor’s Evaluation

Lab instructor’s remark whether the student finished the work to meet the lab objectives.




                          © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                               University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                      University of Engineering & Technology, Lahore.




3.4 Programming Exercise 4 (Parallel Sort)
A sorting network is a sorting algorithm, where the sequence of comparisons is not data-dependent. That
makes them suitable for parallel implementations. Bitonic sort is one of the fastest sorting networks,
consisting of Θ(n log n 2 ) comparators. It has a simple implementation and it's very efficient when sorting a
small number of elements

3.4.1      Lab Objectives

Objectives of this lab experiment include:

     14.   Learning Bitonic sorting algorithm
     15.   Learning how to use __shared__ construct
     16.   Learning how to use __device__ construct
     17.   Using Barrier syncrhonization for thread coordinateion support parallelism.

3.4.2      Setup
Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in
section 2.3.3.

/*
 *   File:         parallel_sort.cu
 *   Author:       Ghulam Mustafa
 *   Created       on July 31,2009, 7:30 PM
 *   Code is       adapted from Nvidia CUDA SDK sample projects ver 2.2.1
*/

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

#define NUM             32

__device__ inline void swap(int & a, int & b)
{
      int tmp = a;
            a = b;
            b = tmp;
}

__global__ static void bitonicSort(int * values)
{
    extern __shared__ int shared[];
    const unsigned int tid = threadIdx.x;
    // Copy input to shared mem.
    shared[tid] = values[tid];
    __syncthreads();
    // Parallel bitonic sort
    for (unsigned int k = 2; k <= NUM; k *= 2)
    {
        // Bitonic merge:
        for (unsigned int j = k / 2; j>0; j /= 2)
                             © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                                  University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                           University of Engineering & Technology, Lahore.




        {
            unsigned int ixj = tid ^ j;

            if (ixj > tid)
            {
                if ((tid & k) == 0)
                {
                     if (shared[tid] > shared[ixj])
                     {
                         swap(shared[tid], shared[ixj]);
                     }
                }
                else
                {
                     if (shared[tid] < shared[ixj])
                     {
                         swap(shared[tid], shared[ixj]);
                     }
                }
            }
            __syncthreads();
        }
    }
    // Write result.
    values[tid] = shared[tid];
}

int main(int argc, char** argv)
{
      int values[NUM];
      printf( "nUnsorted Arrayn==============n");
    for(int i = 0; i < NUM; i++)
    {
        values[i] = rand()%1000;
      printf("%dt",values[i]);
    }
      printf("n");

      int * dvalues;
      cudaMalloc((void**)&dvalues, sizeof(int) * NUM);
      cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice);

      bitonicSort<<<1, NUM, sizeof(int) * NUM>>>(dvalues);

        // check for any errors
      cudaMemcpy(values, dvalues, sizeof(int) * NUM, cudaMemcpyDeviceToHost);
       cudaFree(dvalues);

      bool passed = true;
      int i;
      printf( "nSorted Arrayn==============n");
    for( i = 1; i < NUM; i++)
    {
        if (values[i-1] > values[i])
             passed = false;
      printf( "%dt", values[i-1]);
                  © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                       University of Engineering and Technology, Lahore.
Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK

                                    University of Engineering & Technology, Lahore.




        }
            printf( "%dtn", values[i]);
            printf( "Test %sn", passed ? "PASSED" : "FAILED");

}

3.4.3       Procedure

Write this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and
run as mentioned below. Experiment with values of NUM and check the status of test (last line of the
output). Try to understand the concept of threadIdx, blockDim and blockIdx and how they are used in this
context.
To Compile & Run:

[root@gm gm]# nvcc –o ll_sort -deviceemu parallel_sort.cu

[root@gm gm]# ./ll_sort

3.4.4       Conclusions

List your conclusions with respect to the objectives of this experiment.




3.4.5       Lab Instructor’s Evaluation

Lab instructor’s remark whether the student finished the work to meet the lab objectives.




                          © Copyright 2009 Al-Khawarizmi Institute of Computer Science
                               University of Engineering and Technology, Lahore.

More Related Content

What's hot

What's hot (20)

Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...
 
Cuda
CudaCuda
Cuda
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
Cuda
CudaCuda
Cuda
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
CuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPUCuPy: A NumPy-compatible Library for GPU
CuPy: A NumPy-compatible Library for GPU
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 

Viewers also liked

assignment 1 properties of fluids-Fluid mechanics
assignment 1 properties of fluids-Fluid mechanicsassignment 1 properties of fluids-Fluid mechanics
assignment 1 properties of fluids-Fluid mechanics
asghar123456
 
Experiment10
Experiment10Experiment10
Experiment10
john
 
Fluid mechanics assignment 2
Fluid mechanics  assignment 2Fluid mechanics  assignment 2
Fluid mechanics assignment 2
kushal_chats
 
Sheet 1 pressure measurments
Sheet 1 pressure measurmentsSheet 1 pressure measurments
Sheet 1 pressure measurments
asomah
 
3_hydrostatic-force_tutorial-solution(1)
3_hydrostatic-force_tutorial-solution(1)3_hydrostatic-force_tutorial-solution(1)
3_hydrostatic-force_tutorial-solution(1)
Diptesh Dash
 
Discussion lect3
Discussion lect3Discussion lect3
Discussion lect3
Fasildes
 
Center of pressure and hydrostatic force on a submerged body rev
Center of pressure and hydrostatic force on a submerged body revCenter of pressure and hydrostatic force on a submerged body rev
Center of pressure and hydrostatic force on a submerged body rev
Natalie Ulza
 

Viewers also liked (20)

Equipment diagrams of fluid mechanics lab 1
Equipment diagrams of fluid mechanics lab  1Equipment diagrams of fluid mechanics lab  1
Equipment diagrams of fluid mechanics lab 1
 
Experiment no 7 fluid mechanics lab
Experiment no 7 fluid mechanics lab Experiment no 7 fluid mechanics lab
Experiment no 7 fluid mechanics lab
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
assignment 1 properties of fluids-Fluid mechanics
assignment 1 properties of fluids-Fluid mechanicsassignment 1 properties of fluids-Fluid mechanics
assignment 1 properties of fluids-Fluid mechanics
 
Unit 3 Fluid Static
Unit 3 Fluid StaticUnit 3 Fluid Static
Unit 3 Fluid Static
 
Estimate coefficient of discharge for rectangular and V notches weirs
Estimate coefficient of discharge for rectangular and V notches weirsEstimate coefficient of discharge for rectangular and V notches weirs
Estimate coefficient of discharge for rectangular and V notches weirs
 
Chap 03
Chap 03Chap 03
Chap 03
 
Experiment10
Experiment10Experiment10
Experiment10
 
Fluid mechanics assignment 2
Fluid mechanics  assignment 2Fluid mechanics  assignment 2
Fluid mechanics assignment 2
 
Sheet 1 pressure measurments
Sheet 1 pressure measurmentsSheet 1 pressure measurments
Sheet 1 pressure measurments
 
3_hydrostatic-force_tutorial-solution(1)
3_hydrostatic-force_tutorial-solution(1)3_hydrostatic-force_tutorial-solution(1)
3_hydrostatic-force_tutorial-solution(1)
 
Discussion lect3
Discussion lect3Discussion lect3
Discussion lect3
 
Dams and barrages
Dams and barragesDams and barrages
Dams and barrages
 
Canal of design
Canal of designCanal of design
Canal of design
 
ORIFICES AND MOUTHPIECES
ORIFICES AND MOUTHPIECESORIFICES AND MOUTHPIECES
ORIFICES AND MOUTHPIECES
 
Hydrostatic force
Hydrostatic forceHydrostatic force
Hydrostatic force
 
Fluids lab manual_2
Fluids lab manual_2Fluids lab manual_2
Fluids lab manual_2
 
flow through venturimeter
flow through venturimeterflow through venturimeter
flow through venturimeter
 
Center of pressure and hydrostatic force on a submerged body rev
Center of pressure and hydrostatic force on a submerged body revCenter of pressure and hydrostatic force on a submerged body rev
Center of pressure and hydrostatic force on a submerged body rev
 
Venturimeter,Orificemeter,Notches & weirs,Pilot tubes
Venturimeter,Orificemeter,Notches & weirs,Pilot tubesVenturimeter,Orificemeter,Notches & weirs,Pilot tubes
Venturimeter,Orificemeter,Notches & weirs,Pilot tubes
 

Similar to Cuda lab manual

Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
Ashwin Ashok
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
Ankita Dewan
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
Ankita Dewan
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
Edge AI and Vision Alliance
 

Similar to Cuda lab manual (20)

Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
 
IIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita DewanIIT ropar_CUDA_Report_Ankita Dewan
IIT ropar_CUDA_Report_Ankita Dewan
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGA...
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
 
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
 
Make Accelerator Pluggable for Container Engine
Make Accelerator Pluggable for Container EngineMake Accelerator Pluggable for Container Engine
Make Accelerator Pluggable for Container Engine
 
Deploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesDeploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and Kubernetes
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019
 
DRIVE PX 2
DRIVE PX 2DRIVE PX 2
DRIVE PX 2
 
CUDA
CUDACUDA
CUDA
 
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020
 

More from coolmirza143 (8)

Introduction Cell Processor
Introduction Cell ProcessorIntroduction Cell Processor
Introduction Cell Processor
 
Cell processor lab
Cell processor labCell processor lab
Cell processor lab
 
Introduction Cell Processor
Introduction Cell ProcessorIntroduction Cell Processor
Introduction Cell Processor
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Multithreading in Android
Multithreading in AndroidMultithreading in Android
Multithreading in Android
 
Introduction to Real-Time Operating Systems
Introduction to Real-Time Operating SystemsIntroduction to Real-Time Operating Systems
Introduction to Real-Time Operating Systems
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Cuda lab manual

  • 1. Al-Khawarizmi Institute of Computer Science Univeristy of Engineering and Technology, Lahore Pakistan LAB WORKBOOK Parallel Programming With CUDA Summar Short Course August 2009 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 2. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. TABLE OF CONTENTS 1 INTRODUCTION ........................................................................................................................................... 4 1.1 GENERAL PURPOSE GRAPHIC PROCESSING UNIT (GPGPU) ..................................................................... 4 1.2 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) .............................................................................. 4 1.3 MAIN OBJECTIVES ..................................................................................................................................... 4 2 SETTING UP CUDA DEVELOPMENT ENVIRONMENT ..................................................................... 5 2.1 VERIFYING THAT YOU HAVE A CUDA-CAPABLE SYSTEM .......................................................................... 5 2.2 DOWNLOADING CUDA DEVELOPMENT COMPONENTS............................................................................. 6 2.3 INSTALLING CUDA SOFTWARE COMPONENTS .......................................................................................... 6 2.4 VERIFYING CUDA INSTALLATIONS ........................................................................................................... 8 2.5 GENERAL PROCEDURE OF PROGRAMMING IN CUDA .............................................................................. 11 3 PROGRAMMING IN CUDA ........................................................................................................................ 11 3.1 PROGRAMMING EXERCISE 1 (HELLO WORLD) ......................................................................................... 11 3.2 PROGRAMMING EXERCISE 2 (MATRIX MULTIPLICATION) ........................................................................ 13 3.3 PROGRAMMING EXERCISE 3 (NUMERICAL CALCULATION OF VALUE OF PI (Π)) ........................................ 17 3.4 PROGRAMMING EXERCISE 4 (PARALLEL SORT) ........................................................................................ 20 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 3. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. LAB WORKBOOK This workbook is written for assisting the students of Summer Short Course on “Parallel Programming With CUDA” at Al-Khawarzmi Institute of Computer Science (KICS). This edition was prepared over a short period of two months and was finalized in July 2009. The contents of this document have been compiled from various academic resources to expose the students to Genral Purpose Graphic Processing Units (GPGPU) and Nvidia’s Compute Unified Device Architecture (CUDA) in a hands-on fashion. For Further information, please contact the KICS at UET, Lahore: Telephone: (042) 992 50450 Fax: (042) 992 50246 Email: ghulam.mustafa@kics.edu.pk © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 4. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. 1 Introduction Multicore and Many-core systems provide within-the-box parallel processing capabilities. Computing task that were run on supercomputers in past are now able to run on desktops provided that we know the capabilities of available hardware, and software techniques to exploit these available resources. 1.1 General Purpose Graphic Processing Unit (GPGPU) Graphic Processing Unit (GPU) available on commodity video adapters has evolved into highly parallel, multithreaded, Many-core processor, thanks to gaming industry. These GPUs have huge computational power as well as very high memory bandwidth that can be exploited by general purpose high performance applications. These programmable GPU are also known as general purpose graphic processing units (GPGPU, from now onward we will use term GPU). GPU is specialized for compute-intensive, highly parallel computation just like graphics rendering is done. GPU is based on SIMD architectural model and utilized by data-parallel programming model. 1.2 Compute Unified Device Architecture (CUDA) Nvidia Corporation, market leader in GPU market, introduced a general purpose parallel computing architecture in November 2006, to harness the computing capabilities of their high-end GPUs. Compute Unified Device Architecture (CUDA) is based on a new parallel programming model and instruction set architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU. CUDA comes with a software environment that allows developers to use C as a high-level programming language. Other languages such as FORTRAN, C++, OpenGL, and DirectX will be supported in the future. 1.3 Main Objectives The objective of this lab is to become familiar with parallel programming using CUDA. It will give you an idea that how we can run CUDA programs on systems with and without CUDA-capable GPU. Programming exercises will enable you to decompose a certain complex problem into portions that could run in parallel using data-parallel programming model. Following activities are intended to be carried out in this lab: • Verification of CUDA-capable system • Installation and verification of CUDA software components • Programming exercises o Hello world o Matrix Multiplication o Numerical calculation of the value of π o Parallel Sort At the end of this lab, you should be able to: • Setup CUDA development environment • Write, compile and run CUDA programs on Nvidia device as well as on x86 multicore systems in device emulation mode. • Use data parallel programming model © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 5. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. 2 Setting up CUDA development environment To use CUDA on your system, you will need a supported version of Linux with a gcc compiler and toolchain, CUDA software (available freely at http://www.nvidia.com/cuda) and a CUDA-capable GPU. If you do not have a CUDA-capable GPU, you can still use CUDA in device emulation mode. Device emulation mode is basically for debugging purposes and obviously, does not offer as much performance as with a CUDA- capable GPU. So device emulation mode should not be used for release versions and performance tuning. After installing CUDA software, we need to test our CUDA build environment by compiling and running one or more sample programs (available in CUDA SDK). This will validate that hardware and software are running and communicating correctly. 2.1 Verifying that you have a CUDA-Capable System Before starting installation of different CUDA software components, we should verify that we have supported version of Linux with a gcc compiler, toolchain and optionally CUDA-capable Nvidia GPU. 2.1.1 Verify Nvidia video adapter Enter the following command to verify Nvidia video adapter, Note: Skip this section if your system is not equiped wih a CUDA-capable Nvidia GPU. [root@gm gm]# lspci |grep -i nVidia 01:00.0 VGA compatible controller: nVidia Corporation GeForce 9600M GT (rev a1) [root@gm gm]# If you do not see anything, either you do not have an Nvidia graphic adapter or you have to update PCI hardware database, maintained by Linux, using following command. If your network connection is fine, output should look like below. [root@gm gm]# update-pciids % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 148k 100 148k 0 0 6241k 0 --:--:-- --:--:-- --:--:-- 6767k Done. [root@gm gm]# 2.1.2 Verify supported version of Linux Current version (2.2) of CUDA software components requires an x86-based Linux distribution. Following command checks distribution and release number of running system, [root@gm gm]# uname -i && cat /etc/*release i386 Fedora release 10 (Cambridge) Fedora release 10 (Cambridge) Fedora release 10 (Cambridge) [root@gm gm]# © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 6. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. Output shows that running system is 32-bit (i386) Fedora version 10. On a 64-bit system running in 64-bit mode the typical output will be x86_64. Version 2.2 of CUDA development tools support only following distributions: Red Hat Enterprise Linux 4.3-4.7, 5.0-5.3 SUSE Enterprise Desktop 10-SP2 Open SUSE 11.0 or 11.1 Fedora 9 or 10 Ubuntu 8.04 or 8.10 You should frequently visit CUDA download page for updates because other distributions are promised to be supported latter. 2.1.3 Verifying gcc Current CUDA development tools supports version 3.4, 4.x of gcc. You can check the version of currently installed gcc by issuing the following command: [root@gm gm]# gcc --version gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7) Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [root@gm gm]# 2.2 Downloading CUDA development components You can get CUDA software components from http://www.nvidia.com/object/cuda_get.html. Read the instructions given on this page carefully and download necessary files. Nvidia CUDA Driver is not necessary if you do not have Nvidia GPU and want to run CUDA programs in device emulation mode. 2.3 Installing CUDA software components Uninstall any previously installed versions of CUDA SDK and toolkit, by just deleting the directory containing these packages. Default directory for toolkit and SDK are /usr/local/cuda/ and ~/NVIDIA_CUDA_SDK/ respectively. If you want to keep older versions, just rename these directories. 2.3.1 Installing CUDA driver Note: You do not have to install CUDA driver if you don't have an Nvidia GPU (cuda- capable). If tried, You will see an error like "You do not appear to have an NVIDIA GPU supported by the 185.18.14 NVIDIA Linux graphics driver installed in this system." You need to shutdown x server before installing the driver (best way is to change id:5:initdefault: to id:3:initdefault: in /etc/inittab file and reboot). You will get console only (No graphics). Secondly, you must have source code of running kernel (if needed) that can be installed by issuing following command: [root@gm gm]# yum install kernel-devel © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 7. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. More information about driver installation is available on http://us.download.nvidia.com/XFree86/Linux-x86/1.0-9755/README/index.html. To install driver, first of all exit the GUI (ctl-alt-backspace). On available command line issue the following commands to turn off x windows as a superuser, install driver and restart GUI environment, respectively. [root@gm gm]# su password: [root@gm gm]# /sbin/init 3 [root@gm gm]# cd <directory containing downloaded .run files> [root@gm gm]# ./NVIDIA-Linux-x86-185.18.14-pkg1.run [root@gm gm]# /sbin/init 5 You can also issue the following command to start the GUI environment, [root@gm gm]# startx Make sure your internet connection is working fine. Follow the instruction displayed on your screen. Note: You can verify driver release by running the following command, [root@gm gm]# /usr/bin/nvidia-settings 2.3.2 Installing CUDA toolkit Just issue following commands, [root@gm gm]# cd <directory containing downloaded .run files> [root@gm gm]# ./cudatoolkit_2.2_linux_32_fedora10.run (Output omitted for the sake of brevity) 2.3.3 Setting environment variables Issue following commands, [root@gm gm]# export PATH=/usr/local/cuda/bin/:$PATH [root@gm gm]# export LD_LIBRARY_PATH=/usr/local/cuda/lib/:$LD_LIBRARY_PATH You can make these settings permanent by putting the above mentioned commands to ~/.bashrc 2.3.4 Configuring CUDA libraries Add LD_LIBRARY_PATH=/usr/local/cuda/lib/:$LD_LIBRARY_PATH to /etc/ld.so.conf and issue the following command, [root@gm gm]# ldconfig 2.3.5 Installing CUDA SDK [root@gm gm]# cd <directory containing downloaded .run files> [root@gm gm]# ./cudasdk_2.21_linux.run (Output omitted for the sake of brevity) © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 8. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. 2.3.6 Installing CUDA Debugger [root@gm gm]# cd <directory containing downloaded .run files> [root@gm gm]# ./cudagdb_2.2_linux_32_rhel5.3.run (Output omitted for the sake of brevity) 2.4 Verifying CUDA installations After installation, best practice is to validate the installed packages and environment setting. 2.4.1 Verifing CUDA environment [root@gm gm]# env ORBIT_SOCKETDIR=/tmp/orbit-gm HOSTNAME=gm.kics-uet TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=871a3cd51587ff750aec3a5049a408c9-1247661191.484531-1772071398 HISTSIZE=1000 GTK_RC_FILES=/etc/gtk/gtkrc:/home/gm/.gtkrc-1.2-gnome2 WINDOWID=31457334 QTDIR=/usr/lib/qt-3.3 QTINC=/usr/lib/qt-3.3/include http_proxy=http://10.11.20.20:8888/ USER=gm LD_LIBRARY_PATH=/usr/local/cuda/lib/: LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:do=00;35:bd=40;33;01:cd=40;3 3;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44: ex=00;32:*.tar=00;31:*.tgz=00;31:*.svgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.lz ma=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.dz=00;31:*.gz=00;31:*.bz2=00;31:*.tbz2=00;3 1:*.bz=00;31:*.tz=00;31:*.deb=00;31:*.rpm=00;31:*.jar=00;31:*.rar=00;31:*.ace=00;31:*. zoo=00;31:*.cpio=00;31:*.7z=00;31:*.rz=00;31:*.jpg=00;35:*.jpeg=00;35:*.gif=00;35:*.bm p=00;35:*.pbm=00;35:*.pgm=00;35:*.ppm=00;35:*.tga=00;35:*.xbm=00;35:*.xpm=00;35:*.tif= 00;35:*.tiff=00;35:*.png=00;35:*.mng=00;35:*.pcx=00;35:*.mov=00;35:*.mpg=00;35:*.mpeg= 00;35:*.m2v=00;35:*.mkv=00;35:*.ogm=00;35:*.mp4=00;35:*.m4v=00;35:*.mp4v=00;35:*.vob=0 0;35:*.qt=00;35:*.nuv=00;35:*.wmv=00;35:*.asf=00;35:*.rm=00;35:*.rmvb=00;35:*.flc=00;3 5:*.avi=00;35:*.fli=00;35:*.gl=00;35:*.dl=00;35:*.xcf=00;35:*.xwd=00;35:*.yuv=00;35:*. svg=00;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.m p3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: SSH_AUTH_SOCK=/tmp/keyring-qpkd1F/ssh GNOME_KEYRING_SOCKET=/tmp/keyring-qpkd1F/socket USERNAME=gm SESSION_MANAGER=local/unix:@/tmp/.ICE-unix/2747,unix/unix:/tmp/.ICE-unix/2747 DESKTOP_SESSION=gnome PATH=/usr/local/cuda/bin/:/usr/kerberos/sbin:/usr/lib/qt- 3.3/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin :/home/gm/bin MAIL=/var/spool/mail/gm PWD=/home/gm/Desktop XMODIFIERS=@im=imsettings GNOME_KEYRING_PID=2745 LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=gnome SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass HOME=/root SHLVL=3 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 9. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. no_proxy=localhost,127.0.0.0/8 GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=gm QTLIB=/usr/lib/qt-3.3/lib DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus- E9ZoYtPeZC,guid=4328bc8674e6eb0b12d4ef874a5dcc87 LESSOPEN=|/usr/bin/lesspipe.sh %s DISPLAY=:0.0 G_BROKEN_FILENAMES=1 XAUTHORITY=/root/.xauth5fdjoq COLORTERM=gnome-terminal _=/usr/bin/env OLDPWD=/home/gm 2.4.2 Verify CUDA compiler nvcc is compiler driver for CUDA programs. It calls gcc compiler for C code and NVIDIA PTX compiler foe CUDA code. To verify, enter one of the following commands: [root@gm gm]# which nvcc /usr/local/cuda/bin/nvcc [root@gm ~]# nvcc –V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2009 NVIDIA Corporation Built on Thu_Apr__9_07:37:20_PDT_2009 Cuda compilation tools, release 2.2, V0.2.1221 [root@gm ~]# 2.4.3 Compiling Sample Projects [root@gm gm]# cd <SDK directory> [root@gm gm]# make The resulting binaries will be in NVIDIA_CUDA_SDK/bin/linux/release 2.4.4 Compiling Sample Projects in emulation mode [root@gm gm]# cd <SDK derectory> [root@gm gm]# make emu=1 The resulting binaries will be in NVIDIA_CUDA_SDK/bin/linux/emurelease. 2.4.5 Running deviceQuery and bandwidthTest Note: You do not need to run deviceQuery and bandwidthTest if you don't have an Nvidia GPU (cuda-capable). In this case, you can try some other executable from nvidia_CUDA_SDK/bin/linux/emurelease directory Run ./deviceQuery in <NVIDIA_CUDA_SDK>/bin/linux/release. To run deviceQuery, on SELinux-enabled systems, you may need to disable this security feature using setenforce command. [root@gm gm]# setenforce 0 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 10. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. [root@gm gm]# cd <NVIDIA_CUDA_SDK>/bin/linux/emurelease [root@gm release]# ./deviceQuery CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "GeForce 9600M GT" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 1 Total amount of global memory: 536150016 bytes Number of multiprocessors: 4 Number of cores: 32 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.25 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: No Compute mode: Default (multiple host threads can use this device simultaneously) Test PASSED Press ENTER to exit... To test that system and CUDA-capable device communicate correctly, run following [root@gm release]# ./bandwidthTest Running on...... device 0:GeForce 9600M GT Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1756.6 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1168.8 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 10762.2 &&&& Test PASSED © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 11. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. Press ENTER to exit... Start using CUDA to build your own high performance applications. NVIDIA CUDA Programming Guide, located in /usr/local/cuda/doc/ is your next step in this course. 2.5 General procedure of programming in CUDA You can use any text editor to write your CUDA source code for your program. Save it with .cu extension. Then issue the following commnd (assuming environment variables are properly set, as described above): [root@gm <dir>]# nvcc –o <executeable_name> -deviceemu <program_name>.cu [root@gm <dir>]# ./<executeable_name> Replace contents contained in “< >” with actual names. “-deviceemu” compiles code that is expected to run on CPU only. 3 Programming in CUDA CUDA comes with a software environment that allows developers to use C as a high-level programming language. This section is composed of programming exercises for hands on practice. Problem partitionaing in terms of threads and thread Blocks, and organization of thread blocks in one or more block grids is the main challenge faced by CUDA programmers. Following programming exercises are designed to understand this concept of problem orchestration. Complicated details of CUDA like compilation steps, generated files, different file formats, and very precise and efficient use of different memory hierarchy etc. are out of scope of this activity. You will gradually learn these concepts. Most important is to tackle problem orchestration and to get output of your simple programs. 3.1 Programming Exercise 1 (Hello World) This is a well-known warm-up program that asks all threads to prints Hello World! 3.1.1 Lab Objectives Objectives of this lab experiment include: 1. Learning about the general structure of a CUDA program 2. Learning the concept of kernel, kernel invocation, hierarchical thread grouping. 3. Learning the concept of threadIdx, blockIdx and blockDim. 4. Compiling and running CUDA code in device emulation mode 3.1.2 Setup Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3. /* * File: Hello_World.cu * Author: Ghulam Mustafa © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 12. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. */ #include <cuda.h> #include <stdio.h> #include <stdlib.h> __global__ void printhello() { int thid = blockIdx.x * blockDim.x + threadIdx.x; printf("Thread%d: Hello World!n", thid); } int main() { printhello<<<5,10>>>(); return 0; } 3.1.3 Procedure Write this simple program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with kernel invocation statement by changing the values of dimGrid and dimBlock where general kernel invocation statement is “kernel<<<dimGrid, dimBlock>>> ( ).” Try to figure out how the ID of a thread will change by changing dimBlock and dimGrid. To Compile & Run: [root@gm gm]# nvcc –o hello -deviceemu Hello_World.cu [root@gm gm]# ./hello 3.1.4 Conclusions List your conclusions with respect to the objectives of this experiment 3.1.5 Lab Instructor’s Evaluation Lab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 13. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. 3.2 Programming Exercise 2 (Matrix Multiplication) Parallel matrix multiplication is representative of those problems which are good examples for CUDA implementation. Each element of resulting matrix is calculated in parallel. 3.2.1 Lab Objectives Objectives of this lab experiment include: 5. Learning the application of CUDA to linear algebra problems 6. Learning how to partion a large problem in to subproblems 7. Learning how to exploit the thread and block IDs for useful calculations 8. Learning how to download parallel portion of code to device 9. Learning how to use device memory 10. Understanding hetrogeneous programming 3.2.2 Setup Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3. /* * File: matrix_mul.cu * Author: Ghulam Mustafa * Created on July 31,2009, 7:30 PM * Code is adapted from Nvidia CUDA Programming Guide ver 2.2.1 * Matrices are stored in row-major order:M(row, col) = M.ents[row*M.w + col] */ #include <cuda.h> #include <stdio.h> #include <stdlib.h> #define BLOCK_SZ 2 #define DBG 1 //Order of Matrix X = (Xr x Xc) #define Xc (2 * BLOCK_SZ) #define Xr (3 * BLOCK_SZ) //Order of Matrix Y = (Yr x Yc) © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 14. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. #define Yc (2 * BLOCK_SZ) #define Yr Xc //Order of Matrix Z = (Zr x Zc) #define Zc Yc #define Zr Xr #define N (Zr*Zc) typedef struct Matrix{ int r,c; float* elements; } matrix; void populate_matrix(matrix*); void print_matrix(matrix); __global__ void matrix_mul_krnl(matrix A, matrix B, matrix C) { float C_entry = 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int i; for (i = 0; i < A.c; i++) C_entry += A.elements[row * A.c + i] * B.elements[i * B.c + col]; C.elements[row * C.c + col] = C_entry; } int main() { matrix X, Y, Z; X.r = Xr; Y.r = Yr; Z.r = Zr; X.c = Xc; Y.c = Yc; Z.c = Zc; if(DBG) printf("C(%d,%d) = A(%d,%d) x B(%d,%d)n----------------------- n" ,Z.r,Z.c, X.r,X.c, Y.r,Y.c); size_t size_Z = Z.c * Z.r * sizeof(float); Z.elements = (float*) malloc(size_Z); populate_matrix(&X); populate_matrix(&Y); printf("Matrix A (%d,%d)n",X.r,X.c); print_matrix(X); printf("Matrix B(%d,%d)n",Y.r,Y.c); print_matrix(Y); matrix d_A; d_A.c = X.c; d_A.r = X.r; size_t size_A = X.c * X.r * sizeof(float); cudaMalloc((void**)&d_A.elements, size_A); cudaMemcpy(d_A.elements, X.elements, size_A, cudaMemcpyHostToDevice); matrix d_B; d_B.c = Y.c; © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 15. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. d_B.r = Y.r; size_t size_B = Y.c * Y.r * sizeof(float); cudaMalloc((void**)&d_B.elements, size_B); cudaMemcpy(d_B.elements, Y.elements, size_B,cudaMemcpyHostToDevice); // Allocate C in device memory matrix d_C; d_C.c = Z.c; d_C.r = Z.r; size_t size_C = Z.c * Z.r * sizeof(float); cudaMalloc((void**)&d_C.elements, size_C); dim3 dimBlock(BLOCK_SZ, BLOCK_SZ); dim3 dimGrid(Y.c / dimBlock.x, X.r / dimBlock.y); matrix_mul_krnl<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); // Read C from device memory cudaMemcpy(Z.elements, d_C.elements, size_C, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A.elements); cudaFree(d_B.elements); cudaFree(d_C.elements); printf("Matrix C(%d,%d)n",Z.r,Z.c); print_matrix(Z); free (X.elements); free(Y.elements); free(Z.elements); } void populate_matrix(matrix* mat) { int dim = mat -> c * mat -> r; size_t sz = dim * sizeof(float); mat -> elements = (float*) malloc(sz); int i; for (i = 0; i < dim; i++) mat->elements[i] = (float)(rand()%1000); } void print_matrix(matrix mat) { int i, n = 0, dim; dim = mat.c * mat.r; for (i = 0; i < dim; i++) { if (i == mat.c * n) { printf("n"); n++; } printf("%0.2ft", mat.elements[i]); } © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 16. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. printf("n============================================================n"); } 3.2.3 Procedure Write this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with matrices of different sizes as well as with different block sizes. Try to understand the concept of threadIdx, blockDim and blockIdx and how they are used in this context. To Compile & Run: [root@gm gm]# nvcc –o matrix -deviceemu Matrix_mul.cu [root@gm gm]# ./matrix 3.2.4 Conclusions List your conclusions with respect to the objectives of this experiment. 3.2.5 Lab Instructor’s Evaluation Lab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 17. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. 3.3 Programming Exercise 3 (Numerical calculation of value of pi ( )) Parallel programming is extensively used in scientific computing. Numerical calculation of the value of Pi involves the usage of loop. This programming exercise uses specified numbers of threads in such a way that each thread is assigned an equal portion of specified interval. 3.3.1 Lab Objectives Objectives of this lab experiment include: 11. Learning the application of CUDA to scientific (numerical) computing 12. Learning how to use thread IDs in the situations where sequence of executaion is important 13. Learning how to attack loops for parallelism 3.3.2 Setup Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3. /* * File: pi.cu * Author: Ghulam Mustafa * Created on July 31,2009, 7:30 PM */ #include <cuda.h> #include <stdio.h> #include <stdlib.h> typedef struct PI_data{ int n; int PerThrItr; int nThr; } data; __global__ void calculate_PI(data d, float* s) { float sum, x, w; int itr,i,j; itr = d.PerThrItr; i = blockIdx.x * blockDim.x + threadIdx.x; int N = d.n-i; w = 1.0/(float)N; sum = 0.0; if (i < d.nThr) { for (j = i * itr; j < (i * itr+itr); j++) { x = w * (j-0.5); sum+= (4.0)/(1.0 + x*x); } s[i] = sum * w; } } © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 18. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. // Host code int main(int argc, char** argv) { printf("Usage: ./<progname> #intervals #Threadsn"); if(argc < 2) { printf("Usage: ./<progname> #itrations #Threadsn"); exit(1); } data pi_data; float PI=0; pi_data.n = atoi(argv[1]); pi_data.nThr = atoi(argv[2]); pi_data.PerThrItr = pi_data.n/pi_data.nThr; float *d_sum; float *h_sum; // Allocate vectors in device memory size_t size = pi_data.nThr * sizeof(float); cudaMalloc((void**)&d_sum, size); //Memory allocation on host h_sum = (float*) malloc(size); // cudaMemcpy(d_sum, h_sum, size, cudaMemcpyHostToDevice); int threads_per_block = 4; int blocks_per_grid; blocks_per_grid = (pi_data.nThr + threads_per_block - 1)/threads_per_block; calculate_PI<<<blocks_per_grid, threads_per_block>>>(pi_data, d_sum); cudaMemcpy(h_sum, d_sum, size, cudaMemcpyDeviceToHost); int i; for (i = 0; i < pi_data.nThr; i++) PI+= h_sum[i]; //PI = PI * pi_data.n; printf("Using %d itrations, Value of PI is %f n", pi_data.n, PI); // Free device memory cudaFree(d_sum); } 3.3.3 Procedure For computing Pi we use numerical methods. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 19. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. N −1 4 4 1 ∑ 1 Π= ∫ dx = × 0 1 + x2 2 i =0  i − 0 .5  N 1+    N  Using this technique each partial sum can be calculated in parallel. Write this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with of different values of intervals and threads. Try to understand how threadIdx, blockDim and blockIdx are exploited here to keep the sequence of workflow. To Compile & Run: [root@gm gm]# nvcc –o PI -deviceemu pi.cu [root@gm gm]# ./PI <2300> <25> 3.3.4 Conclusions List your conclusions with respect to the objectives of this experiment 3.3.5 Lab Instructor’s Evaluation Lab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 20. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. 3.4 Programming Exercise 4 (Parallel Sort) A sorting network is a sorting algorithm, where the sequence of comparisons is not data-dependent. That makes them suitable for parallel implementations. Bitonic sort is one of the fastest sorting networks, consisting of Θ(n log n 2 ) comparators. It has a simple implementation and it's very efficient when sorting a small number of elements 3.4.1 Lab Objectives Objectives of this lab experiment include: 14. Learning Bitonic sorting algorithm 15. Learning how to use __shared__ construct 16. Learning how to use __device__ construct 17. Using Barrier syncrhonization for thread coordinateion support parallelism. 3.4.2 Setup Make sure that environment variables are properly setup. If not first set the environment variables as mentioned in section 2.3.3. /* * File: parallel_sort.cu * Author: Ghulam Mustafa * Created on July 31,2009, 7:30 PM * Code is adapted from Nvidia CUDA SDK sample projects ver 2.2.1 */ #include <cuda.h> #include <stdio.h> #include <stdlib.h> #define NUM 32 __device__ inline void swap(int & a, int & b) { int tmp = a; a = b; b = tmp; } __global__ static void bitonicSort(int * values) { extern __shared__ int shared[]; const unsigned int tid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; __syncthreads(); // Parallel bitonic sort for (unsigned int k = 2; k <= NUM; k *= 2) { // Bitonic merge: for (unsigned int j = k / 2; j>0; j /= 2) © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 21. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. { unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) { swap(shared[tid], shared[ixj]); } } else { if (shared[tid] < shared[ixj]) { swap(shared[tid], shared[ixj]); } } } __syncthreads(); } } // Write result. values[tid] = shared[tid]; } int main(int argc, char** argv) { int values[NUM]; printf( "nUnsorted Arrayn==============n"); for(int i = 0; i < NUM; i++) { values[i] = rand()%1000; printf("%dt",values[i]); } printf("n"); int * dvalues; cudaMalloc((void**)&dvalues, sizeof(int) * NUM); cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice); bitonicSort<<<1, NUM, sizeof(int) * NUM>>>(dvalues); // check for any errors cudaMemcpy(values, dvalues, sizeof(int) * NUM, cudaMemcpyDeviceToHost); cudaFree(dvalues); bool passed = true; int i; printf( "nSorted Arrayn==============n"); for( i = 1; i < NUM; i++) { if (values[i-1] > values[i]) passed = false; printf( "%dt", values[i-1]); © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  • 22. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. } printf( "%dtn", values[i]); printf( "Test %sn", passed ? "PASSED" : "FAILED"); } 3.4.3 Procedure Write this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with values of NUM and check the status of test (last line of the output). Try to understand the concept of threadIdx, blockDim and blockIdx and how they are used in this context. To Compile & Run: [root@gm gm]# nvcc –o ll_sort -deviceemu parallel_sort.cu [root@gm gm]# ./ll_sort 3.4.4 Conclusions List your conclusions with respect to the objectives of this experiment. 3.4.5 Lab Instructor’s Evaluation Lab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.