This document provides an overview of the CUDA lab including the programming environment, GPU server specifications, CUDA tools, lab assignments, and programming tips. The GPU server has two Intel Xeon CPUs and two NVIDIA K20X GPUs with 5760MB of memory each. The lab assignments involve rewriting CPU programs to CUDA kernels and optimizing parallel reduction algorithms. CUDA tools demonstrated include cuda-memcheck for error checking and nvidia-smi for querying the GPU state. Programming tips cover kernel launch configuration, thread indexing, memory transfers, synchronization, and profiling kernel execution time.
5. COMPILE & RUN CUDA
Directlycompile to executable code
GPUand CPUcode are compiled and linked separately
#compilethesourcecodetoexecutablefile
$nvcca.cu-oa.out
6. COMPILE & RUN CUDA
The nvcc compiler willtranslate CUDAsource code into Parallel
Thread Execution (PTX) language in the intermediate phase.
#keepallintermediatephasefiles
$nvcca.cu-keep
#or
$nvcca.cu-save-temps
$nvcca.cu-keep
$ls
a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a
a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a
a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a
#cleanallintermediatephasefiles
$nvcca.cu-keep-clean
7. USEFUL NVCC USAGE
Printcode generation statistics:
$nvcc-Xptxas-vreduce.cu
ptxasinfo :0bytesgmem
ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10'
ptxasinfo :Used6registers,32bytessmem,4bytescmem[1]
-Xptxas
--ptxas-options
Specifyoptionsdirectlytotheptxoptimizingassembler.
register number:should be less than the number of available
registers, otherwises the restregisters willbe mapped into
the localmemory(off-chip).
smem stands for shared memory.
cmem stands for constantmemory. The bank-#1 constant
memorystores 4 bytes of constantvariables.
9. CUDA-MEMCHECK
This toolchecks the following memoryerrors of your program,
and italso reports hardware exceptions encountered bythe
GPU.
These errors maynotcause program crash, buttheycould
unexpected program and memorymisusage.
Table .Memcheck reported errortypes
Name Description Location Precision
Memoryaccess
error
Errorsdue to out of boundsormisaligned accessesto memorybyaglobal,
local,shared orglobal atomic access.
Device Precise
Hardware
exception
Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise
Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise
CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise
cudaMalloc
memoryleaks
Allocationsof device memoryusing cudaMalloc()that have not beenfreed
bythe application.
Host Precise
Device Heap
MemoryLeaks
Allocationsof device memoryusing malloc()indevice code that have not
beenfreed bythe application.
Device Imprecise
10. CUDA-MEMCHECK
EXAMPLE
Program with double free fault
intmain(intargc,char*argv[])
{
constintelemNum=1024;
inth_data[elemNum];
int*d_data;
initArray(h_data);
intarraySize=elemNum*sizeof(int);
cudaMalloc((void**)&d_data,arraySize);
incrOneForAll<<<1,1024>>>(d_data);
cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_data); //fault
printArray(h_data);
return0;
}
13. NVIDIA-SMI
You can querymore specific information on temperature,
memory, power, etc.
$nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...]
For example:
$nvidia-smi-q-dPOWER
==============NVSMILOG==============
Timestamp :
DriverVersion :319.37
AttachedGPUs :2
GPU0000:0B:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :60.71W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
EnforcedPowerLimit :235.00W
MinPowerLimit :150.00W
MaxPowerLimit :235.00W
GPU0000:85:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :31.38W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
14. LAB ASSIGNMENTS
1. Program-#1:increase each elementin an arraybyone.
(You are required to rewrite a CPUprogram into a CUDA
one.)
2. Program-#2:use parallelreduction to calculate the sum of all
the elements in an array.
(You are required to fillin the blanks of a template CUDA
program, and reportyour GPUbandwidth to TAafter you
finish each assignment.)
1. SUM CUDAprogramming with "multi-kerneland shared
memory"
2. SUM CUDAprogramming with "interleaved addressing"
3. SUM CUDAprogramming with "sequentialaddressing"
4. SUM CUDAprogramming with "firstadd during load"
0.2 scores per task.
15. LABS ASSIGNMENT #1
Rewrite the following CPUfunction into a CUDAkernel
function and complete the main function byyourself:
//increaseoneforalltheelements
voidincrOneForAll(int*array,constintelemNum)
{
inti;
for(i=0;i<elemNum;++i)
{
array[i]++;
}
}
16. LABS ASSIGNMENT #2
Fillin the CUDAkernelfunction:
Partof the main function is given, you are required to fillin the
blanks according to the comments:
__global__voidreduce(int*g_idata,int*g_odata)
{
extern__shared__intsdata[];
//TODO:loadthecontentofglobalmemorytosharedmemory
//NOTE:synchronizeallthethreadsafterthisstep
//TODO:sumcalculation
//NOTE:synchronizeallthethreadsaftereachiteration
//TODO:writebacktheresultintothecorrespondingentryofglobalmemory
//NOTE:onlyonethreadisenoughtodothejob
}
//parametersforthefirstkernel
//TODO:setgridandblocksize
//threadNum=?
//blockNum=?
intsMemSize=1024*sizeof(int);
reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);
17. Hint:for "firstadd during globalload" optimization (Assignment
#2-4), the third kernelis unnecessary.
LABS ASSIGNMENT #2
Given $10^{22}$ INTs, each block has the maximum block
size $10^{10}$
How to use 3 kernelto synchronize between iterations?
21. LABS ASSIGNMENT #2-4
Reduce the number of blocks in each kernel:
Notice:
Only2 kernels are needed in this case because each kernel
can now process twice amountof data than before.
Globalmemoryshould be accessed in a sequential
addressing way.
24. BUILT-IN VARIABLES FOR INDEXING IN A
KERNEL FUNCTION
blockIdx.x, blockIdx.y, blockIdx.z:block index
threadIdx.x, threadIdx.y, threadIdx.z:thread index
gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks
per grid) per dimension
blockDim.x, blockDim.y, blockDim.z:block size (number of
threads per block) per dimension
26. SYNCHRONIZATION
__synthread():synchronizes allthreads in a block (used inside
the kernelfunction).
cudaDeviceSynchronize():blocks untilthe device has
completed allpreceding requested tasks (used between two
kernellaunches).
kernel1<<<gridSize,blockSize>>>(args);
cudaDeviceSynchronize();
kernel2<<<gridSize,blockSize>>>(args);
27. HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
Methods:
cudaEventCreate():inittimer
cudaEventDestory():destorytimer
cudaEventRecord():settimer
cudaEventSynchronize():sync timer after each kernelcall
cudaEventElapsedTime():returns the elapsed time in
milliseconds
28. Example:
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
cudaEvent_tstart,stop;
floattime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
kernel<<<grid,threads>>>(d_idata,d_odata);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);