SlideShare a Scribd company logo
1 of 30
Download to read offline
CUDA LAB
LSALAB
OVERVIEW
Programming Environment
Compile &Run CUDAprogram
CUDATools
Lab Tasks
CUDAProgramming Tips
References
GPU SERVER
IntelE5-2670 V2 10Cores CPUX 2
NVIDIAK20X GPGPUCARD X 2
Command to getyour GPGPUHW spec:
$/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
Device0:"TeslaK20Xm"
CUDADriverVersion/RuntimeVersion 5.5/5.5
CUDACapabilityMajor/Minorversionnumber: 3.5
Totalamountofglobalmemory: 5760MBytes(6039339008bytes)
(14)Multiprocessors,(192)CUDACores/MP: 2688CUDACores
GPUClockrate: 732MHz(0.73GHz)
MemoryClockrate: 2600Mhz
MemoryBusWidth: 384-bit
L2CacheSize: 1572864bytes
Totalamountofconstantmemory: 65536bytes
Totalamountofsharedmemoryperblock: 49152bytes
Totalnumberofregistersavailableperblock:65536
Warpsize: 32
Maximumnumberofthreadspermultiprocessor: 2048
Maximumnumberofthreadsperblock: 1024
Maxdimensionsizeofathreadblock(x,y,z):(1024,1024,64)
Maxdimensionsizeofagridsize (x,y,z):(2147483647,65535,65535)
theoreticalmemorybandwidth:$2600 times 10^{6} times
(384 /8) times 2 ÷ 1024^3 = 243 GB/s$
OfficialHW Spec details:
http://www.nvidia.com/object/tesla-servers.html
COMPILE & RUN CUDA
Directlycompile to executable code
GPUand CPUcode are compiled and linked separately
#compilethesourcecodetoexecutablefile
$nvcca.cu-oa.out
COMPILE & RUN CUDA
The nvcc compiler willtranslate CUDAsource code into Parallel
Thread Execution (PTX) language in the intermediate phase.
#keepallintermediatephasefiles
$nvcca.cu-keep
#or
$nvcca.cu-save-temps
$nvcca.cu-keep
$ls
a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a
a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a
a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a
#cleanallintermediatephasefiles
$nvcca.cu-keep-clean
USEFUL NVCC USAGE
Printcode generation statistics:
$nvcc-Xptxas-vreduce.cu
ptxasinfo :0bytesgmem
ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10'
ptxasinfo :Used6registers,32bytessmem,4bytescmem[1]
-Xptxas
--ptxas-options
Specifyoptionsdirectlytotheptxoptimizingassembler.
register number:should be less than the number of available
registers, otherwises the restregisters willbe mapped into
the localmemory(off-chip).
smem stands for shared memory.
cmem stands for constantmemory. The bank-#1 constant
memorystores 4 bytes of constantvariables.
CUDA TOOLS
cuda-memcheck:functionalcorrectness checking suite.
nvidia-smi:NVIDIASystem ManagementInterface
CUDA-MEMCHECK
This toolchecks the following memoryerrors of your program,
and italso reports hardware exceptions encountered bythe
GPU.
These errors maynotcause program crash, buttheycould
unexpected program and memorymisusage.
Table .Memcheck reported errortypes
Name Description Location Precision
Memoryaccess
error
Errorsdue to out of boundsormisaligned accessesto memorybyaglobal,
local,shared orglobal atomic access.
Device Precise
Hardware
exception
Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise
Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise
CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise
cudaMalloc
memoryleaks
Allocationsof device memoryusing cudaMalloc()that have not beenfreed
bythe application.
Host Precise
Device Heap
MemoryLeaks
Allocationsof device memoryusing malloc()indevice code that have not
beenfreed bythe application.
Device Imprecise
CUDA-MEMCHECK
EXAMPLE
Program with double free fault
intmain(intargc,char*argv[])
{
constintelemNum=1024;
inth_data[elemNum];
int*d_data;
initArray(h_data);
intarraySize=elemNum*sizeof(int);
cudaMalloc((void**)&d_data,arraySize);
incrOneForAll<<<1,1024>>>(d_data);
cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(d_data); //fault
printArray(h_data);
return0;
}
CUDA-MEMCHECK
EXAMPLE
$nvcc-g-Gexample.cu
$cuda-memcheck./a.out
=========CUDA-MEMCHECK
=========Programhiterror17onCUDAAPIcalltocudaFree
========= Savedhostbacktraceuptodriverentrypointaterror
========= HostFrame:/usr/lib64/libcuda.so[0x26d660]
========= HostFrame:./a.out[0x42af6]
========= HostFrame:./a.out[0x2a29]
========= HostFrame:/lib64/libc.so.6(__libc_start_main+0xfd)[0x1ecdd]
========= HostFrame:./a.out[0x2769]
=========
No error is shown if itis run directly, butCUDA-MEMCHECK
can detectthe error.
NVIDIA SYSTEM MANAGEMENT INTERFACE
(NVIDIA-SMI)
Purpose:Queryand modifyGPUdevices' state.
$nvidia-smi
+------------------------------------------------------+
|NVIDIA-SMI5.319.37 DriverVersion:319.37 |
|-------------------------------+----------------------+----------------------+
|GPU Name Persistence-M|Bus-Id Disp.A|VolatileUncorr.ECC|
|Fan Temp Perf Pwr:Usage/Cap| Memory-Usage|GPU-Util ComputeM.|
|===============================+======================+======================|
| 0 TeslaK20Xm On |0000:0B:00.0 Off| 0|
|N/A 35C P0 60W/235W| 84MB/ 5759MB| 0% Default|
+-------------------------------+----------------------+----------------------+
| 1 TeslaK20Xm On |0000:85:00.0 Off| 0|
|N/A 39C P0 60W/235W| 14MB/ 5759MB| 0% Default|
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
|Computeprocesses: GPUMemory|
| GPU PID Processname Usage |
|=============================================================================|
| 0 33736 ./RS 69MB |
+-----------------------------------------------------------------------------+
NVIDIA-SMI
You can querymore specific information on temperature,
memory, power, etc.
$nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...]
For example:
$nvidia-smi-q-dPOWER
==============NVSMILOG==============
Timestamp :
DriverVersion :319.37
AttachedGPUs :2
GPU0000:0B:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :60.71W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
EnforcedPowerLimit :235.00W
MinPowerLimit :150.00W
MaxPowerLimit :235.00W
GPU0000:85:00.0
PowerReadings
PowerManagement :Supported
PowerDraw :31.38W
PowerLimit :235.00W
DefaultPowerLimit :235.00W
LAB ASSIGNMENTS
1. Program-#1:increase each elementin an arraybyone.
(You are required to rewrite a CPUprogram into a CUDA
one.)
2. Program-#2:use parallelreduction to calculate the sum of all
the elements in an array.
(You are required to fillin the blanks of a template CUDA
program, and reportyour GPUbandwidth to TAafter you
finish each assignment.)
1. SUM CUDAprogramming with "multi-kerneland shared
memory"
2. SUM CUDAprogramming with "interleaved addressing"
3. SUM CUDAprogramming with "sequentialaddressing"
4. SUM CUDAprogramming with "firstadd during load"
0.2 scores per task.
LABS ASSIGNMENT #1
Rewrite the following CPUfunction into a CUDAkernel
function and complete the main function byyourself:
//increaseoneforalltheelements
voidincrOneForAll(int*array,constintelemNum)
{
inti;
for(i=0;i<elemNum;++i)
{
array[i]++;
}
}
LABS ASSIGNMENT #2
Fillin the CUDAkernelfunction:
Partof the main function is given, you are required to fillin the
blanks according to the comments:
__global__voidreduce(int*g_idata,int*g_odata)
{
extern__shared__intsdata[];
//TODO:loadthecontentofglobalmemorytosharedmemory
//NOTE:synchronizeallthethreadsafterthisstep
//TODO:sumcalculation
//NOTE:synchronizeallthethreadsaftereachiteration
//TODO:writebacktheresultintothecorrespondingentryofglobalmemory
//NOTE:onlyonethreadisenoughtodothejob
}
//parametersforthefirstkernel
//TODO:setgridandblocksize
//threadNum=?
//blockNum=?
intsMemSize=1024*sizeof(int);
reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);
Hint:for "firstadd during globalload" optimization (Assignment
#2-4), the third kernelis unnecessary.
LABS ASSIGNMENT #2
Given $10^{22}$ INTs, each block has the maximum block
size $10^{10}$
How to use 3 kernelto synchronize between iterations?
LABS ASSIGNMENT #2-1
Implementthe naïve data parallelism assignmentas follows:
LABS ASSIGNMENT #2-2
Reduce number of active warps of your program:
LABS ASSIGNMENT #2-3
Preventshared memoryaccess bank confliction:
LABS ASSIGNMENT #2-4
Reduce the number of blocks in each kernel:
Notice:
Only2 kernels are needed in this case because each kernel
can now process twice amountof data than before.
Globalmemoryshould be accessed in a sequential
addressing way.
CUDA PROGRAMMING TIPS
KERNEL LAUNCH
mykernel<<<gridSize,blockSize,sMemSize,streamID>>>(args);
gridSize:number of blocks per grid
blockSize:number of threads per block
sMemSize[optional]:shared memorysize (in bytes)
streamID[optional]:stream ID, defaultis 0
BUILT-IN VARIABLES FOR INDEXING IN A
KERNEL FUNCTION
blockIdx.x, blockIdx.y, blockIdx.z:block index
threadIdx.x, threadIdx.y, threadIdx.z:thread index
gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks
per grid) per dimension
blockDim.x, blockDim.y, blockDim.z:block size (number of
threads per block) per dimension
CUDAMEMCPY
cudaError_tcudaMemcpy(void*dst,
constvoid*src,
size_t count,
enumcudaMemcpyKindkind
)
Enumerator:
cudaMemcpyHostToHost:Host-> Host
cudaMemcpyHostToDevice:Host-> Device
cudaMemcpyDeviceToHost;Device -> Host
cudaMemcpyDeviceToDevice:Device -> Device
SYNCHRONIZATION
__synthread():synchronizes allthreads in a block (used inside
the kernelfunction).
cudaDeviceSynchronize():blocks untilthe device has
completed allpreceding requested tasks (used between two
kernellaunches).
kernel1<<<gridSize,blockSize>>>(args);
cudaDeviceSynchronize();
kernel2<<<gridSize,blockSize>>>(args);
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
Methods:
cudaEventCreate():inittimer
cudaEventDestory():destorytimer
cudaEventRecord():settimer
cudaEventSynchronize():sync timer after each kernelcall
cudaEventElapsedTime():returns the elapsed time in
milliseconds
Example:
HOW TO MEASURE KERNEL EXECUTION TIME
USING CUDA GPU TIMERS
cudaEvent_tstart,stop;
floattime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
kernel<<<grid,threads>>>(d_idata,d_odata);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
REFERENCES
1.
2.
3.
4.
5.
6.
7.
NVIDIACUDARuntime API
Programming Guide ::CUDAToolkitDocumentation
BestPractices Guide ::CUDAToolkitDocumentation
NVCC ::CUDAToolkitDocumentation
CUDA-MEMCHECK::CUDAToolkitDocumentation
nvidia-smidocumentation
CUDAerror types
THE END
ENJOY CUDA & HAPPY NEW YEAR!

More Related Content

What's hot

Kernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPIKernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPIAnne Nicolas
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source codeAndrey Karpov
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Mr. Vengineer
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321Teddy Hsiung
 
Kotlin coroutine - the next step for RxJava developer?
Kotlin coroutine - the next step for RxJava developer?Kotlin coroutine - the next step for RxJava developer?
Kotlin coroutine - the next step for RxJava developer?Artur Latoszewski
 
Linux Timer device driver
Linux Timer device driverLinux Timer device driver
Linux Timer device driver艾鍗科技
 
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)CODE BLUE
 
Making a Process
Making a ProcessMaking a Process
Making a ProcessDavid Evans
 
Down to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap DumpsDown to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap DumpsAndrei Pangin
 
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...Igalia
 
OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified! OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified! DVClub
 
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin PiebiakWorkflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin PiebiakNETWAYS
 
Workflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large EnterprisesWorkflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large EnterprisesPuppet
 
計算機性能の限界点とその考え方
計算機性能の限界点とその考え方計算機性能の限界点とその考え方
計算機性能の限界点とその考え方Naoto MATSUMOTO
 
The Ring programming language version 1.8 book - Part 54 of 202
The Ring programming language version 1.8 book - Part 54 of 202The Ring programming language version 1.8 book - Part 54 of 202
The Ring programming language version 1.8 book - Part 54 of 202Mahmoud Samir Fayed
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Mr. Vengineer
 
Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2YEONG-CHEON YOU
 

What's hot (20)

Kernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPIKernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
 
Npc14
Npc14Npc14
Npc14
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
 
Kotlin coroutine - the next step for RxJava developer?
Kotlin coroutine - the next step for RxJava developer?Kotlin coroutine - the next step for RxJava developer?
Kotlin coroutine - the next step for RxJava developer?
 
Linux Timer device driver
Linux Timer device driverLinux Timer device driver
Linux Timer device driver
 
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
various tricks for remote linux exploits  by Seok-Ha Lee (wh1ant)
 
Making a Process
Making a ProcessMaking a Process
Making a Process
 
Down to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap DumpsDown to Stack Traces, up from Heap Dumps
Down to Stack Traces, up from Heap Dumps
 
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
Development of hardware-based Elements for GStreamer 1.0: A case study (GStre...
 
OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified! OOP for Hardware Verification--Demystified!
OOP for Hardware Verification--Demystified!
 
Book
BookBook
Book
 
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin PiebiakWorkflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
 
Workflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large EnterprisesWorkflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large Enterprises
 
計算機性能の限界点とその考え方
計算機性能の限界点とその考え方計算機性能の限界点とその考え方
計算機性能の限界点とその考え方
 
The Ring programming language version 1.8 book - Part 54 of 202
The Ring programming language version 1.8 book - Part 54 of 202The Ring programming language version 1.8 book - Part 54 of 202
The Ring programming language version 1.8 book - Part 54 of 202
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
 
Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2
 

Viewers also liked

Pedagogizacja rodziców
Pedagogizacja rodzicówPedagogizacja rodziców
Pedagogizacja rodzicówsiwonas
 
my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"Shuai Yuan
 
Poradnik gimnazjalisty
Poradnik gimnazjalistyPoradnik gimnazjalisty
Poradnik gimnazjalistysiwonas
 
Prezentacja typologia form zachowania sie bezrobotnych
Prezentacja   typologia form zachowania sie bezrobotnychPrezentacja   typologia form zachowania sie bezrobotnych
Prezentacja typologia form zachowania sie bezrobotnychsiwonas
 
Bezrobocie jako problem społeczny
Bezrobocie jako problem społecznyBezrobocie jako problem społeczny
Bezrobocie jako problem społecznysiwonas
 
Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1siwonas
 
11 writing pp elaboration examples
11 writing pp elaboration examples11 writing pp elaboration examples
11 writing pp elaboration exampleskartia79
 
اتجاهات المعلمين نحو استخدام التعليم الالكتروني
اتجاهات المعلمين نحو استخدام التعليم الالكترونياتجاهات المعلمين نحو استخدام التعليم الالكتروني
اتجاهات المعلمين نحو استخدام التعليم الالكترونيMarwa Soliman
 

Viewers also liked (14)

Marcas (flora y fauna en chimborazo)
Marcas  (flora y fauna en chimborazo)Marcas  (flora y fauna en chimborazo)
Marcas (flora y fauna en chimborazo)
 
Pedagogizacja rodziców
Pedagogizacja rodzicówPedagogizacja rodziców
Pedagogizacja rodziców
 
Soccer maddie
Soccer maddieSoccer maddie
Soccer maddie
 
14776451 reacciones-redox
14776451 reacciones-redox14776451 reacciones-redox
14776451 reacciones-redox
 
my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"my presentation of the paper "FAST'12 NCCloud"
my presentation of the paper "FAST'12 NCCloud"
 
Poradnik gimnazjalisty
Poradnik gimnazjalistyPoradnik gimnazjalisty
Poradnik gimnazjalisty
 
Prezentacja typologia form zachowania sie bezrobotnych
Prezentacja   typologia form zachowania sie bezrobotnychPrezentacja   typologia form zachowania sie bezrobotnych
Prezentacja typologia form zachowania sie bezrobotnych
 
Bezrobocie jako problem społeczny
Bezrobocie jako problem społecznyBezrobocie jako problem społeczny
Bezrobocie jako problem społeczny
 
Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1Bilans kompetencji w poradnictwie zawodowym 1
Bilans kompetencji w poradnictwie zawodowym 1
 
11 writing pp elaboration examples
11 writing pp elaboration examples11 writing pp elaboration examples
11 writing pp elaboration examples
 
Sony vegas pro 11 manual de inicio rápido
Sony vegas pro 11   manual de inicio rápidoSony vegas pro 11   manual de inicio rápido
Sony vegas pro 11 manual de inicio rápido
 
اتجاهات المعلمين نحو استخدام التعليم الالكتروني
اتجاهات المعلمين نحو استخدام التعليم الالكترونياتجاهات المعلمين نحو استخدام التعليم الالكتروني
اتجاهات المعلمين نحو استخدام التعليم الالكتروني
 
Hsp pkjr
Hsp pkjrHsp pkjr
Hsp pkjr
 
Bo rang claim perjalanan
Bo rang claim perjalananBo rang claim perjalanan
Bo rang claim perjalanan
 

Similar to CUDA LAB Overview

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
 
Roll your own toy unix clone os
Roll your own toy unix clone osRoll your own toy unix clone os
Roll your own toy unix clone oseramax
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computingArjan Lamers
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 
CUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesCUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesSubhajit Sahu
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda MoayadMoayadhn
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
망고100 보드로 놀아보자 15
망고100 보드로 놀아보자 15망고100 보드로 놀아보자 15
망고100 보드로 놀아보자 15종인 전
 

Similar to CUDA LAB Overview (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Roll your own toy unix clone os
Roll your own toy unix clone osRoll your own toy unix clone os
Roll your own toy unix clone os
 
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
CUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesCUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : Notes
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
Cuda debugger
Cuda debuggerCuda debugger
Cuda debugger
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
망고100 보드로 놀아보자 15
망고100 보드로 놀아보자 15망고100 보드로 놀아보자 15
망고100 보드로 놀아보자 15
 

Recently uploaded

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxBipin Adhikari
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 

Recently uploaded (20)

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Intellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptxIntellectual property rightsand its types.pptx
Intellectual property rightsand its types.pptx
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 

CUDA LAB Overview

  • 2. OVERVIEW Programming Environment Compile &Run CUDAprogram CUDATools Lab Tasks CUDAProgramming Tips References
  • 3. GPU SERVER IntelE5-2670 V2 10Cores CPUX 2 NVIDIAK20X GPGPUCARD X 2
  • 4. Command to getyour GPGPUHW spec: $/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Device0:"TeslaK20Xm" CUDADriverVersion/RuntimeVersion 5.5/5.5 CUDACapabilityMajor/Minorversionnumber: 3.5 Totalamountofglobalmemory: 5760MBytes(6039339008bytes) (14)Multiprocessors,(192)CUDACores/MP: 2688CUDACores GPUClockrate: 732MHz(0.73GHz) MemoryClockrate: 2600Mhz MemoryBusWidth: 384-bit L2CacheSize: 1572864bytes Totalamountofconstantmemory: 65536bytes Totalamountofsharedmemoryperblock: 49152bytes Totalnumberofregistersavailableperblock:65536 Warpsize: 32 Maximumnumberofthreadspermultiprocessor: 2048 Maximumnumberofthreadsperblock: 1024 Maxdimensionsizeofathreadblock(x,y,z):(1024,1024,64) Maxdimensionsizeofagridsize (x,y,z):(2147483647,65535,65535) theoreticalmemorybandwidth:$2600 times 10^{6} times (384 /8) times 2 ÷ 1024^3 = 243 GB/s$ OfficialHW Spec details: http://www.nvidia.com/object/tesla-servers.html
  • 5. COMPILE & RUN CUDA Directlycompile to executable code GPUand CPUcode are compiled and linked separately #compilethesourcecodetoexecutablefile $nvcca.cu-oa.out
  • 6. COMPILE & RUN CUDA The nvcc compiler willtranslate CUDAsource code into Parallel Thread Execution (PTX) language in the intermediate phase. #keepallintermediatephasefiles $nvcca.cu-keep #or $nvcca.cu-save-temps $nvcca.cu-keep $ls a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a #cleanallintermediatephasefiles $nvcca.cu-keep-clean
  • 7. USEFUL NVCC USAGE Printcode generation statistics: $nvcc-Xptxas-vreduce.cu ptxasinfo :0bytesgmem ptxasinfo :Compilingentryfunction'_Z6reducePiS_'for'sm_10' ptxasinfo :Used6registers,32bytessmem,4bytescmem[1] -Xptxas --ptxas-options Specifyoptionsdirectlytotheptxoptimizingassembler. register number:should be less than the number of available registers, otherwises the restregisters willbe mapped into the localmemory(off-chip). smem stands for shared memory. cmem stands for constantmemory. The bank-#1 constant memorystores 4 bytes of constantvariables.
  • 8. CUDA TOOLS cuda-memcheck:functionalcorrectness checking suite. nvidia-smi:NVIDIASystem ManagementInterface
  • 9. CUDA-MEMCHECK This toolchecks the following memoryerrors of your program, and italso reports hardware exceptions encountered bythe GPU. These errors maynotcause program crash, buttheycould unexpected program and memorymisusage. Table .Memcheck reported errortypes Name Description Location Precision Memoryaccess error Errorsdue to out of boundsormisaligned accessesto memorybyaglobal, local,shared orglobal atomic access. Device Precise Hardware exception Errorsthat are reported bythe hardware errorreporting mechanism. Device Imprecise Malloc/Free errors Errorsthat occurdue to incorrect use of malloc()/free()inCUDAkernels. Device Precise CUDAAPIerrors Reported whenaCUDAAPIcall inthe applicationreturnsafailure. Host Precise cudaMalloc memoryleaks Allocationsof device memoryusing cudaMalloc()that have not beenfreed bythe application. Host Precise Device Heap MemoryLeaks Allocationsof device memoryusing malloc()indevice code that have not beenfreed bythe application. Device Imprecise
  • 10. CUDA-MEMCHECK EXAMPLE Program with double free fault intmain(intargc,char*argv[]) { constintelemNum=1024; inth_data[elemNum]; int*d_data; initArray(h_data); intarraySize=elemNum*sizeof(int); cudaMalloc((void**)&d_data,arraySize); incrOneForAll<<<1,1024>>>(d_data); cudaMemcpy((void**)&h_data,d_data,arraySize,cudaMemcpyDeviceToHost); cudaFree(d_data); cudaFree(d_data); //fault printArray(h_data); return0; }
  • 11. CUDA-MEMCHECK EXAMPLE $nvcc-g-Gexample.cu $cuda-memcheck./a.out =========CUDA-MEMCHECK =========Programhiterror17onCUDAAPIcalltocudaFree ========= Savedhostbacktraceuptodriverentrypointaterror ========= HostFrame:/usr/lib64/libcuda.so[0x26d660] ========= HostFrame:./a.out[0x42af6] ========= HostFrame:./a.out[0x2a29] ========= HostFrame:/lib64/libc.so.6(__libc_start_main+0xfd)[0x1ecdd] ========= HostFrame:./a.out[0x2769] ========= No error is shown if itis run directly, butCUDA-MEMCHECK can detectthe error.
  • 12. NVIDIA SYSTEM MANAGEMENT INTERFACE (NVIDIA-SMI) Purpose:Queryand modifyGPUdevices' state. $nvidia-smi +------------------------------------------------------+ |NVIDIA-SMI5.319.37 DriverVersion:319.37 | |-------------------------------+----------------------+----------------------+ |GPU Name Persistence-M|Bus-Id Disp.A|VolatileUncorr.ECC| |Fan Temp Perf Pwr:Usage/Cap| Memory-Usage|GPU-Util ComputeM.| |===============================+======================+======================| | 0 TeslaK20Xm On |0000:0B:00.0 Off| 0| |N/A 35C P0 60W/235W| 84MB/ 5759MB| 0% Default| +-------------------------------+----------------------+----------------------+ | 1 TeslaK20Xm On |0000:85:00.0 Off| 0| |N/A 39C P0 60W/235W| 14MB/ 5759MB| 0% Default| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ |Computeprocesses: GPUMemory| | GPU PID Processname Usage | |=============================================================================| | 0 33736 ./RS 69MB | +-----------------------------------------------------------------------------+
  • 13. NVIDIA-SMI You can querymore specific information on temperature, memory, power, etc. $nvidia-smi-q-d[TEMPERATURE|MEMORY|POWER|CLOCK|...] For example: $nvidia-smi-q-dPOWER ==============NVSMILOG============== Timestamp : DriverVersion :319.37 AttachedGPUs :2 GPU0000:0B:00.0 PowerReadings PowerManagement :Supported PowerDraw :60.71W PowerLimit :235.00W DefaultPowerLimit :235.00W EnforcedPowerLimit :235.00W MinPowerLimit :150.00W MaxPowerLimit :235.00W GPU0000:85:00.0 PowerReadings PowerManagement :Supported PowerDraw :31.38W PowerLimit :235.00W DefaultPowerLimit :235.00W
  • 14. LAB ASSIGNMENTS 1. Program-#1:increase each elementin an arraybyone. (You are required to rewrite a CPUprogram into a CUDA one.) 2. Program-#2:use parallelreduction to calculate the sum of all the elements in an array. (You are required to fillin the blanks of a template CUDA program, and reportyour GPUbandwidth to TAafter you finish each assignment.) 1. SUM CUDAprogramming with "multi-kerneland shared memory" 2. SUM CUDAprogramming with "interleaved addressing" 3. SUM CUDAprogramming with "sequentialaddressing" 4. SUM CUDAprogramming with "firstadd during load" 0.2 scores per task.
  • 15. LABS ASSIGNMENT #1 Rewrite the following CPUfunction into a CUDAkernel function and complete the main function byyourself: //increaseoneforalltheelements voidincrOneForAll(int*array,constintelemNum) { inti; for(i=0;i<elemNum;++i) { array[i]++; } }
  • 16. LABS ASSIGNMENT #2 Fillin the CUDAkernelfunction: Partof the main function is given, you are required to fillin the blanks according to the comments: __global__voidreduce(int*g_idata,int*g_odata) { extern__shared__intsdata[]; //TODO:loadthecontentofglobalmemorytosharedmemory //NOTE:synchronizeallthethreadsafterthisstep //TODO:sumcalculation //NOTE:synchronizeallthethreadsaftereachiteration //TODO:writebacktheresultintothecorrespondingentryofglobalmemory //NOTE:onlyonethreadisenoughtodothejob } //parametersforthefirstkernel //TODO:setgridandblocksize //threadNum=? //blockNum=? intsMemSize=1024*sizeof(int); reduce<<<threadNum,blockNum,sMemSize>>>(d_idata,d_odata);
  • 17. Hint:for "firstadd during globalload" optimization (Assignment #2-4), the third kernelis unnecessary. LABS ASSIGNMENT #2 Given $10^{22}$ INTs, each block has the maximum block size $10^{10}$ How to use 3 kernelto synchronize between iterations?
  • 18. LABS ASSIGNMENT #2-1 Implementthe naïve data parallelism assignmentas follows:
  • 19. LABS ASSIGNMENT #2-2 Reduce number of active warps of your program:
  • 20. LABS ASSIGNMENT #2-3 Preventshared memoryaccess bank confliction:
  • 21. LABS ASSIGNMENT #2-4 Reduce the number of blocks in each kernel: Notice: Only2 kernels are needed in this case because each kernel can now process twice amountof data than before. Globalmemoryshould be accessed in a sequential addressing way.
  • 23. KERNEL LAUNCH mykernel<<<gridSize,blockSize,sMemSize,streamID>>>(args); gridSize:number of blocks per grid blockSize:number of threads per block sMemSize[optional]:shared memorysize (in bytes) streamID[optional]:stream ID, defaultis 0
  • 24. BUILT-IN VARIABLES FOR INDEXING IN A KERNEL FUNCTION blockIdx.x, blockIdx.y, blockIdx.z:block index threadIdx.x, threadIdx.y, threadIdx.z:thread index gridDim.x, gridDim.y, gridDim.z:grid size (number of blocks per grid) per dimension blockDim.x, blockDim.y, blockDim.z:block size (number of threads per block) per dimension
  • 26. SYNCHRONIZATION __synthread():synchronizes allthreads in a block (used inside the kernelfunction). cudaDeviceSynchronize():blocks untilthe device has completed allpreceding requested tasks (used between two kernellaunches). kernel1<<<gridSize,blockSize>>>(args); cudaDeviceSynchronize(); kernel2<<<gridSize,blockSize>>>(args);
  • 27. HOW TO MEASURE KERNEL EXECUTION TIME USING CUDA GPU TIMERS Methods: cudaEventCreate():inittimer cudaEventDestory():destorytimer cudaEventRecord():settimer cudaEventSynchronize():sync timer after each kernelcall cudaEventElapsedTime():returns the elapsed time in milliseconds
  • 28. Example: HOW TO MEASURE KERNEL EXECUTION TIME USING CUDA GPU TIMERS cudaEvent_tstart,stop; floattime; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); kernel<<<grid,threads>>>(d_idata,d_odata); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time,start,stop); cudaEventDestroy(start); cudaEventDestroy(stop);
  • 29. REFERENCES 1. 2. 3. 4. 5. 6. 7. NVIDIACUDARuntime API Programming Guide ::CUDAToolkitDocumentation BestPractices Guide ::CUDAToolkitDocumentation NVCC ::CUDAToolkitDocumentation CUDA-MEMCHECK::CUDAToolkitDocumentation nvidia-smidocumentation CUDAerror types
  • 30. THE END ENJOY CUDA & HAPPY NEW YEAR!