Cuda debugger

2020/10/22 朱玉婷
CUDA Debugger

OUTLINE
CUDA Programming and Execution Model
CUDA Memory Architecture
CUDA Exception List
CUDA Debugging
CUDA Terminology

BASIC CONCEPT
kernal
SM
device
warp
lane
block
thread
1 : 1
1 : N
N : 1

FUNCTION SPECIFIERS
Denote whether a function executes on the host or on the device and whether
it is callable from the host or from the device
 __global__ void kernel ( )
 __device__ void device ( )
 __host__ void main ( )
host device
__global__ callable execute
__device__ callable
__host__ execute

COMPILING PROCESS
 Separate source code to
host code and device code
 NVCC continue deal with
device code (PTX)
 Host code is pass to c++
compiler
 Combine then into
executable file
CPU code GPU code
.cu
.cpp
.ptxC++ compiler
Host linker
executable

PROGRAMMING MODEL
CPU GPU
MemoryMemory
coprocessor
CPU code GPU code
CUDA program  Data : CPU to GPU
 Allocate GPU memory
 Launch kernel on GPU
 Data : GPU to CPU
C funtion CUDA C funtion
malloc cudaMalloc
memcpy cudaMemcpy
memset cudaMemset
free cudaFree

GPU
HARDWARE ARCHITECTURE
Texture cache
Device Memory
SM0
SM1
Constant cache
Share memory
SP0 SP1 SP2 Registerthread
= thread blocks
Local
block

MEMORY TYPE
scope life locate
variable in kernel thread kernel register
arrary in kernel thread kernel local
__shared__ in kernel block kernel shared
__device__ grid application global
__constant__ grid application constant

CUDA EXCEPTION LIST
 illegal address
 stack overflow
 illegal instruction
 out-of-range address
 misaligned address
 invalid address space
 invaild PC
 Warp assert
 Syscall error
 invalid managed memory access

CASLAB SM EXCEPTION LIST
 illegal address
 stack overflow
 illegal instruction
 out-of-range address
 misaligned address
 invalid address space
 invaild PC
 Warp assert
 Syscall error
 invalid managed memory access

INVAILD PC
 Warp
This occurs when any thread within a warp advances its PC beyond the
40-bit address space

ILLEGAL INSTRUCTION
 Warp
This occurs when any thread within a warp has executed an illegal
instruction

CUDA EXCEPTION LIST
 illegal address
 Lane
 Device
 Warp
 stack overflow
 Lane
 Device
 Warp
 illegal instruction
 Warp
 out-of-range address
 Warp
 1
 2
 3
 4
 misaligned address
 Warp
 Lane
 invalid address space
 Warp
 invaild PC
 Warp
 Warp assert
 Warp
 Syscall error
 Lane
 invalid managed memory access

ILLEGAL ADDRESS
 Device
This occurs when a thread accesses an illegal (out of bounds) global address
 Warp
This occurs when a thread accesses an illegal (out of bounds) global/local/shared
address
 Lane
Precise (Requires memcheck on)
This occurs when a thread accesses an illegal (out of bounds) global address

STACK OVERFLOW
 Device
This occurs when the application triggers a global hardware stack overflow
The main cause of this error is large amounts of divergence in the presence of
function calls
 Warp
This occurs when any thread in a warp triggers a hardware stack overflow
 Lane
This occurs when a thread exceeds its stack memory limit

INVALID ADDRESS SPACE
 Warp
This occurs when any thread within a warp executes an instruction that
accesses a memory space not permitted for that instruction

MISALIGNED ADDRESS
 Warp
Occurs when any thread within a warp accesses an address in the local
or shared memory segments that is not correctly aligned
 Lane
This occurs when a thread accesses a global address that is not correctly
aligned

SYSCALL ERROR
 Lane
This occurs when a thread corrupts the heap by invoking free with an
invalid address
( ie, trying to free the same memory region twice )

INVALID MANAGED MEMORY ACCESS
 Host thread
This occurs when a host thread attempts to access managed memory
currently used by the GPU

WARP ASSERT
 Warp
This occurs when any thread in the warp hits a device side assertion
# include < assert.h >
__global__ void kernel ( )
{
assert ( threadIdx.x == 0 ) ;
}

CUDA DEBUGGING
1. Kernel Debugging
To inspect the flow and state of kernel execution on the fly
2. Memory Debugging
It focuses on the discovery of odd program behavior to the memory location

CUDA DEBUGGING
1. Kernel Debugging
2. Memory Debugging

KERNEL DEBUGGING
 Three Techniques
 CUDA-gdb
$ nvcc –g –G foo.cu –o foo
$ cuda-gdb foo
 printf
 assert

CUDA-GDB
Commands : break print run continue next step quit
 A CUDA program contain multiple host threads and many CUDA threads
 We can use cuda-gdb to report information about the current focus

CUDA INFO / FOCUS
(cuda-gdb) cuda thread lane warp block sm grid device kernel
Kernel1 ,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0
(cuda-gdb) cuda thread (2)

MEMORY DEBUGGING
$ cuda-memcheck [memcheck_options] app [app_options]
 Memcheck tool
 Racecheck tool
 Initcheck tool
 Syncheck tool
Memory
access error
Hardware
exception
Malloc/Free
errors
CUDA API
errors
cudaMalloc
memory leaks
Device Heap
Memory Leaks

MEMCHECK TOOL
To check for out-of-bounds and misaligned accesses in CUDA kernels
 Memcheck tool
 Racecheck tool
 Initcheck tool
 Syncheck tool

RACECHECK TOOL
Shared memory data access hazards that can cause data races
 Memcheck tool
 Racecheck tool
 Initcheck tool
 Syncheck tool

RACECHECK TOOL - BLOCK
__synthreads()

RACECHECK TOOL - WARP
__synwarp()

INITCHECK TOOL
The GPU performs uninitialized accesses to global memory
 Memcheck tool
 Racecheck tool
 Initcheck tool
 Syncheck tool

SYNCHECK TOOL
The application is attempting invalid usages of synchronization
 Memcheck tool
 Racecheck tool
 Initcheck tool
 Syncheck tool

FUTURE WORK
改寫現有功能，使其更符合硬體行為
Trap handler 處理 SM exception 實作功能
處理軟體相容性問題、軟硬體溝通問題
功能擴充新增GDB 除錯指令

OUTLINE
CUDA Programming and Execution Model
CUDA Memory Architecture
CUDA Exception List
CUDA Debugging
Appendix : CUDA Terminology

TERMINOLOGY
 Host
CPU and the system memory
 Device
GPU and its memory
 Kernel
A function that executes on the device , compose of several thread blocks (grid)
 SM
Streaming Multiprocessor , compose of several SPs , assign several thread blocks
 SP
Streaming Processor = CUDA Core , execute one thread

TERMINOLOGY
 Grid
Multiple thread blocks will form a grid
 Block
Several threads are grouped into a block, and the threads in the same block can be
synchronized, or they can communicate with each other via shared memory
 Warp
Set of threads that execute same instruction at the same time
 Thread
CUDA program is executed by many threads. A thread of a warp, called lane

CUDA GUARANTEES
 All threads in a thread black run on the same SM at the same
 All threads in a thread black run on the same SM may cooperate to solve sub-problem
 All threads in different thread black will not have cooperate relationship
 All blocks in a kernel finish before any blocks from the next kernel run

Cuda debugger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cuda debugger

Similar to Cuda debugger (20)

Recently uploaded

Recently uploaded (20)

Cuda debugger