Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cuda debugger

cuda basic debug research

  • Be the first to comment

  • Be the first to like this

Cuda debugger

  1. 1. 2020/10/22 朱玉婷 CUDA Debugger
  2. 2. OUTLINE CUDA Programming and Execution Model CUDA Memory Architecture CUDA Exception List CUDA Debugging CUDA Terminology
  3. 3. OUTLINE CUDA Programming and Execution Model CUDA Memory Architecture CUDA Exception List CUDA Debugging CUDA Terminology
  4. 4. BASIC CONCEPT kernal SM device warp lane block thread 1 : 1 1 : N N : 1
  5. 5. FUNCTION SPECIFIERS Denote whether a function executes on the host or on the device and whether it is callable from the host or from the device  __global__ void kernel ( )  __device__ void device ( )  __host__ void main ( ) host device __global__ callable execute __device__ callable __host__ execute
  6. 6. COMPILING PROCESS  Separate source code to host code and device code  NVCC continue deal with device code (PTX)  Host code is pass to c++ compiler  Combine then into executable file CPU code GPU code .cu .cpp .ptxC++ compiler Host linker executable
  7. 7. PROGRAMMING MODEL CPU GPU MemoryMemory coprocessor CPU code GPU code CUDA program  Data : CPU to GPU  Allocate GPU memory  Launch kernel on GPU  Data : GPU to CPU C funtion CUDA C funtion malloc cudaMalloc memcpy cudaMemcpy memset cudaMemset free cudaFree
  8. 8. OUTLINE CUDA Programming and Execution Model CUDA Memory Architecture CUDA Exception List CUDA Debugging CUDA Terminology
  9. 9. GPU HARDWARE ARCHITECTURE Texture cache Device Memory SM0 SM1 Constant cache Share memory SP0 SP1 SP2 Registerthread = thread blocks Local block
  10. 10. MEMORY TYPE scope life locate variable in kernel thread kernel register arrary in kernel thread kernel local __shared__ in kernel block kernel shared __device__ grid application global __constant__ grid application constant
  11. 11. EXPRIMENT GPU
  12. 12. OUTLINE CUDA Programming and Execution Model CUDA Memory Architecture CUDA Exception List CUDA Debugging CUDA Terminology
  13. 13. CUDA EXCEPTION LIST  illegal address  stack overflow  illegal instruction  out-of-range address  misaligned address  invalid address space  invaild PC  Warp assert  Syscall error  invalid managed memory access
  14. 14. CASLAB SM EXCEPTION LIST  illegal address  stack overflow  illegal instruction  out-of-range address  misaligned address  invalid address space  invaild PC  Warp assert  Syscall error  invalid managed memory access
  15. 15. INVAILD PC  Warp This occurs when any thread within a warp advances its PC beyond the 40-bit address space
  16. 16. ILLEGAL INSTRUCTION  Warp This occurs when any thread within a warp has executed an illegal instruction
  17. 17. CASLAB SM EXCEPTION LIST
  18. 18. CUDA EXCEPTION LIST  illegal address  Lane  Device  Warp  stack overflow  Lane  Device  Warp  illegal instruction  Warp  out-of-range address  Warp  1  2  3  4  misaligned address  Warp  Lane  invalid address space  Warp  invaild PC  Warp  Warp assert  Warp  Syscall error  Lane  invalid managed memory access
  19. 19. ILLEGAL ADDRESS  Device This occurs when a thread accesses an illegal (out of bounds) global address  Warp This occurs when a thread accesses an illegal (out of bounds) global/local/shared address  Lane Precise (Requires memcheck on) This occurs when a thread accesses an illegal (out of bounds) global address
  20. 20. STACK OVERFLOW  Device This occurs when the application triggers a global hardware stack overflow The main cause of this error is large amounts of divergence in the presence of function calls  Warp This occurs when any thread in a warp triggers a hardware stack overflow  Lane This occurs when a thread exceeds its stack memory limit
  21. 21. INVALID ADDRESS SPACE  Warp This occurs when any thread within a warp executes an instruction that accesses a memory space not permitted for that instruction
  22. 22. MISALIGNED ADDRESS  Warp Occurs when any thread within a warp accesses an address in the local or shared memory segments that is not correctly aligned  Lane This occurs when a thread accesses a global address that is not correctly aligned
  23. 23. SYSCALL ERROR  Lane This occurs when a thread corrupts the heap by invoking free with an invalid address ( ie, trying to free the same memory region twice )
  24. 24. INVALID MANAGED MEMORY ACCESS  Host thread This occurs when a host thread attempts to access managed memory currently used by the GPU
  25. 25. WARP ASSERT  Warp This occurs when any thread in the warp hits a device side assertion # include < assert.h > __global__ void kernel ( ) { assert ( threadIdx.x == 0 ) ; }
  26. 26. OUTLINE CUDA Programming and Execution Model CUDA Memory Architecture CUDA Exception List CUDA Debugging CUDA Terminology
  27. 27. CUDA DEBUGGING 1. Kernel Debugging To inspect the flow and state of kernel execution on the fly 2. Memory Debugging It focuses on the discovery of odd program behavior to the memory location
  28. 28. CUDA DEBUGGING 1. Kernel Debugging 2. Memory Debugging
  29. 29. KERNEL DEBUGGING  Three Techniques  CUDA-gdb $ nvcc –g –G foo.cu –o foo $ cuda-gdb foo  printf  assert
  30. 30. CUDA-GDB Commands : break print run continue next step quit  A CUDA program contain multiple host threads and many CUDA threads  We can use cuda-gdb to report information about the current focus
  31. 31. CUDA INFO / FOCUS (cuda-gdb) cuda thread lane warp block sm grid device kernel Kernel1 ,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0 (cuda-gdb) cuda thread (2)
  32. 32. CUDA DEBUGGING 1. Kernel Debugging 2. Memory Debugging
  33. 33. MEMORY DEBUGGING $ cuda-memcheck [memcheck_options] app [app_options]  Memcheck tool  Racecheck tool  Initcheck tool  Syncheck tool Memory access error Hardware exception Malloc/Free errors CUDA API errors cudaMalloc memory leaks Device Heap Memory Leaks
  34. 34. MEMCHECK TOOL To check for out-of-bounds and misaligned accesses in CUDA kernels $ cuda-memcheck [memcheck_options] app [app_options]  Memcheck tool  Racecheck tool  Initcheck tool  Syncheck tool
  35. 35. MEMCHECK TOOL - OUT OF BOUNDS
  36. 36. MEMCHECK TOOL - OUT OF BOUNDS
  37. 37. MEMCHECK TOOL - OUT OF BOUNDS
  38. 38. MEMCHECK TOOL - OUT OF BOUNDS
  39. 39. MEMCHECK TOOL - MISALIGNED
  40. 40. MEMCHECK TOOL - MISALIGNED
  41. 41. RACECHECK TOOL Shared memory data access hazards that can cause data races $ cuda-memcheck [memcheck_options] app [app_options]  Memcheck tool  Racecheck tool  Initcheck tool  Syncheck tool
  42. 42. RACECHECK TOOL - BLOCK __synthreads()
  43. 43. RACECHECK TOOL - WARP __synwarp()
  44. 44. INITCHECK TOOL The GPU performs uninitialized accesses to global memory $ cuda-memcheck [memcheck_options] app [app_options]  Memcheck tool  Racecheck tool  Initcheck tool  Syncheck tool
  45. 45. INITCHECK TOOL 4 * 2 * 128
  46. 46. SYNCHECK TOOL The application is attempting invalid usages of synchronization $ cuda-memcheck [memcheck_options] app [app_options]  Memcheck tool  Racecheck tool  Initcheck tool  Syncheck tool
  47. 47. SYNCHECK TOOL
  48. 48. FUTURE WORK 改寫現有功能,使其更符合硬體行為 Trap handler 處理 SM exception 實作功能 處理軟體相容性問題、軟硬體溝通問題 功能擴充 新增GDB 除錯指令
  49. 49. OUTLINE CUDA Programming and Execution Model CUDA Memory Architecture CUDA Exception List CUDA Debugging Appendix : CUDA Terminology
  50. 50. TERMINOLOGY  Host CPU and the system memory  Device GPU and its memory  Kernel A function that executes on the device , compose of several thread blocks (grid)  SM Streaming Multiprocessor , compose of several SPs , assign several thread blocks  SP Streaming Processor = CUDA Core , execute one thread
  51. 51. TERMINOLOGY  Grid Multiple thread blocks will form a grid  Block Several threads are grouped into a block, and the threads in the same block can be synchronized, or they can communicate with each other via shared memory  Warp Set of threads that execute same instruction at the same time  Thread CUDA program is executed by many threads. A thread of a warp, called lane
  52. 52. CUDA GUARANTEES  All threads in a thread black run on the same SM at the same  All threads in a thread black run on the same SM may cooperate to solve sub-problem  All threads in different thread black will not have cooperate relationship  All blocks in a kernel finish before any blocks from the next kernel run
  53. 53. THANKS

×