SlideShare a Scribd company logo
1 of 15
Image Processing:
Gaussian smoothing
201301032
Darshan Parsana
Blurring/smoothing
 Mathematically, applying a Gaussian blur to an image is the same
as convolving the image with a Gaussian function.
 The Gaussian blur is a type of image-blurring filter
that uses a Gaussian function for calculating the
transformation to apply to each pixel in the image.
 Gaussian blur takes a weighted average around
the pixel, while "normal" blur just averages all the
pixels in the radius of the single pixel together.
Gaussian function:
How it works? kernel type : Gaussian
 Complexity = O(N*r*r) ; r = blur radii. N = total no. of pixels
 It is a widely used effect in graphics software, typically to
reduce image noise and reduce image detail.
 Ref: https://en.wikipedia.org/wiki/Gaussian_blur
Examples:
 Input = image Output = image
Blur radii = 1.2 pixel
Blur radii = 2.5 pixel
Blur radii = 5.0 pixel
Serial code
 Complexity = O(N*r*r); N=total no. of pixel
 So, in parallel code we can just launch threads based on output image(like in
matrix multiplication)
for(row = 0; row < height; row++){
for(col = 0; col < width; col++){
int sumX = 0,sumY = 0,ans = 0;
int r = row;
int c = col;
for(i = -filterWidth/2; i < filterWidth/2; i++){
for(j = -filterWidth/2; j < filterWidth/2; j++){
row = row+i;
col = col+j;
row = min(max(0, row), width - 1);
col = min(max(0, col), height - 1);
int pixel = input[row][col];
sumX += pixel*Mx[i + filterWidth/2][j +
filterWidth/2]; }
}
ans = abs(sumX/273) ;
if(ans > 255) ans = 255;
if(ans < 0) ans = 0;
output[r][c] = ans;
}
}
Serial code
64*64 228*221 749*912
convolution 0.45 4.92 90.027
load 1.01 6.75 70.8
1.01 6.75
70.8
0.45
4.92
90.027
0
20
40
60
80
100
120
140
160
180
time
size
load convolution
Strategy & Naïve Implementation
 Each thread generates a single output pixel.
 Simple implementation => load image, launch kernel, compute output
 A block of pixels from the image is loaded into an array in shared memory.
 And load filter into constant memory
Parallel code:
(without shared)
Here, Block size = 16*16;
__global__ void image(int * in, int *out, int width)
{
//masks
int Mx[5][5] =
{ { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4
},{1,4,7,4,1} };
int sumX = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row <= 0 || row >= n-1 || col <= 0 || col >=
n-1)
{
out[row*width + col] = 0;
}
else
{
for(int i = -2; i < 3; i++)
{
for(int j = -2; j < 3; j++)
{
int pixel = in[(row + i) * width + (col + j)];
sumX += pixel * Mx[i+2][j+2];
}
}
__syncthreads();
int ans = abs(sumX)/273;
//if the value of sum exceeds general pixels
measures then assign boundaries
if(ans > 255) ans = 255;
if(ans < 0) ans = 0;
//save the convolved pixel to out array
out[row*width + col] = ans;
}
}
Parallel code:
(shared)
 Use of constant and shared
memory
 Tile size = block size = 16*16
//kernel
__global__ void image(int * in, int *out, int width, int
height)
{
__shared__ int smem[BLOCK_W*BLOCK_H];
__const__ int Mx[5][5] =
{ { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4
},{1,4,7,4,1} };
int x =blockIdx.x*TILE_W+threadIdx.x - R;
int y = blockIdx.y*TILE_H + threadIdx.y -R;
x = min(max(0, x), width-1);
y = min(max(0,x), height-1);
unsigned int index = y*width+x;
unsigned int bindex = threadIdx.y*blockDim.y+threadIdx.x;
smem[bindex] = in[index];
__syncthreads();
if((threadIdx.x>=R)&&(threadIdx.x<(BLOCK_W-
R))&&(threadIdx.y>=R)&&(threadIdx.y<(BLOCK_H-R)))
{
int sum =0;
for(int dy = -R; dy<R;dy++){
for(int dx=-R;dx<R;dx++){
int i = smem[bindex+(dy*blockDim.y)+dx];
sum += Mx[dy][dx]*i;
}
}
out[index]= sum/273;
}
}
Comparison
(block size/TILE size on time)
0.28
0.16
0.14
0.167
0.08
0.07 0.064
0.081
0
0.05
0.1
0.15
0.2
0.25
0.3
4*4 8*8 16*16 32*32
time
Block size
Effect of block size
without shared shared
Fixed input size : 228*221
0.03 0.176
1.89
0.0649 0.1453
1.93
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
64*64 228*221 749*912
time
size
Without shared
load convolution
0.03 0.181
1.88
0.05 0.064
0.2658
0
0.5
1
1.5
2
2.5
64*64 228*221 749*912
shared
load convolution
Speed up:
15
33.86
46.65
15
76.875
338.7
1
0
50
100
150
200
250
300
350
400
450
64*64 228*221 749*912
without shared shared
From graph ,we can see that use of shared memory improves performance.
Conclusion
 Using shared mem. and const. mem. , we can get much more speed up
(here ~10x) than naïve.
Thank you

More Related Content

What's hot

木を綺麗に描画するアルゴリズム
木を綺麗に描画するアルゴリズム木を綺麗に描画するアルゴリズム
木を綺麗に描画するアルゴリズム
mfumi
 
デジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくる
デジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくるデジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくる
デジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくる
Atsushi Tadokoro
 

What's hot (20)

Kaggle RSNA Pneumonia Detection Challenge 解法紹介
Kaggle RSNA Pneumonia Detection Challenge 解法紹介Kaggle RSNA Pneumonia Detection Challenge 解法紹介
Kaggle RSNA Pneumonia Detection Challenge 解法紹介
 
[DL輪読会]Taskonomy: Disentangling Task Transfer Learning
[DL輪読会]Taskonomy: Disentangling Task Transfer Learning[DL輪読会]Taskonomy: Disentangling Task Transfer Learning
[DL輪読会]Taskonomy: Disentangling Task Transfer Learning
 
AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解AVX-512(フォーマット)詳解
AVX-512(フォーマット)詳解
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション
 
新しい並列for構文のご提案
新しい並列for構文のご提案新しい並列for構文のご提案
新しい並列for構文のご提案
 
Computer graphics
Computer graphics   Computer graphics
Computer graphics
 
3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理
3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理
3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理
 
QGISで河川縦断図
QGISで河川縦断図QGISで河川縦断図
QGISで河川縦断図
 
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
 
木を綺麗に描画するアルゴリズム
木を綺麗に描画するアルゴリズム木を綺麗に描画するアルゴリズム
木を綺麗に描画するアルゴリズム
 
Implementation of reed solomon codes basics
Implementation of reed solomon codes basicsImplementation of reed solomon codes basics
Implementation of reed solomon codes basics
 
PythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングPythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミング
 
【関東GPGPU勉強会#3】OpenCVの新機能 UMatを先取りしよう
【関東GPGPU勉強会#3】OpenCVの新機能 UMatを先取りしよう【関東GPGPU勉強会#3】OpenCVの新機能 UMatを先取りしよう
【関東GPGPU勉強会#3】OpenCVの新機能 UMatを先取りしよう
 
高速な倍精度指数関数expの実装
高速な倍精度指数関数expの実装高速な倍精度指数関数expの実装
高速な倍精度指数関数expの実装
 
デジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくる
デジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくるデジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくる
デジタルアートセミナー#2 openFrameworksで学ぶ、 クリエイティブ・コーディング Session 2: 構造をつくる
 
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)
 
コンピューテーショナルフォトグラフィ
コンピューテーショナルフォトグラフィコンピューテーショナルフォトグラフィ
コンピューテーショナルフォトグラフィ
 
CRC-32
CRC-32CRC-32
CRC-32
 
Graphics practical lab manual
Graphics practical lab manualGraphics practical lab manual
Graphics practical lab manual
 

Viewers also liked

Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)
John Williams
 
Image enhancement techniques
Image enhancement techniquesImage enhancement techniques
Image enhancement techniques
Saideep
 
Noise filtering
Noise filteringNoise filtering
Noise filtering
Alaa Ahmed
 
Digital image processing img smoothning
Digital image processing img smoothningDigital image processing img smoothning
Digital image processing img smoothning
Vinay Gupta
 

Viewers also liked (20)

Blur Filter - Hanpo
Blur Filter - HanpoBlur Filter - Hanpo
Blur Filter - Hanpo
 
study Accelerating Spatially Varying Gaussian Filters
study Accelerating Spatially Varying Gaussian Filtersstudy Accelerating Spatially Varying Gaussian Filters
study Accelerating Spatially Varying Gaussian Filters
 
Parallel Vision by GPGPU/CUDA
Parallel Vision by GPGPU/CUDAParallel Vision by GPGPU/CUDA
Parallel Vision by GPGPU/CUDA
 
CUDA
CUDACUDA
CUDA
 
Image Denoising using Spatial Domain Filters: A Quantitative Study
Image Denoising using Spatial Domain Filters: A Quantitative StudyImage Denoising using Spatial Domain Filters: A Quantitative Study
Image Denoising using Spatial Domain Filters: A Quantitative Study
 
Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Image Smoothing using Frequency Domain Filters
Image Smoothing using Frequency Domain FiltersImage Smoothing using Frequency Domain Filters
Image Smoothing using Frequency Domain Filters
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Edge Detection algorithm and code
Edge Detection algorithm and codeEdge Detection algorithm and code
Edge Detection algorithm and code
 
Edge detection
Edge detectionEdge detection
Edge detection
 
Image enhancement techniques
Image enhancement techniquesImage enhancement techniques
Image enhancement techniques
 
Cuda
CudaCuda
Cuda
 
CUDA
CUDACUDA
CUDA
 
Noise filtering
Noise filteringNoise filtering
Noise filtering
 
Histogram equalization
Histogram equalizationHistogram equalization
Histogram equalization
 
Spatial filtering
Spatial filteringSpatial filtering
Spatial filtering
 
Digital image processing img smoothning
Digital image processing img smoothningDigital image processing img smoothning
Digital image processing img smoothning
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Gaussian Image Blurring in CUDA C++

Computer graphics
Computer graphicsComputer graphics
Computer graphics
amitsarda3
 
3Dtexture_doc_rep
3Dtexture_doc_rep3Dtexture_doc_rep
3Dtexture_doc_rep
Liu Zhen-Yu
 
A Tutorial On Ip 1
A Tutorial On Ip 1A Tutorial On Ip 1
A Tutorial On Ip 1
ankuredkie
 

Similar to Gaussian Image Blurring in CUDA C++ (20)

Computer Graphics Unit 1
Computer Graphics Unit 1Computer Graphics Unit 1
Computer Graphics Unit 1
 
raster algorithm.pdf
raster algorithm.pdfraster algorithm.pdf
raster algorithm.pdf
 
SCIPY-SYMPY.pdf
SCIPY-SYMPY.pdfSCIPY-SYMPY.pdf
SCIPY-SYMPY.pdf
 
alexnet.pdf
alexnet.pdfalexnet.pdf
alexnet.pdf
 
Primitives
PrimitivesPrimitives
Primitives
 
Test
TestTest
Test
 
Computer graphics
Computer graphicsComputer graphics
Computer graphics
 
Computer graphics 2
Computer graphics 2Computer graphics 2
Computer graphics 2
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
Explanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expertExplanation on Tensorflow example -Deep mnist for expert
Explanation on Tensorflow example -Deep mnist for expert
 
3Dtexture_doc_rep
3Dtexture_doc_rep3Dtexture_doc_rep
3Dtexture_doc_rep
 
Computer Graphics Notes 2.pdf
Computer Graphics Notes 2.pdfComputer Graphics Notes 2.pdf
Computer Graphics Notes 2.pdf
 
Writeup advanced lane_lines_project
Writeup advanced lane_lines_projectWriteup advanced lane_lines_project
Writeup advanced lane_lines_project
 
A Tutorial On Ip 1
A Tutorial On Ip 1A Tutorial On Ip 1
A Tutorial On Ip 1
 
Otsu
OtsuOtsu
Otsu
 
matlab.docx
matlab.docxmatlab.docx
matlab.docx
 
4 CG_U1_M3_PPT_4 DDA.pptx
4 CG_U1_M3_PPT_4 DDA.pptx4 CG_U1_M3_PPT_4 DDA.pptx
4 CG_U1_M3_PPT_4 DDA.pptx
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
 
Computer Programming- Lecture 9
Computer Programming- Lecture 9Computer Programming- Lecture 9
Computer Programming- Lecture 9
 

Recently uploaded

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Recently uploaded (20)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Gaussian Image Blurring in CUDA C++

  • 2. Blurring/smoothing  Mathematically, applying a Gaussian blur to an image is the same as convolving the image with a Gaussian function.  The Gaussian blur is a type of image-blurring filter that uses a Gaussian function for calculating the transformation to apply to each pixel in the image.  Gaussian blur takes a weighted average around the pixel, while "normal" blur just averages all the pixels in the radius of the single pixel together.
  • 4. How it works? kernel type : Gaussian  Complexity = O(N*r*r) ; r = blur radii. N = total no. of pixels  It is a widely used effect in graphics software, typically to reduce image noise and reduce image detail.  Ref: https://en.wikipedia.org/wiki/Gaussian_blur
  • 5. Examples:  Input = image Output = image Blur radii = 1.2 pixel Blur radii = 2.5 pixel Blur radii = 5.0 pixel
  • 6. Serial code  Complexity = O(N*r*r); N=total no. of pixel  So, in parallel code we can just launch threads based on output image(like in matrix multiplication) for(row = 0; row < height; row++){ for(col = 0; col < width; col++){ int sumX = 0,sumY = 0,ans = 0; int r = row; int c = col; for(i = -filterWidth/2; i < filterWidth/2; i++){ for(j = -filterWidth/2; j < filterWidth/2; j++){ row = row+i; col = col+j; row = min(max(0, row), width - 1); col = min(max(0, col), height - 1); int pixel = input[row][col]; sumX += pixel*Mx[i + filterWidth/2][j + filterWidth/2]; } } ans = abs(sumX/273) ; if(ans > 255) ans = 255; if(ans < 0) ans = 0; output[r][c] = ans; } }
  • 7. Serial code 64*64 228*221 749*912 convolution 0.45 4.92 90.027 load 1.01 6.75 70.8 1.01 6.75 70.8 0.45 4.92 90.027 0 20 40 60 80 100 120 140 160 180 time size load convolution
  • 8. Strategy & Naïve Implementation  Each thread generates a single output pixel.  Simple implementation => load image, launch kernel, compute output  A block of pixels from the image is loaded into an array in shared memory.  And load filter into constant memory
  • 9. Parallel code: (without shared) Here, Block size = 16*16; __global__ void image(int * in, int *out, int width) { //masks int Mx[5][5] = { { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4 },{1,4,7,4,1} }; int sumX = 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if(row <= 0 || row >= n-1 || col <= 0 || col >= n-1) { out[row*width + col] = 0; } else { for(int i = -2; i < 3; i++) { for(int j = -2; j < 3; j++) { int pixel = in[(row + i) * width + (col + j)]; sumX += pixel * Mx[i+2][j+2]; } } __syncthreads(); int ans = abs(sumX)/273; //if the value of sum exceeds general pixels measures then assign boundaries if(ans > 255) ans = 255; if(ans < 0) ans = 0; //save the convolved pixel to out array out[row*width + col] = ans; } }
  • 10. Parallel code: (shared)  Use of constant and shared memory  Tile size = block size = 16*16 //kernel __global__ void image(int * in, int *out, int width, int height) { __shared__ int smem[BLOCK_W*BLOCK_H]; __const__ int Mx[5][5] = { { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4 },{1,4,7,4,1} }; int x =blockIdx.x*TILE_W+threadIdx.x - R; int y = blockIdx.y*TILE_H + threadIdx.y -R; x = min(max(0, x), width-1); y = min(max(0,x), height-1); unsigned int index = y*width+x; unsigned int bindex = threadIdx.y*blockDim.y+threadIdx.x; smem[bindex] = in[index]; __syncthreads(); if((threadIdx.x>=R)&&(threadIdx.x<(BLOCK_W- R))&&(threadIdx.y>=R)&&(threadIdx.y<(BLOCK_H-R))) { int sum =0; for(int dy = -R; dy<R;dy++){ for(int dx=-R;dx<R;dx++){ int i = smem[bindex+(dy*blockDim.y)+dx]; sum += Mx[dy][dx]*i; } } out[index]= sum/273; } }
  • 11. Comparison (block size/TILE size on time) 0.28 0.16 0.14 0.167 0.08 0.07 0.064 0.081 0 0.05 0.1 0.15 0.2 0.25 0.3 4*4 8*8 16*16 32*32 time Block size Effect of block size without shared shared Fixed input size : 228*221
  • 12. 0.03 0.176 1.89 0.0649 0.1453 1.93 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 64*64 228*221 749*912 time size Without shared load convolution 0.03 0.181 1.88 0.05 0.064 0.2658 0 0.5 1 1.5 2 2.5 64*64 228*221 749*912 shared load convolution
  • 13. Speed up: 15 33.86 46.65 15 76.875 338.7 1 0 50 100 150 200 250 300 350 400 450 64*64 228*221 749*912 without shared shared From graph ,we can see that use of shared memory improves performance.
  • 14. Conclusion  Using shared mem. and const. mem. , we can get much more speed up (here ~10x) than naïve.