i'm trying to copy for each block of threads a patch of image and relative apron to shared memory.
After my data are copyied(i used a matrix) to shared memory, i want a relations that map the center of the mask in shared memory that i consider for convolution and the center of the mask in the image buffer.
I want that because if i try to do convolution of image seems that the center of the mask in shared memory doesn't correspond to the center in the image buffer stored in global memory.
In the code below i write an example of simple image black and white erosion algorithm , when i put the result of a convolution to the output image seems that the center not corresponds.
i write my sample using a 512x512px image
my settings are:
//block and grid size
dim3 block(16,16);
dim3 grid(512/(block.x),512/(block.y),1);
this is my kernel:
#define STREL_SIZE 5
#define TILE_W 16
#define TILE_H 16
#define R (STREL_SIZE/2)
//size of tile image + apron
#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))
__global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
unsigned char * buffer_out,
int w,int h ){
// Data cache: threadIdx.x , threadIdx.y
__shared__ unsigned char data[TILE_W +STREL_SIZE ][TILE_H +STREL_SIZE ];
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
// global mem address of this thread
int gLoc = row*w +col;
int x, y; // image based coordinate
if((col<w)&&(row<h)) {
data[threadIdx.x][threadIdx.y]=buffer_in[gLoc];
if (threadIdx.y > (h-STREL_SIZE))
data[threadIdx.x][threadIdx.y + STREL_SIZE]=buffer_in[gLoc + STREL_SIZE];
if (threadIdx.x >(w-STREL_SIZE))
data[threadIdx.x + STREL_SIZE][threadIdx.y]=buffer_in[gLoc+STREL_SIZE];
if ((threadIdx.x >(w-STREL_SIZE)) && (threadIdx.y > (h-STREL_SIZE)))
data[threadIdx.x+STREL_SIZE][threadIdx.y+STREL_SIZE] = buffer_in[gLoc+2*STREL_SIZE];
//wait for all threads to finish read
__syncthreads();
unsigned char min_value = 255;
for(x=0;x<STREL_SIZE;x++){
for(y=0;y<STREL_SIZE;y++){
min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
}
}
buffer_out[gLoc]= min_value;
}
}
my input image:
my the output of the kernel is:
where w is the width of image,and is equal 512,
where h is the height of image,and is equal 512.
i call the kernel with:
erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);
the dimage_src is the input image an array buffer not a matrix, and dimage_dst is the output image a buffer.
each buffer have the size of nElem * nImg * sizeof(unsigned char) where nElem=512*512 is the size of the buffer and nImg is the number of image that i want processing in my case is equal to 1.
where i'm wrong?
CODE UPDATE:
__global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
unsigned char * buffer_out,
int w,int h ){
// Data cache: threadIdx.x , threadIdx.y
__shared__ unsigned char data[TILE_W + STREL_SIZE-1 ][TILE_H + STREL_SIZE-1 ];
// global mem address of this thread
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int gLoc = row*w +col;
// each threads loads four values from global memory into shared mem
int x, y; // image based coordinate
if((col<w)&&(row<h)) {
data[threadIdx.x][threadIdx.y] = buffer_in[gLoc];
if (threadIdx.y > (TILE_H-STREL_SIZE+1))
data[threadIdx.x][threadIdx.y + STREL_SIZE-1]=buffer_in[(row + STREL_SIZE-1)*w + col];
if (threadIdx.x > (TILE_W-STREL_SIZE+1))
data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = buffer_in[row*w+col + STREL_SIZE-1];
if ((threadIdx.x > (TILE_W-STREL_SIZE+1)) && (threadIdx.y > (TILE_H-STREL_SIZE+1)))
data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1];
//wait for all threads to finish read
__syncthreads();
unsigned char min_value = 255;
for(x=0;x<STREL_SIZE;x++){
for(y=0;y<STREL_SIZE;y++){
min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
}
}
buffer_out[gLoc]= min_value;
}
}
my output now is:
UPDATED 2(version 2 -working-):
i have implemented another version of algorithm.To do that i follow that slide that i found very useful and well explained,in particular the part in wich the author talk about convolution slide 27.
i change the block and grid settings to :
dim3 block(20,20);
dim3 grid(512/(block.x)+ block.x,512/(block.y)+block.y);
the kernel call instead ramain the same:
erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);
where the argument of the kernel are:
- dimage_src: buffer of unsigned char with size height x width that contain input image.
- dimage_dst:**buffer of unsigned char with size **height x width that contain output image, that my kernel produced.
- 512: the third argument is the width of the image.
- 512: the fourth argument is the height of the image.
remember my image sample are black and white but this version of erosion can work with grayscale too.
here my working Kernel:
#define STREL_W 5
#define STREL_H 5
#define STREL_SIZE 5
#define TILE_W 16
#define TILE_H 16
#define R (STREL_SIZE/2)
#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))
__global__ void erode_multiple_img_working(unsigned char * buffer_in,
unsigned char * buffer_out,
int w,int h ){
__shared__ unsigned char fast_acc_mat[BLOCK_w][BLOCK_H];
int ty = threadIdx.y;
int tx = threadIdx.x;
int row_o = blockIdx.y * TILE_W + ty;
int col_o = blockIdx.x * TILE_H + tx;
int row_i = row_o - R;
int col_i = col_o - R;
//in of img size
if((row_i >= 0) && (row_i < h) && (col_i >= 0) && (col_i < w) ){
fast_acc_mat[ty][tx] = buffer_in[ row_i * w + col_i];
}
else{
fast_acc_mat[ty][tx] = 0;
}
__syncthreads();
if( ty < TILE_H && tx < TILE_W ){
unsigned char min_val=255;
for(int i = 0; i < STREL_SIZE; i++) {
for(int j = 0; j < STREL_SIZE; j++) {
min_val = min( fast_acc_mat[i+ty][j+tx] , min_val );
}
}
if(row_o < h && col_o < w)
buffer_out[row_o * w + col_o] = min_val;
}
}
and this is my eroded image(output):
I realized a scheme that show how the part of the algorithm described by Eric load pixel of a TILE in shared memory :
See Question&Answers more detail:
os