c++ - cuda convolution mapping

Question

Welcome To Ask or Share your Answers For Others

c++ - cuda convolution mapping

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

c++ - cuda convolution mapping

i'm trying to copy for each block of threads a patch of image and relative apron to shared memory.

After my data are copyied(i used a matrix) to shared memory, i want a relations that map the center of the mask in shared memory that i consider for convolution and the center of the mask in the image buffer.

I want that because if i try to do convolution of image seems that the center of the mask in shared memory doesn't correspond to the center in the image buffer stored in global memory.

In the code below i write an example of simple image black and white erosion algorithm , when i put the result of a convolution to the output image seems that the center not corresponds.

i write my sample using a 512x512px image

my settings are:

//block and grid size
dim3 block(16,16);
dim3 grid(512/(block.x),512/(block.y),1);

this is my kernel:

#define STREL_SIZE 5

#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)

//size of tile image + apron
#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))


 __global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

    // Data cache: threadIdx.x , threadIdx.y
    __shared__ unsigned char data[TILE_W +STREL_SIZE  ][TILE_H +STREL_SIZE ];


     int col = blockIdx.x * blockDim.x + threadIdx.x;
     int row = blockIdx.y * blockDim.y + threadIdx.y;

     // global mem address of this thread
     int gLoc =  row*w +col;


     int x, y;  // image based coordinate



     if((col<w)&&(row<h)) {
         data[threadIdx.x][threadIdx.y]=buffer_in[gLoc];

     if (threadIdx.y > (h-STREL_SIZE))
          data[threadIdx.x][threadIdx.y + STREL_SIZE]=buffer_in[gLoc + STREL_SIZE];

     if (threadIdx.x >(w-STREL_SIZE))
          data[threadIdx.x + STREL_SIZE][threadIdx.y]=buffer_in[gLoc+STREL_SIZE];

     if ((threadIdx.x >(w-STREL_SIZE)) && (threadIdx.y > (h-STREL_SIZE)))
          data[threadIdx.x+STREL_SIZE][threadIdx.y+STREL_SIZE] =     buffer_in[gLoc+2*STREL_SIZE];

     //wait for all threads to finish read
     __syncthreads();

      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }
}

my input image:

my the output of the kernel is:

where w is the width of image,and is equal 512, where h is the height of image,and is equal 512.

i call the kernel with:

 erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

the dimage_src is the input image an array buffer not a matrix, and dimage_dst is the output image a buffer.

each buffer have the size of nElem * nImg * sizeof(unsigned char) where nElem=512*512 is the size of the buffer and nImg is the number of image that i want processing in my case is equal to 1. where i'm wrong?

CODE UPDATE:

__global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

// Data cache: threadIdx.x , threadIdx.y
__shared__ unsigned char data[TILE_W + STREL_SIZE-1 ][TILE_H + STREL_SIZE-1 ];

// global mem address of this thread
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;



int gLoc =  row*w +col;



// each threads loads four values from global memory into shared mem
int x, y;   // image based coordinate



if((col<w)&&(row<h)) {

    data[threadIdx.x][threadIdx.y] = buffer_in[gLoc];

     if (threadIdx.y > (TILE_H-STREL_SIZE+1))
          data[threadIdx.x][threadIdx.y + STREL_SIZE-1]=buffer_in[(row + STREL_SIZE-1)*w + col];

     if (threadIdx.x > (TILE_W-STREL_SIZE+1))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = buffer_in[row*w+col + STREL_SIZE-1];

     if ((threadIdx.x > (TILE_W-STREL_SIZE+1)) && (threadIdx.y > (TILE_H-STREL_SIZE+1)))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1];

    //wait for all threads to finish read
     __syncthreads();



      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }

    }

my output now is:

UPDATED 2(version 2 -working-):

i have implemented another version of algorithm.To do that i follow that slide that i found very useful and well explained,in particular the part in wich the author talk about convolution slide 27.

i change the block and grid settings to :

dim3 block(20,20);
dim3 grid(512/(block.x)+ block.x,512/(block.y)+block.y);

the kernel call instead ramain the same:

erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

where the argument of the kernel are:

dimage_src: buffer of unsigned char with size height x width that contain input image.
dimage_dst:**buffer of unsigned char with size **height x width that contain output image, that my kernel produced.
512: the third argument is the width of the image.
512: the fourth argument is the height of the image.

remember my image sample are black and white but this version of erosion can work with grayscale too.

here my working Kernel:

#define STREL_W 5
#define STREL_H 5

#define STREL_SIZE 5


#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)


#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))

__global__ void erode_multiple_img_working(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){


    __shared__ unsigned char fast_acc_mat[BLOCK_w][BLOCK_H];

    int ty = threadIdx.y;
    int tx = threadIdx.x;


    int row_o = blockIdx.y * TILE_W + ty;
    int col_o = blockIdx.x * TILE_H + tx;


    int row_i = row_o - R;
    int col_i = col_o - R;

    //in of img size
    if((row_i >= 0) && (row_i < h) && (col_i >= 0) && (col_i < w) ){

        fast_acc_mat[ty][tx] = buffer_in[ row_i * w + col_i];

    }
    else{

        fast_acc_mat[ty][tx] = 0;

    }


    __syncthreads();





    if( ty < TILE_H && tx < TILE_W ){

        unsigned char min_val=255;
        for(int i = 0; i < STREL_SIZE; i++) {
            for(int j = 0; j < STREL_SIZE; j++) {

                min_val = min( fast_acc_mat[i+ty][j+tx] , min_val );

            }
        }
        if(row_o < h && col_o < w)
                buffer_out[row_o * w + col_o] = min_val;

        }

     }

and this is my eroded image(output):

I realized a scheme that show how the part of the algorithm described by Eric load pixel of a TILE in shared memory :

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:21:33+0000

You need only [20][20] shared mem, rather than [21][21]. It should be changed to

__shared__ unsigned char data[TILE_W + STREL_SIZE-1][TILE_H + STREL_SIZE-1];

Another problem is the data loading. The correct way is to read (16+4) x (16+4) pixels from input to share memory, using (16 x 16) threads collaboratively. This can be divided into 4 parts:

1)first part: thread(0:15, 0:15) load pixels (0:15,0:15)

2)second part: thread(0:15,12:15) load pixels (0:15, 16:19)

3)third part: thread(12:15,0:15) load pixels (16:19,0:15)

4)fourth part: thread(12:15,12:15) load pixels (16:19,16:19)

But in your code you are messing up the indexing. For part 2~4, only some of the threads in the thread block will be working, and additional boundary checking is also required.

For the 2nd part, you should use thread(0:15, 12:15) to read pixel(0:15, 16:19) as

 if (threadIdx.y > (TILE_H-STREL_SIZE))
      data[threadIdx.x][threadIdx.y + STREL_SIZE-1] = row + STREL_SIZE-1<h ? buffer_in[(row + STREL_SIZE-1)*w + col] : 0;

The 3rd and the 4th part require similar modifications as

 if (threadIdx.x > (TILE_W-STREL_SIZE))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = col + STREL_SIZE-1<w ? buffer_in[row*w+col + STREL_SIZE-1] : 0;

 if ((threadIdx.x > (TILE_W-STREL_SIZE)) && (threadIdx.y > (TILE_H-STREL_SIZE)))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = (row + STREL_SIZE-1<h && col + STREL_SIZE-1<w) ? buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1] : 0;

Then you should be able to get the correct result image, although there will be 2x2 pixel shift, because you do the convolution on (0...4, 0...4) rather than (-2. .2, -2...2).

For more details, you could read

http://igm.univ-mlv.fr/~biri/Enseignement/MII2/Donnees/convolutionSeparable.pdf

https://www.evl.uic.edu/sjames/cs525/final.html

Categories

c++ - cuda convolution mapping

c++ - cuda convolution mapping

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags