Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
136 views
in Technique[技术] by (71.8m points)

c++ - cuda convolution mapping

i'm trying to copy for each block of threads a patch of image and relative apron to shared memory.

After my data are copyied(i used a matrix) to shared memory, i want a relations that map the center of the mask in shared memory that i consider for convolution and the center of the mask in the image buffer.

I want that because if i try to do convolution of image seems that the center of the mask in shared memory doesn't correspond to the center in the image buffer stored in global memory.

In the code below i write an example of simple image black and white erosion algorithm , when i put the result of a convolution to the output image seems that the center not corresponds.

i write my sample using a 512x512px image

my settings are:

//block and grid size
dim3 block(16,16);
dim3 grid(512/(block.x),512/(block.y),1);

this is my kernel:

#define STREL_SIZE 5

#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)

//size of tile image + apron
#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))


 __global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

    // Data cache: threadIdx.x , threadIdx.y
    __shared__ unsigned char data[TILE_W +STREL_SIZE  ][TILE_H +STREL_SIZE ];


     int col = blockIdx.x * blockDim.x + threadIdx.x;
     int row = blockIdx.y * blockDim.y + threadIdx.y;

     // global mem address of this thread
     int gLoc =  row*w +col;


     int x, y;  // image based coordinate



     if((col<w)&&(row<h)) {
         data[threadIdx.x][threadIdx.y]=buffer_in[gLoc];

     if (threadIdx.y > (h-STREL_SIZE))
          data[threadIdx.x][threadIdx.y + STREL_SIZE]=buffer_in[gLoc + STREL_SIZE];

     if (threadIdx.x >(w-STREL_SIZE))
          data[threadIdx.x + STREL_SIZE][threadIdx.y]=buffer_in[gLoc+STREL_SIZE];

     if ((threadIdx.x >(w-STREL_SIZE)) && (threadIdx.y > (h-STREL_SIZE)))
          data[threadIdx.x+STREL_SIZE][threadIdx.y+STREL_SIZE] =     buffer_in[gLoc+2*STREL_SIZE];

     //wait for all threads to finish read
     __syncthreads();

      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }
}

my input image:

input

my the output of the kernel is:

output

where w is the width of image,and is equal 512, where h is the height of image,and is equal 512.

i call the kernel with:

 erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

the dimage_src is the input image an array buffer not a matrix, and dimage_dst is the output image a buffer.

each buffer have the size of nElem * nImg * sizeof(unsigned char) where nElem=512*512 is the size of the buffer and nImg is the number of image that i want processing in my case is equal to 1. where i'm wrong?

CODE UPDATE:

__global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

// Data cache: threadIdx.x , threadIdx.y
__shared__ unsigned char data[TILE_W + STREL_SIZE-1 ][TILE_H + STREL_SIZE-1 ];

// global mem address of this thread
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;



int gLoc =  row*w +col;



// each threads loads four values from global memory into shared mem
int x, y;   // image based coordinate



if((col<w)&&(row<h)) {

    data[threadIdx.x][threadIdx.y] = buffer_in[gLoc];

     if (threadIdx.y > (TILE_H-STREL_SIZE+1))
          data[threadIdx.x][threadIdx.y + STREL_SIZE-1]=buffer_in[(row + STREL_SIZE-1)*w + col];

     if (threadIdx.x > (TILE_W-STREL_SIZE+1))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = buffer_in[row*w+col + STREL_SIZE-1];

     if ((threadIdx.x > (TILE_W-STREL_SIZE+1)) && (threadIdx.y > (TILE_H-STREL_SIZE+1)))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1];

    //wait for all threads to finish read
     __syncthreads();



      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }

    }

my output now is:

new output

UPDATED 2(version 2 -working-):

i have implemented another version of algorithm.To do that i follow that slide that i found very useful and well explained,in particular the part in wich the author talk about convolution slide 27.

i change the block and grid settings to :

dim3 block(20,20);
dim3 grid(512/(block.x)+ block.x,512/(block.y)+block.y);

the kernel call instead ramain the same:

erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

where the argument of the kernel are:

  1. dimage_src: buffer of unsigned char with size height x width that contain input image.
  2. dimage_dst:**buffer of unsigned char with size **height x width that contain output image, that my kernel produced.
  3. 512: the third argument is the width of the image.
  4. 512: the fourth argument is the height of the image.

remember my image sample are black and white but this version of erosion can work with grayscale too.

here my working Kernel:

#define STREL_W 5
#define STREL_H 5

#define STREL_SIZE 5


#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)


#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))

__global__ void erode_multiple_img_working(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){


    __shared__ unsigned char fast_acc_mat[BLOCK_w][BLOCK_H];

    int ty = threadIdx.y;
    int tx = threadIdx.x;


    int row_o = blockIdx.y * TILE_W + ty;
    int col_o = blockIdx.x * TILE_H + tx;


    int row_i = row_o - R;
    int col_i = col_o - R;

    //in of img size
    if((row_i >= 0) && (row_i < h) && (col_i >= 0) && (col_i < w) ){

        fast_acc_mat[ty][tx] = buffer_in[ row_i * w + col_i];

    }
    else{

        fast_acc_mat[ty][tx] = 0;

    }


    __syncthreads();





    if( ty < TILE_H && tx < TILE_W ){

        unsigned char min_val=255;
        for(int i = 0; i < STREL_SIZE; i++) {
            for(int j = 0; j < STREL_SIZE; j++) {

                min_val = min( fast_acc_mat[i+ty][j+tx] , min_val );

            }
        }
        if(row_o < h && col_o < w)
                buffer_out[row_o * w + col_o] = min_val;

        }

     }

and this is my eroded image(output):

image eroded

I realized a scheme that show how the part of the algorithm described by Eric load pixel of a TILE in shared memory :

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You need only [20][20] shared mem, rather than [21][21]. It should be changed to

__shared__ unsigned char data[TILE_W + STREL_SIZE-1][TILE_H + STREL_SIZE-1];

Another problem is the data loading. The correct way is to read (16+4) x (16+4) pixels from input to share memory, using (16 x 16) threads collaboratively. This can be divided into 4 parts:

1)first part: thread(0:15, 0:15) load pixels (0:15,0:15)

2)second part: thread(0:15,12:15) load pixels (0:15, 16:19)

3)third part: thread(12:15,0:15) load pixels (16:19,0:15)

4)fourth part: thread(12:15,12:15) load pixels (16:19,16:19)

But in your code you are messing up the indexing. For part 2~4, only some of the threads in the thread block will be working, and additional boundary checking is also required.

For the 2nd part, you should use thread(0:15, 12:15) to read pixel(0:15, 16:19) as

 if (threadIdx.y > (TILE_H-STREL_SIZE))
      data[threadIdx.x][threadIdx.y + STREL_SIZE-1] = row + STREL_SIZE-1<h ? buffer_in[(row + STREL_SIZE-1)*w + col] : 0;

The 3rd and the 4th part require similar modifications as

 if (threadIdx.x > (TILE_W-STREL_SIZE))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = col + STREL_SIZE-1<w ? buffer_in[row*w+col + STREL_SIZE-1] : 0;

 if ((threadIdx.x > (TILE_W-STREL_SIZE)) && (threadIdx.y > (TILE_H-STREL_SIZE)))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = (row + STREL_SIZE-1<h && col + STREL_SIZE-1<w) ? buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1] : 0;

Then you should be able to get the correct result image, although there will be 2x2 pixel shift, because you do the convolution on (0...4, 0...4) rather than (-2. .2, -2...2).

For more details, you could read

http://igm.univ-mlv.fr/~biri/Enseignement/MII2/Donnees/convolutionSeparable.pdf

https://www.evl.uic.edu/sjames/cs525/final.html


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...