Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
464 views
in Technique[技术] by (71.8m points)

image processing - What's the advantage of the local memory in OpenCL?

I'm wondering the advantage of the local memory in it. Since the global memory can get the item separately and freely. Can't we just use the global memory?

For example, we have a 1000*1000 image, and we want add every pixel value 1. We can use 1000*1000's global memory right?

Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts?

I'll be so appreciate for you, if you give me a simple code of the local memory.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Cann't we just use the global memory?

Of course you can. First write an actual working code. Then optimize.

Since the global memory can get the item separately and freely

Im not sure if all architectures have broadcasting ability. But Im sure if memory is accessed randomly for all threads, it gets too slow. Ray tracing is an example. Each pixel refracts/reflected to different distances and different memory areas. This is a performance hit. If every thread was accessing to global memory in a uniform way, it would be much faster.

We can use 1000*1000's global memory right?

There is a minimum value of maximum buffer size and it can be around 128MB or 1/4 of device memory. Combined size of all buffers can vary with platforms/devices, in the range of several GBs.

Will it be faster for us if we use local memory and turn the 1000*1000 image into 100 100*100 parts?

That depends on the data re-use ratio and coalescedness of access pattern. Random(non coalesced) access to local memory is much faster than random(non coalesced) access to global memory. If you use too much local memory/private file, then it can be even slower because more local memory consumption leads to less occupation and less memory latency hiding and more register spilling to global memory. Try to balance it with using private registers too. Or you can use a compression technique to fit more data into local memory.

If you re-use each data for lets say 256 times, then it will be around 10-20x faster for local memory than global memory access.

Here is a very simple 2D nbody code for force calculations:

// global memory access is only 257 times per item, 1 for private save
//                                                  256 for global broadcast
//                                                  for global-to-local copy
// unoptimized version accesses 65537 times per item.
__kernel void nBodyF(__global float *x, __global float *y,
                     __global float *vx, __global float *vy,
                     __global float *fx, __global float *fy)
{
     int N=65536; // this is total number of masses for this example
     int LN=256;  // this is length of each chunk in local memory, 
                  // means 256 masses per compute unit
    int i=get_global_id(0);  // global thread id keys 0....65535
    int L=get_local_id(0);   // local thread id keys 0...255 for each group
    float2 Fi=(float2)(0,0); // init
    float xi=x[i]; float yi=y[i]; // re-use for 65536 times
    __local xL[256]; __local yL[256]; //declare local mem array with constant length


    for(int k=0;k<N/LN;k++) // number of chunks to fetch from global to local
    {
        barrier(CLK_LOCAL_MEM_FENCE);  //synchronization
        xL[L]=x[k*LN+L]; yL[L]=y[k*LN+L]; //get 256-element chunks into local mem
        barrier(CLK_LOCAL_MEM_FENCE);  //synchronization
        for(int j=0;j<LN;j++)          //start processing local/private variables
        {
            float2 F=(float2)(0,0);          // private force vector init
            float2 r1=(float2)(xi,yi);       // private vector
            float2 r2=(float2)(xL[j],yL[j]); // use local mem to get r2 vector
            float2 dr=r1-r2;                 // private displacement
            F=dr/(0.01f+dot(dr,dr));         // private force calc.
            Fi.x-=F.x; Fi.y-=F.y;            // private force add to private
        }
     }
     fx[i]=Fi.x; fy[i]=Fi.y; //write result to global mem only once
}

The upper example is poor in terms of local memory re-use ratio. But half of the variables is in private memory and is re-used for 64k times.

Worst case scenario:

  1)Big portion of items cannot fit GPU cache.
  2)Only global memory accesses are done
  3)Data is not re-used
  4)Memory is accessed in a very non-uniform way.
  This will make it very slow.
  When data doesnt fit cache and not re-used, you should use __read_only for
  necessary buffers(__write_only for writing).

If you make a convolution(or some anti-aliasing, or edge detection), data re-use will be 4 to 20 and local memory optimization gives 3-4x performance at least.

If your GPU has 300GB/s global memory bandwidth, then its local memory bandwidth would be around 3-4 TB/s. You can optimize for private registers too! Then it could be 15-20 TB/s. But this type has fewer usage areas.

Edit: If you are reading single bytes and if these bytes differ by only a small value(e.g. maximum 16) between them, then you can pack multiple variables into single bytes and decrypt them in local memoru. Example:

  Global memory(copied to local mem): 
  Reference_byte   Byte0  byte1        byte2         byte3  
  128              +3,-5  +24,+50      -25,-63      0, +2

  Unpacking in local memory:
  Reference_byte   Byte0  byte1 byte2 byte3 Byte4  byte5 byte6 byte7      
  128              131    126   150   200   175    112   112   114

  Computing results on the array
  Reference_byte   Byte0  byte1 byte2 byte3 Byte4  byte5 byte6 byte7 
  128              120    130   140   150   150    150   100   110

  Packing results in local memory:
  Reference_byte   Byte0  byte1        byte2         byte3  
  128              -8,+10 +10,+10      0,0           -50, +10

  Global memory(copied from local mem): 
  Reference_byte   Byte0  byte1        byte2         byte3  
  128              -8,+10 +10,+10      0,0           -50, +10

  //Maybe a coordinate compression for a voxel rendering.

Use a profiler that gives you cache line usage info.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...