Cann't we just use the global memory?
Of course you can. First write an actual working code. Then optimize.
Since the global memory can get the item separately and freely
Im not sure if all architectures have broadcasting ability.
But Im sure if memory is accessed randomly for all threads, it gets too slow.
Ray tracing is an example. Each pixel refracts/reflected to different distances and different memory areas. This is a performance hit. If every thread was accessing to global memory in a uniform way, it would be much faster.
We can use 1000*1000's global memory right?
There is a minimum value of maximum buffer size and it can be around 128MB or 1/4 of device memory. Combined size of all buffers can vary with platforms/devices, in the range of several GBs.
Will it be faster for us if we use local memory and turn the 1000*1000
image into 100 100*100 parts?
That depends on the data re-use ratio and coalescedness of access pattern. Random(non coalesced) access to local memory is much faster than random(non coalesced) access to global memory. If you use too much local memory/private file, then it can be even slower because more local memory consumption leads to less occupation and less memory latency hiding and more register spilling to global memory. Try to balance it with using private registers too. Or you can use a compression technique to fit more data into local memory.
If you re-use each data for lets say 256 times, then it will be around 10-20x faster for local memory than global memory access.
Here is a very simple 2D nbody code for force calculations:
// global memory access is only 257 times per item, 1 for private save
// 256 for global broadcast
// for global-to-local copy
// unoptimized version accesses 65537 times per item.
__kernel void nBodyF(__global float *x, __global float *y,
__global float *vx, __global float *vy,
__global float *fx, __global float *fy)
{
int N=65536; // this is total number of masses for this example
int LN=256; // this is length of each chunk in local memory,
// means 256 masses per compute unit
int i=get_global_id(0); // global thread id keys 0....65535
int L=get_local_id(0); // local thread id keys 0...255 for each group
float2 Fi=(float2)(0,0); // init
float xi=x[i]; float yi=y[i]; // re-use for 65536 times
__local xL[256]; __local yL[256]; //declare local mem array with constant length
for(int k=0;k<N/LN;k++) // number of chunks to fetch from global to local
{
barrier(CLK_LOCAL_MEM_FENCE); //synchronization
xL[L]=x[k*LN+L]; yL[L]=y[k*LN+L]; //get 256-element chunks into local mem
barrier(CLK_LOCAL_MEM_FENCE); //synchronization
for(int j=0;j<LN;j++) //start processing local/private variables
{
float2 F=(float2)(0,0); // private force vector init
float2 r1=(float2)(xi,yi); // private vector
float2 r2=(float2)(xL[j],yL[j]); // use local mem to get r2 vector
float2 dr=r1-r2; // private displacement
F=dr/(0.01f+dot(dr,dr)); // private force calc.
Fi.x-=F.x; Fi.y-=F.y; // private force add to private
}
}
fx[i]=Fi.x; fy[i]=Fi.y; //write result to global mem only once
}
The upper example is poor in terms of local memory re-use ratio. But half of the variables is in private memory and is re-used for 64k times.
Worst case scenario:
1)Big portion of items cannot fit GPU cache.
2)Only global memory accesses are done
3)Data is not re-used
4)Memory is accessed in a very non-uniform way.
This will make it very slow.
When data doesnt fit cache and not re-used, you should use __read_only for
necessary buffers(__write_only for writing).
If you make a convolution(or some anti-aliasing, or edge detection), data re-use will be 4 to 20 and local memory optimization gives 3-4x performance at least.
If your GPU has 300GB/s global memory bandwidth, then its local memory bandwidth would be around 3-4 TB/s. You can optimize for private registers too! Then it could be 15-20 TB/s. But this type has fewer usage areas.
Edit: If you are reading single bytes and if these bytes differ by only a small value(e.g. maximum 16) between them, then you can pack multiple variables into single bytes and decrypt them in local memoru. Example:
Global memory(copied to local mem):
Reference_byte Byte0 byte1 byte2 byte3
128 +3,-5 +24,+50 -25,-63 0, +2
Unpacking in local memory:
Reference_byte Byte0 byte1 byte2 byte3 Byte4 byte5 byte6 byte7
128 131 126 150 200 175 112 112 114
Computing results on the array
Reference_byte Byte0 byte1 byte2 byte3 Byte4 byte5 byte6 byte7
128 120 130 140 150 150 150 100 110
Packing results in local memory:
Reference_byte Byte0 byte1 byte2 byte3
128 -8,+10 +10,+10 0,0 -50, +10
Global memory(copied from local mem):
Reference_byte Byte0 byte1 byte2 byte3
128 -8,+10 +10,+10 0,0 -50, +10
//Maybe a coordinate compression for a voxel rendering.
Use a profiler that gives you cache line usage info.