About my CUDA kernel:
I have data, residing in global memory that is read by all threads, and it is passed in as a kernel function argument.
NSIGHT identified it as a source of bandwidth use.
Every thread reads every value, and in the same sequential order as all other threads.
To avoid 32 reads from global memory for every value, I decided to fetch a copy of the data into shared memory.
It looks something like this:
extern __shared__ __half aab[];
...
__global__ mykernel
(
int numbox,
...
const value_t* glob_aab,
...
)
{
// Each thread does a portion of the data.
const int copysz = numbox*6;
for ( int i=threadIdx.x; i<copysz; i+=blockDim.x )
aab[i] = glob_aab[i];
__syncthreads();
// All threads are done cacheing the aab geometry.
...
}
So far, so good. It works.
However, a peculiar thing is discovered when I profile this kernel with the Nvidia NSIGHT Compute tool.
Those 18,508 non-issued instructions are due to a math-pipe throttle.
I am guessing the i+=blockDim.x
statement?
Anyways... what is the correct way to copy read-only data from global memory into a shared (cached) block?
I use the steps of blockDim.x so that the reads (and writes to cache) are coalesced. But if it gets throttled by the math pipeline, should I use another method to fill my cached copy?
OS: Ubuntu 20.10
GPU: RTX 3070
CUDA: 11.2
question from:
https://stackoverflow.com/questions/65650050/cuda-caching-global-memory-into-shared-memory 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…