Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
276 views
in Technique[技术] by (71.8m points)

nsight - CUDA caching global memory into shared memory

About my CUDA kernel:

I have data, residing in global memory that is read by all threads, and it is passed in as a kernel function argument.

NSIGHT identified it as a source of bandwidth use.

Every thread reads every value, and in the same sequential order as all other threads.

To avoid 32 reads from global memory for every value, I decided to fetch a copy of the data into shared memory.

It looks something like this:

extern __shared__ __half aab[];
...
__global__ mykernel
(
  int numbox,
  ...
  const value_t* glob_aab,
  ...
)
{
        // Each thread does a portion of the data.
        const int copysz = numbox*6;
        for ( int i=threadIdx.x; i<copysz; i+=blockDim.x )
                aab[i] = glob_aab[i];
        __syncthreads();
        // All threads are done cacheing the aab geometry.
        ...
}

So far, so good. It works.

However, a peculiar thing is discovered when I profile this kernel with the Nvidia NSIGHT Compute tool.

profile

Those 18,508 non-issued instructions are due to a math-pipe throttle.

I am guessing the i+=blockDim.x statement?

Anyways... what is the correct way to copy read-only data from global memory into a shared (cached) block?

I use the steps of blockDim.x so that the reads (and writes to cache) are coalesced. But if it gets throttled by the math pipeline, should I use another method to fill my cached copy?

OS: Ubuntu 20.10

GPU: RTX 3070

CUDA: 11.2

question from:https://stackoverflow.com/questions/65650050/cuda-caching-global-memory-into-shared-memory

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...