I am unable to use more than 48K of shared memory (on V100, Cuda 10.2)
I call
cudaFuncSetAttribute(my_kernel,
cudaFuncAttributePreferredSharedMemoryCarveout,
cudaSharedmemCarveoutMaxShared);
before launching my_kernel
first time.
I use launch bounds
and dynamic shared memory inside my_kernel
:
__global__
void __launch_bounds__(768, 1)
my_kernel(...)
{
extern __shared__ float2 sh[];
...
}
Kernel is called like this:
dim3 blk(32, 24); // 768 threads as in launch_bounds.
my_kernel<<<grd, blk, 64 * 1024, my_stream>>>( ... );
cudaGetLastError()
after kernel call returns cudaErrorInvalidValue
.
If I use <= 48 K of shared memory (e.g., my_kernel<<<grd, blk, 48 * 1024, my_stream>>>
), it works.
Compilation flags are:
nvcc -std=c++14 -gencode arch=compute_70,code=sm_70 -Xptxas -v,-dlcm=cg
What am I missing?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…