I have a strange problem which origin I cannot determine:
I have a working Kernel for some special Matrix-Vector-multiplication, which I want to speed up. Basically the big matrix (10^6 times 10^6) is constructed from few small matrices. So I want to put that data in shared memory. However when I try to add the shared memory, I only get the error:
pycuda._driver.LogicError: cuLaunchKernel failed: invalid value
So my working kernel is:
#define FIELD_SIZE {field}
#define BLOCK_SIZE {block}
__global__ void MatrixMulKernel(double *gpu_matrix, double *gpu_b, double *gpu_y)
{
int tx = ... + threadIdx.x;
if(tx < FIELD_SIZE*FIELD_SIZE*BLOCK_SIZE)
{ ... multiplication ... }
}
And if I try to add the shared memory part it looks like
#define FIELD_SIZE {field}
#define BLOCK_SIZE {block}
__global__ void MatrixMulKernel(double *gpu_matrix_ptr, double *gpu_b, double *gpu_y)
{
__shared__ double gpu_matrix[BLOCK_SIZE*BLOCK_SIZE*13];
int tx = ... + threadIdx.x;
if(tx < BLOCK_SIZE*BLOCK_SIZE*13) { gpu_matrix[tx] = gpu_matrix_ptr[tx]; }
__syncthreads();
if(tx < FIELD_SIZE*FIELD_SIZE*BLOCK_SIZE)
{ ... multiplication ... }
}
This is the only part I changed, so basically it has to be the gpu_matrix[tx] = gpu_matrix_ptr[tx] statement, hasnt it? But I fail to see how that should be. I basically tried to copy the tiled matrix-multiplication example from the pycuda examples. http://wiki.tiker.net/PyCuda/Examples/MatrixmulTiled
The invocation is:
self.kernel.prepare([np.intp, np.intp, np.intp])
self.kernel.prepared_call(grid_shape,
block_shape,
self.matrix_gpu.gpudata,
b_gpu.gpudata,
y_gpu.gpudata)
where matrix_gpu, b_gpu and y_gpu are pycuda.gpuarray instances.
Hope that you can clear up some of my confusion...
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…