c - Cuda Shared Memory array variable

Question

Welcome To Ask or Share your Answers For Others

c - Cuda Shared Memory array variable

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Cuda Shared Memory array variable

I am trying to declare a variable for matrix multiplication as follows:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

I am trying to make it so the user could input the size of the matrix to calculate, however that would mean changing the BLOCK_SIZE. I changed it but I am getting a compiler error:"error: constant value is not known". I've looked into it and it's similar to this thread. So I tried:

__shared__ int buf [];

But then I get: "error: incomplete type is not allowed"

Thanks, Dan Update with code(pretty much followed this guide and the staring out with cuda guide): The block size is passed in by asking the user of the size of the matrix. They enter the x and y. Block size is only x and right now it has to accept the same size as x and y.

__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * block_size * by;

    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;

    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = block_size;

    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = block_size * bx;

    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = block_size * wB;
    float Csub=0;
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A

        extern __shared__ float As[];

        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        smem[ty*block_size+tx] = A[a + wA * ty + tx];
        //cuPrintf("

What are the memory locations?
");
        //cuPrintf("The shared memory(A) is: %.2f
",smem[ty*block_size+tx]);
        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];
        //cuPrintf("The shared memory(B) is: %.2f
",smem[block_size*block_size+ty*block_size+tx]);
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < block_size; ++k)
        {

            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
            //cuPrintf("Csub is currently: %.2f
",Csub);
        }
        //cuPrintf("


");
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        //cuPrintf("the results are csub: %.2f
",Csub);
        __syncthreads();
    }
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;


}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:30:38+0000

extern __shared__ int buf[];

when you launch the kernel you should launch it this way;

kernel<<<blocks,threads,numbytes_for_shared>>>(...);

If you have multiple extern declaration of shared:

extern __shared__ float As[];

extern __shared__ float Bs[];

this will lead to As pointing to the same address as Bs.

You will need to keep As and Bs inside the 1D-array.

extern __shared__ float smem[];

When calling kernel, you should launch it with 2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float).

When indexing into As, use smem[y*BLOCK_SIZE+x] and when indexing into Bs use smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

Categories

c - Cuda Shared Memory array variable

c - Cuda Shared Memory array variable

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags