Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
364 views
in Technique[技术] by (71.8m points)

cuda - Is starting 1 thread per element always optimal for data independent problems on the GPU?

I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:

template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
    using vector_type = int2;
    vector_type* src2 = reinterpret_cast<vector_type*>(src);
    vector_type* dest2 = reinterpret_cast<vector_type*>(dest);

    //This copy kernel is only correct when size%sizeof(vector_type)==0
    auto numElements = size / sizeof(vector_type);

    for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
        dest2[id] = src2[id];
    }
}

I also calculated the number of blocks required to reach 100% occupancy like so:

THREADS_PER_BLOCK = 256 
Multi-Processors: 4 
Max Threads per Multi Processor: 2048 
NUM_BLOCKS = 4 * 2048 / 256 = 32

My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:

bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s

So my questions are:

Why is there a speed difference?

Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Is starting 1 thread per element always optimal for data independent problems on the GPU?

Not always. Let's consider 3 different implementations. In each case we'll assume we're dealing with a trivially parallelizable problem that involves one element load, some "work" and one element store per thread. In your copy example there is basically no work - just loads and stores.

  1. One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.

  2. Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.

  3. Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:

      LOOP:  LD R0, in[idx];
             ST out[idx], R0;
             ...
             BRA  LOOP;
    

    which it can unroll (let's say) 8 times:

         LD R0, in[idx];
         ST out[idx], R0;
         LD R0, in[idx+1];
         ST out[idx+1], R0;
         LD R0, in[idx+2];
         ST out[idx+2], R0;
         LD R0, in[idx+3];
         ST out[idx+3], R0;
         LD R0, in[idx+4];
         ST out[idx+4], R0;
         LD R0, in[idx+5];
         ST out[idx+5], R0;
         LD R0, in[idx+6];
         ST out[idx+6], R0;
         LD R0, in[idx+7];
         ST out[idx+7], R0;
    

    and after that it can reorder the instructions, since the operations are independent:

         LD R0, in[idx];
         LD R1, in[idx+1];
         LD R2, in[idx+2];
         LD R3, in[idx+3];
         LD R4, in[idx+4];
         LD R5, in[idx+5];
         LD R6, in[idx+6];
         LD R7, in[idx+7];
         ST out[idx], R0;
         ST out[idx+1], R1;
         ST out[idx+2], R2;
         ST out[idx+3], R3;
         ST out[idx+4], R4;
         ST out[idx+5], R5;
         ST out[idx+6], R6;
         ST out[idx+7], R7;
    

    at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.

What we have done here, is effectively (working in conjunction with the compiler) increase the number of instructions that can be issued per thread, before the thread will hit a stall. This is also referred to as increasing the exposed parallelism, basically via ILP in this approach. Whether or not it has any benefit will vary depending on your actual code, your actual GPU, and what else is going in the GPU at that time. But it is always a good strategy to increase exposed parallelism using techniques such as this, because the ability to issue instructions is how the GPU hides the various forms of latency that it must deal with, so we have effectively improved the GPU's ability to hide latency, with this approach.

Why is there a speed difference?

This can be difficult to answer without profiling the code carefully. However it's often the case that launching just enough threads to fully satisfy the instantaneous carrying capacity of the GPU is not a good strategy, possibly due to the "tail effect" or other types of inefficiency. It may also be the case that blocks are limited by some other factor, such as registers or shared memory usage. It's usually necessary to carefully profile as well as possibly study the generated machine code to fully answer such a question. But it may be that the loop overhead measurably impacts your comparison, which is basically my case 2 vs. my case 1 above.

(note the memory indices in my "pseudo" machine code example are not what you would expect for a well written grid-striding copy loop - they are just for example purposes to demonstrate unrolling and the benefit it can have via compiler instruction reordering).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...