One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.
Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:
LOOP: LD R0, in[idx];
ST out[idx], R0;
...
BRA LOOP;
which it can unroll (let's say) 8 times:
LD R0, in[idx];
ST out[idx], R0;
LD R0, in[idx+1];
ST out[idx+1], R0;
LD R0, in[idx+2];
ST out[idx+2], R0;
LD R0, in[idx+3];
ST out[idx+3], R0;
LD R0, in[idx+4];
ST out[idx+4], R0;
LD R0, in[idx+5];
ST out[idx+5], R0;
LD R0, in[idx+6];
ST out[idx+6], R0;
LD R0, in[idx+7];
ST out[idx+7], R0;
and after that it can reorder the instructions, since the operations are independent:
LD R0, in[idx];
LD R1, in[idx+1];
LD R2, in[idx+2];
LD R3, in[idx+3];
LD R4, in[idx+4];
LD R5, in[idx+5];
LD R6, in[idx+6];
LD R7, in[idx+7];
ST out[idx], R0;
ST out[idx+1], R1;
ST out[idx+2], R2;
ST out[idx+3], R3;
ST out[idx+4], R4;
ST out[idx+5], R5;
ST out[idx+6], R6;
ST out[idx+7], R7;
at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD
instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST
instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD
instruction, but immediately hits a dependent ST
instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD
operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD
instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.