This is technically an ill-defined program.
Most, but not all (for example G80 does not), NVIDIA GPUs support early exit in this way because the hardware maintains an active thread count for each block, and this count is used for barrier synchronization rather than the initial thread count for the block.
Therefore, when the __syncthreads()
in your code is reached, the hardware will not wait on any threads that have already returned, and the program runs without deadlock.
A more common use of this style is:
__global__ void foo(int n, ...) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
... // do some computation with remaining threads
}
Important note: barrier counts are updated per-warp (see here), not per-thread. So you may have the case where, say, only a few (or zero) threads return early. This means that the barrier count is not decremented. However, as long as at least one thread from each warp reaches the barrier, it will not deadlock.
So in general, you need to use barriers carefully. But specifically, (simple) early exit patterns like this do work.
Edit: for your specific case.
Iteration Idx==36: 2 active warps so barrier exit count is 64. All threads from warp 0 reach barrier, incrementing count from 0 to 32. 4 threads from warp 1 reach barrier, incrementing count from 32 to 64, and warps 0 and 1 are released from barrier. Read the link above to understand why this happens.
Iteration Idx==18: 1 active warp so barrier exit count is 32. 18 threads from warp 0 reach barrier, incrementing count from 0 to 32. Barrier is satisfied and warp 0 is released.
Etc...
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…