cuda - conditional syncthreads & deadlock (or not)

Question

Welcome To Ask or Share your Answers For Others

cuda - conditional syncthreads & deadlock (or not)

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

cuda - conditional syncthreads & deadlock (or not)

A follow up Q to: EarlyExit and DroppedThreads

According to the above links, the code below should dead-lock.
Please explain why this does NOT dead-lock. (Cuda 5 on a Fermi)

__device__ int add[144];
__device__ int result;

add<<<1,96>>>();  // the calling 

__global__ void add() {
 for(idx=72>>1; idx>0; idx>>=1) {
  if(thrdIdx < idx) 
   add[thrdIdx]+= add[thrdIdx+idx];
  else
   return;
  __syncthreads();
 }

 if(thrdIdx == 0)
  result= add[0];
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:59:01+0000

This is technically an ill-defined program.

Most, but not all (for example G80 does not), NVIDIA GPUs support early exit in this way because the hardware maintains an active thread count for each block, and this count is used for barrier synchronization rather than the initial thread count for the block.

Therefore, when the __syncthreads() in your code is reached, the hardware will not wait on any threads that have already returned, and the program runs without deadlock.

A more common use of this style is:

__global__ void foo(int n, ...) {
  int idx = threadIdx.x + blockIdx.x * blockDim.x;
  if (idx >= n) return;
  ... // do some computation with remaining threads
}

Important note: barrier counts are updated per-warp (see here), not per-thread. So you may have the case where, say, only a few (or zero) threads return early. This means that the barrier count is not decremented. However, as long as at least one thread from each warp reaches the barrier, it will not deadlock.

So in general, you need to use barriers carefully. But specifically, (simple) early exit patterns like this do work.

Edit: for your specific case.

Iteration Idx==36: 2 active warps so barrier exit count is 64. All threads from warp 0 reach barrier, incrementing count from 0 to 32. 4 threads from warp 1 reach barrier, incrementing count from 32 to 64, and warps 0 and 1 are released from barrier. Read the link above to understand why this happens.

Iteration Idx==18: 1 active warp so barrier exit count is 32. 18 threads from warp 0 reach barrier, incrementing count from 0 to 32. Barrier is satisfied and warp 0 is released.

Etc...

Categories

cuda - conditional syncthreads & deadlock (or not)

cuda - conditional syncthreads & deadlock (or not)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags