Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
176 views
in Technique[技术] by (71.8m points)

c++ - Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel

I have a Monte Carlo simulation in which the state of the system is a bit string (size N) with the bits being randomly flipped. In an effort to accelerate the simulation the code was revised to use CUDA. However because of the large number of statistics I need calculated from the system state (goes as N^2) this part needs to be done on the CPU where there is more memory. Currently the algorithm looks like this:

loop
  CUDA kernel making 10s of Monte Carlo steps
  Copy system state back to CPU
  Calculate statistics

This is inefficient and I would like to have the kernel run persistently while the CPU occasionally queries the state of the system and calculates the statistics while the kernel continues to run.

Based on Tom's answer to this question I think the answer is double buffering, but I haven't been able to find an explanation or example of how to do this.

How does one set up the double buffering described in the third paragraph of Tom's answer for a CUDA/C++ code?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here's a fully worked example of a "persistent" kernel, producer-consumer approach, with a double-buffered interface from device (producer) to host (consumer).

Persistent kernel design generally implies launching kernels with, at most, the number of blocks that can be simultaneously resident on the hardware (see item 1 on slide 16 here). For the most efficient usage of the machine, we'd generally like to maximize this, while still staying within the aforementioned limit. This involves an occupancy study for a specific kernel, and it will vary from kernel to kernel. Therefore I've chosen to take a shortcut here, and simply launch as many blocks as there are multiprocessors. Such an approach is always guaranteed to work (it could be considered a "lower bound" on the number of blocks to launch for a persistent kernel), but is (typically) not the most efficient usage of the machine. Nevertheless, I claim the occupancy study is beside the point of your question. Furthermore, it is arguable that proper "persistent kernel" design with guaranteed forward progress is actually quite tricky - requiring careful design of the CUDA thread code and placement of threadblocks (e.g. only use 1 threadblock per SM) to guarantee forward progress. However we don't need to delve to this level to address your question (I don't think) and the persistent kernel example I propose here only places 1 threadblock per SM.

I'm also assuming a proper UVA setup, so that I can skip the details of arranging for proper mapped memory allocations in an non-UVA setup.

The basic idea is that we will have 2 buffers on the device, along with 2 "mailboxes" in mapped memory, one for each buffer. The device kernel will fill a buffer with data, then set the "mailbox" to a value (2, in this case) that indicates the host may "consume" the buffer. The device then goes on to the other buffer and repeats the process in a ping-pong fashion between buffers. In order to make this work we must make sure that the device itself has not overrun the buffers (no thread is allowed to be more than one buffer ahead of any other thread) and that before a buffer is populated by the device, the host has consumed the previous contents.

On the host side, it is simply waiting for the mailbox to indicate "full", then copying the buffer from device to host, reset the mailbox, and perform the "processing" on it (the validate function). It then goes on to the next buffer in a ping-pong fashion. The actual data "production" by the device is just to fill each buffer with the iteration number. The host then checks to see that the proper iteration number was received.

I've structured the code to call out the actual device "work" function (my_compute_function) which is where you would put whatever your Monte Carlo code is. If your code is nicely thread-independent, this should be straightforward. Thus the device side my_compute_function is the producer function, and the host side validate is the consumer function. If your device producer code is not simply thread independent, then you may need to restructure things slightly around the calling point to my_compute_function.

The net effect of this is that the device can "race ahead" and begin filling the next buffer, while the host is "consuming" the data in the previous buffer.

Because persistent kernel design imposes an upper bound on the number of blocks (and threads) in a kernel launch, I've chosen to implement the "work" producer function in a grid-striding loop, so that arbitrary size buffers can be handled by the given grid-width.

Here's a fully worked example:

$ cat t942.cu
#include <stdio.h>

#define ITERS 1000
#define DSIZE 65536
#define nTPB 256

#define cudaCheckErrors(msg) 
    do { 
        cudaError_t __err = cudaGetLastError(); 
        if (__err != cudaSuccess) { 
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)
", 
                msg, cudaGetErrorString(__err), 
                __FILE__, __LINE__); 
            fprintf(stderr, "*** FAILED - ABORTING
"); 
            exit(1); 
        } 
    } while (0)


__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;

__device__ void my_compute_function(int *buf, int idx, int data){
  buf[idx] = data;  // put your work code here
}

__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready,  const int buffersize, const int iterations){
  // assumption of persistent block-limited kernel launch
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int iter_count = 0;
  while (iter_count < iterations ){ // persistent until iterations complete
    int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
    volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
    volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
    int my_idx = idx;
    while (iter_count - itercnt > 1); // don't overrun buffers on device
    while (*bufrdy == 2);  // wait for buffer to be consumed
    while (my_idx < buffersize){ // perform the "work"
      my_compute_function(buf, my_idx, iter_count);
      my_idx += gridDim.x*blockDim.x; // grid-striding loop
      }
    __syncthreads(); // wait for my block to finish
    __threadfence(); // make sure global buffer writes are "visible"
    if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
    if (!idx){ // am I the master block/thread?
      while (*blkcnt < gridDim.x);  // wait for all blocks to finish
      *blkcnt = 0;
      *bufrdy = 2;  // indicate that buffer is ready
      __threadfence_system(); // push it out to mapped memory
      itercnt++;
      }
    iter_count++;
    }
}

int validate(const int *data, const int dsize, const int val){

  for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d
", i, data[i], val); return 0;}
  return 1;
}

int main(){

  int *h_buf1, *d_buf1, *h_buf2, *d_buf2;
  volatile int *m_bufrdy1, *m_bufrdy2;
  // buffer and "mailbox" setup
  cudaHostAlloc(&h_buf1, DSIZE*sizeof(int), cudaHostAllocDefault);
  cudaHostAlloc(&h_buf2, DSIZE*sizeof(int), cudaHostAllocDefault);
  cudaHostAlloc(&m_bufrdy1, sizeof(int), cudaHostAllocMapped);
  cudaHostAlloc(&m_bufrdy2, sizeof(int), cudaHostAllocMapped);
  cudaCheckErrors("cudaHostAlloc fail");
  cudaMalloc(&d_buf1, DSIZE*sizeof(int));
  cudaMalloc(&d_buf2, DSIZE*sizeof(int));
  cudaCheckErrors("cudaMalloc fail");
  cudaStream_t streamk, streamc;
  cudaStreamCreate(&streamk);
  cudaStreamCreate(&streamc);
  cudaCheckErrors("cudaStreamCreate fail");
  *m_bufrdy1 = 0;
  *m_bufrdy2 = 0;
  cudaMemset(d_buf1, 0xFF, DSIZE*sizeof(int));
  cudaMemset(d_buf2, 0xFF, DSIZE*sizeof(int));
  cudaCheckErrors("cudaMemset fail");
  // inefficient crutch for choosing number of blocks
  int nblock = 0;
  cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
  cudaCheckErrors("get multiprocessor count fail");
  testkernel<<<nblock, nTPB, 0, streamk>>>(d_buf1, d_buf2, m_bufrdy1, m_bufrdy2, DSIZE, ITERS);
  cudaCheckErrors("kernel launch fail");
  volatile int *bufrdy;
  int *hbuf, *dbuf;
  for (int i = 0; i < ITERS; i++){
    if (i & 1){  // ping pong on the host side
      bufrdy = m_bufrdy2;
      hbuf = h_buf2;
      dbuf = d_buf2;}
    else {
      bufrdy = m_bufrdy1;
      hbuf = h_buf1;
      dbuf = d_buf1;}
    // int qq = 0; // add for failsafe - otherwise a machine failure can hang
    while ((*bufrdy)!= 2); // use this for a failsafe:  if (++qq > 1000000) {printf("bufrdy = %d
", *bufrdy); return 0;} // wait for buffer to be full;
    cudaMemcpyAsync(hbuf, dbuf, DSIZE*sizeof(int), cudaMemcpyDeviceToHost, streamc);
    cudaStreamSynchronize(streamc);
    cudaCheckErrors("cudaMemcpyAsync fail");
    *bufrdy = 0; // release buffer back to device
    if (!validate(hbuf, DSIZE, i)) {printf("validation failure at iter %d
", i); exit(1);}
    }
 printf("Completed %d iterations successfully
", ITERS);
}


$ nvcc -o t942 t942.cu
$ ./t942
Completed 1000 iterations successfully
$

I've tested the above code and it seems to work well on linux. I believe it should be OK on a windows TCC setup. On windows WDDM, however, I think there are issues that I am still investigating.

Note that the above kernel design attempts to do a grid-wide synchronization using a block-counting atomic strategy. CUDA now (9.0 and newer) has cooperative groups, and that is the recommended approach, rather than the above methodology, to create a grid-wide sync.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...