In cuBLAS, cublasIsamin()
gives the argmin for a single-precision array.
Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
The cuBLAS programmer guide provides this information about the cublasIsamin()
parameters:
If I use host (CPU) memory for result
, then cublasIsamin
works properly. Here's an example:
void argmin_experiment_hostOutput(){
float h_A[4] = {1, 2, 3, 4}; int N = 4;
float* d_A = 0;
CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));
int result; //host memory
CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, &result));
printf("argmin = %d, min = %f
", result, h_A[result]);
CHECK_CUBLAS(cublasDestroy(handle));
}
However, if I use device (GPU) memory for result
, then cublasIsamin
segfaults. Here's an example that segfaults:
void argmin_experiment_deviceOutput(){
float h_A[4] = {1, 2, 3, 4}; int N = 4;
float* d_A = 0;
CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));
int* d_result = 0;
CHECK_CUDART(cudaMalloc((void**)&d_result, 1 * sizeof(d_result[0]))); //just enough device memory for 1 result
CHECK_CUDART(cudaMemset(d_result, 0, 1 * sizeof(d_result[0])));
CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, d_result)); //SEGFAULT!
CHECK_CUBLAS(cublasDestroy(handle));
}
The Nvidia guide says that `cublasIsamin()` can output to device memory. What am I doing wrong?
Motivation: I want to compute the argmin() of several vectors concurrently in multiple streams. Outputting to host memory requires CPU-GPU synchronization and seems to kill the multi-kernel concurrency. So, I want to output the argmin to device memory instead.
See Question&Answers more detail:
os