Since OpenCV allocates host memory for cv::Mat
, you can't use Mat
and related OpenCV APIs in a kernel as you would have used it in a host code. So you have to write your own kernel for your matrix multiplication.
OpenCV provides a class called cv::cuda::GpuMat
. OpenCV allocates device memory for them. However, APIs related to GpuMat
are meant to be used in host code. For matrix multiplication you have to write your own kernel anyway.
However I sometimes find some APIs for GpuMat
convenient, such as allocating device memory using its constructor and copying data between host matrix and device matrix using download()
and upload()
. Also, Gpumat
class keeps your matrix's attributes such as rows
, cols
, type()
, step
, etc. in a single structure. This may come in handy for some cases.
Following sample code uses GpuMat
.
int main (void)
{
Mat a{ 10, 1, CV_64FC1 }; // 10x1 matrix
Mat b{ 1, 10, CV_64FC1 }; // 1x10 matrix
Mat c{ 10, 10, CV_64FC1 }; // multiplying a and b results in 10x10 matrix
a.setTo(Scalar(2.2f));
b.setTo(Scalar(3.35f));
cv::cuda::GpuMat d_a{ a.rows, a.cols, CV_64FC1 };
cv::cuda::GpuMat d_b{ b.rows, b.cols, CV_64FC1 };
cv::cuda::GpuMat d_c{ c.rows, c.cols, CV_64FC1 };
d_a.upload(a);
d_b.upload(b);
MatMul<<<1, dim3(c.cols, c.rows)>>>((double*)d_a.data, d_a.step,
(double*)d_b.data, d_b.step,
(double*)d_c.data, d_c.step,
a.cols);
d_c.download(c);
}
__global__ void MatMul(const double* const a, const int a_step,
const double* const b, const int b_step,
double* const c, const int c_step,
const int a_cols)
{
int c_row = threadIdx.y;
int c_col = threadIdx.x;
double sum = 0;
for (int i = 0; i < a_cols; i++)
sum += ((double*)((unsigned char*)a + c_row * a_step))[i]
* ((double*)((unsigned char*)b + i * b_step))[c_col];
((double*)((unsigned char*)c + c_row * c_step))[c_col] = sum;
}
Note that if the number of elements of c
, the result matrix, exceeds the maximum number of threads in a block(1024 for cc >= 2.0), this code won't work. Kernel should be designed differently.
EDIT
((double*)((unsigned char*)c + c_row * c_step))[c_col];
The above statement access c_row
-th row and c_col
-th column element of matrix c
. This matrix is a single channel matrix and the element type is double. It's step is given by c_step
. In OpenCV, step refers to the number of bytes allocated per row. It is larger than or equal to the total size of actual pixels in each row to meet memory alignment, which in turn makes memory access faster.
The above statement first casts c
(which is of type double*
) to unsigned char*
, since c_step
is counted in bytes. Adding c_row * c_step
to (unsigned char*)c
gives pointer to the 0-th column of c_row
-th row. It now casts the pointer to double*
to access c_col
-th column with standard array access operator []
.