I'm new to cuda and thrust, what I'm doing is using cuda (specifically thrust) to implement a 1-dimensional bitmap
or bitset mapping of a billion int numbers in a set( unduplicated numbers greater than 0), mapping each number to a binary bit, the bitmap is implemented with a unsigned int array, I get the index of the array by performing /32 calculation, and get the position of the binary bits by %32 operation.
My question is, the elements in my int-number set are disordered, so random global memory accesses make my code less efficient when using the for_each function in thrust, if the array is sorted, the execution time will be more than 10 times faster. Is there any way to optimize my method or cuda's random memory accesses? Thanks in advance!
The main code is as follows, the d_set's type is device_vector, bitmap's type is device_vector and initialized by fill with zeros
struct map_functor
{
unsigned int* d_ptr;
map_functor(unsigned int * a) : d_ptr(a) {}
__device__
void operator()(int x)
{
// note that using printf in a __device__ function requires
// code compiled for a GPU with compute capability 2.0 or
// higher (nvcc --arch=sm_20)
int i = x / 32; //index
int j = x % 32; //position
atomicOr(&d_ptr[i], (1 << j));
}
};
thrust::for_each(d_set.begin(), d_set.end(), map_functor(thrust::raw_pointer_cast(bitmap.data())));//for_each 测试
Here is the comparison of execution time before sorting and after sorting.
Before:
![Before Sorting](https://i.stack.imgur.com/eTV62.png)
After:
![enter image description here](https://i.stack.imgur.com/ekdcA.png)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…