While unaligned accesses are not allowed in CUDA, the prmt
PTX instruction has a handy mode to emulate the effect of unaligned reads within registers. This can be exposed with a bit of inline PTX assembly. If you can tolerate a read past the end of the array, the code becomes quite simple:
// WARNING! Reads past ptr!
__device__ uint32_t read_unaligned(void* ptr)
{
uint32_t result;
asm("{
"
" .reg .b64 aligned_ptr;
"
" .reg .b32 low, high, alignment;
"
" and.b64 aligned_ptr, %1, 0xfffffffffffffffc;
"
" ld.u32 low, [aligned_ptr];
"
" ld.u32 high, [aligned_ptr+4];
"
" cvt.u32.u64 alignment, %1;
"
" prmt.b32.f4e %0, low, high, alignment;
"
"}"
: "=r"(result) : "l"(ptr));
return result;
}
To ensure the access past the end of the array remains harmless, round up the number of allocated byte to a multiple of 4, and add another 4 bytes.
Above device code has the same effect as the following code on a little-endian host that tolerates unaligned accesses:
__host__ uint32_t read_unaligned_host(void* ptr)
{
return *(uint32_t*)ptr;
}
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…