I wanted to get some advice on the most efficient way to interact with my data on the GPU. The data is stored in a Python dictionary of Sorted Dict. My plan was to transfer/ transform the dictionary data into the following format to the device:
Row i: [key_index_0: float_0 ,key_index_1: float_1 ,...,key_index_n: float_n]
,
where key_index_0: float_0
are stored in a struct and all key_index are already in sorted order. Then I want to perform some calculation (for example the dot product) on all matching key values present in both A_i
, and B_j
. Below is an example:
Let A = [ [key_0: float_0 ,key_1: float_1 ,...,key_3: float_3 ] ]
Let B = [ [key_0: float_0 ,key_1: float_1 ,…,key_5: float_5 ],
[key_0: float_0 ,key_1: float_1 ,…,key_10:float_10 ] ]
Dot_product_results = [value, value] = (A_i. * B_j)
I’m currently considering the following three approaches:
- Manually refactor/flatten the jagged array into a flat (1D) array. In this approach, I believe I would need one array containing all categories and one (1D) arrays that stores the start of each individual sub-array. For example, the following array
[(1:0.5), (2:3.43), (3:0.00032)], [(2:1.0), (6:1.5)]
would be transformed as data = [(1:0.5), (2:3.43), (3:0.00032), (2:1.0), (6:1.5)]
and index = [0,3]
. I am hoping that there is a way to use this data structure without pointer chasing, use coalesced memory, and somehow take advantage of shared memory.
- The second approach would be to create a 2D matrix that is equal to the largest array of my data, and, then append a bunch of empty structs to ensure uniformity. In theory, the B matrix will be larger than the A matrix, and thus I will keep it in column-major order to ensure memory coalescing on the GPU. Which means, the A matrix would need to be transposed; meaning that the x-dimension would contain categories and the y-dimension contain key-values. In this scenario then, the calculation operation would be more like
(B_i. * A_j)
. I think that this approach would be horrible for GPU memory as less than 0.05% of columns would be filled with non-empty structs. To fix that, I could use a sparse matrix representation to reduce the memory overhead for this approach.
- My final approach would be to utilize a more traditional approach of using (SpMV) or (SpMM) representation, such as ELLPACK (ELL). Basically, my data would no longer be stored inside a struct, rather the columns indexes value would be mapped to the keys, the row indexes would mapped the category index, and the data stored inside each cell would be the float values.
So, any suggestions, any alternative ideas, or any help be greatly appreciated.
question from:
https://stackoverflow.com/questions/65865724/cuda-data-structure-for-jagged-static-and-sparse-data 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…