As mentioned in the comments, there is no way to "vectorize" the combinations_with_replacement()
call from the itertools
library directly (with Numba CUDA). Numba CUDA doesn't work that way.
However, I believe it should be possible to generate an equivalent result dataset, using Numba CUDA, in a way that seems to run faster than the itertools
library function for certain cases. I imagine there are probably a number of ways to accomplish this, and I make no claims that the method I describe is in any way optimal. It certainly is not, and could certainly be made to run faster. However according to my testing, even this not-very-optimized approach can run a particular test case about 10x faster than python itertools
or so on a V100 GPU.
As background, I consider this and this (or equivalent material) to be essential reading.
From the above, the formula for the number of combinations of n
items with k
choices, with replacement, is given by:
(n-1+k)!
-------- (Formula 1)
(n-1)!k!
In the code below, I have encapsulated the above calculation in count_comb_with_repl
(device) and host_count_comb_with_repl
(host) functions. It turns out we can use this one basic calculation, with a cascading-smaller sequence of choices for n
and k
, to drive the entire calculation process to compute a combination given only an index into the final result array. To visualize what we are doing, it helps to have a simple example. Let's take the case of 3 items, and 3 choices. Indexing items from zero, the array of possibilities looks like this:
n = 3, k = 3
index choices first digit calculation
0 0,0,0 -----------------
1 0,0,1
2 0,0,2
3 0,1,1 equivalent to n = 3, k = 2
4 0,1,2
5 0,2,2 -----------------
6 1,1,1 -----------------
7 1,1,2 equivalent to n = 2, k = 2
8 1,2,2 -----------------
9 2,2,2 equivalent to n = 1, k = 2
The length of the above list is given by plugging the values of n
= 3 and k
= 3 into formula 1. The key observation to understanding the method I present is that to compute the first digit of the choices result given only the index, we can compute the dividing points between 0, and 1 for example by observing that considering the results where the first choice index is 0, the length of this range is given by plugging the values of n
= 3 and k
= 2 into formula 1. Therefore if our given index is less than this value (6) then we know the first digit is 0. If it is greater than this value then we know the first digit is 1 or 2, and with suitable offsetting we can recompute the next range (corresponding to first digit of 1) and see if our index falls within this range.
Once we know the first digit, we can repeat the process (with suitable list reduction and offsetting) to find the next digit, and the next digit, etc.
Here is a python code that implements the above method. As I mentioned, for a test case of n=260
and k=4
this runs in less than 3 seconds on my V100.
$ cat t2.py
from numba import cuda,jit
import numpy as np
@cuda.jit(device=True)
def get_next_count_comb_with_repl(n,k,prev):
return int(round((prev*(n))/(n+k)))
@cuda.jit(device=True)
def count_comb_with_repl(n,k):
mymax = max(n-1,k)
ans = 1.0
cnt = 1
for i in range(mymax+1, n+k):
ans = ans*i/cnt
cnt += 1
return int(round(ans))
#intended to be identical to the previous function
#I just need a version I can call from host code
def host_count_comb_with_repl(n,k):
mymax = max(n-1,k)
ans = 1.0
cnt = 1
for i in range(mymax+1, n+k):
ans = ans*i/cnt
cnt += 1
return int(round(ans))
@cuda.jit(device=True)
def find_first_digit(n,k,i):
psum = 0
count = count_comb_with_repl(n, k-1)
if (i-psum) < count:
return 0,psum
psum += count
for j in range(1,n):
count = get_next_count_comb_with_repl(n-j,k-1,count)
if (i-psum) < count:
return j,psum
psum += count
return -1,0 # error
@cuda.jit
def kernel_count_comb_with_repl(n, k, l, r):
for i in range(cuda.grid(1), l, cuda.gridsize(1)):
new_ll = n
new_cc = k
new_i = i
new_digit = 0
for j in range(k):
digit,psum = find_first_digit(new_ll, new_cc, new_i)
new_digit += digit
new_ll -= digit
new_cc -= 1
new_i -= psum
r[i+j*l] = new_digit
combo_count = 4
ll = 260
cl = host_count_comb_with_repl(ll, combo_count)
print(cl)
# bug if cl > 2G
if cl < 2**31:
my_dtype = np.uint8
if ll > 255:
my_dtype = np.uint16
r = np.empty(cl*combo_count, dtype=my_dtype)
d_r = cuda.device_array_like(r)
block = 256
grid = (cl//block)+1
#grid = 640
kernel_count_comb_with_repl[grid,block](ll, combo_count, cl, d_r)
r = d_r.copy_to_host()
print(r.reshape(combo_count,cl))
$ time python t2.py
194831715
[[ 0 0 0 ... 258 258 259]
[ 0 0 0 ... 258 259 259]
[ 0 0 0 ... 259 259 259]
[ 0 1 2 ... 259 259 259]]
real 0m2.212s
user 0m1.110s
sys 0m1.077s
$
(The above test case: n=260, k = 4, takes ~30s on my system using OP's code.)
This should be considered to be a sketch of an idea. I make no claims that it is defect free. This type of problem can quickly exhaust the memory on a GPU (for large enough choices of n
and/or k
), and your only indication of that would probably be a crude out of memory error from numba.
Yes, the above code does not produce concatenations of aa
through jz
but this is just an indexing exercise using the result. You would use the result indices to index into your array of items, as needed to convert a result like 0,0,0,1 to a result like aa
,aa
,aa
,ab
This isn't a performance win across the board. They python method is still faster for smaller test cases, and larger test cases (e.g. n = 260, k = 5) will exceed available memory on the GPU.