Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
62 views
in Technique[技术] by (71.8m points)

python - Vectorize itertools combination_with_replacement

My goal is to speed up the creation of a list of combinations by using my GPU. How can I accomplish this?

By way of example, the following code creates a list of 260 text strings ranging from "aa" through "jz". We then use itertools combinations_with_replacement() to create all possible combinations of R elements of this list. The use of timeit shows that, beyond 3 elements, extracting a list of these combinations slows exponentially. I suspect this can be done with numba cuda, but I don't know how.

import timeit
timeit.timeit('''

from itertools import combinations_with_replacement

combo_count = 2

alphabet = 'a'
alpha_list = []
item_list = []

for i in range(0,26):
    alpha_list.append(alphabet)
    alphabet = chr(ord(alphabet) + 1)

for first_letter in alpha_list[0:10]:
    for second_letter in alpha_list:
        item_list.append(first_letter+second_letter)

print("Length of item list:",len(item_list))
    
combos = combinations_with_replacement(item_list,combo_count)
cmb_lst = [bla for bla in combos]
print("Length of list of all {} combinations: {}".format(combo_count,len(cmb_lst)))
              ''', number=1)
question from:https://stackoverflow.com/questions/65909331/vectorize-itertools-combination-with-replacement

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As mentioned in the comments, there is no way to "vectorize" the combinations_with_replacement() call from the itertools library directly (with Numba CUDA). Numba CUDA doesn't work that way.

However, I believe it should be possible to generate an equivalent result dataset, using Numba CUDA, in a way that seems to run faster than the itertools library function for certain cases. I imagine there are probably a number of ways to accomplish this, and I make no claims that the method I describe is in any way optimal. It certainly is not, and could certainly be made to run faster. However according to my testing, even this not-very-optimized approach can run a particular test case about 10x faster than python itertools or so on a V100 GPU.

As background, I consider this and this (or equivalent material) to be essential reading.

From the above, the formula for the number of combinations of n items with k choices, with replacement, is given by:

(n-1+k)!
--------   (Formula 1)
(n-1)!k!

In the code below, I have encapsulated the above calculation in count_comb_with_repl (device) and host_count_comb_with_repl (host) functions. It turns out we can use this one basic calculation, with a cascading-smaller sequence of choices for n and k, to drive the entire calculation process to compute a combination given only an index into the final result array. To visualize what we are doing, it helps to have a simple example. Let's take the case of 3 items, and 3 choices. Indexing items from zero, the array of possibilities looks like this:

n = 3, k = 3
index    choices  first digit calculation
0        0,0,0    -----------------
1        0,0,1
2        0,0,2
3        0,1,1     equivalent to n = 3, k = 2
4        0,1,2
5        0,2,2    -----------------
6        1,1,1    -----------------
7        1,1,2     equivalent to n = 2, k = 2
8        1,2,2    -----------------
9        2,2,2     equivalent to n = 1, k = 2

The length of the above list is given by plugging the values of n = 3 and k = 3 into formula 1. The key observation to understanding the method I present is that to compute the first digit of the choices result given only the index, we can compute the dividing points between 0, and 1 for example by observing that considering the results where the first choice index is 0, the length of this range is given by plugging the values of n = 3 and k = 2 into formula 1. Therefore if our given index is less than this value (6) then we know the first digit is 0. If it is greater than this value then we know the first digit is 1 or 2, and with suitable offsetting we can recompute the next range (corresponding to first digit of 1) and see if our index falls within this range.

Once we know the first digit, we can repeat the process (with suitable list reduction and offsetting) to find the next digit, and the next digit, etc.

Here is a python code that implements the above method. As I mentioned, for a test case of n=260 and k=4 this runs in less than 3 seconds on my V100.

$ cat t2.py
from numba import cuda,jit
import numpy as np

@cuda.jit(device=True)
def get_next_count_comb_with_repl(n,k,prev):
    return int(round((prev*(n))/(n+k)))

@cuda.jit(device=True)
def count_comb_with_repl(n,k):
    mymax = max(n-1,k)
    ans = 1.0
    cnt = 1
    for i in range(mymax+1, n+k):
        ans = ans*i/cnt
        cnt += 1
    return int(round(ans))
#intended to be identical to the previous function
#I just need a version I can call from host code
def host_count_comb_with_repl(n,k):
    mymax = max(n-1,k)
    ans = 1.0
    cnt = 1
    for i in range(mymax+1, n+k):
        ans = ans*i/cnt
        cnt += 1
    return int(round(ans))

@cuda.jit(device=True)
def find_first_digit(n,k,i):
    psum = 0
    count = count_comb_with_repl(n, k-1)
    if (i-psum) < count:
        return 0,psum
    psum += count
    for j in range(1,n):
        count = get_next_count_comb_with_repl(n-j,k-1,count)
        if (i-psum) < count:
            return j,psum
        psum += count
    return -1,0 # error

@cuda.jit
def kernel_count_comb_with_repl(n, k, l, r):
    for i in range(cuda.grid(1), l, cuda.gridsize(1)):
        new_ll = n
        new_cc = k
        new_i = i
        new_digit = 0
        for j in range(k):
            digit,psum = find_first_digit(new_ll, new_cc, new_i)
            new_digit += digit
            new_ll -= digit
            new_cc -= 1
            new_i  -= psum
            r[i+j*l] = new_digit

combo_count = 4
ll = 260
cl = host_count_comb_with_repl(ll, combo_count)
print(cl)
# bug if cl > 2G
if cl < 2**31:
    my_dtype = np.uint8
    if ll > 255:
        my_dtype = np.uint16
    r   = np.empty(cl*combo_count, dtype=my_dtype)
    d_r = cuda.device_array_like(r)
    block = 256
    grid = (cl//block)+1
    #grid = 640
    kernel_count_comb_with_repl[grid,block](ll, combo_count, cl, d_r)
    r = d_r.copy_to_host()
    print(r.reshape(combo_count,cl))
$ time python t2.py
194831715
[[  0   0   0 ... 258 258 259]
 [  0   0   0 ... 258 259 259]
 [  0   0   0 ... 259 259 259]
 [  0   1   2 ... 259 259 259]]

real    0m2.212s
user    0m1.110s
sys     0m1.077s
$

(The above test case: n=260, k = 4, takes ~30s on my system using OP's code.)

This should be considered to be a sketch of an idea. I make no claims that it is defect free. This type of problem can quickly exhaust the memory on a GPU (for large enough choices of n and/or k), and your only indication of that would probably be a crude out of memory error from numba.

Yes, the above code does not produce concatenations of aa through jz but this is just an indexing exercise using the result. You would use the result indices to index into your array of items, as needed to convert a result like 0,0,0,1 to a result like aa,aa,aa,ab

This isn't a performance win across the board. They python method is still faster for smaller test cases, and larger test cases (e.g. n = 260, k = 5) will exceed available memory on the GPU.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...