Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
463 views
in Technique[技术] by (71.8m points)

python - Write double (triple) sum as inner product?

Since my np.dot is accelerated by OpenBlas and Openmpi I am wondering if there was a possibility to write the double sum

for i in range(N):
     for j in range(N):
         B[k,l] += A[i,j,k,l] * X[i,j]

as an inner product. Right at the moment I am using

B = np.einsum("ijkl,ij->kl",A,X)

but unfortunately it is quite slow and only uses one processor. Any ideas?

Edit: I benchmarked the answers given until now with a simple example, seems like they are all in the same order of magnitude:

A = np.random.random([200,200,100,100])
X = np.random.random([200,200])
def B1():
    return es("ijkl,ij->kl",A,X) 
def B2():
    return np.tensordot(A, X, [[0,1], [0, 1]])
def B3():
    shp = A.shape
    return np.dot(X.ravel(),A.reshape(shp[0]*shp[1],1)).reshape(shp[2],shp[3])

%timeit B1()
%timeit B2()
%timeit B3()

1 loops, best of 3: 300 ms per loop
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 150 ms per loop

Concluding from these results I would choose np.einsum, since its syntax is still the most readable and the improvement with the other two are only a factor 2x. I guess the next step would be to externalize the code into C or fortran.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use np.tensordot():

np.tensordot(A, X, [[0,1], [0, 1]])

which does use multiple cores.


EDIT: it is insteresting to see how np.einsum and np.tensordot scale when increasing the size of the input arrays:

In [18]: for n in range(1, 31):
   ....:     A = np.random.rand(n, n+1, n+2, n+3)
   ....:     X = np.random.rand(n, n+1)
   ....:     print(n)
   ....:     %timeit np.einsum('ijkl,ij->kl', A, X)
   ....:     %timeit np.tensordot(A, X, [[0, 1], [0, 1]])
   ....:
1
1000000 loops, best of 3: 1.55 μs per loop
100000 loops, best of 3: 8.36 μs per loop
...
11
100000 loops, best of 3: 15.9 μs per loop
100000 loops, best of 3: 17.2 μs per loop
12
10000 loops, best of 3: 23.6 μs per loop
100000 loops, best of 3: 18.9 μs per loop
...
21
10000 loops, best of 3: 153 μs per loop
10000 loops, best of 3: 44.4 μs per loop

and it becomes clear the advantage of using tensordot for larger arrays.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...