You can have the best of both worlds:
def func_dot_einsum(C, X):
Y = X.dot(C)
return np.einsum('ij,ij->i', Y, X)
On my system:
In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop
In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop
In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop
In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop
When available, np.dot
uses BLAS, MKL, or whatever library you have . So the call to np.dot
is almost certainly being multithreaded. np.einsum
has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation.
Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. So be advised that the following is speculative at best, but here's what I think is going on...
When you run something like np.einsum('ij,ij->i', a, b)
, the benefit over doing np.sum(a*b, axis=1)
comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. So at the low level what goes on is something like:
for i in range(I):
out[i] = 0
for j in range(J):
out[i] += a[i, j] * b[i, j]
Say now that you are after something like:
np.einsum('ij,jk,ik->i', a, b, c)
You could do the same operation as
np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster:
In [29]: a, b, c = np.random.rand(3, 100, 100)
In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop
In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop
But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. This is what I think einsum is doing at the low level:
for i in range(I):
out[i] = 0
for j in range(J):
for k in range(K):
out[i] += a[i, j] * b[j, k] * c[i, k]
But you are repeating a ton of operations! If you instead did:
for i in range(I):
out[i] = 0
for j in range(J):
temp = 0
for k in range(K):
temp += b[j, k] * c[i, k]
out[i] += a[i, j] * temp
you would be doing I * J * (K-1)
less multiplications (and I * J
extra additions), and save yourself a ton of time. My guess is that einsum is not smart enough to optimize things at this level. In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple...