python - Why isn't broadcasting with numpy faster than a nested loop

Question

Welcome To Ask or Share your Answers For Others

python - Why isn't broadcasting with numpy faster than a nested loop

posted Jan 29, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Why isn't broadcasting with numpy faster than a nested loop

I have a calculation in my code that get carried out thousands of times and I wanted to see if I could make it faster as it is currently using two nested loops. I assumed that if I used broadcasting I could make it several times faster.

I've shown the two options below, which thankfully give the same results.

import numpy as np

n = 1000
x = np.random.random([n, 3])
y = np.random.random([n, 3])
func_weight = np.random.random(n)


result = np.zeros([n, 9])
result_2 = np.zeros([n, 9])

# existing
for a in range(3):
    for b in range(3):
        result[:, 3*a + b] = x[:, a] * y[:, b] * func_weight

# broadcasting - assumed this would be faster
for a in range(3):
    result_2[:, 3*a:3*(a+1)] = np.expand_dims(x[:, a], axis=-1) * y * np.expand_dims(func_weight, axis=-1)

Timings

n=100
nested loops: 24.7 μs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
broadcasting: 70.3 μs ± 1.22 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

n=1000
nested loops: 50.5 μs ± 913 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
broadcasting: 148 μs ± 372 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

n=10000
nested loops: 327 μs ± 7.99 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
broadcasting: 864 μs ± 5.57 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In my testing, broadcasting is always slower, so I'm a little confused as to what is happening. I'm guessing that because I had to use expand_dims to get the shapes aligned in the second solution, that is what the big impact on performance is. Is that correct? As the array size grows, there's not much change in performance with the nested loop always about 3 times quicker.

Is there a more optimal third solution that I haven't considered?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-01-29T04:27:34+0000

In [126]: %%timeit
     ...: result = np.zeros([n,9])
     ...: for a in range(3):
     ...:     for b in range(3):
     ...:         result[:, 3*a + b] = x[:, a] * y[:, b] * func_weight
141 μs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [128]: %%timeit
     ...: result_2 = np.zeros([n,9])
     ...: for a in range(3):
     ...:    result_2[:, 3*a:3*(a+1)] = np.expand_dims(x[:, a], axis=-1) * y * n
     ...: p.expand_dims(func_weight, axis=-1)
202 μs ± 10.8 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A fully broadcasted version:

In [130]: %%timeit
     ...: result_3 = (x[:,:,None]*y[:,None,:]*func_weight[:,None,None]).reshape(
     ...: n,9) 
88.8 μs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Replacing the expand_dims with np.newaxis/None expansion:

In [131]: %%timeit
     ...: result_2 = np.zeros([n,9])
     ...: for a in range(3):
     ...:    result_2[:, 3*a:3*(a+1)] = x[:, a,None] * y * func_weight[:,None]
132 μs ± 315 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So yes, expand_dims is a bit slow, I think because it tries to be general purpose. And an extra layer of function calls.

expand_dims is just a.reshape(shape), but it takes a bit of time to translate your axis parameter into the shape tuple. As an experienced user I find that the None syntax is clearer (and faster) - visually it stands out as a dimension-adding action.

Categories

python - Why isn't broadcasting with numpy faster than a nested loop

python - Why isn't broadcasting with numpy faster than a nested loop

Timings

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags