TL;DR: The set()
function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist()
to convert the array to a Python list before doing the iteration is (much) faster.
To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.
A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10
could look like this:
The blue circle is the "name" you use in the Python interpreter for the variable ten
and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).
A Python list
is just a collection of Python objects, for example mylist = [1, 2, 3]
would be saved like this:
This time the list references the Python integers 1
, 2
and 3
and the name mylist
just references the list
instance.
But an array myarray = np.array([1, 2, 3])
doesn't store Python objects as elements:
The values 1
, 2
and 3
are stored directly in the NumPy array
instance.
With this information I can explain why iterating over an array
is so much slower compared to an iteration over a list
:
Each time you access the next element in a list
the list
just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).
On the other hand when you want an element of an array
it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:
Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:
import numpy as np
arr = np.arange(100000)
lst = list(range(100000))
def iterateover(obj):
for item in obj:
pass
%timeit iterateover(arr)
# 20.2 ms ± 155 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The set
"constructor" just does an iteration over the object.
One thing I can't answer definitely is why the tolist
method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist
could avoid. But one thing I know for sure is that list(array)
is slower than array.tolist()
:
arr = np.arange(100000)
%timeit list(arr)
# 20 ms ± 114 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Each of these has O(n)
runtime complexity but the constant factors are very different.
In your case you did compare set()
to tolist()
- which isn't a particular good comparison. It would make more sense to compare set(arr)
to list(arr)
or set(arr.tolist())
to arr.tolist()
:
arr = np.random.randint(0, 1000, (10000, 3))
def tosets(arr):
for line in arr:
set(line)
def tolists(arr):
for line in arr:
list(line)
def tolists_method(arr):
for line in arr:
line.tolist()
def tosets_intermediatelist(arr):
for line in arr:
set(line.tolist())
%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So if you want set
s you are better off using set(arr.tolist())
. For bigger arrays it could make sense to use np.unique
but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).
In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.
I'm not sure how numba (re-)implements set
s but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set
:
Sets are more complicated than list
s because it they involve hash
es and empty slots (Python uses open-addressing for sets, so I assume numba will too).
Like the NumPy array
the numba set
saves the values directly. So when you convert a NumPy array
to a numba set
(or vise-versa) it won't need to use "Python boxes" at all, so when you create the set
s in a numba nopython function it will be much faster even than the set(arr.tolist())
operation:
import numba as nb
@nb.njit
def tosets_numba(arr):
for lineno in range(arr.shape[0]):
set(arr[lineno])
tosets_numba(arr) # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's roughly five times faster than the set(arr.tolist())
approach. But it's important to highlight that I did not return the set
s from the function. When you return a set
from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).
Just FYI: The same boxing/unboxing happens if you pass list
s to Numba nopython functions or return lists from these functions. So what's a O(1)
operation in Python is an O(n)
operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)
).
I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).
Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.
Just a note: from numpy import *
should be avoided, you hide several python built-in functions when you do that (sum
, min
, max
, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np
. The np.
in front of function calls makes the code clearer and isn't much to type.