When I change your +=
to just =
for your 2 sparse arrays:
for row, col in zip(rows, cols):
#freqs[row,col] += 1
freqs[row,col] = 1
their respective times are cut in half. What's consuming the most time is the indexing. With +=
it is has to do both a __getitem__
and a __setitem__
.
When the docs say that dok
and lil
are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.
When I try to make a csr
matrix with your code, I get a:
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
and 30x slower speed.
So the speed claims are relative to formats like csr
, not relative to pure Python or numpy
structures.
You might want to look at the Python code for dok_matrix.__get_item__
and dok_matrix.__set_item__
to see what happens when you do freq[r,c]
.
A faster way to construct your dok
would be:
freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
d[(row, col)] = 1
freqs.update(d)
taking advantage of the fact that a dok
is a subclassed dictionary. Note that dok
matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50)
.
Another fast way of constructing the same sparse array is:
freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))
In other words, since you already have the rows
and cols
arrays (or ranges), calculate the corresponding data
array, and THEN construct the sparse array.
But if you must perform sparse operations on your matrix between incremental growth steps, then dok
or lil
may be your best choices.
Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr
format is the ultimate goal, and the coo
format was a convenient initialization format.
Now many of the SO scipy sparse questions arise from scikit-learn
and text analysis problems. They are also used in a biological database files. But still the (data),(row,col)
definition method works best.
So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.
Here's a faster dok
iteration that takes advantage of its dictionary methods. update
seems to work as fast as on a plain dictionary. get
is about 3x faster the equivalent indexing (freq[row,col]
). Indexing probably uses get
, but must have a lot of overhead.
def fast_dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows,cols):
i = freqs.get((row,col),0)
freqs.update({(row,col):i+1})
return freqs
Skipping the get
, and just doing
freqs.update({(row,col): 1)
is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)}
)