I have three suggestions. My first is baldly stolen from aix. The problem is that bitarray
objects are mutable, and their hash
es are content-independent (i.e. for bitarray b
, hash(b) == id(b)
). This can be worked around, as aix's answer shows, but in fact you don't need bitarray
s at all -- you can just use tostring
!
In [1]: import numpy
In [2]: a = numpy.arange(25).reshape((5, 5))
In [3]: (a > 10).tostring()
Out[3]: 'x00x00x00x00x00x00x00x00x00x00x00x01x01
x01x01x01x01x01x01x01x01x01x01x01x01'
Now we have an immutable string of bytes, perfectly suitable for use as a dictionary key. To be clear, note that those escapes aren't escaped, so this is as compact as you can get without bitstring
-style serialization.
In [4]: len((a > 10).tostring())
Out[4]: 25
Converting back is easy and fast:
In [5]: numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
Out[5]:
array([[False, False, False, False, False],
[False, False, False, False, False],
[False, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
In [6]: %timeit numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
100000 loops, best of 3: 5.75 us per loop
Like aix, I was unable to figure out how to store dimension information in a simple way. If you must have that, then you may have to put up with longer keys. cPickle
seems like a good choice though. Still, its output is 10x as big...
In [7]: import cPickle
In [8]: len(cPickle.dumps(a > 10))
Out[8]: 255
It's also slower:
In [9]: cPickle.loads(cPickle.dumps(a > 10))
Out[9]:
array([[False, False, False, False, False],
[False, False, False, False, False],
[False, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
In [10]: %timeit cPickle.loads(cPickle.dumps(a > 10))
10000 loops, best of 3: 45.8 us per loop
My third suggestion uses bitstring
s -- specifically, bitstring.ConstBitArray
. It's similar in spirit to aix
's solution, but ConstBitArray
s are immutable, so they do what you want, hash
-wise.
In [11]: import bitstring
You have to flatten the numpy array explicitly:
In [12]: b = bitstring.ConstBitArray((a > 10).flat)
In [13]: b.bin
Out[13]: '0b0000000000011111111111111'
It's immutable so it hashes well:
In [14]: hash(b)
Out[14]: 12144
It's super-easy to convert back into an array, but again, shape information is lost.
In [15]: numpy.array(b).reshape(5, 5)
Out[15]:
array([[False, False, False, False, False],
[False, False, False, False, False],
[False, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
It's also the slowest option by far:
In [16]: %timeit numpy.array(b).reshape(5, 5)
1000 loops, best of 3: 240 us per loop
Here's some more information. I kept fiddling around and testing things and came up with the following. First, bitarray
is way faster than bitstring
when you use it right:
In [1]: %timeit numpy.array(bitstring.ConstBitArray(a.flat)).reshape(5, 5)
1000 loops, best of 3: 283 us per loop
In [2]: %timeit numpy.array(bitarray.bitarray(a.flat)).reshape(5, 5)
10000 loops, best of 3: 19.9 us per loop
Second, as you can see from the above, all the tostring
shenanigans are unnecessary; you could also just explicitly flatten the numpy
array. But actually, aix's method is faster, so that's what the now-revised numbers below are based on.
So here's a full rundown of the results. First, definitions:
small_nda = numpy.arange(25).reshape(5, 5) > 10
big_nda = numpy.arange(10000).reshape(100, 100) > 5000
small_barray = bitarray.bitarray(small_nda.flat)
big_barray = bitarray.bitarray(big_nda.flat)
small_bstr = bitstring.ConstBitArray(small_nda.flat)
big_bstr = bitstring.ConstBitArray(big_nda.flat)
keysize
is the result of sys.getsizeof({small|big}_nda.tostring())
, sys.getsizeof({small|big}_barray) + sys.getsizeof({small|big}_barray.tostring())
, or sys.getsizeof({small|big}_bstr) + sys.getsizeof({small|big}_bstr.tobytes())
-- both the latter methods return bitstrings packed into bytes, so they should be good estimates of the space taken by each.
speed
is the time it takes to convert from {small|big}_nda
to a key and back, plus the time it takes to convert a bitarray
object into a string for hashing, which is either a one-time cost if you cache the string or a cost per dict operation if you don't cache it.
small_nda big_nda small_barray big_barray small_bstr big_bstr
keysize 64 10040 148 1394 100 1346
speed 2.05 us 3.15 us 3.81 us 96.3 us 277 us 92.2ms
+ 161 ns + 257 ns
As you can see, bitarray
is impressively fast, and aix's suggestion of a subclass of bitarray
should work well. Certainly it's a lot faster than bitstring
. Glad to see that you accepted that answer.
On the other hand, I still feel attached to the numpy.array.tostring()
method. The keys it generates are, asymptotically, 8x as large, but the speedup you get for big arrays remains substantial -- about 30x on my machine for large arrays. It's a good tradeoff. Still, it's probably not enough to bother with until it becomes the bottleneck.