Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
280 views
in Technique[技术] by (71.8m points)

python - Efficient serialization of numpy boolean arrays

I have hundreds of thousands of NumPy boolean arrays that I would like to use as keys to a dictionary. (The values of this dictionary are the number of times we've observed each of these arrays.) Since NumPy arrays are not hashable and can't be used as keys themselves. I would like to serialize these arrays as efficiently as possible.

We have two definitions for efficiency to address, here:

  1. Efficiency in memory usage; smaller is better
  2. Efficiency in computational time serializing and reconstituting the array; less time is better

I'm looking to strike a good balance between these two competing interests, however, efficient memory usage is more important to me and I'm willing to sacrifice computing time.

There are two properties that I hope will make this task easier:

  1. I can guarantee that all arrays have the same size and shape
  2. The arrays are boolean, which means that it is possible to simply represent them as a sequence of 1s and 0s, a bit sequence

Is there an efficient Python (2.7, or, if possible, 2.6) data structure that I could serialize these to (perhaps some sort of bytes structure), and could you provide an example of the conversion between an array and this structure, and from the structure back to the original array?

Note that it is not necessary to store information about whether each index was True or False; a structure that simply stored indices where the array was True would be sufficient to reconstitute the array.

A sufficient solution would work for a 1-dimensional array, but a good solution would also work for a 2-dimensional array, and a great solution would work for arrays of even higher dimensions.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I have three suggestions. My first is baldly stolen from aix. The problem is that bitarray objects are mutable, and their hashes are content-independent (i.e. for bitarray b, hash(b) == id(b)). This can be worked around, as aix's answer shows, but in fact you don't need bitarrays at all -- you can just use tostring!

In [1]: import numpy
In [2]: a = numpy.arange(25).reshape((5, 5))
In [3]: (a > 10).tostring()
Out[3]: 'x00x00x00x00x00x00x00x00x00x00x00x01x01
         x01x01x01x01x01x01x01x01x01x01x01x01'

Now we have an immutable string of bytes, perfectly suitable for use as a dictionary key. To be clear, note that those escapes aren't escaped, so this is as compact as you can get without bitstring-style serialization.

In [4]: len((a > 10).tostring())
Out[4]: 25

Converting back is easy and fast:

In [5]: numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
Out[5]: 
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)
In [6]: %timeit numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
100000 loops, best of 3: 5.75 us per loop

Like aix, I was unable to figure out how to store dimension information in a simple way. If you must have that, then you may have to put up with longer keys. cPickle seems like a good choice though. Still, its output is 10x as big...

In [7]: import cPickle
In [8]: len(cPickle.dumps(a > 10))
Out[8]: 255

It's also slower:

In [9]: cPickle.loads(cPickle.dumps(a > 10))
Out[9]: 
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)
In [10]: %timeit cPickle.loads(cPickle.dumps(a > 10))
10000 loops, best of 3: 45.8 us per loop

My third suggestion uses bitstrings -- specifically, bitstring.ConstBitArray. It's similar in spirit to aix's solution, but ConstBitArrays are immutable, so they do what you want, hash-wise.

In [11]: import bitstring

You have to flatten the numpy array explicitly:

In [12]: b = bitstring.ConstBitArray((a > 10).flat)
In [13]: b.bin
Out[13]: '0b0000000000011111111111111'

It's immutable so it hashes well:

In [14]: hash(b)
Out[14]: 12144

It's super-easy to convert back into an array, but again, shape information is lost.

In [15]: numpy.array(b).reshape(5, 5)
Out[15]: 
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

It's also the slowest option by far:

In [16]: %timeit numpy.array(b).reshape(5, 5)
1000 loops, best of 3: 240 us per loop

Here's some more information. I kept fiddling around and testing things and came up with the following. First, bitarray is way faster than bitstring when you use it right:

In [1]: %timeit numpy.array(bitstring.ConstBitArray(a.flat)).reshape(5, 5)
1000 loops, best of 3: 283 us per loop

In [2]: %timeit numpy.array(bitarray.bitarray(a.flat)).reshape(5, 5)
10000 loops, best of 3: 19.9 us per loop

Second, as you can see from the above, all the tostring shenanigans are unnecessary; you could also just explicitly flatten the numpy array. But actually, aix's method is faster, so that's what the now-revised numbers below are based on.

So here's a full rundown of the results. First, definitions:

small_nda = numpy.arange(25).reshape(5, 5) > 10
big_nda = numpy.arange(10000).reshape(100, 100) > 5000
small_barray = bitarray.bitarray(small_nda.flat)
big_barray = bitarray.bitarray(big_nda.flat)
small_bstr = bitstring.ConstBitArray(small_nda.flat)
big_bstr = bitstring.ConstBitArray(big_nda.flat)

keysize is the result of sys.getsizeof({small|big}_nda.tostring()), sys.getsizeof({small|big}_barray) + sys.getsizeof({small|big}_barray.tostring()), or sys.getsizeof({small|big}_bstr) + sys.getsizeof({small|big}_bstr.tobytes()) -- both the latter methods return bitstrings packed into bytes, so they should be good estimates of the space taken by each.

speed is the time it takes to convert from {small|big}_nda to a key and back, plus the time it takes to convert a bitarray object into a string for hashing, which is either a one-time cost if you cache the string or a cost per dict operation if you don't cache it.

         small_nda   big_nda   small_barray   big_barray   small_bstr   big_bstr

keysize  64          10040     148            1394         100          1346

speed    2.05 us     3.15 us   3.81 us        96.3 us      277 us       92.2ms  
                             + 161 ns       + 257 ns 

As you can see, bitarray is impressively fast, and aix's suggestion of a subclass of bitarray should work well. Certainly it's a lot faster than bitstring. Glad to see that you accepted that answer.

On the other hand, I still feel attached to the numpy.array.tostring() method. The keys it generates are, asymptotically, 8x as large, but the speedup you get for big arrays remains substantial -- about 30x on my machine for large arrays. It's a good tradeoff. Still, it's probably not enough to bother with until it becomes the bottleneck.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...