The overhead for a dict
object is quite large. It depends on your Python version and your system architechture, but on Python 3.5 64bit
In [21]: sys.getsizeof({})
Out[21]: 288
So guesstimating:
250*36e6*1e-9 == 9.0
So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list
!
Rather than use a dict as a record type, which isn't really the use case, use a namedtuple
.
And to get a view of how this compares, let's set up an equivalent list of tuples:
In [23]: Record = namedtuple("Record", "name age")
In [24]: records = [Record("john", 28) for _ in range(36000000)]
In [25]: getsizeof = sys.getsizeof
Consider:
In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age) for record in records)
Out[31]: 5220000000
In [32]: _ + getsizeof(records)
Out[32]: 5517842208
In [33]: _ * 1e-9
Out[33]: 5.517842208
So 5 gigs is an upper limit that is quite conservative. For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. On my own system, the python process is registering 2.7 gigs of memory usage (via top
).
So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int
objects for us, so we just have to worry about the 8-byte pointers!
In [35]: sum(getsizeof("0123456789") + 8 for record in records)
Out[35]: 2412000000
In [36]: _ + getsizeof(records)
Out[36]: 2709842208
In [37]: _ * 1e-9
Out[37]: 2.709842208
Which is a good model for what I'm observing from top
.
If you really want efficient storage
Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. You could use the array
module in combination with struct
, to get C-like memory efficiency. An easier world to wade into might be numpy
instead, which allows for similar things. For example:
In [1]: import numpy as np
In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])
In [3]: records = np.empty((36000000), dtype=recordtype)
In [4]: records.nbytes
Out[4]: 756000000
In [5]: records.nbytes*1e-9
Out[5]: 0.756
Note, we are now allowed to be quite compact. I can use 8-bit unsigned integers (i.e. a single byte) to represent age. However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. I've used 'S20'
, which is 20 characters. These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names.
Now, numpy
gives you a lot of fast methods wrapping C-compiled code. So, just to play around with it, let's fill our records with some toy data. Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10.
In [8]: for i in range(1, 36000000+1):
...: records['name'][i - 1] = b"%08d" % i
...:
In [9]: import random
...: for i in range(36000000):
...: records['age'][i] = max(0, int(random.normalvariate(50, 10)))
...:
Now, we can use numpy to query our records
. For example, if you want the indices of your records given some condition, use np.where
:
In [10]: np.where(records['age'] > 70)
Out[10]: (array([ 58, 146, 192, ..., 35999635, 35999768, 35999927]),)
In [11]: idx = np.where(records['age'] > 70)[0]
In [12]: len(idx)
Out[12]: 643403
So 643403
records that have an age > 70
. Now, let's try 100
:
In [13]: idx = np.where(records['age'] > 100)[0]
In [14]: len(idx)
Out[14]: 9
In [15]: idx
Out[15]:
array([ 2315458, 5088296, 5161049, 7079762, 15574072, 17995993,
25665975, 26724665, 28322943])
In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
(b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
(b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
dtype=[('name', 'S20'), ('age', 'u1')])
Of course, one major inflexibility is that numpy
arrays are sized. Resizing operations are expensive. Now, you could maybe wrap a numpy.array
in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. Lucky for you, Python comes with sqlite
.