A typical record from the file looks like following:
rows[0]
{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}
That means most of your immutable objects are strings and only the 'rides'
-value is an integer.
For small integers (-5...255
), Python3 keeps an integer pool - so these small integers feels like being cached (as long as PyLong_FromLong
and Co. are used).
The rules are more complicated for strings - they are, as pointed out by @timgeb, interned. There is a greate article about interning, even if it is about Python2.7 - but not much changed since then. In a nutshell, the most important rules are:
- all strings of length
0
and 1
are interned.
- stings with more than one character are interned if they constist of characters that can be used in identifiers and are created at compile time either directly or through peephole optimization/constant folding (but in the second case only if the result is no longer than 20 characters (4096 since Python 3.7).
All of the above are implementation details, but taking them into account we get the following for the row[0]
above:
'route', 'date', 'daytype', 'rides'
are all interned because they created at compile time of the function read_as_dicts
and don't have "strange" characters.
'3'
and 'W'
are interned because their length is only 1
.
01/01/2001
isn't interned because it is longer than 1
, created at runtime and whouldn't qualify anyway because it has character /
in it.
7354
isn't from the small integer pool, because too large. But other entries might be from this pool.
This was an explanation for the current behavior, with only some objects being "cached".
But why doesn't Python cache all created strings/integer?
Let's start with integers. In order to be able to look-up fast if an integer-number is already created (much faster than O(n)
), one has to keep an additional look-up data-structure, which needs additional memory. However, there are so many integers, that the probability to hit one already existing integer again is not very high, so the memory overhead for the look-up-data-structure will not be repaid in the most cases.
Because strings need more memory, the relative (memory) cost of the look-up data-structure isn't that high. But it doesn't make any sense to intern a 1000-character-string, because the probability for a randomly created string to have the very same characters is almost 0
!
On the other hand, if for example a hash-dictionary is used as the look-up structure, the calculation of the hash will take O(n)
(n
-number of characters), which probably won't pay off for large strings.
Thus, Python makes a trade off, which works pretty well in most scenarios - but it cannot be perfect in some special cases. Yet for those special scenarios you can optimize per hand using sys.intern()
.
Note: Having the same id doesn't mean to be the same object, if the live time of two objects don't overlapp, - so your reasoning in the question isn't entrirely watherproof - but this is of no consequence in this special case.