I'd like to understand the difference in RAM-usage of this methods when reading a large file in python.
Version 1, found here on stackoverflow:
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(file, 'rb')
for piece in read_in_chunks(f):
process_data(piece)
f.close()
Version 2, I used this before I found the code above:
f = open(file, 'rb')
while True:
piece = f.read(1024)
process_data(piece)
f.close()
The file is read partially in both versions. And the current piece could be processed. In the second example, piece
is getting new content on every cycle, so I thought this would do the job without loading the complete file into memory.
But I don't really understand what yield
does, and I'm pretty sure I got something wrong here. Could anyone explain that to me?
There is something else that puzzles me, besides of the method used:
The content of the piece I read is defined by the chunk-size, 1KB in the examples above. But... what if I need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"
?
Depending on where in the file the string occurs, it could be that one piece contains the part "ThisIsTheStr"
- and the next piece would contain "ingILikeToFind"
. Using such a method it's not possible to detect the whole string in any piece.
Is there a way to read a file in chunks - but somehow care about such strings?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…