Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
294 views
in Technique[技术] by (71.8m points)

python - Read file in chunks - RAM-usage, reading strings from binary files

I'd like to understand the difference in RAM-usage of this methods when reading a large file in python.

Version 1, found here on stackoverflow:

def read_in_chunks(file_object, chunk_size=1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

f = open(file, 'rb')
for piece in read_in_chunks(f):
    process_data(piece)
f.close()

Version 2, I used this before I found the code above:

f = open(file, 'rb')
while True:
    piece = f.read(1024)
    process_data(piece)
f.close()

The file is read partially in both versions. And the current piece could be processed. In the second example, piece is getting new content on every cycle, so I thought this would do the job without loading the complete file into memory.

But I don't really understand what yield does, and I'm pretty sure I got something wrong here. Could anyone explain that to me?


There is something else that puzzles me, besides of the method used:

The content of the piece I read is defined by the chunk-size, 1KB in the examples above. But... what if I need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"?

Depending on where in the file the string occurs, it could be that one piece contains the part "ThisIsTheStr" - and the next piece would contain "ingILikeToFind". Using such a method it's not possible to detect the whole string in any piece.

Is there a way to read a file in chunks - but somehow care about such strings?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

yield is the keyword in python used for generator expressions. That means that the next time the function is called (or iterated on), the execution will start back up at the exact point it left off last time you called it. The two functions behave identically; the only difference is that the first one uses a tiny bit more call stack space than the second. However, the first one is far more reusable, so from a program design standpoint, the first one is actually better.

EDIT: Also, one other difference is that the first one will stop reading once all the data has been read, the way it should, but the second one will only stop once either f.read() or process_data() throws an exception. In order to have the second one work properly, you need to modify it like so:

f = open(file, 'rb')
while True:
    piece = f.read(1024)  
    if not piece:
        break
    process_data(piece)
f.close()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...