I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
Is there a rule of thumb for using mmap()
versus reading in blocks via C++'s fstream
library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
The mmap()
code could potentially get very messy since mmap
'd blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstream
s, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap()
is 2x faster) or simple tests?
Question&Answers:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…