My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example.
The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse (each line is an entry), so we can easily write something like:
parse :: Lazy.ByteString -> [LogEntry]
Now, I have a lot of statistics that I would like to compute from the log file. It is easiest to write separate functions such as:
totalEntries = length
nrBots = sum . map fromEnum . map isBotEntry
averageTimeOfDay = histogram . map extractHour
All of these are of the form foldl' k z . map f
.
The problem is that if I try to use them in the most natural way, like
main = do
input <- Lazy.readFile "input.txt"
let logEntries = parse input
totalEntries' = totalEntries logEntries
nrBots' = nrBots logEntries
avgTOD = averageTimeOfDay logEntries
print totalEntries'
print nrBots'
print avgTOD
This will allocate the whole list in memory, which is not what I want. I want the folds to be done synchronously, so that the cons cells can be garbage collected. If I compute only a single statistic, this is what happens.
I can write a single big function that does this, but it is non-composable code.
Alternatively, which is what I have been doing, I run each pass separately, but this reloads & uncompresses the file each time.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…