I've done some research and found that the most efficient way for me to read and write multi-gig (+5GB) files is to use something like the following code:
using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs, 256 * 1024))
using (StreamReader sr = new StreamReader(bs, Encoding.ASCII, false, 256 * 1024))
{
StreamWriter sw = new StreamWriter(outputFile, true, Encoding.Unicode, 256 * 1024);
string line = "";
while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
{
//Try to clean csv then split
line = Regex.Replace(line, "[\s\dA-Za-z]["][\s\dA-Za-z]", "");
string[] fields = Regex.Split(line, ",(?=(?:[^"]*"[^"]*")*[^"]*$)");
//I know there are libraries for this that I will switch out
//when I have time to create the classes as it seems they all
//require a mapping class
//Remap 90-250 properties
object myObj = ObjectMapper(fields);
//Write line
bool success = ObjectWriter(myObj);
}
sw.Dispose();
}
CPU is averaging around 33% for each of 3 instances on an Intel Xeon 2.67 GHz. I was able to output 2 files in ~26 hrs that were just under 7GB while the process was running 3 instances using:
Parallel.Invoke(
() => new Worker().DoWork(args[0]),
() => new Worker().DoWork(args[1]),
() => new Worker().DoWork(args[2])
);
The third instance is generating a MUCH larger file being, so far, +34GB and am coming up on day 3, ~67 hrs in.
From what I've read, I think performance may be increased slightly by getting the buffer lowered to a sweet spot.
My questions are:
- Based on what is stated, is this typical performance?
- Besides what I mentioned above, are there any other improvements you can see?
- Are the CSV mapping and reading libraries much faster that regex?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…