c# - .NET Performance: Large CSV Read, Remap, Write Remapped

Question

Welcome To Ask or Share your Answers For Others

c# - .NET Performance: Large CSV Read, Remap, Write Remapped

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

c# - .NET Performance: Large CSV Read, Remap, Write Remapped

I've done some research and found that the most efficient way for me to read and write multi-gig (+5GB) files is to use something like the following code:

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs, 256 * 1024))
using (StreamReader sr = new StreamReader(bs, Encoding.ASCII, false, 256 * 1024))
{
    StreamWriter sw = new StreamWriter(outputFile, true, Encoding.Unicode, 256 * 1024);
    string line = "";

    while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
    {
        //Try to clean csv then split
        line = Regex.Replace(line, "[\s\dA-Za-z]["][\s\dA-Za-z]", ""); 
        string[] fields = Regex.Split(line, ",(?=(?:[^"]*"[^"]*")*[^"]*$)");
        //I know there are libraries for this that I will switch out 
        //when I have time to create the classes as it seems they all
        //require a mapping class

        //Remap 90-250 properties
        object myObj = ObjectMapper(fields);

        //Write line
        bool success = ObjectWriter(myObj);
    }

    sw.Dispose();
}

CPU is averaging around 33% for each of 3 instances on an Intel Xeon 2.67 GHz. I was able to output 2 files in ~26 hrs that were just under 7GB while the process was running 3 instances using:

Parallel.Invoke(
    () => new Worker().DoWork(args[0]),
    () => new Worker().DoWork(args[1]),
    () => new Worker().DoWork(args[2])
);

The third instance is generating a MUCH larger file being, so far, +34GB and am coming up on day 3, ~67 hrs in.

From what I've read, I think performance may be increased slightly by getting the buffer lowered to a sweet spot.

My questions are:

Based on what is stated, is this typical performance?
Besides what I mentioned above, are there any other improvements you can see?
Are the CSV mapping and reading libraries much faster that regex?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:16:10+0000

So, first of all, you should profile your code to identify bottlenecks.

Visual Studio comes with a built-in profiler for this purpose, which can clearly identify hot-spots in your code.

Given that your process is CPU bound, this is likely to prove very effective.

However, if I had to guess at why it's slow, I would imagine it's because you are not re-using your regexes. A regex is relatively expensive to construct, so re-using it can see massive performance improvements.

var regex1 = new Regex("[\s\dA-Za-z]["][\s\dA-Za-z]", RegexOptions.Compiled);
var regex2 = new Regex(",(?=(?:[^"]*"[^"]*")*[^"]*$)", RegexOptions.Compiled);
while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
{
    //Try to clean csv then split
    line = regex1.Replace(line, ""); 
    string[] fields = regex2.Split(line);
    //I know there are libraries for this that I will switch out 
    //when I have time to create the classes as it seems they all
    //require a mapping class

    //Remap 90-250 properties
    object myObj = ObjectMapper(fields);

    //Write line
    bool success = ObjectWriter(myObj);
}

However, I would strongly encourage you to use a library like Linq2Csv - it will likely be more performant, as it will have had several rounds of performance tuning, and it will handle edge-cases that your code doesn't.

Categories

c# - .NET Performance: Large CSV Read, Remap, Write Remapped

c# - .NET Performance: Large CSV Read, Remap, Write Remapped

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags