Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
529 views
in Technique[技术] by (71.8m points)

regex - Regular Expression (C#) For CSV by RFC 4180

Requires universal CSV parser by specification RFC 4180. There is the csv file, with all the problems of the specification:

Excel opens the file as it is written in the specification:

Anyone does work regex for parse it?

CSV File

"a
b
c","x
y
z",357
test;test,xxx;xxx,152
"test2,test2","xxx2,xxx2",123
"test3""test3","xxx3""xxx3",987
,qwe,13
asd,123,
,,
,123,
,,123
123,,
123,123

Expected Results

Table by EXCEL

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)


###About the pattern When one or more lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value that contains the captured values in that line/record (and potentially an OpenValue capture group if there was an open quote at the end of the line/record).

Here's the commented pattern (test it on Regexstorm.net):

(?<=
|
|^)(?!
|
|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,
]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=
|
|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=
|
|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:
|
|
|$)                              // Records end at EOL

Here's the raw pattern without all the comments or whitespace.
(?<=
|
|^)(?!
|
|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,
]+)|"(?<OpenValue>(?:[^"]|"")*)(?=
|
|$)|(?<Value>))(?:,|(?=
|
|$)))+?(?:(?<=,)(?<Value>))?(?:
|
|
|$)

[Here is a visualization from Debuggex.com][3] (capture groups named for clarity): ![Debuggex.com visualization][4]

###Usage examples:

Simple example for reading an entire CSV file/stream at once (test it on C# Pad):
(For better performance and less impact on system resources you should use the second example)

using System.Text.RegularExpressions;

Regex CSVParser = new Regex(
    @"(?<=
|
|^)(?!
|
|$)" +
    @"(?:" +
        @"(?:" +
            @"""(?<Value>(?:[^""]|"""")*)""|" +
            @"(?<Value>(?!"")[^,
]+)|" +
            @"""(?<OpenValue>(?:[^""]|"""")*)(?=
|
|$)|" +
            @"(?<Value>)" +
        @")" +
        @"(?:,|(?=
|
|$))" +
    @")+?" +
    @"(?:(?<=,)(?<Value>))?" +
    @"(?:
|
|
|$)",
    RegexOptions.Compiled);

String CSVSample =
    ",record1 value2,val3,"value 4","testing ""embedded double quotes"""," +
    ""testing quoted "","" character", value 7,,value 9," +
    ""testing empty """" embedded quotes"," +
    ""testing a quoted value" + Environment.NewLine +
    Environment.NewLine +
    "that includes CR/LF patterns" + Environment.NewLine +
    Environment.NewLine +
    "(which we wish would never happen - but it does)", after CR/LF" + Environment.NewLine +
    Environment.NewLine +
    ""testing an open ended quoted value" + Environment.NewLine +
    Environment.NewLine +
    ",value 2 ,value 3," + Environment.NewLine +
    ""test"";

MatchCollection CSVRecords = CSVParser.Matches(CSVSample);

for (Int32 recordIndex = 0; recordIndex < CSVRecords.Count; recordIndex++)
{
    Match Record = CSVRecords[recordIndex];

    for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
    {
        Capture c = Record.Groups["Value"].Captures[valueIndex];
        Console.Write("R" + (recordIndex + 1) + ":V" + (valueIndex + 1) + " = ");

        if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '"')
        {
            // No need to unescape/undouble quotes if the value is empty, the value starts
            // at the beginning of the record, or the character before the value is not a
            // quote (not a quoted value)
            Console.WriteLine(c.Value);
        }
        else
        {
            // The character preceding this value is a quote
            // so we need to unescape/undouble any embedded quotes
            Console.WriteLine(c.Value.Replace("""", """));
        }
    }
    
    foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
        Console.WriteLine("ERROR - Open ended quoted value: " + OpenValue.Value);
}

Better example for reading a large CSV file/stream without reading the entire file/stream into a string (test it [on C# Pad][6]).
using System.IO;
using System.Text.RegularExpressions;

// Same regex from before shortened to one line for brevity
Regex CSVParser = new Regex(
    @"(?<=
|
|^)(?!
|
|$)(?:(?:""(?<Value>(?:[^""]|"""")*)""|(?<Value>(?!"")[^,
]+)|""(?<OpenValue>(?:[^""]|"""")*)(?=
|
|$)|(?<Value>))(?:,|(?=
|
|$)))+?(?:(?<=,)(?<Value>))?(?:
|
|
|$)",
    RegexOptions.Compiled);

String CSVSample = ",record1 value2,val3,"value 4","testing ""embedded double quotes""","testing quoted "","" character", value 7,,value 9,"testing empty """" embedded quotes","testing a quoted value," + 
    Environment.NewLine + Environment.NewLine + "that includes CR/LF patterns" + Environment.NewLine + Environment.NewLine + "(which we wish would never happen - but it does)", after CR/LF," + Environment.NewLine + Environment
    .NewLine + ""testing an open ended quoted value" + Environment.NewLine + Environment.NewLine + ",value 2 ,value 3," + Environment.NewLine + ""test"";

using (StringReader CSVReader = new StringReader(CSVSample))
{
    String CSVLine = CSVReader.ReadLine();
    StringBuilder RecordText = new StringBuilder();
    Int32 RecordNum = 0;

    while (CSVLine != null)
    {
        RecordText.AppendLine(CSVLine);
        MatchCollection RecordsRead = CSVParser.Matches(RecordText.ToString());
        Match Record = null;
        
        for (Int32 recordIndex = 0; recordIndex < RecordsRead.Count; recordIndex++)
        {
            Record = RecordsRead[recordIndex];
        
            if (Record.Groups["OpenValue"].Success && recordIndex == RecordsRead.Count - 1)
            {
                // We're still trying to find the end of a muti-line value in this record
                // and it's the last of the records from this segment of the CSV.
                // If we're not still working with the initial record we started with then
                // prep the record text for the next read and break out to the read loop.
                if (recordIndex != 0)
                    RecordText.AppendLine(Record.Value);
                
                break;
            }
            
            // Valid record found or new record started before the end could be found
            RecordText.Clear();            
            RecordNum++;
            
            for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
            {
                Capture c = Record.Groups["Value"].Captures[valueIndex];
                Console.Write("R" + RecordNum + ":V" + (valueIndex + 1) + " = ");
                if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '"')
                    Console.WriteLine(c.Value);
                else
                    Console.WriteLine(c.Value.Replace("""", """));
            }
            
            foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
                Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
        }
        
        CSVLine = CSVReader.ReadLine();
        
        if (CSVLine == null && Record != null)
        {
            RecordNum++;
            
            //End of file - still working on an open value?
            foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
                Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
        }
    }
}

Both examples return the same result of:

R1:V1 =
R1:V2 = record1 value2
R1:V3 = val3
R1:V4 = value 4
R1:V5 = testing "embedded double quotes"
R1:V6 = testing quoted "," character
R1:V7 = value 7
R1:V8 =
R1:V9 = value 9
R1:V10 = testing empty "" embedded quotes
R1:V11 = testing a quoted value
<b


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...