Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
146 views
in Technique[技术] by (71.8m points)

c# - Slow Regex performance

The code below contains a regular expression designed to extract a C# string literal but the performance of the regex matching for input strings of more than a few characters is woeful.

class Program
{
   private static void StringMatch(string s)
    {
        // regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
        Match m = Regex.Match(s, ""(([^"]*)(\\.)?)*"");
        if (m.Success)
            Trace.WriteLine(m.Value);
        else
            Trace.WriteLine("no match");
    }

    public static void Main()
    {
        // this first string is unterminated (so the match fails), but it returns instantly
        StringMatch(""OK");

        // this string is terminated (the match succeeds)
        StringMatch(""This is a longer terminated string - it matches and returns instantly"");

        // this string is unterminated (so the match will fail), but it never returns
        StringMatch(""This is another unterminated string and takes FOREVER to match");
    }
}

I can refactor the regex into a different form, but can anyone offer an explanation why the performance is so bad?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You're running into catastrophic backtracking:

Let's simplify the regex a bit (without the escaped quotes and without the second optional group because, as in your comment, it's irrelevant for the tested strings):

"(([^"]*))*" 

([^"]*) matches any string except quotes or backslashes. This again is enclosed in an optional group that can repeat any number of times.

Now for the string "ABC, the regex engine needs to try the following permutations:

  • ", ABC
  • ", ABC, <empty string>
  • ", AB, C
  • ", AB, C, <empty string>
  • ", AB, <empty string>, C
  • ", AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, C
  • ", <empty string>, AB, C, <empty string>
  • ", <empty string>, AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, <empty string>, C
  • ", A, BC
  • ", A, BC, <empty string>
  • ", A, <empty string>, BC
  • ", <empty string>, A, BC
  • etc.
  • ", A, B, C
  • ", A, B, C, <empty string>
  • ", A, B, <empty string>, C
  • etc. etc.

each of which then fails because there is no following ".

Also, you're only testing for substrings instead of forcing the regex to match the entire string. And you usually want to use verbatim strings for regexes to cut down on the number of backslashes you need. How about this:

foundMatch = Regex.IsMatch(subjectString, 
    @"A     # Start of the string
    ""       # Match a quote
    (?:      # Either match...
     \.     # an escaped character
    |        # or
     [^""] # any character except backslash or quote
    )*       # any number of times
    ""       # Match a quote
           # End of the string", 
    RegexOptions.IgnorePatternWhitespace);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...