Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
171 views
in Technique[技术] by (71.8m points)

c# - Improving/Fixing a Regex for C style block comments

I'm writing (in C#) a simple parser to process a scripting language that looks a lot like classic C.

On one script file I have, the regular expression that I'm using to recognize /* block comments */ is going into some kind of infinite loop, taking 100% CPU for ages.

The Regex I'm using is this:

/*([^*]|[
]|(*+([^*/]|[
])))**+/

Any suggestions on why this might get locked up?

Alternatively, what's another Regex I could use instead?

More information:

  • Working in C# 3.0 targeting .NET 3.5;
  • I'm using the Regex.Match(string,int) method to start matching at a particular index of the string;
  • I've left the program running for over an hour, but the match isn't completed;
  • Options passed to the Regex constructor are RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace;
  • The regex works correctly for 452 of my 453 test files.
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Some problems I see with your regex:

There's no need for the |[ ] sequences in your regex; a negated character class like [^*] matches everything except *, including line separators. It's only the . (dot) metacharacter that doesn't match those.

Once you're inside the comment, the only character you have to look for is an asterisk; as long as you don't see one of those, you can gobble up as many characters you want. That means it makes no sense to use [^*] when you can use [^*]+ instead. In fact, you might as well put that in an atomic group -- (?>[^*]+) -- because you'll never have any reason to give up any of those not-asterisks once you've matched them.

Filtering out extraneous junk, the final alternative inside your outermost parens is *+[^*/], which means "one or more asterisks, followed by a character that isn't an asterisk or a slash". That will always match the asterisk at the end of the comment, and it will always have to give it up again because the next character is a slash. In fact, if there are twenty asterisks leading up to the final slash, that part of your regex will match them all, then it will give them all up, one by one. Then the final part -- *+/ -- will match them for keeps.

For maximum performance, I would use this regex:

/*(?>(?:(?>[^*]+)|*(?!/))*)*/

This will match a well-formed comment very quickly, but more importantly, if it starts to match something that isn't a valid comment, it will fail as quickly as possible.


Courtesy of David, here's a version that matches nested comments with any level of nesting:

(?s)/*(?>/*(?<LEVEL>)|*/(?<-LEVEL>)|(?!/*|*/).)+(?(LEVEL)(?!))*/

It uses .NET's Balancing Groups, so it won't work in any other flavor. For the sake of completeness, here's another version (from RegexBuddy's Library) that uses the Recursive Groups syntax supported by Perl, PCRE and Oniguruma/Onigmo:

/*(?>[^*/]+|*[^/]|/[^*])*(?>(?R)(?>[^*/]+|*[^/]|/[^*])*)**/

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...