Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
543 views
in Technique[技术] by (71.8m points)

computer science - What kind of formal languages can modern regex engines parse?

Here on SO people sometimes say something like "you cannot parse X with regular expressions, because X is not a regular language". From my understanding however, modern regular expressions engines can match more than just regular languages in Chomsky's sense. My questions:

given a regular expression engine that supports

  • backreferences
  • lookaround assertions of unlimited width
  • recursion, like (?R)

what kind of languages can it parse? Can it parse any context-free language, and if not, what would be the counterexample?

(To be precise, by "parse" I mean "build a single regular expression that would accept all strings generated by the grammar X and reject all other strings").

Add.: I'm particularly interested to see an example of a context-free language that modern regex engines (Perl, Net, python regex module) would be unable to parse.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I recently wrote a rather long article on this topic: The true power of regular expressions.

To summarize:

  • Regular expressions with support for recursive subpattern references can match all context-free languages (e.g a^n b^n).
  • Regular expressions with lookaround assertions and subpattern references can match at least some context-sensitive languages (e.g. ww and a^n b^n c^n).
  • If the assertions have unlimited width (as you say), then all context-sensitive grammars can be matched. I don't know any regex flavor though that does not have fixed-width restrictions on lookbehind (and at the same time supports subpattern references).
  • Regular expressions with backreferences are NP-complete, so any other NP problem can be solved using regular expressions (after applying a polynomial-time transformation).

Some examples:

  • Matching the context-free language {a^n b^n, n>0}:

    /^(a(?1)?b)$/
    # or
    /^ (?: a (?= a* (1?+ b) ) )+ 1 $/x
    
  • Matching the context-sensitive language {a^n b^n c^n, n>0}:

    /^
        (?=(a(?-1)?b)c)
        a+(b(?-1)?c)
    $/x
    # or
    /^ (?: a (?= a* (1?+ b) b* (2?+ c) ) )+ 1 2 $/x
    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...