Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:
$regex = '~^
(?=(a(?-1)?b)c)
a+(b(?-1)?c)
$~x';
var_dump(preg_match($regex, 'aabbcc')); // 1
var_dump(preg_match($regex, 'aaabbbccc')); // 1
var_dump(preg_match($regex, 'aaabbbcc')); // 0
var_dump(preg_match($regex, 'aaaccc')); // 0
var_dump(preg_match($regex, 'aabcc')); // 0
var_dump(preg_match($regex, 'abbcc')); // 0
Try it yourself: http://codepad.viper-7.com/1erq9v
Explanation
If you consider the regex without the positive lookahead assertion (the (?=...)
part), you have this:
~^a+(b(?-1)?c)$~
This does nothing more than check that there's an arbitrary number of a
s, followed by an equal number of b
s and c
s.
This doesn't yet satisfy our grammar, because the number of a
s must be the same, too. We can ensure that by checking that the number of a
s equals the number of b
s. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c
. The c
is necessary so we don't only match a part of the b
s.
Conclusion
I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…