This is a known bug reported since 2002 and it has yet to be fixed. You now know you are not the first person to encounter this bug (or feature, as you will see soon).
From this comment in the bug report, it seems that quantifiers (*
, +
, {n,m}
, {n,}
) are designed to have an upper limit on the number of repetitions, which prevents the engine from segfault when the stack used for backtracking overflows, but violates the very definition of Kleene operator in regular expression (repeat the pattern for arbitrary number of times) and gives wrong answer for the query1.
1 In contrast, Java's regex engine (Oracle's implemetation) simply allow StackOverflowError
to occur for cases like this, but the quantifier has the upper limit of 232 - 1, which is sufficient for most use case. And there exists a workaround for cases like this, which is to use possessive quantifier.
The same comment also print the regex compilation debugging info, and the output clearly shows that *
is translated to {0,32767}
. It is also reproducible on my machine (perl v5.10.1 (*) built for x86_64-linux-thread-multi).
$ perl -Mre=debug -wce '/(A|B)*/'
Compiling REx "(A|B)*"
Final program:
1: CURLYM[1] {0,32767} (15)
5: TRIE-EXACT[AB] (13)
<A>
<B>
13: SUCCEED (0)
14: NOTHING (15)
15: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*"
This following test further confirms the problem, and it shows that perl doesn't let you specify a repetition that exceeds the limit.
$ perl -e 'print (("a".("f"x32767)."a") =~ /a(?:[^a]|bb){0,32767}a/)'
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/a(?:[^a]|bb){ <-- HERE 0,32767}a/ at -e line 1.
Making the quantifier possessive *+
does not solve the problem, since the limit is still there:
$ perl -Mre=debug -wce '/(A|B)*+/'
Compiling REx "(A|B)*+"
Final program:
1: SUSPEND (19)
3: CURLYM[1] {0,32767} (17)
7: TRIE-EXACT[AB] (15)
<A>
<B>
15: SUCCEED (0)
16: NOTHING (17)
17: SUCCEED (0)
18: TAIL (19)
19: END (0)
minlen 0
-e syntax OK
Freeing REx: "(A|B)*+"