Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
355 views
in Technique[技术] by (71.8m points)

java - StackOverflowError when matching large input using RegEx

I got StackOverflowError when matching the result using a RegEx pattern.

The pattern is (d*?(;(?=d))?)+. This regex is used to validate the input:

12345;4342;234*;123*;344324

The input is a string consists of values (only digits) separated by ;. Each value could include one * at the end (used as wildcard for other matching). There is no ; at the end of the string.

The problem is that this regex works fine which small number of values. But when the numbers of values is too large (over 300), it will cause StackOverflowError.

final String TEST_REGEX = "(\d\*?(;(?=\d))?)+";

// Generate string
StringBuilder builder = new StringBuilder();
int number = 123456;
for (int count = 1; count <= 300; count++) {
    builder.append(Integer.toString(number).concat(";"));
    number++;
}
builder.deleteCharAt(builder.lastIndexOf(";"))

builder.toString().matches(TEST_REGEX); //<---------- StackOverflowError

And the stacktrace:

java.lang.StackOverflowError
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    ...

I think the lookahead in the pattern cause this error since there are lots of lookup but I haven't figured out how to reduce it or work around.

I really appreciate any suggestion since I'm not experienced in RegEx.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Before fixing the problem with StackOverflowError...

  1. I would like to point out that your current regex (d*?(;(?=d))?)+ fails to validate this condition.

    Each value could include one * at the end (used as wildcard for other matching)

    It fails to reject the case 23*4*4*;34*434*34, as seen here1.

  2. Your regex will do unnecessary backtracking on an non-matching input.

  3. Java uses one stack frame for each repetition of the group (d*?(;(?=d))?) (which is repeated 1 or more time +).

A correct regex would be:

d+*?(?:;d+*?)*

Note that this will reject *, which is not too clear from your requirement whether you want to accept or reject this.

This doesn't fix the StackOverflow problem, since each repetition of the group (?:;d+*?) is also going to use up stack. To fix that, make all quantifiers possessive, since there is no need for backtracking, as the grammar is not ambiguous:

d++*?+(?:;d++*?+)*+

Putting into string literal:

"\d++\*?+(?:;\d++\*?+)*+"

I have tested the regex above with matching and non matching input, which has more than 3600 tokens (separated by ;).

Footnote

1: regex101 uses PCRE flavor, which is slightly different from Java regex flavor. However, the features used in your regex are common between them, so there should be no discrepancy.

Appendix

  • Actually, from my testing with your regex (d*?(;(?=d))?)+ (which is incorrect according to your requirement), making the outer most + possessive ++ seem to fix the StackOverflowError problem, at least in my testing with around 3600 tokens (separated by ;, the string is around 20k character long). It also doesn't seem to cause long execution time when testing against a non-matching string.

  • In my solution, make the * quantifier for the group (?:;d+*?) possessive is enough to resolve StackOverflowError.

    "\d+\*?(?:;\d+\*?)*+"
    

    However, I make everything possessive to be on the safe side.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...