Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
173 views
in Technique[技术] by (71.8m points)

Javascript regex hangs (using v8)

Im using this regex to get the contents of a tag in a file.

var regex = new RegExp("<tag:main>((?:.|\s)*)</tag:main>");

This causes the v8 engine to hang indefinitely.

Now, if I use new RegExp("<tag:main>([sS]*)</tag:main>"), all is good.

Anyone have an idea why the first one takes too long?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This catastrophically backtracks on long sequences of spaces that occur after the last closing </tag:main> tag. Consider the case where the subject string ends with 100 spaces. First it matches them all with the . on the left of the alternation. That fails because there's no closing tag, so it tries matching the last character with the s instead. That fails too, so it tries matching the second-to-last space as a s and the last space as a .. That fails (still no closing tag) so it tries the last space as a s. When that fails it matches the third-to-last space as a s and tries all 4 ways to match the last two spaces. When that fails it tries the fourth-to-last space as a s and all 8 ways on the last 3 spaces. Then 16, 32 etc. The universe ends before it gets to the 100th-to-last space.

Different VMs have different reactions to regexp matches that take forever because of catastrophic backtracking. Some will simply report 'no match'. In V8 it's like writing any other infinite or near-infinite loop.

Using non-greedy * will do what you want (you want to stop at the first </tag:main>, not the last), but will still do catastrophic backtracking for long strings of spaces where the closing sequence is missing.

Making sure the same characters in the inner bracket can't match both sides of the alternation will reduce the problem from an exponential one to one that is linear in the length of the string. Use a character class instead of an alternation or put on the right hand side of the alternation bar. is disjoint with . so if you hit a long sequence of spaces the regexp engine doesn't try all left-right-left etc. combinations before terminating.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...