Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
182 views
in Technique[技术] by (71.8m points)

javascript - Unroll Loop, when to use

I'm trying to understand unroll loops in regex. What is the big difference between:

MINISTéRIO[sS]*?PáG

and

MINISTéRIO(?:[^P]*(?:P(?!áGs:sd+/d+)[^P]*)(?:[sS]*?))PáG

In this context:

http://regexr.com/3dmlr

Why should i use the second, if the first do the SAME thing?

Thanks.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

What is Unroll-the-loop

See this Unroll the loop technique source:

This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.

The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

So, this is an optimization technique where alternations are turned into linearly matching atoms.

This makes these unrolled patterns very efficient since they involve less backtracking.

Current Scenario

Your MINISTéRIO[sS]*?PáG is a non-unrolled pattern while MINISTéRIO[^P]*(?:P(?!áG)[^P]*)*PáG is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.

See the Difference between .*?, .* and [^"]*+ quantifiers section in my answer to understand how lazy matching works (your [sS]*? is the same as .*? with a DOTALL modifier in languages that allow a . to match a newline, too).

Performance Question

Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.

Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.

Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.

Bonus: Commonly Unrolled patterns

  1. Tempered greedy tokens

  2. Regular string literals ("Stringu0020:"text""): "[^"\]*(?:\.[^"\]*)*"

  3. Multiline comment regex (/* Comments */): /*[^*]**+(?:[^/*][^*]**+)*/

  4. @<...>@ comment regex: @<[^>]*(?:>[^@]*)*@


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...