javascript - Unroll Loop, when to use

Question

Welcome To Ask or Share your Answers For Others

javascript - Unroll Loop, when to use

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

javascript - Unroll Loop, when to use

I'm trying to understand unroll loops in regex. What is the big difference between:

MINISTéRIO[sS]*?PáG

and

MINISTéRIO(?:[^P]*(?:P(?!áGs:sd+/d+)[^P]*)(?:[sS]*?))PáG

In this context:

http://regexr.com/3dmlr

Why should i use the second, if the first do the SAME thing?

Thanks.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:31:13+0000

What is Unroll-the-loop

See this Unroll the loop technique source:

This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.

The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

So, this is an optimization technique where alternations are turned into linearly matching atoms.

This makes these unrolled patterns very efficient since they involve less backtracking.

Current Scenario

Your MINISTéRIO[sS]*?PáG is a non-unrolled pattern while MINISTéRIO[^P]*(?:P(?!áG)[^P]*)*PáG is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.

See the Difference between .*?, .* and [^"]*+ quantifiers section in my answer to understand how lazy matching works (your [sS]*? is the same as .*? with a DOTALL modifier in languages that allow a . to match a newline, too).

Performance Question

Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.

Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.

Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.

Bonus: Commonly Unrolled patterns

Tempered greedy tokens
Regular string literals ("Stringu0020:"text""): "[^"\]*(?:\.[^"\]*)*"
Multiline comment regex (/* Comments */): /*[^*]**+(?:[^/*][^*]**+)*/
@<...>@ comment regex: @<[^>]*(?:>[^@]*)*@

Categories

javascript - Unroll Loop, when to use

javascript - Unroll Loop, when to use

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

What is Unroll-the-loop

Current Scenario

Performance Question

Bonus: Commonly Unrolled patterns

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags