Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
540 views
in Technique[技术] by (71.8m points)

regex - Should I avoid "|" in flex patterns?

I've heard that the "|" operator slows down regex matching, and it certainly seems to be true in Perl, for example.

Do I have to worry about that when I build scanners with tools like the Flex lexer generator?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Absolutely not. Flex (like the lex lexer generator on which it was based, and most other similar compiler-construction tools) compiles all of the regular expressions in the scanner into a single Deterministic Finite State Automaton (DFA). The DFA never backs up during the scan of a lexical token, and neither the complexity of the component regular expressions nor the precise operators they use make the slightest difference.

This is quite different from Perl's approach to regex matching. Perl (at least under some circumstances) will explore the possible subpatterns in an alternation expression one at a time, with the consequence that a significant performance hit can be noticed. This effect was demonstrated by a Perl benchmark constructed by @sln in this answer.

To show that this is not true with Flex-generated parsers, I constructed a very similar benchmark in Flex. The two Flex input files:

(With |):

%{
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <time.h>

  static int echo_match = 0;
  static const char* regex="'''([^']|['][^']|[']['][^'])*'''";
%}

%option noyywrap noinput nounput nodefault

%%

[^'
]+   { }

        { }
'[^'
]*' {
            if (echo_match) printf("%s
", yytext);
          }
'''([^']|['][^']|[']['][^'])*''' {
            if (echo_match) printf("%s
", yytext);
          }
'         { fputs("Unmatched quote
", stderr); }

Without | is identical, except for the regex pattern (in two places):

'''[^']*(?:[']{1,2}[^']+)*'''

I then used this main to test each of the scanners:

static const char* dataset[] = {
  "'''Set 1 - this
is another
multiline
string'''",
  "'''Set 2 - this
is' another
mul'tiline
st''ring'''"
};

#define COUNTOF(ARY) (sizeof(ARY)/sizeof(ARY[0]))

int main(int argc, char** argv) {
  int reps = 500000;
  printf("-----------------------
Using regex: %s
", regex);
  for (unsigned d = 0; d < COUNTOF(dataset); ++d) {
    echo_match = 1;
    yy_scan_string(dataset[d]);
    yylex();
    yy_delete_buffer(YY_CURRENT_BUFFER);
    echo_match = 0;
    struct timespec before, after;
    if (clock_gettime(CLOCK_REALTIME, &before)) perror("gettime");
    for (int r = 0; r < reps; ++r) {
      yy_scan_string(dataset[d]);
      yylex();
      yy_delete_buffer(YY_CURRENT_BUFFER);
    }
    if (clock_gettime(CLOCK_REALTIME, &after)) perror("gettime");
    unsigned long elapsed =
        (after.tv_sec - before.tv_sec) * 1000000
        + (after.tv_nsec - before.tv_nsec) / 1000;
    printf("Wall-clock: %ld microseconds
", elapsed);
  }
  return 0;
}

Random results of the two sets of benchmarks (from a set of 10 runs of each):

$ tail -n+$((1+12*(RAND/10))) threeq.log | head -n12
-----------------------
Using regex: '''([^']|['][^']|[']['][^'])*'''
'''Set 1 - this
is another
multiline
string'''
Wall-clock: 243410 microseconds
'''Set 2 - this
is' another
mul'tiline
st''ring'''
Wall-clock: 233519 microseconds

$ tail -n+$((1+12*(RAND/10))) threeq2.log | head -n12
-----------------------
Using regex: '''[^']*(?:[']{1,2}[^']+)*'''
'''Set 1 - this
is another
multiline
string'''
Wall-clock: 246842 microseconds
'''Set 2 - this
is' another
mul'tiline
st''ring'''
Wall-clock: 242191 microseconds

There are certain cases where Flex needs to rescan input after a token has been recognized, but these have nothing to do with the alternation operator |. Because Flex always tries to find the longest match, it might need to continue scanning even after a token has been recognized, in case that token were a prefix of another possible token. If the longer token proves to not be matchable, then the Flex scanner will back up to the end of the longest token matched, and the remaining characters will be rescanned in the next token.

As an example, in C, . and ... are both valid tokens, but .. is not. If .. appears in the input, then a Flex-built scanner would have to continue scanning after the first . in order to see if it could match .... However, when the third character proves not to be a ., it will have to return the . token, and then the second . will be rescanned. (That will almost always be a syntax error, so the issue is not very serious.)

Another case where Flex-built scanners will need to rescan is when the / (trailing context) operator is used. Since the trailing context is not actually part of the returned token, it must be rescanned as part of the next token.

Neither of these situations are common, and the rescanned sequences are generally very short (typically one character), so the performance impact is trivial. But if you really worry about this, you can provide the --backup option to Flex to ask it to prepare a report of backup states. If you manage to eliminate all the backup states, you will get a small performance boost, but most of the time this qualifies as premature optimization.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.8k users

...