Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
174 views
in Technique[技术] by (71.8m points)

c++ - Optimizing flex string literal parsing

I am starting writing a lexical analyzer for my programming language.

String literals in this language start with a " and end when an unescaped " is encountered. Everything inside (including newlines) is preserved, except escape sequences (the usual s, s, "s etc plus a way of escaping a character by using its ASCII code, e.g. 97 or 97).

This is the code I have written so far:

%{
#include <iostream>
#define YY_DECL extern "C" int yylex()

std::string buffstr;
%}
%x SSTATE
%%

"                   {
                         buffstr.clear();
                         BEGIN(SSTATE);
                     }
<SSTATE>\[0-9]{1,3} {
                         unsigned code = atoi(yytext + 1);
                         if (code > 255) {
                             std::cerr << "SyntaxError: decimal escape sequence larger than 255 (" << code << ')' << std::endl;
                             exit(1);
                         }
                         buffstr += code;
                     }

<SSTATE>\a          buffstr += 'a';
<SSTATE>\b          buffstr += '';
<SSTATE>\f          buffstr += 'f';
<SSTATE>
           buffstr += '
';
<SSTATE>
           buffstr += '
';
<SSTATE>           buffstr += '';
<SSTATE>v           buffstr += 'v';
<SSTATE>\\         buffstr += '\';
<SSTATE>"         buffstr += '"';
<SSTATE>\.          {
                         std::cerr << "SyntaxError: invalid escape sequence (" << yytext << ')' << std::endl;
                         exit(1);
                     }
<SSTATE>"           {
                         std::cout << "Found a string: " << buffstr << std::endl;
                         BEGIN(INITIAL);
                     }
<SSTATE>.            buffstr += yytext[0];

.                    ;

%%

int main(int argc, char** argv) {
    yylex();
}

It works perfectly, but as you can see it's not particularly optimized.

It's appending a character to a std::string once for each character in the string literal being parsed, which is not ideal.

I wonder if there's a bettere way of doing it, for an example storing a pointer and increasing a lenght and then building the string with std::string(const char* ptr, size_t lenght).

Is there one? What would be it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It's probably the case that the code provided is sufficiently fast for all practical purposes, and that you should not worry about optimizing it until you actually observe it being a bottleneck. Lexical scans, even inefficient ones, are rarely an important contribution to compile times.

However, some optimizations are straight-forward.

The easiest one is to observe that most strings do not contain escape sequences. So applying the usual optimization technique of going for the low-lying fruit, we start by handling strings without escape sequences in one single pattern, without even passing through the separate lexical state. [Note 1]

"[^"\]*"   { yylval.str = new std::string(yytext + 1, yyleng - 2); 
                return T_STRING;
              }

(F)lex provides yyleng which is the length of the token it found, so there is never really any reason to recompute the length with strlen. In this case, we don't want the surrounding double quotes in the string, so we select yyleng - 2 characters starting at the second character.

Of course, we need to handle the escape codes; we can use a start condition similar to yours to do so. We only enter this start condition when we find an escape character inside the string literal. [Note 2] To catch this case, we rely on the maximal munch rule implemented by (f)lex, which is that the pattern with the longest match beats out any other patterns which happen to match at the same input point. [Note 3] Since we've already matched any token which starts with a " and does not include a backslash before the closing ", we can add a very similar pattern without the closing quote which will only match in case the first rule doesn't, because the match with the closing quote is one character longer.

"[^"\]*     { yylval.str = new std::string(yytext + 1, yyleng - 1);
                BEGIN(S_STRING);
                /* No return, so the scanner will continue in the new state */
              }

In the S_STRING state, we can still match sequences (not just single characters) which don't contain a backslash, thereby reducing significantly the number of action executions and string appends:

(Braced pattern lists in a start condition are a flex extension.)

<S_STRING>{
  [^"\]+       { yylval.str->append(yytext, yyleng); }
  \n           { (*yylval.str) += '
'; }
   /* Etc. Handle other escape sequences similarly */
  \.           { (*yylval.str) += yytext[1]; }
  \
          { /* A backslash at the end of the line. Do nothing */ }
  "            { BEGIN(INITIAL); return T_STRING; }
     /* See below */
}

When we eventually find an unescaped double-quote, which will match the last pattern, we first reset the lexical state, and then return the string which has been completely constructed.

The pattern \ actually matches a backslash at the very end of the line. It's common to completely ignore this backslash and the newline, in order to allow long strings to be continued over several source lines. If you don't want to provide this feature, just change the . pattern to (.| ).

And what if we don't find an unescaped double-quote? That is, what if the closing double quote was accidentally omitted? We will end up in the S_STRING start condition in this case, since the string was not terminated by a quote, and so the fallback pattern will match. In the S_STRING patterns, we need to add two more possibilities:

<S_STRING>{
    // ... As above
  <<EOF>>      |
  \           { /* Signal a lexical error */ }
}

The first of these rules catches the simple unterminated string error. The second one catches the case in which a backslash was not followed by a legitimate character, which given the other rules can only happen if a backslash is the very last character in a program with an unterminated string. Unlikely though that is, it can happen so we should catch it.


One further optimization is relatively simple, although I wouldn't recommend it because it mostly just complicates the code, and the benefit is infinitesimal. (For this very reason, I haven't included any sample code.)

In the start condition, a backslash (almost) always results in appending a single character to the string we're accumulating, which means that we might resize the string in order to do this append, even though we just resized it to append the non-escaped characters. Instead, we could add one additional character to the string in the action which matches the non-escape characters. (Because (f)lex modifies the input buffer to NUL-terminate the token, the character following the token will always be a NUL, so increasing the length of the append by one will insert this NUL and not the backslash into the string. But that's not important.)

Then the code which handles the escape character needs to replace the last character in the string rather than appending a single character to the string, thereby avoiding one append call. Of course, in the cases where we don't want to insert anything, we'll need to reduce the size of the string by one character, and if there is an escape sequence (such as unicode escapes) which add more than one byte to the string, we'll need to do some other acrobatics.

In short, I'd qualify this as a hack more than an optimization. But for what it's worth, I have done things like this in the past, so I have to plead guilty to the charge of premature optimization, too.


Notes

  1. Your code only prints out the token, which makes it hard to know what your design is for passing the string to the parser. I'm assuming here one more or less standard strategy in which the semantic value yylval is a union one of whose members is a std::string* (not a std::string). I don't address the resulting memory management issues, but a %destruct declaration will help a lot.

  2. In the original version of this answer, I suggested catching this case by using a pattern which matches a backslash as trailing context:

    "[^"\]*/\    { yylval.str = new std::string(yytext + 1, yyleng - 1);
                      BEGIN(S_STRING);
                      /* No return, so the scanner will continue in the new state */
                }
    

    But using the maximal munch rule is simpler and more general.

  3. If more than one pattern has the same longest match, the first one in the scanner description wins.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...