This is quite a common issue with using the lex/flex tools that stumps beginners (and sometime non-beginners). There are two solutions to the problem that require two different advanced features of the tools. A phrase like dog ... cat
is much the same problem as matching comments in various programming languages, such as the C comment form /* ... */
or even 'comment' ... 'tnemmoc'
. These have exactly the same characteristics as your example. Consider the following C code:
/* This is a comment */ "This is a String */"
A greedy lexical match of that would match the wrong comment terminator (and is a good test of a student lexer BTW!).
There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:
- J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.)
- M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator
The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.
For your cat ... dog
problem they can be programmed in the following ways:
Start Conditions
In this solution we need several states. The keyword dog
causes causes it to enter the DOG
state which continues until a letter c
is encountered. This then enters the LETTERC
state which must be followed by a letter a
, if not the DOG
state continues; a letter a
causes the CAT
state to be entered which must be followed by a letter t
which causes the entire phrase to be matched and returns to the INITIAL
state. The yymore
causes the entire dog ... cat
text to be retained for use.
%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [
]+
%%
<INITIAL>{d}{o}{g} {
BEGIN(DOG);
printf("DOG
");
yymore();
}
<DOG>[^cC]*{c} {
printf("C: %s
",yytext);
yymore();
BEGIN(LETTERC);
}
<LETTERC>{a} {
printf("A: %s
",yytext);
yymore();
BEGIN(CAT);
}
<LETTERC>[^aA] {
BEGIN(DOG);
yymore();
}
<CAT>{t} {
printf("CAT: %s
",yytext);
BEGIN(INITIAL);
}
<CAT>[^tT] {
BEGIN(DOG);
yymore();
}
<INITIAL>{ws} /* skip */ ;
Manual Input
The Manual input method just matches the start phrase dog
and the enters C code which swallows up input characters until the desired cat
sequence is encountered. (I did not bother with both upper and lower case letters). The problem with this solution is that it is hard to retain the input text value in yytext for later use in the parser. It discards it, which would be OK if the construct is a comment, but no so useful otherwise.
d [dD]
o [oO]
g [gG]
ws [
]+
%%
{d}{o}{g} {
register int c;
for ( ; ; )
{
/* Not dealt with upper case .. left as an exercise */
while ( (c = input()) != 'c' &&
c != EOF )
; /* eat up text of dog */
if ( c == 'c' )
{
if ( ( c = input()) == 'a' )
if ( (c = input()) == 't' )
break; /* found the end */
}
if ( c == EOF )
{
REJECT;
break;
}
}
/* because we have used input() yytext always contains "dog" */
printf("cat: %s
", yytext);
}
{ws} /* skip */ ;
(Both these solutions have been tested)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…