Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
961 views
in Technique[技术] by (71.8m points)

regex - Non-Greedy Regular Expression Matching in Flex

I have just started with Flex and can't seem to figure out how to match the following Expression :

"Dog".*"Cat"
------------------
Input :
Dog Ca Cat Cc Cat
------------------
Output:
Dog Ca Cat Cc Cat

But I want a non-greedy matching, with the following output :

Output:
Dog Ca Cat

How can this be acheived on Flex ?

EDIT

Tried the following :

%%
Dog.*Cat/.*Cat  printf("Matched : ||%s||", yytext);
dog.*cat        printf("Matched : ||%s||", yytext);
dOg[^c]*cAt     printf("Matched : ||%s||", yytext);
DOG.*?CAT       printf("Matched : ||%s||", yytext);
%%

Input :

Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT

Output :

Matched : ||Dog Ca Cat Cc Cat||
Matched : ||dog Ca cat Cc cat||
Matched : ||dOg Ca cAt|| Cc cAt
Matched : ||DOG CA CAT CC CAT||

Also receiving a warning :

lex4.l:2: warning, dangerous trailing context

Flex Version :

flex 2.5.35 Apple(flex-31)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is quite a common issue with using the lex/flex tools that stumps beginners (and sometime non-beginners). There are two solutions to the problem that require two different advanced features of the tools. A phrase like dog ... cat is much the same problem as matching comments in various programming languages, such as the C comment form /* ... */ or even 'comment' ... 'tnemmoc'. These have exactly the same characteristics as your example. Consider the following C code:

/* This is a comment */ "This is a String */"

A greedy lexical match of that would match the wrong comment terminator (and is a good test of a student lexer BTW!).

There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:

  • J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.)
  • M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator

The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.

For your cat ... dog problem they can be programmed in the following ways:

Start Conditions

In this solution we need several states. The keyword dog causes causes it to enter the DOG state which continues until a letter c is encountered. This then enters the LETTERC state which must be followed by a letter a, if not the DOG state continues; a letter a causes the CAT state to be entered which must be followed by a letter t which causes the entire phrase to be matched and returns to the INITIAL state. The yymore causes the entire dog ... cat text to be retained for use.

%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [ 
]+
%%

<INITIAL>{d}{o}{g} {
        BEGIN(DOG);
        printf("DOG
");
        yymore();
        }
<DOG>[^cC]*{c} {
        printf("C: %s
",yytext);
        yymore();
        BEGIN(LETTERC);
        }
<LETTERC>{a} {
       printf("A: %s
",yytext);
       yymore();
       BEGIN(CAT);
      }
<LETTERC>[^aA] {
        BEGIN(DOG);
        yymore();
        }
<CAT>{t} {
        printf("CAT: %s
",yytext);
        BEGIN(INITIAL);
        }
<CAT>[^tT] {
        BEGIN(DOG);
        yymore();
        }
<INITIAL>{ws}  /* skip */ ;

Manual Input

The Manual input method just matches the start phrase dog and the enters C code which swallows up input characters until the desired cat sequence is encountered. (I did not bother with both upper and lower case letters). The problem with this solution is that it is hard to retain the input text value in yytext for later use in the parser. It discards it, which would be OK if the construct is a comment, but no so useful otherwise.

d [dD]
o [oO]
g [gG]
ws [ 
]+
%%
{d}{o}{g}   {
   register int c;

                     for ( ; ; )
                         {
                         /* Not dealt with upper case .. left as an exercise */
                         while ( (c = input()) != 'c' &&
                                 c != EOF )
                             ;    /* eat up text of dog */

                         if ( c == 'c' )
                             {
                              if ( ( c = input()) == 'a' )
                                     if ( (c = input()) == 't' )
                                 break;    /* found the end */
                             }
                        if ( c == EOF )
                             {
                             REJECT;
                             break;
                             }
                         }
            /* because we have used input() yytext always contains "dog" */
            printf("cat: %s
", yytext);
       }
{ws}  /* skip */ ;

(Both these solutions have been tested)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...