Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
850 views
in Technique[技术] by (71.8m points)

parsing - parse bibtex with flex+bison: revisited

For last few weeks, I am trying to write a parser for bibtex (http://www.bibtex.org/Format/) file using flex and bison.

$ cat raw.l
%{
#include "raw.tab.h" 
%}
value ["{][a-zA-Z0-9 .{} "\]*["}]
%%
[a-zA-Z]*               return(KEY);
"                          return(QUOTE);
{                          return(OBRACE);
}                          return(EBRACE);
;                           return(SEMICOLON);
[ ]+                  /* ignore whitespace */;
{value}     {
    yylval.sval = malloc(strlen(yytext));
    strncpy(yylval.sval, yytext, strlen(yytext));
    return(VALUE);
}

$ cat raw.y
%{
#include <stdio.h>
%}

//Symbols.
%union
{
 char *sval;
};
%token <sval> VALUE
%token KEY
%token OBRACE
%token EBRACE
%token QUOTE
%token SEMICOLON 

%start Entry
%%

Entry:
     '@'KEY OBRACE VALUE ',' 
     KeyVal
     EBRACE
     ;

KeyVal:
      /* empty */
      | KeyVal '=' VALUE ','
      | KeyVal '=' VALUE 
      ;
%%

int yyerror(char *s) {
  printf("yyerror : %s
",s);
}

int main(void) {
  yyparse();

}

%% A sample bibtex is:

@Book{a1,
    author = "a {"m}ook, Rudra Banerjee",
    Title="ASR",
    Publisher="oxf",
    Year="2010",
    Add="UK",
    Edition="1",
}
@Article{a2,
    Author="Rudra Banerjee",
    Title="Fe{"Ni}Mo",
    Publisher={P{"R}B},
    Issue="12",
    Page="36690",
    Year="2011",
    Add="UK",
    Edition="1",
}

When I am trying to parse it, its giving syntax error. with GDB, it shows it expect fields in KEY to be declared(probably),

Reading symbols from /home/rudra/Programs/lex/Parsing/a.out...done.
(gdb) Undefined command: "".  Try "help".
(gdb) Undefined command: "Author".  Try "help".
(gdb) Undefined command: "Editor".  Try "help".
(gdb) Undefined command: "Title".  Try "help".
.....

I will be grateful if someone kindly help me on this.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Lots of problems. First, your lexer is confused, trying to recognize quoted strings and braced things as a single VALUE as well as trying to recognize single characters like " and {. For quotes, it makes sense to have the lexer recognize the whole string, but for structural things that you want to parse (like braced lists), you need to return single tokens for the parser to parse. Second, when allocating space for a string, you aren't allocating space for a NUL-terminiator. Finally, your grammar looks odd, wanting parse things like = VALUE = VALUE as a KeyValue, which doesn't correspond to anything in a bibtex file.

So first, for the lexer. You want to recognize quoted strings and identifiers, but other things should be single characters:

[A-Za-z][A-Za-z0-9]*      { yylval.sval = strdup(yytext); return KEY; }
"([^"]|\.)*"          { yylval.sval = strdup(yytext); return VALUE; }
[ 
]                   ; /* ignore whitespace */
[{}@=,]                   { return *yytext; }
.                         { fprintf(stderr, "Unrecognized character %c in input
", *yytext); }

Now you need a parser for the entries:

Input: /* empty */ | Input Entry ;  /* input is zero or more entires */
Entry: '@' KEY '{' KEY ',' KeyVals '}' ;
KeyVals: /* empty */ | KeyVals KeyVal ; /* zero or more keyvals */
KeyVal: KEY '=' VALUE ',' ;

That should parse the example you give.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...