Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
239 views
in Technique[技术] by (71.8m points)

lex - What is regular expression for multi string?

I am learning to make a compiler and it's got some rules like single string:

char ch[] ="abcd";

and multi string:

printf("This is
a multi
string");

I wrote the regular expression

STRING "([^"
]|\{NEWLINE})*"

It works fine with single line string but it doesn't work with multi line string where one line ends with a '' character. What should I change?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A common string pattern is

"([^"\
]|\(.|
))*"

This will match strings which include escaped double quotes (") and backslashes (\). It uses \(.| ) to allow any character after a backslash. Although some backslash sequences are longer than one character (x40), none of them include non-alphanumerics after the first character.

It is possible that your input includes Windows line endings (CR-LF), in which case the backslash will not be directly followed by a newline; it will be followed by a carriage return. If you want to accept that input rather than throwing an error (which might be more appropriate), you need to do so explicitly:

"([^"\
]|\(.|
?
))*"

But recognising a string and understanding what the string represents are two different things. Normally a compiler will need to turn the representation of a string into a byte sequence and that requires, for example, turning into the byte 10 and removing backslashed newlines altogether.

That task can easily be done in a (f)lex scanner using start conditions. (Or, of course, you can rescan the string using a different lexical scanner.)

Additionally, you need to think about error-handling. Once you ban strings with unescaped newlines (as C does), you open the door to the possibility of an unterminated string, where a newline is encountered before the closing quote. The same could happen at the end of the file if a string is not correctly? closed.

If you have a single-character fallback rule, it will recognise the opening quote of an unterminated string. This is not desirable because it will then scan the contents of the string as program text leading to a cascade of errors. If you are not attempting error recovery it doesn't matter, but if you are it is usually better to at least recognize the unterminated string as such up to the newline, using a different pattern.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...