Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
599 views
in Technique[技术] by (71.8m points)

regex - Regular Expression Compiler

I have had the need to use regular expressions only a few times in the work that I have done. However, in those few times I discovered a very powerful form of expression that would enable me to do some extremely useful things.

The problem is that the language used for regular expressions is wrong - full stop.

It is wrong from a psychological point of view - using disembodied symbols provides a useful reference only to those with an eidetic memory. Whilst the syntactic rules are clearly laid out, from my experience and what I have learnt from others, evolving a regular expression that functions successfully can prove to be a difficult thing to do in all but the most trivial situations. This is understandable since it is a symbolic analog for set theory, which is a fairly complicated thing.

One of the things that can prove difficult is dissolving the expression that you are working on into its discrete parts. Due to the nature of the language, it is possible to read one regular expression in multiple ways if you don't have an understanding of its primary goal so interpreting other people's regexes is complicated. In natural language study I believe this is called pragmatics.

The question I'd like to ask then is this - is there such a thing as a regular expression compiler? Or can one even be built?

It could be possible to consider regexes, from a metaphorical point of view, as assembly language - there are some similarities. Could a compiler be designed that could turn a more natural language - a higher language - into regular expressions? Then in my code, I could define my regexes using the higher level language in a header file and reference them where necessary using a symbolic reference. I and others could refer from my code to the header file and more easily appreciate what I am trying to achieve with my regexes.

I know it can be done from a logical point of view otherwise computers wouldn't be possible but if you have read this far then would you consider investing the time in realising it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

1) Perl permits the /x switch on regular expressions to enable comments and whitespace to be included inside the regex itself. This makes it possible to spread a complex regex over several lines, using indentation to indicate block structure.

2) If you don't like the line-noise-resembling symbols themselves, it's not too hard to write your own functions that build regular expressions. E.g. in Perl:

sub at_start { '^'; }
sub at_end { '$'; }
sub any { "."; }
sub zero_or_more { "(?:$_[0])*"; }
sub one_or_more { "(?:$_[0])+"; }
sub optional { "(?:$_[0])?"; }
sub remember { "($_[0])"; }
sub one_of { "(?:" . join("|", @_) . ")"; }
sub in_charset { "[$_[0]]"; }       # I know it's broken for ']'...
sub not_in_charset { "[^$_[0]]"; }   # I know it's broken for ']'...

Then e.g. a regex to match a quoted string (/^"(?:[^"]|\.)*"/) becomes:

at_start .
'"' .
zero_or_more(
    one_of(
        not_in_charset('"'),    # Yuck, 2 levels of escaping required
        '\\' . any
    )
) .
'"'

Using this "string-building functions" strategy lends itself to expressing useful building blocks as functions (e.g. the above regex could be stored in a function called quoted_string(), you might have other functions for reliably matching any numeric value, an email address, etc.).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...