Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
244 views
in Technique[技术] by (71.8m points)

python - Regex for managing escaped characters for items like string literals

I would like to be able to match a string literal with the option of escaped quotations. For instance, I'd like to be able to search "this is a 'test with escaped' values' ok" and have it properly recognize the backslash as an escape character. I've tried solutions like the following:

import re
regexc = re.compile(r"'(.*?)(?<!\)'")
match = regexc.search(r""" Example: 'Foo ' Bar'  End. """)
print match.groups() 
# I want ("Foo ' Bar") to be printed above

After looking at this, there is a simple problem that the escape character being used, "", can't be escaped itself. I can't figure out how to do that. I wanted a solution like the following, but negative lookbehind assertions need to be fixed length:

# ...
re.compile(r"'(.*?)(?<!\(\\)*)'")
# ...

Any regex gurus able to tackle this problem? Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

re_single_quote = r"'[^'\]*(?:\.[^'\]*)*'"

First note that MizardX's answer is 100% accurate. I'd just like to add some additional recommendations regarding efficiency. Secondly, I'd like to note that this problem was solved and optimized long ago - See: Mastering Regular Expressions (3rd Edition), (which covers this specific problem in great detail - highly recommended).

First let's look at the sub-expression to match a single quoted string which may contain escaped single quotes. If you are going to allow escaped single quotes, you had better at least allow escaped-escapes as well (which is what Douglas Leeder's answer does). But as long as you're at it, its just as easy to allow escaped-anything-else. With these requirements. MizardX is the only one who got the expression right. Here it is in both short and long format (and I've taken the liberty to write this in VERBOSE mode, with lots of descriptive comments - which you should always do for non-trivial regexes):

# MizardX's correct regex to match single quoted string:
re_sq_short = r"'((?:\.|[^\'])*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # Capture group $1: Contents.
      (?:       # Group for contents alternatives
        \.     # Either escaped anything
      | [^\']  # or one non-quote, non-escape.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

This works and correctly matches all the following string test cases:

text01 = r"out1 'escaped-escape:        \ ' out2"
test02 = r"out1 'escaped-quote:         ' ' out2"
test03 = r"out1 'escaped-anything:      X ' out2"
test04 = r"out1 'two escaped escapes: \\ ' out2"
test05 = r"out1 'escaped-quote at end:   '' out2"
test06 = r"out1 'escaped-escape at end:  \' out2"

Ok, now lets begin to improve on this. First, the order of the alternatives makes a difference and one should always put the most likely alternative first. In this case, non escaped characters are more likely than escaped ones, so reversing the order will improve the regex's efficiency slightly like so:

# Better regex to match single quoted string:
re_sq_short = r"'((?:[^\']|\.)*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # $1: Contents.
      (?:       # Group for contents alternatives
        [^\']  # Either a non-quote, non-escape,
      | \.     # or an escaped anything.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

"Unrolling-the-Loop":

This is a little better, but can be further improved (significantly) by applying Jeffrey Friedl's "unrolling-the-loop" efficiency technique (from MRE3). The above regex is not optimal because it must painstakingly apply the star quantifier to the non-capture group of two alternatives, each of which consume only one or two characters at a time. This alternation can be eliminated entirely by recognizing that a similar pattern is repeated over and over, and that an equivalent expression can be crafted to do the same thing without alternation. Here is an optimized expression to match a single quoted string and capture its contents into group $1:

# Better regex to match single quoted string:
re_sq_short = r"'([^'\]*(?:\.[^'\]*)*)'"
re_sq_long = r"""
    '            # Literal opening quote
    (            # $1: Contents.
      [^'\]*    # {normal*} Zero or more non-', non-escapes.
      (?:        # Group for {(special normal*)*} construct.
        \.      # {special} Escaped anything.
        [^'\]*  # More {normal*}.
      )*         # Finish up {(special normal*)*} construct.
    )            # End $1: Contents.
    '
    """

This expression gobbles up all non-quote, non-backslashes (the vast majority of most strings), in one "gulp", which drastically reduces the amount of work that the regex engine must perform. How much better you ask? Well, I entered each of the regexes presented from this question into RegexBuddy and measured how many steps it took the regex engine to complete a match on the following string (which all solutions correctly match):

'This is an example string which contains one 'internally quoted' string.'

Here are the benchmark results on the above test string:

r"""
AUTHOR            SINGLE-QUOTE REGEX   STEPS TO: MATCH  NON-MATCH
Evan Fosmark      '(.*?)(?<!\)'                  374     376
Douglas Leeder    '(([^\']|\'|\\)*)'          154     444
cletus/PEZ        '((?:\'|[^'])*)(?<!\)'        223     527
MizardX           '((?:\.|[^\'])*)'             221     369
MizardX(improved) '((?:[^\']|\.)*)'             153     369
Jeffrey Friedl    '([^\']*(?:\.[^\']*)*)'       13      19
"""

These steps are the number of steps required to match the test string using the RegexBuddy debugger function. The "NON-MATCH" column is the number of steps required to declare match failure when the closing quote is removed from the test string. As you can see, the difference is significant for both the matching and non-matching cases. Note also that these efficiency improvements are only applicable to a NFA engine which uses backtracking (i.e. Perl, PHP, Java, Python, Javascript, .NET, Ruby and most others.) A DFA engine will not see any performance boost by this technique (See: Regular Expression Matching Can Be Simple And Fast).

On to the complete solution:

The goal of the original question (my interpretation), is to pick out single quoted sub-strings (which may contain escaped quotes) from a larger string. If it is known that the text outside the quoted sub-strings will never contain escaped-single-quotes, the regex above will do the job. However, to correctly match single-quoted sub-strings within a sea of text swimming with escaped-quotes and escaped-escapes and escaped-anything-elses, (which is my interpretation of what the author is after), requires parsing from the beginning of the string No, (this is what I originally thought), but it doesn't - this can be achieved using MizardX's very clever (?<!\)(?:\\)* expression. Here are some test strings to exercise the various solutions:

text01 = r"out1 'escaped-escape:        \ ' out2"
test02 = r"out1 'escaped-quote:         ' ' out2"
test03 = r"out1 'escaped-anything:      X ' out2"
test04 = r"out1 'two escaped escapes: \\ ' out2"
test05 = r"out1 'escaped-quote at end:   '' out2"
test06 = r"out1 'escaped-escape at end:  \' out2"
test07 = r"out1           'str1' out2 'str2' out2"
test08 = r"out1 '        'str1' out2 'str2' out2"
test09 = r"out1 \'      'str1' out2 'str2' out2"
test10 = r"out1 \        'str1' out2 'str2' out2"
test11 = r"out1 \\      'str1' out2 'str2' out2"
test12 = r"out1         \'str1' out2 'str2' out2"
test13 = r"out1       \\'str1' out2 'str2' out2"
test14 = r"out1           'str1''str2''str3' out2"

Given this test data let's see how the various solutions fare ('p'==pass, 'XX'==fail):

r"""
AUTHOR/REGEX     01  02  03  04  05  06  07  08  09  10  11  12  13  14
Douglas Leeder    p   p  XX   p   p   p   p   p   p   p   p  XX  XX  XX
  r"(?:^|[^\])'(([^\']|\'|\\)*)'"
cletus/PEZ        p   p   p   p   p  XX   p   p   p   p   p  XX  XX  XX
  r"(?<!\)'((?:\'|[^'])*)(?<!\)'"
MizardX           p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\)(?:\\)*'((?:\.|[^\'])*)'"
ridgerunner       p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\)(?:\\)*'([^'\]*(?:\.[^'\]*)*)'"
"""

A working test script:

import re
data_list = [
    r"out1 'escaped-escape:        \ ' out2",
    r"out1 'escaped-quote:         ' ' out2",
    r"out1 'escaped-anything:      X ' out2",
    r"out1 'two escaped escapes: \\ ' out2",
    r"out1 'escaped-quote at end:   '' out2",
    r"out1 'escaped-escape at end:  \' out2",
    r"out1           'str1' out2 'str2' out2",
    r"out1 '        'str1' out2 'str2' out2",
    r"out1 \'      'str1' out2 'str2' out2",
    r"out1 \        'str1' out2 'str2' out2",
    r"out1 \\      'str1' out2 'str2' out2",
    r"out1         \'str1' out2 'str2' out2",
    r"out1       \\'str1' out2 'str2' out2",
    r"out1           'str1''str2''str3' out2",
    ]

regex = re.compile(
    r"""(?<!\)(?:\\)*'([^'\]*(?:\.[^'\]*)*)'""",
    re.DOTALL)

data_cnt = 0
for data in data_list:
    data_cnt += 1
    print ("
Data string %d" % (data_cnt))
    m_cnt = 0
    for match in regex.finditer(data):
        m_cnt += 1
        if (match.group(1)):
            print("  quoted sub-string%3d = "%s"" %
                (m_cnt, match.group(1)))

Phew!

p.s. Thanks to MizardX for the very cool (?<!\)(?:\\)* expression. Learn something new every day!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...