re_single_quote = r"'[^'\]*(?:\.[^'\]*)*'"
First note that MizardX's answer is 100% accurate. I'd just like to add some additional recommendations regarding efficiency. Secondly, I'd like to note that this problem was solved and optimized long ago - See: Mastering Regular Expressions (3rd Edition), (which covers this specific problem in great detail - highly recommended).
First let's look at the sub-expression to match a single quoted string which may contain escaped single quotes. If you are going to allow escaped single quotes, you had better at least allow escaped-escapes as well (which is what Douglas Leeder's answer does). But as long as you're at it, its just as easy to allow escaped-anything-else. With these requirements. MizardX is the only one who got the expression right. Here it is in both short and long format (and I've taken the liberty to write this in VERBOSE
mode, with lots of descriptive comments - which you should always do for non-trivial regexes):
# MizardX's correct regex to match single quoted string:
re_sq_short = r"'((?:\.|[^\'])*)'"
re_sq_long = r"""
' # Literal opening quote
( # Capture group $1: Contents.
(?: # Group for contents alternatives
\. # Either escaped anything
| [^\'] # or one non-quote, non-escape.
)* # Zero or more contents alternatives.
) # End $1: Contents.
'
"""
This works and correctly matches all the following string test cases:
text01 = r"out1 'escaped-escape: \ ' out2"
test02 = r"out1 'escaped-quote: ' ' out2"
test03 = r"out1 'escaped-anything: X ' out2"
test04 = r"out1 'two escaped escapes: \\ ' out2"
test05 = r"out1 'escaped-quote at end: '' out2"
test06 = r"out1 'escaped-escape at end: \' out2"
Ok, now lets begin to improve on this. First, the order of the alternatives makes a difference and one should always put the most likely alternative first. In this case, non escaped characters are more likely than escaped ones, so reversing the order will improve the regex's efficiency slightly like so:
# Better regex to match single quoted string:
re_sq_short = r"'((?:[^\']|\.)*)'"
re_sq_long = r"""
' # Literal opening quote
( # $1: Contents.
(?: # Group for contents alternatives
[^\'] # Either a non-quote, non-escape,
| \. # or an escaped anything.
)* # Zero or more contents alternatives.
) # End $1: Contents.
'
"""
"Unrolling-the-Loop":
This is a little better, but can be further improved (significantly) by applying Jeffrey Friedl's "unrolling-the-loop" efficiency technique (from MRE3). The above regex is not optimal because it must painstakingly apply the star quantifier to the non-capture group of two alternatives, each of which consume only one or two characters at a time. This alternation can be eliminated entirely by recognizing that a similar pattern is repeated over and over, and that an equivalent expression can be crafted to do the same thing without alternation. Here is an optimized expression to match a single quoted string and capture its contents into group $1
:
# Better regex to match single quoted string:
re_sq_short = r"'([^'\]*(?:\.[^'\]*)*)'"
re_sq_long = r"""
' # Literal opening quote
( # $1: Contents.
[^'\]* # {normal*} Zero or more non-', non-escapes.
(?: # Group for {(special normal*)*} construct.
\. # {special} Escaped anything.
[^'\]* # More {normal*}.
)* # Finish up {(special normal*)*} construct.
) # End $1: Contents.
'
"""
This expression gobbles up all non-quote, non-backslashes (the vast majority of most strings), in one "gulp", which drastically reduces the amount of work that the regex engine must perform. How much better you ask? Well, I entered each of the regexes presented from this question into RegexBuddy and measured how many steps it took the regex engine to complete a match on the following string (which all solutions correctly match):
'This is an example string which contains one 'internally quoted' string.'
Here are the benchmark results on the above test string:
r"""
AUTHOR SINGLE-QUOTE REGEX STEPS TO: MATCH NON-MATCH
Evan Fosmark '(.*?)(?<!\)' 374 376
Douglas Leeder '(([^\']|\'|\\)*)' 154 444
cletus/PEZ '((?:\'|[^'])*)(?<!\)' 223 527
MizardX '((?:\.|[^\'])*)' 221 369
MizardX(improved) '((?:[^\']|\.)*)' 153 369
Jeffrey Friedl '([^\']*(?:\.[^\']*)*)' 13 19
"""
These steps are the number of steps required to match the test string using the RegexBuddy debugger function. The "NON-MATCH" column is the number of steps required to declare match failure when the closing quote is removed from the test string. As you can see, the difference is significant for both the matching and non-matching cases. Note also that these efficiency improvements are only applicable to a NFA engine which uses backtracking (i.e. Perl, PHP, Java, Python, Javascript, .NET, Ruby and most others.) A DFA engine will not see any performance boost by this technique (See: Regular Expression Matching Can Be Simple And Fast).
On to the complete solution:
The goal of the original question (my interpretation), is to pick out single quoted sub-strings (which may contain escaped quotes) from a larger string. If it is known that the text outside the quoted sub-strings will never contain escaped-single-quotes, the regex above will do the job. However, to correctly match single-quoted sub-strings within a sea of text swimming with escaped-quotes and escaped-escapes and escaped-anything-elses, (which is my interpretation of what the author is after), requires parsing from the beginning of the string No, (this is what I originally thought), but it doesn't - this can be achieved using MizardX's very clever (?<!\)(?:\\)*
expression. Here are some test strings to exercise the various solutions:
text01 = r"out1 'escaped-escape: \ ' out2"
test02 = r"out1 'escaped-quote: ' ' out2"
test03 = r"out1 'escaped-anything: X ' out2"
test04 = r"out1 'two escaped escapes: \\ ' out2"
test05 = r"out1 'escaped-quote at end: '' out2"
test06 = r"out1 'escaped-escape at end: \' out2"
test07 = r"out1 'str1' out2 'str2' out2"
test08 = r"out1 ' 'str1' out2 'str2' out2"
test09 = r"out1 \' 'str1' out2 'str2' out2"
test10 = r"out1 \ 'str1' out2 'str2' out2"
test11 = r"out1 \\ 'str1' out2 'str2' out2"
test12 = r"out1 \'str1' out2 'str2' out2"
test13 = r"out1 \\'str1' out2 'str2' out2"
test14 = r"out1 'str1''str2''str3' out2"
Given this test data let's see how the various solutions fare ('p'==pass, 'XX'==fail):
r"""
AUTHOR/REGEX 01 02 03 04 05 06 07 08 09 10 11 12 13 14
Douglas Leeder p p XX p p p p p p p p XX XX XX
r"(?:^|[^\])'(([^\']|\'|\\)*)'"
cletus/PEZ p p p p p XX p p p p p XX XX XX
r"(?<!\)'((?:\'|[^'])*)(?<!\)'"
MizardX p p p p p p p p p p p p p p
r"(?<!\)(?:\\)*'((?:\.|[^\'])*)'"
ridgerunner p p p p p p p p p p p p p p
r"(?<!\)(?:\\)*'([^'\]*(?:\.[^'\]*)*)'"
"""
A working test script:
import re
data_list = [
r"out1 'escaped-escape: \ ' out2",
r"out1 'escaped-quote: ' ' out2",
r"out1 'escaped-anything: X ' out2",
r"out1 'two escaped escapes: \\ ' out2",
r"out1 'escaped-quote at end: '' out2",
r"out1 'escaped-escape at end: \' out2",
r"out1 'str1' out2 'str2' out2",
r"out1 ' 'str1' out2 'str2' out2",
r"out1 \' 'str1' out2 'str2' out2",
r"out1 \ 'str1' out2 'str2' out2",
r"out1 \\ 'str1' out2 'str2' out2",
r"out1 \'str1' out2 'str2' out2",
r"out1 \\'str1' out2 'str2' out2",
r"out1 'str1''str2''str3' out2",
]
regex = re.compile(
r"""(?<!\)(?:\\)*'([^'\]*(?:\.[^'\]*)*)'""",
re.DOTALL)
data_cnt = 0
for data in data_list:
data_cnt += 1
print ("
Data string %d" % (data_cnt))
m_cnt = 0
for match in regex.finditer(data):
m_cnt += 1
if (match.group(1)):
print(" quoted sub-string%3d = "%s"" %
(m_cnt, match.group(1)))
Phew!
p.s. Thanks to MizardX for the very cool (?<!\)(?:\\)*
expression. Learn something new every day!