Many regex engines match .*
twice in a single-line string, e.g., when performing regex-based string replacement:
- The 1st match is - by definition - the entire (single-line) string, as expected.
In many engines there is a 2nd match, namely the empty string; that is, even though the 1st match has consumed the entire input string, .*
is matched again, which then matches the empty string at the end of the input string.
- Note: To ensure that only one match is found, use
^.*
My questions are:
Is there a good reason for this behavior? Once the input string has been consumed in full, I wouldn't expect another attempt to find a match.
Other than trial and error, can you glean from the documentation / regex dialect/standard supported which engines exhibit this behavior?
Update: revo's helpful answer explains the how of the current behavior; as for the potential why, see this related question.
Languages/platforms that DO exhibit the behavior:
# .NET, via PowerShell (behavior also applies to the -replace operator)
PS> [regex]::Replace('a', '.*', '[$&]'
[a][] # !! Note the *2* matches, first the whole string, then the empty string
# Node.js
$ node -pe "'a'.replace(/.*/g, '[$&]')"
[a][]
# Ruby
$ ruby -e "puts 'a'.gsub(/.*/, '[\0]')"
[a][]
# Python 3.7+ only
$ python -c "import re; print(re.sub('.*', '[g<0>]', 'a'))"
[a][]
# Perl 5
$ echo a | perl -ple 's/.*/[$&]/g'
[a][]
# Perl 6
$ echo 'a' | perl6 -pe 's:g/.*/[$/]/'
[a][]
# Others?
Languages/platforms that do NOT exhibit the behavior:
# Python 2.x and Python 3.x <= 3.6
$ python -c "import re; print(re.sub('.*', '[g<0>]', 'a'))"
[a] # !! Only 1 match found.
# Others?
bobble bubble brings up some good related points:
If you make it lazy like .*?
, you'd even get 3 matches in some and 2 matches in others. Same with .??
. As soon as we use a start anchor, I thought we should get only one match, but interestingly it seems ^.*?
gives two matches in PCRE for a
, whereas ^.*
should result in one match everywhere.
Here's a PowerShell snippet for testing the behavior across languages, with multiple regexes:
Note: Assumes that Python 3.x is available as python3
and Perl 6 as perl6
.
You can paste the whole snippet directly on the command line and recall it from the history to modify the inputs.
& {
param($inputStr, $regexes)
# Define the commands as script blocks.
# IMPORTANT: Make sure that $inputStr and $regex are referenced *inside "..."*
# Always use "..." as the outer quoting, to work around PS quirks.
$cmds = { [regex]::Replace("$inputStr", "$regex", '[$&]') },
{ node -pe "'$inputStr'.replace(/$regex/g, '[$&]')" },
{ ruby -e "puts '$inputStr'.gsub(/$regex/, '[\0]')" },
{ python -c "import re; print(re.sub('$regex', '[g<0>]', '$inputStr'))" },
{ python3 -c "import re; print(re.sub('$regex', '[g<0>]', '$inputStr'))" },
{ "$inputStr" | perl -ple "s/$regex/[$&]/g" },
{ "$inputStr" | perl6 -pe "s:g/$regex/[$/]/" }
$regexes | foreach {
$regex = $_
Write-Verbose -vb "----------- '$regex'"
$cmds | foreach {
$cmd = $_.ToString().Trim()
Write-Verbose -vb ('{0,-10}: {1}' -f (($cmd -split '|')[-1].Trim() -split '[ :]')[0],
$cmd -replace '$inputStr', $inputStr -replace '$regex', $regex)
& $_ $regex
}
}
} -inputStr 'a' -regexes '.*', '^.*', '.*$', '^.*$', '.*?'
Sample output for regex ^.*
, which confirms bobble bubble's expectation that using the start anchor (^
) yields only one match in all languages:
VERBOSE: ----------- '^.*'
VERBOSE: [regex] : [regex]::Replace("a", "^.*", '[$&]')
[a]
VERBOSE: node : node -pe "'a'.replace(/^.*/g, '[$&]')"
[a]
VERBOSE: ruby : ruby -e "puts 'a'.gsub(/^.*/, '[\0]')"
[a]
VERBOSE: python : python -c "import re; print(re.sub('^.*', '[g<0>]', 'a'))"
[a]
VERBOSE: python3 : python3 -c "import re; print(re.sub('^.*', '[g<0>]', 'a'))"
[a]
VERBOSE: perl : "a" | perl -ple "s/^.*/[$&]/g"
[a]
VERBOSE: perl6 : "a" | perl6 -pe "s:g/^.*/[$/]/"
[a]
See Question&Answers more detail:
os