It looks like TRE regex engine (used by default in base R regex functions), based on the regex library initially written by Henry Spencer in 1986, matches the shortest match at the end of the string if the first pattern in the regular expression starts with a lazy quantifier and ends with $
anchor.
Compare these cases:
sub(" +?on.*$", "", Data) # "Posted by ondrej" "Posted by ona'je"
sub(" +?on.*", "", Data) # "Posted bydrej on 29 Feb 2020." "Posted bya'je on 29feb 2020"
sub(" +?on(.*)", "", Data) # as expected
sub(" +on.*", "", Data) # as expected
What is going on?
The first case is sub(" +?on.*$", "", Data)
and the first pattern sets the greediness of all the quantifiers on the same level in the regex. So, the second quantifier, *
, will be set to lazy even without ?
after it as the first space was quantified with +?
, a lazy quantifier. It is a known TRE "bug", also present in some other regex engines based on Henry Spencer's regexl library.
The second sub(" +?on.*", "", Data)
matches the same way as if it were written " +?on.*?"
(again, due to the first pattern setting the greediness level to lazy on that level) and that would only match 1 or more spaces and then on
, .*?
matches nothing when at the end of the pattern.
The third one, sub(" +?on(.*)", "", Data)
, yields the expected results because the second quantified pattern, .*
, is on the other level (one level deep) and its greediness is not affected by the +?
that is on another level. So, (.*)
matches greedily here.
The fourth one, sub(" +on.*", "", Data)
, yields the expected results because the first pattern is greedy, so the next quantified pattern greediness is also greedy.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…