Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

regex - Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!

Code:

str = '<br><br />A<br />B'
print(re.sub(r'<br.*?>w$', '', str))

It is expected to return <br><br />A, but it returns an empty string ''!

Any suggestion?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:

  1. The regex engine matches <br at the start of the string.
  2. .*? is ignored for now, it is lazy.
  3. Try to match >, and succeeds.
  4. Try to match w and fails. Now it's interesting - the engine starts backtracking, and sees the .*? rule. In this case, . can match the first >, so there's still hope for that match.
  5. This keep happening until the regex reaches the slash. Then >w can match, but $ fails. Again, the engine comes back to the lazy .* rule, and keeps matching, until it matches<br><br />A<br />B

Luckily, there's an easy solution: By replacing <br[^>]*>w$ you don't allow matching outside of your tags, so it should replace the last occurrence.
Strictly speaking, this doesn't work well for HTML, because tag attributes can contain > characters, but I assume it's just an example.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...