Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
327 views
in Technique[技术] by (71.8m points)

python - How to delete the words between two delimiters?

I have a noisy data..something like

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something". Is there a way on how to delete the text between those two delimiters "<" and ">"?

question from:https://stackoverflow.com/questions/8784396/how-to-delete-the-words-between-two-delimiters

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use regular expressions:

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update]

If you tried a pattern like <.+>, where the dot means any character and the plus sign means one or more, you know it does not work.

>>> re.sub(r'<.+>', s, '')
''

Why!?! It happens because regular expressions are "greedy" by default. The expression will match anything until the end of the string, including the > - and this is not what we want. We want to match < and stop on the next >, so we use the [^x] pattern which means "any character but x" (x being >).

The ? operator turns the match "non-greedy", so this has the same effect:

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; be aware that x? means zero or one occurrence of x.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...