Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.7k views
in Technique[技术] by (71.8m points)

python - My negative lookahead is not working - why?

I have a text scattered with various strings, dates, tab characters and language codes. I want to extract the strings that follow a date+tab combination, and which are followed by a language code like '[en]', a tab character, and after which we don't have the string "BAD THINGS" (e.g. "2020-01-12STRING WE NEED[en]GOOD THINGS", as opposed to "2020-01-12STRING WE DON'T NEED[en]BAD THINGS").

Here is a short example text of what I'm working with:

2021-01-12This string is not needed [it]Bad thingsBad things 2021-01-12This string is also not needed [en]Bad thingsBad things 2021-01-11String 1 that is needed! [it]String 1 that is needed! is repeated hereNot interesting here 2021-01-11String 2 that is needed [fr]String 2 that is needed is repeated hereUnnecessary string 2021-01-11String 3 that is needed... [ru]String 3 that is needed... is repeated hereAnother part we're not interested in

I made this regex to capture all strings between a date and a language code:

(d{4}-d{2}-d{2}\t)(.*?)([w{2}]\t)

This works fine (see here). However, when I add a negative lookahead to exclude those followed by "Bad things", all my regex goes south:

(d{4}-d{2}-d{2}\t)(.*?)([w{2}]\t)(?!Bad things)

You can see the result here. I understand my lookahead somehow makes the regex greedy, but I have no idea how to avoid this, adding a ? after it doesn't work. Can you help me out here?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Not sure if this will cover all the cases but this seems to work:

(d{4}-d{2}-d{2}\t)([^][]*)([w{2}]\t)(?!Bad things)

Demo here.

Explanation:

(d{4}-d{2}-d{2}\t)   date and tab
([^][]*)                 collect only things that do not contain chars `[` and `]`   
([w{2}]\t)           follow up [<tag>]
(?!Bad things)           Negative Lookahead

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...