Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
530 views
in Technique[技术] by (71.8m points)

python - regex to get all text outside of brackets

I'm trying to grab any text outside of brackets with a regex.

Example string

Josie Smith [3996 COLLEGE AVENUE, SOMETOWN, MD 21003]Mugsy Dog Smith [2560 OAK ST, GLENMEADE, WI 14098]

I'm able to get the text inside the square brackets successfully with:

addrs = re.findall(r"[(.*?)]", example_str)
print addrs
[u'3996 COLLEGE AVENUE, SOMETOWN, MD 21003',u'2560 OAK ST, GLENMEADE, WI 14098']    

but I'm having trouble getting anything outside of the square brackets. I've tried something like the following:

names = re.findall(r"(.*?)[.*]+", example_str)

but that only finds the first name:

print names
[u'Josie Smith ']

So far I've only seen a string containing one to two name [address] combos, but I'm assuming there could be any number of them in a string.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If there are no nested brackets, you can just do this:

re.findall(r'(.*?)[.*?]', example_str)

However, you don't even really need a regex here. Just split on brackets:

(s.split(']')[-1] for s in example_str.split('['))

The only reason your attempt didn't work:

re.findall(r"(.*?)[.*]+", example_str)

… is that you were doing a non-greedy match within the brackets, which means it was capturing everything from the first open bracket to the last close bracket, instead of capturing just the first pair of brackets.


Also, the + on the end seems wrong. If you had 'abc [def][ghi] jkl[mno]', would you want to get back ['abc ', '', ' jkl'], or ['abc ', ' jkl']? If the former, don't add the +. If it's the latter, do—but then you need to put the whole bracketed pattern in a non-capturing group: r'(.*?)(?:[.*?])+.


If there might be additional text after the last bracket, the split method will work fine, or you could use re.split instead of re.findall… but if you want to adjust your original regex to work with that, you can.

In English, what you want is any (non-greedy) substring before a bracket-enclosed substring or the end of the string, right?

So, you need an alternation between [.*?] and $. Of course you need to group that in order to write the alternation, and you don't want to capture the group. So:

re.findall(r"(.*?)(?:[.*?]|$)", example_str)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...