Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
282 views
in Technique[技术] by (71.8m points)

python - finding all regex matches from a pandas dataframe column

i am trying to extract some data from a dataframe, however following query only extract the first match and ignores the rest of the matches, for example if the entire data is:

df['value']=
           0   123 blah blah blah, 456 blah blah blah, 129kfj blah blah
           1   237 blah blah blah, 438 blah blah blah, 365kfj blah blah 
           ...

and the regex is:

df['newCol']=df['value'].str.extract("[0-9]{3}")

i want the result to be a new column name "newCol" as:

newCol
------
123,456,129
237,438,365
...

but the actual result i get is only the first number:

newCol
------
123
237

what is wrong here? :(

thank you

UPDATE:

thanks to MaxU I found the solution, just couple of suggestions. I had Pandas 0.18.1 so extractall didn't work for me untill i updated pandas to 0.19, so remember to check your pandas version if you have issue with Extractall...second, apply(','.join) didn't work for me because I had some non string values (Null values) and it couldn't handle it so I used Lambda and it finally worked with a small modification of MaxU solution.

x['value'].str.extractall(r'(d{3})').unstack().apply(lambda x:','.join(x.dropna()), axis=1) 
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

you can use Series.str.extractall() method:

In [57]: x
Out[57]:
                                                    value
0  123 blah blah blah 456 blah blah blah 129kfj blah blah
1  237 blah blah blah 438 blah blah blah 365kfj blah blah

In [58]: x['newCol'] = x['value'].str.extractall(r'(d{3})').unstack().apply(','.join, 1)

In [59]: x
Out[59]:
                                                    value       newCol
0  123 blah blah blah 456 blah blah blah 129kfj blah blah  123,456,129
1  237 blah blah blah 438 blah blah blah 365kfj blah blah  237,438,365

UPDATE:

In [77]: x
Out[77]:
                                                      value
0  123 blah blah blah, 456 blah blah blah, 129kfj blah blah
1  237 blah blah blah, 438 blah blah blah, 365kfj blah blah

In [78]: x['value'].str.extractall(r'(d{3})').unstack().apply(','.join, 1)
Out[78]:
0    123,456,129
1    237,438,365
dtype: object

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...