Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

string - Python Pandas removing substring using another column

I've tried searching around and can't figure out an easy way to do this, so I'm hoping your expertise can help.

I have a pandas data frame with two columns

import numpy as np
import pandas as pd

pd.options.display.width = 1000
testing = pd.DataFrame({'NAME':[
    'FIRST', np.nan, 'NAME2', 'NAME3', 
    'NAME4', 'NAME5', 'NAME6'], 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']})

which gives me

          FULL_NAME   NAME
0        FIRST LAST  FIRST
1               NaN    NaN
2        FIRST LAST  NAME2
3       FIRST NAME3  NAME3
4  FIRST NAME4 LAST  NAME4
5      ANOTHER NAME  NAME5
6         LAST NAME  NAME6

what I'd like to do is take the values from the 'NAME' column and remove then from the 'FULL NAME' column if it's there. So the function would then return

          FULL_NAME   NAME           NEW
0        FIRST LAST  FIRST          LAST
1               NaN    NaN           NaN
2        FIRST LAST  NAME2    FIRST LAST
3       FIRST NAME3  NAME3         FIRST
4  FIRST NAME4 LAST  NAME4    FIRST LAST
5      ANOTHER NAME  NAME5  ANOTHER NAME
6         LAST NAME  NAME6     LAST NAME

So far, I've defined a function below and am using the apply method. This runs rather slow on my large data set though and I'm hoping there's a more efficient way to do it. Thanks!

def address_remove(x):
    try:
        newADDR1 = re.sub(x['NAME'], '', x[-1])
        newADDR1 = newADDR1.rstrip()
        newADDR1 = newADDR1.lstrip()
        return newADDR1
    except:
        return x[-1]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is one solution that is quite a bit faster than your current solution, I'm not convinced that there wouldn't be something faster though

In [13]: import numpy as np
         import pandas as pd
         n = 1000
         testing  = pd.DataFrame({'NAME':[
         'FIRST', np.nan, 'NAME2', 'NAME3', 
         'NAME4', 'NAME5', 'NAME6']*n, 'FULL_NAME':['FIRST LAST', np.nan, 'FIRST  LAST', 'FIRST NAME3', 'FIRST NAME4 LAST', 'ANOTHER NAME', 'LAST NAME']*n})

This is kind of a long one liner but it should do what you need

Fasted solution I can come up with is using replace as mentioned in another answer:

In [37]: %timeit testing ['NEW2'] = [e.replace(k, '') for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 4.67 ms per loop

Original answer:

In [14]: %timeit testing ['NEW'] = [''.join(str(e).split(k)) for e, k in zip(testing.FULL_NAME.astype('str'), testing.NAME.astype('str'))]
100 loops, best of 3: 7.24 ms per loop

compared to your current solution:

In [16]: %timeit testing['NEW1'] = testing.apply(address_remove, axis=1)
10 loops, best of 3: 166 ms per loop

These get you the same answer as your current solution


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...