This approach uses .groupby()
and .ngroup()
(new in Pandas 0.20.2) to create the id
column:
df['id'] = df.groupby(['LastName','FirstName']).ngroup()
>>> df
First Second id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
I checked timings and, for the small dataset in this example, Alexander's answer is faster:
%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 μs per loop
%timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop
However, for larger dataframes, the groupby()
approach appears to be faster. To create a large, representative data set, I used faker
to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.
import faker
fakenames = faker.Faker()
first = [ fakenames.first_name() for _ in range(5000) ]
last = [ fakenames.last_name() for _ in range(5000) ]
df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
df2 = pd.concat([df2, df2.iloc[:2000]])
Running the timing on this larger data set gives:
%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
100 loops, best of 3: 5.22 ms per loop
%timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
100 loops, best of 3: 3.1 ms per loop
You may want to test both approaches on your data set to determine which one works best given the size of your data.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…