The fastest method is probably using numpy unique
(if all columns are numeric):
_, new_col = np.unique(df.to_numpy(), axis=0, return_inverse=True)
df['new_col'] = new_col
or as one-liner:
df['new_col'] = np.unique(df.to_numpy(), axis=0, return_inverse=True)[1]
col1 col2 col3 col4 new_col
0 0 0 -10 1 0
1 1 100 0 -1 2
2 0 0 0 1 1
3 0 0 -10 1 0
4 1 100 0 -1 2
This is about 10 times faster (for the sample data) than groupby on all columns and using the group number ngroup
as category code:
df['new_col'] = df.groupby(df.columns.to_list()).ngroup()
The advantage of this method is that it also works for mixed or non-numeric typed dataframes.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…