Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
105 views
in Technique[技术] by (71.8m points)

python - Pandas: create category column based on multiple columns

Which would be the most efficient way to create a category column based on other columns in the row, as quickly as possible?

input:

   col1  col2  col3  col4
0     0     0   -10     1
1     1   100     0    -1
2     0     0     0     1
3     0     0   -10     1
4     1   100     0    -1

output:

   col1  col2  col3  col4 new_col
0     0     0   -10     1       1
1     1   100     0    -1       2
2     0     0     0     1       3
3     0     0   -10     1       1
4     1   100     0    -1       2
question from:https://stackoverflow.com/questions/65864251/pandas-create-category-column-based-on-multiple-columns

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The fastest method is probably using numpy unique (if all columns are numeric):

_, new_col = np.unique(df.to_numpy(), axis=0, return_inverse=True)
df['new_col'] = new_col

or as one-liner:

df['new_col'] = np.unique(df.to_numpy(), axis=0, return_inverse=True)[1]

   col1  col2  col3  col4  new_col
0     0     0   -10     1        0
1     1   100     0    -1        2
2     0     0     0     1        1
3     0     0   -10     1        0
4     1   100     0    -1        2

This is about 10 times faster (for the sample data) than groupby on all columns and using the group number ngroup as category code:

df['new_col'] = df.groupby(df.columns.to_list()).ngroup()

The advantage of this method is that it also works for mixed or non-numeric typed dataframes.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...