Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
470 views
in Technique[技术] by (71.8m points)

python - map column values to 'miscellaneous' if value counts is below a threshold - Categorical Column - Pandas Dataframe

I have a pandas dataframe of shape ~ [200K, 40]. The dataframe has a categorical column (one of many) with over 1000 unique values. I can visualizee the value counts of each such unique column by using:

df['column_name'].value_counts()

How do i now club values with:

  • value_count less than a threshold value, say, 100, and map them to, say, "miscellaneous"?
  • OR based on the cumulative row count % ?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can extract the values you want to mask from the index of value_counts and them map them to "miscellaneous" using replace:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 10, (2000, 2)), columns=['A', 'B'])

frequencies = df['A'].value_counts()

condition = frequencies<200   # you can define it however you want
mask_obs = frequencies[condition].index
mask_dict = dict.fromkeys(mask_obs, 'miscellaneous')

df['A'] = df['A'].replace(mask_dict)  # or you could make a copy not to modify original data

Now, using value_counts will group all the values below your threshold as miscellaneous:

df['A'].value_counts()

df['A'].value_counts()
Out[18]: 
miscellaneous    947
3                226
1                221
0                204
7                201
2                201

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...