Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors?
I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other".
The R package forcats has fct_lump() for this purpose.
forcats
fct_lump()
library(forcats) fct_lump(f, n)
Where f is the factor and n is the number of most common levels to be preserved. The remaining are recoded to Other.
f
n
Other
1.4m articles
1.4m replys
5 comments
57.0k users