Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
270 views
in Technique[技术] by (71.8m points)

python - Differencies between OneHotEncoding (sklearn) and get_dummies (pandas)

I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().

I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?

My code is currently working like the one below:

def one_hot_encode(ds,feature):
    #get DF of dummy variables
    dummies = pd.get_dummies(ds[feature])
    #One dummy variable to drop (Dummy Trap)
    dummyDrop = dummies.columns[0]
    #Create a DF from the original and the dummies' DF
    #Drop the original categorical variable and the one dummy
    final =  pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
    return final

#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns

#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
    dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
question from:https://stackoverflow.com/questions/65599297/differencies-between-onehotencoding-sklearn-and-get-dummies-pandas

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.

If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.

The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before

Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)

Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.

Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.

That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...