If you apply get_dummies()
and OneHotEncoder()
in the general dataset, you should obtain the same result.
If you apply get_dummies()
in the general dataset, and OneHotEncoder()
in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies()
and OneHotEncoder()
is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies()
, you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder()
, will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder()
in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…