Instead of using pd.get_dummies
, which has the drawbacks you identified, use sklearn.preprocessing.OneHotEncoder
. It automatically fetches all nominal categories from your train data and then encodes your test data according to the categories identified in the training step. If there are new categories in the test data, it will just encode your data as 0's.
Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
x_train = np.array([["A1","B1","C1"],["A2","B1","C2"]])
x_test = np.array([["A1","B2","C2"]]) # As you can see, "B2" is a new attribute for column B
ohe = OneHotEncoder(handle_unknown = 'ignore') #ignore tells the encoder to ignore new categories by encoding them with 0's
ohe.fit(x_train)
print(ohe.transform(x_train).toarray())
>>> array([[1., 0., 1., 1., 0.],
[0., 1., 1., 0., 1.]])
To get a summary of the categories by column in the train set, do:
print(ohe.categories_)
>>> [array(['A1', 'A2'], dtype='<U2'),
array(['B1'], dtype='<U2'),
array(['C1', 'C2'], dtype='<U2')]
To map one hot encoded columns to categories, do:
print(ohe.get_feature_names())
>>> ['x0_A1' 'x0_A2' 'x1_B1' 'x2_C1' 'x2_C2']
Finally, this is how the encoder works on new test data:
print(ohe.transform(x_test).toarray())
>>> [[1. 0. 0. 0. 1.]] # 1 for A1, 0 for A2, 0 for B1, 0 for C1, 1 for C2
EDIT:
You seem to be worried about the fact that you lose the labels after doing the encoding. It is actually very easy to get back to these, just wrap the answer in a dataframe and specify the column names from ohe.get_feature_names()
:
pd.DataFrame(ohe.transform(x_test).toarray(), columns = ohe.get_feature_names())