The basic task that I have at hand is
a) Read some tab separated data.
b) Do some basic preprocessing
c) For each categorical column use LabelEncoder
to create a mapping. This is don somewhat like this
mapper={}
#Converting Categorical Data
for x in categorical_list:
mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
df[x]=mapper[x].fit_transform(df.__getattr__(x))
where df
is a pandas dataframe and categorical_list
is a list of column headers that need to be transformed.
d) Train a classifier and save it to disk using pickle
e) Now in a different program, the model saved is loaded.
f) The test data is loaded and the same preprocessing is performed.
g) The LabelEncoder's
are used for converting categorical data.
h) The model is used to predict.
Now the question that I have is, will the step g)
work correctly?
As the documentation for LabelEncoder
says
It can also be used to transform non-numerical labels (as long as
they are hashable and comparable) to numerical labels.
So will each entry hash to the exact same value everytime?
If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…