I tried two ways to train a ML model:
1- "Manually"
a/ preprocessing:
I apply OneHotEncoder(handle_unknown='ignore', sparse=False)
on the specified cat_features
columns on my Training dataset X_train
, by applying the method OH_encoder.fit_transform(X_train[cat_features])
.
The result is called OneHotColumns
.
Then I concatenate the resulting dataset with the other columns of X_train
, so that the final result of that preprocessing step is having a dataset, called X_train_encoded
, similar to X_train
but with the cat_features
columns replaced by the OneHotColumns
.
I also do the same process for the Validation dataset X_val
, but by applying the method OH_encoder.transform(X_val[cat_features])
. The resulting dataset is called X_val_encoded
.
b/ model:
I define the following model: model = RandomForestClassifier(n_estimators=15, max_depth=2, random_state=0)
Then I fit this model to my encoded datasets: model.fit(X_train_encoded, Y_train)
c/ predictions:
Finally, I predict the results by doing predictions=model.predict(X_val_encoded)
.
I finally have the MAE result: mean_absolute_error(Y_valid, predictions)
= MAE1*
2- Using Pipeline
I do the following Pipeline:
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, cat_features)
],
remainder='passthrough'
)
model = RandomForestClassifier(n_estimators=15, max_depth=2, random_state=0)
clf = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
clf.fit(X_train, Y_train)
preds = clf.predict(X_val)
I finally have the MAE result: mean_absolute_error(Y_valid, preds )
= MAE2*
Now the problem that I have is simple: why MAE1 is different from MAE2 ?
Indeed, why the two methods do not give the same results ?
It is very strange since I have the same model in both cases, and I think that the preprocessing in the Pipeline is similar than the preprocessing done in the first method...
In fact, I would really like to know what does clf.fit
in the Pipeline do exactly ?
At the preprocessing step, is it doing the same as in method 1, indeed something like:
OH_encoder.fit_transform
on X_train
, then OH_encoder.transform
on X_valid
?
Thank you for helping me.
question from:
https://stackoverflow.com/questions/65623346/how-does-onehotencoder-work-when-used-through-sklearn-pipeline