python - How does OneHotEncoder work when used through Sklearn Pipeline?

Question

Welcome To Ask or Share your Answers For Others

python - How does OneHotEncoder work when used through Sklearn Pipeline?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How does OneHotEncoder work when used through Sklearn Pipeline?

I tried two ways to train a ML model:

1- "Manually"

a/ preprocessing:

I apply OneHotEncoder(handle_unknown='ignore', sparse=False) on the specified cat_features columns on my Training dataset X_train, by applying the method OH_encoder.fit_transform(X_train[cat_features]). The result is called OneHotColumns.

Then I concatenate the resulting dataset with the other columns of X_train, so that the final result of that preprocessing step is having a dataset, called X_train_encoded, similar to X_train but with the cat_features columns replaced by the OneHotColumns.

I also do the same process for the Validation dataset X_val, but by applying the method OH_encoder.transform(X_val[cat_features]). The resulting dataset is called X_val_encoded.

b/ model:

I define the following model: model = RandomForestClassifier(n_estimators=15, max_depth=2, random_state=0)

Then I fit this model to my encoded datasets: model.fit(X_train_encoded, Y_train)

c/ predictions:

Finally, I predict the results by doing predictions=model.predict(X_val_encoded). I finally have the MAE result: mean_absolute_error(Y_valid, predictions) = MAE1*

2- Using Pipeline

I do the following Pipeline:

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_features)
    ],
    remainder='passthrough'
)
model = RandomForestClassifier(n_estimators=15, max_depth=2, random_state=0)
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])
clf.fit(X_train, Y_train)
preds = clf.predict(X_val)

I finally have the MAE result: mean_absolute_error(Y_valid, preds ) = MAE2*

Now the problem that I have is simple: why MAE1 is different from MAE2 ? Indeed, why the two methods do not give the same results ? It is very strange since I have the same model in both cases, and I think that the preprocessing in the Pipeline is similar than the preprocessing done in the first method...

In fact, I would really like to know what does clf.fit in the Pipeline do exactly ? At the preprocessing step, is it doing the same as in method 1, indeed something like: OH_encoder.fit_transform on X_train, then OH_encoder.transform on X_valid?

Thank you for helping me.

question from:https://stackoverflow.com/questions/65623346/how-does-onehotencoder-work-when-used-through-sklearn-pipeline

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - How does OneHotEncoder work when used through Sklearn Pipeline?

python - How does OneHotEncoder work when used through Sklearn Pipeline?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags