Update:
Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV
will tune the hyperparameters on the data already preprocessed by StandardScaler
, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.
Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.
So instead of:
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=False)
Do this:
clf = make_pipeline(StandardScaler(),
GridSearchCV(LogisticRegression(),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=True))
clf.fit()
clf.predict()
What it will do is, call the StandardScalar() only once, for one call to clf.fit()
instead of multiple calls as you described.
Edit:
Changed refit to True
, when GridSearchCV is used inside a pipeline. As mentioned in documentation:
refit : boolean, default=True
Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance
after fitting.
If refit=False, clf.fit()
will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit()
.
When refit=True
, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit()
.
So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False
is appropriate. If you want to call the clf.predict()
method, refit=True
must be used, else Not Fitted error will be thrown.