Is this the correct process for feature selection?
This is ONE of the many ways of feature selection. Recursive feature elimination is an automated approach to this, others are listed in scikit.learn documentation. They have different pros and cons, and usually feature selection is best achieved by also involving common sense and trying models with different features. RFE is a quick way of selecting a good set of features, but does not necessarily give you the ultimately best. By the way, you don't need to build your StratifiedKFold separately. If you just set the cv
parameter to cv=3
, both RFECV
and GridSearchCV
will automatically use StratifiedKFold if the y values are binary or multiclass, which I'm assuming is most likely the case since you are using LogisticRegression
.
You can also combine
# Fit the features to the response variable
rfecv.fit(X, y)
# Put the best features into new df X_new
X_new = rfecv.transform(X)
into
X_new = rfecv.fit_transform(X, y)
Is this the correct process for hyper-parameter selection?
GridSearchCV is basically an automated way of systematically trying a whole set of combinations of model parameters and picking the best among these according to some performance metric. It's a good way of finding well-suited parameters, yes.
Is this the correct process for fitting?
Yes, this is a valid way of fitting the model. When you call grid.fit(X_new, y)
, it makes a grid of LogisticRegression
estimators (each with a set of parameters that are tried) and fits each of them. It will keep the one with the best performance under grid.best_estimator_
, the parameters of this estimator in grid.best_params_
and the performance score for this estimator under grid.best_score_
. It will return itself, and not the best estimator. Remember that with incoming new X values that you will use the model to predict on, you have to apply the transform with the fitted RFECV model. So, you can actually add this step to the pipeline as well.
Where can I find the fitted coefficients for the selected features?
The grid.best_estimator_
attribute is a LogisticRegression
object with all this information, so grid.best_estimator_.coef_
has all the coefficients (and grid.best_estimator_.intercept_
is the intercept). Note that to be able to get this grid.best_estimator_
, the refit
parameter on GridSearchCV
needs to be set to True
, but this is the default anyway.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…