I'm trying to get "same" metrics using a RFECV, and a cross_val_score method. The 2nd method comes because it's really important for me to get metrics with their standard deviation (uncertainties are cool).
This is the regression model:
regression = Lasso(alpha=0.1,
selection="random",
max_iter=10000,
random_state=42)
The RFECV method:
min_number_features = df.shape[0]//10
rfecv = RFECV(estimator=regression,
step=1,
min_features_to_select=min_number_features,
cv=KFold(n_splits=10,
shuffle=True,
random_state=42),
scoring='neg_mean_squared_error')
rfecv.fit(X_train, target_train)
score = rfecv.score(X_train, target_train)
On aveage, it gives rmse of 0.84. The cross_val_score method is the following:
metrics_cross_val_score=[
"neg_root_mean_squared_error",
"neg_mean_squared_error",
"r2",
"explained_variance",
"neg_mean_absolute_error",
"max_error",
"neg_median_absolute_error"
]
for m in metrics_cross_val_score:
score=cross_val_score(regression,
X_train,
target_train,
cv=KFold(n_splits=10,
shuffle=True,
random_state=42),
scoring=m)
score= [-score.mean()/mean,score.std()/mean]
metrics[m]=round(score[0],2)
dev="std_"+m
metrics[dev]=round(score[1],2)
For the 2nd method, I normalize every metric by the mean (in an attempt to have a from-0-to-1 score): The results tend to not be exactly like with the 1st method (although the RFECV RMSE is within the interval of the cross_val_score RMSE +- the standard deviation, which is quite big and not-good).
So, here comes the questions:
I read many ways of normalizing the RMSE (by the mean, by y_max-y_min, by quantiles... And I don't know yet the best approach for my data. Anyone has a bright recommendation for that?
The RFECV is working with the selected features, and cross_val_score with all features. If cross_val_score works with the very same columns than RFECV selects, the wellness of cross_val_score RMSE decay dramatically, and that really puzzles me.
Here is a comparison between RFECV RMSE (alg_score), and cross_val_score metrics with standard deviation (everything else).
Hope I made myself understood.
If you feel curious, here is the dashboard with everything related to that:
https://datastudio.google.com/s/gUKsAyZfI5I