I am using sklearn.model_selection.LeavePGroupsOut
to train a classifier on each of the sites in my dataset and test it on all other sites. Now I have this problem: After running the analysis I only obtain a 'global' test score for all p sites that are were used for testing. Instead what I am looking for is a way to obtain a test score separately for each site.
Here's an example where I use the breast_cancer
data set and create three dummy sites to which the subjects are assigned (Note that I created different sample sizes for each of the groups, see the lower section why I did this):
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import LeavePGroupsOut
from sklearn.model_selection import cross_validate
from sklearn.datasets import load_breast_cancer
# create a random number generator
rng = np.random.RandomState(42)
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# for this example, only take the first 300 subjects
X = X[0:300,:]
y = y[0:300]
# define dummy sites, let's assume all subjects came from three different sites
# Let's also assume the three sites have different numbers of subjects
groups = np.concatenate((np.repeat('site_1',150),
np.repeat('site_2',100),
np.repeat('site_3',50)))
# optimize classifier on one site and leave two sites out for testing
n_groups = 2
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=0.015625,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
# create pipeline
lr_pipe = Pipeline([
('scaler',scaler),
('lr',lr)
])
# define inner and outer folds (use LeavePGroupsOut)
skf_inner = StratifiedKFold(shuffle=True,random_state=rng)
lpgo_outer = LeavePGroupsOut(n_groups=n_groups)
# implement GridSearch (inner cross validation)
grid = GridSearchCV(lr_pipe,
param_grid=p_grid,
cv=skf_inner,
verbose=1,
)
# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,
X,
y,
groups=groups,
cv=lpgo_outer,
return_train_score=True,
return_estimator=True,
verbose=1
)
Now when one looks at nested_cv_scores['test_score']
one gets these three test scores: 0.915, 0.945, 0.96
. Instead I want to obtain 6 scores (each of the three sites is used once for training and two others are used for testing).
What I already came up with:
I already came up with the idea to obtain the pipeline object from each of the three final estimators (nested_cv_scores['estimator'][idx].best_estimator_
) and to run LeavePGroupsOut
again by using
train_index, test_index in lpgo_outer.split(X, y, groups):
...
With that, I guess one could recalculate the test scores separately for each site (by calling the predict
method and then calculating the test score using y_pred
and y_true
.
Though I wonder, if there could be a more elegant way to the problem? Maybe I have overseen an alternative to LeavePGroupsOut
? Also note that I can't use sklearn.model_selection.cross_val_predict
here, because the three sites have different sample sizes (when using cross_val_predict
instead of cross_validate
one gets a ValueError: cross_val_predict only works for partitions
)
question from:
https://stackoverflow.com/questions/65916969/obtaining-test-scores-separately-for-each-group-after-running-nested-cross-valid