I have a pandas DataFrame which looks approximately as follows:
cli_id | X1 | X2 | X3 | ... | Xn | Y |
----------------------------------------
123 | 1 | A | XX | ... | 4 | 0.1 |
456 | 2 | B | XY | ... | 5 | 0.2 |
789 | 1 | B | XY | ... | 5 | 0.3 |
101 | 2 | A | XX | ... | 4 | 0.1 |
...
I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1.
I need to take a stratified sample in every group (so 10 folds) of Y of size of 200
I often use this to take a stratified sample when splitting into train/test:
def stratifiedSplit(X,y,size):
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)
for train_index, test_index in sss:
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
return X_train, X_test, y_train, y_test
But I don't know how to modify it in this case.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…