python - Stratified splitting of pandas dataframe into training, validation and test set

Question

Welcome To Ask or Share your Answers For Others

python - Stratified splitting of pandas dataframe into training, validation and test set

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Stratified splitting of pandas dataframe into training, validation and test set

The following extremely simplified DataFrame represents a much larger DataFrame containing medical diagnoses:

medicalData = pd.DataFrame({'diagnosis':['positive','positive','negative','negative','positive','negative','negative','negative','negative','negative']})
medicalData

    diagnosis
0   positive
1   positive
2   negative
3   negative
4   positive
5   negative
6   negative
7   negative
8   negative
9   negative

Problem: For machine learning, I need to randomly split this dataframe into three subframes in the following way:

trainingDF, validationDF, testDF = SplitData(medicalData,fractions = [0.6,0.2,0.2])

...where the split array specifies the fraction of the complete data that goes into each subframe.

the data in the subframe needs to be mutually exclusive and the split array (fractions) need to sum to one.
Aditionally, the fraction of positive diagnoses in each subset needs to be approximately the same.
Answers to this question recommend using the pandas sample method or the train_test_split function from sklearn. But none of these solutions seem to generalize well to n splits and none provides a stratified split.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:00:58+0000

`np.array_split`

If you want to generalise to n splits, np.array_split is your friend (it works with DataFrames well).

fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1) 
# split into 3 parts
train, val, test = np.array_split(
    df, (fractions[:-1].cumsum() * len(df)).astype(int))

`train_test_split`

A windy solution using train_test_split for stratified splitting.

y = df.pop('diagnosis').to_frame()
X = df

X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.4)

X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, stratify=y_test, test_size=0.5)

Where X is a DataFrame of your features, and y is a single-columned DataFrame of your labels.

Categories

python - Stratified splitting of pandas dataframe into training, validation and test set

python - Stratified splitting of pandas dataframe into training, validation and test set

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

`np.array_split`

`train_test_split`

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags