I am a bit lost on building a ML classifier with imbalanced data (80:20). The dataset has 30 columns; the target is Label.
I want to predict the major class.
I am trying to reproduce the following steps:
- Split the data on train/test
- Perform CV on trains set
- Apply undersampling only on a test fold
- After the model has been chosen with the help of CV, undersample the train set and train the classifier
- Estimate the performance on the untouched test set (recall)
What I have done is shown below:
y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)
acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
I have some doubts on how to perform CV on trains set, apply under sampling on a test fold and undersample the train set and train the classifier.
Are you familiar with these steps? If you are, I would appreciate your help.
If I do as follows:
y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)
acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)
print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)
I get this output
precision recall f1-score support
-1 0.73 0.99 0.84 291
1 0.00 0.00 0.00 105
accuracy 0.73 396
macro avg 0.37 0.50 0.42 396
weighted avg 0.54 0.73 0.62 396
I guess I have missed something in the code above or doing something wrong.
Sample of data:
Have_0 Have_1 Have_2 Have_letters Label
1 0 1 1 1
0 0 0 1 -1
1 1 1 1 -1
0 1 0 0 1
1 1 0 0 1
1 0 0 1 -1
1 0 0 0 1
See Question&Answers more detail:
os