I am working on the Kaggle dataset, accessible below.
https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction
Given the imbalance of the data, I am running a balanced random forest classifier. However, the below code is giving me 100% accuracy, recall and precision and so is certainly incorrect.
from pandas import pd
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
# import data
path = 'raw Data/'
df_train = pd.read_csv(path + 'train.csv')
df_train.head(3)
# Seperate features from label
X = df_train.drop(columns=['Response'])
y = df_train['Response']
# Get dummy varialbes
X = pd.get_dummies(df_train,columns = ['Gender','Region_Code','Vehicle_Age','Vehicle_Damage','Policy_Sales_Channel'],drop_first=True)
# #Split data into 3: 60% train, 20% validation, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# Run model
brfc = BalancedRandomForestClassifier(n_estimators=500,random_state=0).fit(X_train,y_train)
print("F1 Score for Balanced Random Forest Classifier is ", f1_score(y_test,brfc.predict(X_test)))
question from:
https://stackoverflow.com/questions/66064046/balanced-random-forest-classifier-producing-100-accuracy-f1-unsure-what-is-wro 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…