I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3
, i.e.:
Class
0 0.970691
1 0.029309
Params I used and the code for training is as shown below.
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}
# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)
nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)
model = lgb.train(lgb_params, dtrain, num_boost_round=nround)
preds = model.predict(test_feats)
preds = [1 if x >= 0.5 else 0 for x in preds]
I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.
But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.
What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…