To make this specific for others reading the post, here's a small reproducible example using pred_leaf=True
. Note that the behavior for .predict()
and .predict_proba()
is identical when you pass pred_leaf=True
.
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
clf = lgb.LGBMClassifier()
clf.fit(X, y)
leaf_preds = clf.predict(X, pred_leaf=True)
Examining the first 5 trees with leaf_preds[:,:5]
gives output like the following
array([[ 4, 2, 9, 2, 3],
[ 4, 11, 6, 9, 9],
[ 4, 11, 13, 12, 16],
...,
[ 4, 4, 6, 9, 9],
[ 4, 11, 14, 14, 11],
[ 6, 6, 8, 6, 7]], dtype=int32)
If you pass the training data back into .predict(pred_leaf=True)
, this output could help you to understand if training started hitting tree-level stopping criteria later in the process. For example, if you used num_leaves=50
but the maximum of a column in the pred_leaf
output for the training data is 25
, that tells you that some iterations weren't able to find enough informative splits. You can also get at this type of information using Booster.trees_to_dataframe()
if you'd prefer.
clf.booster_.trees_to_dataframe()
If you pass some evaluation data (data not seen in training) into predict(pref_leaf=True)
, it could be used to detect importance differences between the training data and the evaluation data which might otherwise be hard to see. For example, if you use predict(X_eval, pred_leaf=True)
with evaluation data that you think is representative, you can figure out how often each leaf node is matched. If some leaf nodes match 0 or very very few of the evaluation records, that might give you confidence that a smaller model could perform just as well, which could be important if your deployment strategy is sensitive to model size.
Alternatively, that scenario might be a sign of drift in your data...if most of the evaluation data is falling into only a subset of the leaf nodes, that might be a sign that the data you're scoring on comes from a different distribution than the data the model was trained on, which might be a sign that you'd benefit from re-training on newer data.