Lightgbm: intent of returning leaf_index

Question

Welcome To Ask or Share your Answers For Others

Lightgbm: intent of returning leaf_index

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Lightgbm: intent of returning leaf_index

I am a beginner with lightgbm. Lightgbm provides a input parameter pred_leaf (false by default), which when enabled returns indices of the leafs for all the trees built during training. So for a binary classfier, with 200 trees, the predict_proba function returns a 200 * batch_size long array of indices. Although it does seem to provide some information about the model internals, I am not sure what to use these for? Can anyone please suggest, how these leaf indices maybe of help in interpreting or debugging the model?

Reference: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.predict_proba

Thanks

question from:https://stackoverflow.com/questions/65879414/lightgbm-intent-of-returning-leaf-index

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:21:31+0000

To make this specific for others reading the post, here's a small reproducible example using pred_leaf=True. Note that the behavior for .predict() and .predict_proba() is identical when you pass pred_leaf=True.

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
clf = lgb.LGBMClassifier()
clf.fit(X, y)

leaf_preds = clf.predict(X, pred_leaf=True)

Examining the first 5 trees with leaf_preds[:,:5] gives output like the following

array([[ 4,  2,  9,  2,  3],
       [ 4, 11,  6,  9,  9],
       [ 4, 11, 13, 12, 16],
       ...,
       [ 4,  4,  6,  9,  9],
       [ 4, 11, 14, 14, 11],
       [ 6,  6,  8,  6,  7]], dtype=int32)

If you pass the training data back into .predict(pred_leaf=True), this output could help you to understand if training started hitting tree-level stopping criteria later in the process. For example, if you used num_leaves=50 but the maximum of a column in the pred_leaf output for the training data is 25, that tells you that some iterations weren't able to find enough informative splits. You can also get at this type of information using Booster.trees_to_dataframe() if you'd prefer.

clf.booster_.trees_to_dataframe()

If you pass some evaluation data (data not seen in training) into predict(pref_leaf=True), it could be used to detect importance differences between the training data and the evaluation data which might otherwise be hard to see. For example, if you use predict(X_eval, pred_leaf=True) with evaluation data that you think is representative, you can figure out how often each leaf node is matched. If some leaf nodes match 0 or very very few of the evaluation records, that might give you confidence that a smaller model could perform just as well, which could be important if your deployment strategy is sensitive to model size.

Alternatively, that scenario might be a sign of drift in your data...if most of the evaluation data is falling into only a subset of the leaf nodes, that might be a sign that the data you're scoring on comes from a different distribution than the data the model was trained on, which might be a sign that you'd benefit from re-training on newer data.

Categories

Lightgbm: intent of returning leaf_index

Lightgbm: intent of returning leaf_index

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags