Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
326 views
in Technique[技术] by (71.8m points)

Lightgbm: intent of returning leaf_index

I am a beginner with lightgbm. Lightgbm provides a input parameter pred_leaf (false by default), which when enabled returns indices of the leafs for all the trees built during training. So for a binary classfier, with 200 trees, the predict_proba function returns a 200 * batch_size long array of indices. Although it does seem to provide some information about the model internals, I am not sure what to use these for? Can anyone please suggest, how these leaf indices maybe of help in interpreting or debugging the model?

Reference: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.predict_proba

Thanks

question from:https://stackoverflow.com/questions/65879414/lightgbm-intent-of-returning-leaf-index

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To make this specific for others reading the post, here's a small reproducible example using pred_leaf=True. Note that the behavior for .predict() and .predict_proba() is identical when you pass pred_leaf=True.

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
clf = lgb.LGBMClassifier()
clf.fit(X, y)

leaf_preds = clf.predict(X, pred_leaf=True)

Examining the first 5 trees with leaf_preds[:,:5] gives output like the following

array([[ 4,  2,  9,  2,  3],
       [ 4, 11,  6,  9,  9],
       [ 4, 11, 13, 12, 16],
       ...,
       [ 4,  4,  6,  9,  9],
       [ 4, 11, 14, 14, 11],
       [ 6,  6,  8,  6,  7]], dtype=int32)

If you pass the training data back into .predict(pred_leaf=True), this output could help you to understand if training started hitting tree-level stopping criteria later in the process. For example, if you used num_leaves=50 but the maximum of a column in the pred_leaf output for the training data is 25, that tells you that some iterations weren't able to find enough informative splits. You can also get at this type of information using Booster.trees_to_dataframe() if you'd prefer.

clf.booster_.trees_to_dataframe()

If you pass some evaluation data (data not seen in training) into predict(pref_leaf=True), it could be used to detect importance differences between the training data and the evaluation data which might otherwise be hard to see. For example, if you use predict(X_eval, pred_leaf=True) with evaluation data that you think is representative, you can figure out how often each leaf node is matched. If some leaf nodes match 0 or very very few of the evaluation records, that might give you confidence that a smaller model could perform just as well, which could be important if your deployment strategy is sensitive to model size.

Alternatively, that scenario might be a sign of drift in your data...if most of the evaluation data is falling into only a subset of the leaf nodes, that might be a sign that the data you're scoring on comes from a different distribution than the data the model was trained on, which might be a sign that you'd benefit from re-training on newer data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...