Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
577 views
in Technique[技术] by (71.8m points)

python - Mapping the index of the feat importances to the index of columns in a dataframe

Hello I plotted a graph using feature_importance from xgboost. However, the graph returns "f-values". I do not know which feature is being represented in the graph. One way I heard about how to solve this is mapping the index of the features within my dataframe to the index of the feature_importance "f-values" and selecting the columns manually. How do I go about in doing this? Also, if there is another way in doing this, help would truly be appreciated:

Here is my code below:

feature_importance = pd.Series(model.booster().get_fscore()).sort_values(ascending=False)
feature_importance.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')

Here is the graph: enter image description here

print(feature_importance.head())

Output: 
f20     320
f22      85
f29      67
f34      38
f81      20
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

i tried a simple example here to see whats up, here is the code i 've written:

import pandas as pd
import xgboost as xgb
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

model = xgb.XGBRegressor()

size = 100

data = pd.DataFrame([], columns=['a','b','c','target'])
data['a'] = np.random.rand(size)
data['b'] = np.random.rand(size)
data['c'] = np.random.rand(size)

data['target'] = np.random.rand(size)*data['a'] + data['b']

model.fit(data.drop('target',1), data.target)

feature_importance = pd.Series(model.booster().get_fscore()).sort_values(ascending=False)
feature_importance.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')

the result is:

enter image description here

as you see the labels are fine.

now, lets pass an array instead of a dataframe:

model.fit(np.array(data.drop('target',1)), data.target)

feature_importance = pd.Series(model.booster().get_fscore()).sort_values(ascending=False)
feature_importance.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')

enter image description here

hence your problem, a np.array has no index/column names by default, therefore xgboost make default feature names (f0, f1, ..., fn)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...