Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
356 views
in Technique[技术] by (71.8m points)

statistics - R Random Forests Variable Importance

I am trying to use the random forests package for classification in R.

The Variable Importance Measures listed are:

  • mean raw importance score of variable x for class 0
  • mean raw importance score of variable x for class 1
  • MeanDecreaseAccuracy
  • MeanDecreaseGini

Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.

What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.

If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too. I want to know everything there is to know about these numbers that is relevant to the application of them.

An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.

Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.

Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.

How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.

Here's my shot at some answers:

-mean raw importance score of variable x for class 0

-mean raw importance score of variable x for class 1

Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.

-MeanDecreaseAccuracy

I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.

-MeanDecreaseGini

Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...