Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
334 views
in Technique[技术] by (71.8m points)

r - Difference of prediction results in random forest model

I have built an Random Forest model and I got two different prediction results when I wrote two different lines of code in order to generate the prediction. I wonder which one is the right one. Here is my example dataframe and the usedcode:

dat <- read.table(text = " cats birds    wolfs     snakes
      0        3        9         7
      1        3        8         4
      1        1        2         8
      0        1        2         3
      0        1        8         3
      1        6        1         2
      0        6        7         1
      1        6        1         5
      0        5        9         7
      1        3        8         7
      1        4        2         7
      0        1        2         3
      0        7        6         3
      1        6        1         1
      0        6        3         9
      1        6        1         1   ",header = TRUE)

I've built a random forest model:

model<-randomForest(snakes~cats+birds+wolfs,data=dat,ntree=20)
RF_pred<- data.frame(predict(model))
train<-cbind(train,RF_pred) # this gave me a predictive results named: "predict.model."

I tryed another syntax out of curiosity with this line of code:

dat$RF_pred<-predict(model,newdata=dat,type='response') # this gave me a predictive results named: "RF_pred"

to my suprise I got other predictive results:

 dat
   cats birds wolfs snakes predict.model.  RF_pred
1     0     3     9      7       3.513889 5.400675
2     1     3     8      4       5.570000 5.295417
3     1     1     2      8       3.928571 5.092917
4     0     1     2      3       4.925893 4.208452
5     0     1     8      3       4.583333 4.014008
6     1     6     1      2       3.766667 2.943750
7     0     6     7      1       5.486806 4.061508
8     1     6     1      5       3.098148 2.943750
9     0     5     9      7       4.575397 5.675675
10    1     3     8      7       4.729167 5.295417
11    1     4     2      7       4.416667 5.567917
12    0     1     2      3       4.222619 4.208452
13    0     7     6      3       6.125714 4.036508
14    1     6     1      1       3.695833 2.943750
15    0     6     3      9       4.115079 5.178175
16    1     6     1      1       3.595238 2.943750

Why Is there a diff. between the two? Which one is the correct one? Any Ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The difference is in the two calls to predict:

predict(model)

and

predict(model, newdata=dat)

The first option gets the out-of-bag predictions on your training data from the random forest. This is generally what you want, when comparing predicted values to actuals.

The second treats your training data as if it was a new dataset, and runs the observations down each tree. This will result in an artificially close correlation between the predictions and the actuals, since the RF algorithm generally doesn't prune the individual trees, relying instead on the ensemble of trees to control overfitting. So don't do this if you want to get predictions on the training data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...