r - predict.lm with newdata

Question

Welcome To Ask or Share your Answers For Others

r - predict.lm with newdata

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - predict.lm with newdata

I've built an lm model without using the data= parameter:

m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + 
                            gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

Now I'd like to predict m1 using newdata and so name my new data.frame to match the variables used in the lm() call above.

With newComps as my new gc.pc (which, like the gc.tA prediction, were predicted using the new data.frame without any issues), I've tried

newD <- data.frame( newComps[1:100,1:6] ,
                    predict(gc.tA , newdata = mdldvlp[1:100,predKept]))


names(newD) <- names(m1$coefficients)[-1]
names(newD) <- names(m1$model)[-1]

names(newD) <- c( "gc.pc$scores[, 1]" , "gc.pc$scores[, 2]" , "gc.pc$scores[, 3]" , 
                  "gc.pc$scores[, 4]" , "gc.pc$scores[, 5]" , "gc.pc$scores[, 6]" , 
                  "predict(gc.tA)" )
names(newD) <- c( "gc.pc$scores[,1]" , "gc.pc$scores[,2]" , "gc.pc$scores[,3]" , 
                  "gc.pc$scores[,4]" , "gc.pc$scores[,5]" , "gc.pc$scores[,6]" , 
                  "predict(gc.tA)" )

Unfortunately, predict.lm does not accept the naming strategies above and returns the dreaded newdata warning along with the predictions from the original data.frame that built m1:

Warning message:
'newdata' had 100 rows but variable(s) found have 1414 rows

How should I name the newD columns to make the predict call work? Thanks.

The code below recreates the issue:

    require(rpart)

    set.seed(123)
    X <- matrix(runif(200) , 20 , 10)
    gc.pc <- princomp(X)
    y <- runif(20)
    mdldvlp.trim <- data.frame(y,X)
    names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
    predKept <- paste("x",1:10,sep="")

    gc.tA <- rpart( y ~ . , data = mdldvlp.trim)

    m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + 
                                gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

    mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
    names(mdldvlp) <- predKept

    newComps <- predict( gc.pc , newdata=mdldvlp )

    newD <- data.frame( newComps[1:100,1:6] ,
                        predict(gc.tA , newdata = mdldvlp[1:100,predKept]))

# enter newD naming strategy here

    predict( m1 , newdata=newD )

4/20 Follow up:

Thanks all for your answers. I understand things would be easier by first creating a data.frame with properly named predictors. I understand that. My question is if the modeling data frame does indeed evaluate to a data frame with variables named gc.pc$scores[,1] etc. then why won't the naming 'strategies' used above work with predict.lm? In other words, does lm really evaluate its modeling data frame with gc.pc$scores[,1] and so on? If it did, wouldn't the renaming strategies above work in predict.lm?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:30:50+0000

You are abusing the formula notation and it is this that is causing you problems. Essentially your formula:

m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + 
                            gc.pc$scores[,3] + gc.pc$scores[,4] + 
                            gc.pc$scores[,5] + gc.pc$scores[,6] + 
                            predict(gc.tA))

will evaluate to a data frame with variables named gc.pc$scores[,1] etc. When you use predict() it will look for variables with these same names in the object passed to the newdata argument.

Ideally, you'd create a data object with all the variables you want included in them with appropriate names, eg:

fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")

and then fit the model via:

m1 <- lm(trimY ~ ., data = fitData)

New predictions can be made from the model by providing a data frame with the same names as used to fit the model. Hence using your newD:

newD <- data.frame(newComps[1:100,1:6] ,
                   predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")

and then predict()

predict(m1 , newdata=newD)

Full example

require(rpart)

set.seed(123)
X <- matrix(runif(200) , 20 , 10)
gc.pc <- princomp(X)
y <- runif(20)
mdldvlp.trim <- data.frame(y,X)
names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
predKept <- paste("x",1:10,sep="")

gc.tA <- rpart( y ~ . , data = mdldvlp.trim)
fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")
m1 <- lm(trimY ~ ., data = fitData)
mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
names(mdldvlp) <- predKept

newComps <- predict( gc.pc , newdata=mdldvlp )
newD <- data.frame(newComps[1:100,1:6] ,
                   predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")
predict(m1 , newdata=newD)

Categories

r - predict.lm with newdata

r - predict.lm with newdata

4/20 Follow up:

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Full example

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags