r - lm and predict - agreement of data.frame names

Question

Welcome To Ask or Share your Answers For Others

r - lm and predict - agreement of data.frame names

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:

c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))

and every single time, I get a mysterious error from R:

Warning message:
'newdata' had 34 rows but variables found have 142 rows

which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:

tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)

tempset = testset
c_pred = predict(c_lm,tempset$independent))

or some similar variation, but this is really sloppy, in my opinion.

Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:24:42+0000

No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:

c_lm = lm(trainingset$dependent ~ trainingset$independent)

You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()

c_lm = lm(dependent ~ independent, data = trainingset)

Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).

An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Categories

r - lm and predict - agreement of data.frame names

r - lm and predict - agreement of data.frame names

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags