Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
774 views
in Technique[技术] by (71.8m points)

r - How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.

My problem is that my training and testing sets might not always have the same variable levels.

Here is a reproducible data example:

set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))

data <- data.frame(x,y,z)

summary(data)

Now let's make a glm model:

model_glm <- glm(x~., data = data)

And let's use cross-validation on this model:

library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)

And this is the kind of error output that you will get:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor z has new levels F

if you don't get this error, re-run the cross validation and at some point you will get a similar error.

The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).

In the total amount of our data there is much more D's than E's and F's.

Thus whenever you take a small subset of the whole data (to do cross-validation).

There is a very good chance that your z variable are all going to be set at the D's level.

Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).

My question is: how to avoid the drop in the first place?

If it is not possible, what are the alternatives?

(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:

set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)

#optional tag row for later identification: 
#data$rowid<-1:nrow(data)

stratified <- function(df, column, percent){
  #split dataframe into groups based on column
  listdf<-split(df, df[[column]])
  testsubgroups<-lapply(listdf, function(x){
    #pick the number of samples per group, round up.
    numsamples <- ceiling(percent*nrow(x))
    #selects the rows
    whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
    testsubgroup <-x[whichones,] 
  })  
  #combine the subgroups into one data frame
  testgroup<-do.call(rbind, testsubgroups)
  testgroup
}

testgroup<-stratified(data, "z", 0.8)

This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.

Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...