Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
446 views
in Technique[技术] by (71.8m points)

r - How to debug "contrasts can be applied only to factors with 2 or more levels" error?

Here are all the variables I'm working with:

str(ad.train)
$ Date                : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ...
 $ Team                : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season              : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Round               : Factor w/ 28 levels "EF","GF","PF",..: 5 16 21 22 23 24 25 26 27 6 ...
 $ Score               : int  137 82 84 96 110 99 122 124 49 111 ...
 $ Margin              : int  69 18 -56 46 19 5 50 69 -26 29 ...
 $ WinLoss             : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 2 ...
 $ Opposition          : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 8 18 10 9 13 16 7 3 4 6 ...
 $ Venue               : Factor w/ 19 levels "Adelaide Oval",..: 4 7 10 7 7 13 7 6 7 15 ...
 $ Disposals           : int  406 360 304 370 359 362 365 345 324 351 ...
 $ Kicks               : int  252 215 170 225 221 218 224 230 205 215 ...
 $ Marks               : int  109 102 52 41 95 78 93 110 69 85 ...
 $ Handballs           : int  154 145 134 145 138 144 141 115 119 136 ...
 $ Goals               : int  19 11 12 13 16 15 19 19 6 17 ...
 $ Behinds             : int  19 14 9 16 11 6 7 9 12 6 ...
 $ Hitouts             : int  42 41 34 47 45 70 48 54 46 34 ...
 $ Tackles             : int  73 53 51 76 65 63 65 67 77 58 ...
 $ Rebound50s          : int  28 34 23 24 32 48 39 31 34 29 ...
 $ Inside50s           : int  73 49 49 56 61 45 47 50 49 48 ...
 $ Clearances          : int  39 33 38 52 37 43 43 48 37 52 ...
 $ Clangers            : int  47 38 44 62 49 46 32 24 31 41 ...
 $ FreesFor            : int  15 14 15 18 17 15 19 14 18 20 ...
 $ ContendedPossessions: int  152 141 149 192 138 164 148 151 160 155 ...
 $ ContestedMarks      : int  10 16 11 3 12 12 17 14 15 11 ...
 $ MarksInside50       : int  16 13 10 8 12 9 14 13 6 12 ...
 $ OnePercenters       : int  42 54 30 58 24 56 32 53 50 57 ...
 $ Bounces             : int  1 6 4 4 1 7 11 14 0 4 ...
 $ GoalAssists         : int  15 6 9 10 9 12 13 14 5 14 ...

Here's the glm I'm trying to fit:

ad.glm.all <- glm(WinLoss ~ factor(Team) + Season  + Round + Score  + Margin + Opposition + Venue + Disposals + Kicks + Marks + Handballs + Goals + Behinds + Hitouts + Tackles + Rebound50s + Inside50s+ Clearances+ Clangers+ FreesFor + ContendedPossessions + ContestedMarks + MarksInside50 + OnePercenters + Bounces+GoalAssists, 
                  data = ad.train, family = binomial(logit))

I know it's a lot of variables (plan is to reduce via forward variable selection). But even know it's a lot of variables they're either int or Factor; which as I understand things should just work with a glm. However, every time I try to fit this model I get:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

Which sort of looks to me as if R isn't treating my Factor variables as Factor variables for some reason?

Even something as simple as:

ad.glm.test <- glm(WinLoss ~ factor(Team), data = ad.train, family = binomial(logit))

isn't working! (same error message)

Where as this:

ad.glm.test <- glm(WinLoss ~ Clearances, data = ad.train, family = binomial(logit))

Will work!

Anyone know what's going on here? Why can't I fit these Factor variables to my glm??

Thanks in advance!

-Troy

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Introduction

What a "contrasts error" is has been well explained: you have a factor that only has one level (or less). But in reality this simple fact can be easily obscured because the data that are actually used for model fitting can be very different from what you've passed in. This happens when you have NA in your data, you've subsetted your data, a factor has unused levels, or you've transformed your variables and get NaN somewhere. You are rarely in this ideal situation where a single-level factor can be spotted from str(your_data_frame) directly. Many questions on StackOverflow regarding this error are not reproducible, thus suggestions by people may or may not work. Therefore, although there are by now 118 posts regarding this issue, users still can't find an adaptive solution so that this question is raised again and again. This answer is my attempt, to solve this matter "once for all", or at least to provide a reasonable guide.

This answer has rich information, so let me first make a quick summary.

I defined 3 helper functions for you: debug_contr_error, debug_contr_error2, NA_preproc.

I recommend you use them in the following way.

  1. run NA_preproc to get more complete cases;
  2. run your model, and if you get a "contrasts error", use debug_contr_error2 for debugging.

Most of the answer shows you step by step how & why these functions are defined. There is probably no harm to skip those development process, but don't skip sections from "Reproducible case studies and Discussions".


Revised answer

The original answer works perfectly for OP, and has successfully helped some others. But it had failed somewhere else for lack of adaptiveness. Look at the output of str(ad.train) in the question. OP's variables are numeric or factors; there are no characters. The original answer was for this situation. If you have character variables, although they will be coerced to factors during lm and glm fitting, they won't be reported by the code since they were not provided as factors so is.factor will miss them. In this expansion I will make the original answer both more adaptive.

Let dat be your dataset passed to lm or glm. If you don't readily have such a data frame, that is, all your variables are scattered in the global environment, you need to gather them into a data frame. The following may not be the best way but it works.

## `form` is your model formula, here is an example
y <- x1 <- x2 <- x3 <- 1:4
x4 <- matrix(1:8, 4)
form <- y ~ bs(x1) + poly(x2) + I(1 / x3) + x4

## to gather variables `model.frame.default(form)` is the easiest way 
## but it does too much: it drops `NA` and transforms variables
## we want something more primitive

## first get variable names
vn <- all.vars(form)
#[1] "y"  "x1" "x2" "x3" "x4"

## `get_all_vars(form)` gets you a data frame
## but it is buggy for matrix variables so don't use it
## instead, first use `mget` to gather variables into a list
lst <- mget(vn)

## don't do `data.frame(lst)`; it is buggy with matrix variables
## need to first protect matrix variables by `I()` then do `data.frame`
lst_protect <- lapply(lst, function (x) if (is.matrix(x)) I(x) else x)
dat <- data.frame(lst_protect)
str(dat)
#'data.frame':  4 obs. of  5 variables:
# $ y : int  1 2 3 4
# $ x1: int  1 2 3 4
# $ x2: int  1 2 3 4
# $ x3: int  1 2 3 4
# $ x4: 'AsIs' int [1:4, 1:2] 1 2 3 4 5 6 7 8

## note the 'AsIs' for matrix variable `x4`
## in comparison, try the following buggy ones yourself
str(get_all_vars(form))
str(data.frame(lst))

Step 0: explicit subsetting

If you've used the subset argument of lm or glm, start by an explicit subsetting:

## `subset_vec` is what you pass to `lm` via `subset` argument
## it can either be a logical vector of length `nrow(dat)`
## or a shorter positive integer vector giving position index
## note however, `base::subset` expects logical vector for `subset` argument
## so a rigorous check is necessary here
if (mode(subset_vec) == "logical") {
  if (length(subset_vec) != nrow(dat)) {
    stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
    }
  subset_log_vec <- subset_vec
  } else if (mode(subset_vec) == "numeric") {
  ## check range
  ran <- range(subset_vec)
  if (ran[1] < 1 || ran[2] > nrow(dat)) {
    stop("'numeric' `subset_vec` provided but values are out of bound")
    } else {
    subset_log_vec <- logical(nrow(dat))
    subset_log_vec[as.integer(subset_vec)] <- TRUE
    } 
  } else {
  stop("`subset_vec` must be either 'logical' or 'numeric'")
  }
dat <- base::subset(dat, subset = subset_log_vec)

Step 1: remove incomplete cases

dat <- na.omit(dat)

You can skip this step if you've gone through step 0, since subset automatically removes incomplete cases.

Step 2: mode checking and conversion

A data frame column is usually an atomic vector, with a mode from the following: "logical", "numeric", "complex", "character", "raw". For regression, variables of different modes are handled differently.

"logical",   it depends
"numeric",   nothing to do
"complex",   not allowed by `model.matrix`, though allowed by `model.frame`
"character", converted to "numeric" with "factor" class by `model.matrix`
"raw",       not allowed by `model.matrix`, though allowed by `model.frame`

A logical variable is tricky. It can either be treated as a dummy variable (1 for TRUE; 0 for FALSE) hence a "numeric", or it can be coerced to a two-level factor. It all depends on whether model.matrix thinks a "to-factor" coercion is necessary from the specification of your model formula. For simplicity we can understand it as such: it is always coerced to a factor, but the result of applying contrasts may end up with the same model matrix as if it were handled as a dummy directly.

Some people may wonder why "integer" is not included. Because an integer vector, like 1:4, has a "numeric" mode (try mode(1:4)).

A data frame column may also be a matrix with "AsIs" class, but such a matrix must have "numeric" mode.

Our checking is to produce error when

  • a "complex" or "raw" is found;
  • a "logical" or "character" matrix variable is found;

and proceed to convert "logical" and "character" to "numeric" of "factor" class.

## get mode of all vars
var_mode <- sapply(dat, mode)

## produce error if complex or raw is found
if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")

## get class of all vars
var_class <- sapply(dat, class)

## produce error if an "AsIs" object has "logical" or "character" mode
if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
  stop("matrix variables with 'AsIs' class must be 'numeric'")
  }

## identify columns that needs be coerced to factors
ind1 <- which(var_mode %in% c("logical", "character"))

## coerce logical / character to factor with `as.factor`
dat[ind1] <- lapply(dat[ind1], as.factor)

Note that if a data frame column is already a factor variable, it will not be included in ind1, as a factor variable has "numeric" mode (try mode(factor(letters[1:4]))).

step 3: drop unused factor levels

We won't have unused factor levels for factor variables converted from step 2, i.e., those indexed by ind1. However, factor variables that come with dat might have unused levels (often as the result of step 0 and step 1). We need to drop any possible unused levels from them.

## index of factor columns
fctr <- which(sapply(dat, is.factor))

## factor variables that have skipped explicit conversion in step 2
## don't simply do `ind2 <- fctr[-ind1]`; buggy if `ind1` is `integer(0)`
ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr

## drop unused levels
dat[ind2] <- lapply(dat[ind2], droplevels)

step 4: summarize factor variables

Now we are ready to see what and how many factor levels are actually used by lm or glm:

## export factor levels actually used by `lm` and `glm`
lev <- lapply(dat[fctr], levels)

## count number of levels
nl <- lengths(lev)

To make your life easier, I've wrapped up those steps into a function debug_contr_error.

Input:

  • dat is your data frame passed to lm or glm via data argument;
  • subset_vec is the index vector passed to lm or glm via subset argument.

Output: a list with

  • nlevels (a list) gives the number of factor levels for all factor variables;
  • levels (a vector) gives levels for all factor variables.

The function produces a warning, if there are no complete cases or no factor variables to summarize.

debug_contr_error <- function (dat, subset_vec = NULL) {
  if (!is.null(subset_vec)) {
    ## step 0
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    } else {
    ## step 1
    dat <- stats::na.omit(dat)
    }
  if (nrow(dat) == 0L) warning("no complete cases")
  ## step 2
  var_mode <- sapply(dat, mode)
  if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")
  var_class <- sapply(dat, class)
  if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
    stop("matrix variables with 'AsI

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...