I am trying to understand how R determines reference groups for interactions in a linear model. Consider the following:
df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
year = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), .Label = c("1", "2"), class = "factor"),
treatment = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor"),
y = c(1.4068142116718, 2.67041187927052, 2.69166439169131,
3.56550324537293, 1.60021286173782, 4.26629963353237, 3.85741108250572,
5.15740731957689, 4.15629768365669, 6.14875441068499, 3.31277276551286,
3.47223277168367, 3.74152201649338, 4.02734382610191, 4.49388620764795,
5.6432833241724, 4.76639399631094, 4.16885857079297, 4.96830394378801,
5.6286092105837, 6.60521404151111, 5.51371821706176, 3.97244221149279,
5.68793413111161, 4.90457233598412, 6.02826151378941, 4.92468415350312,
8.23718422822134, 5.87695836962708, 7.47264895892575)), .Names = c("id",
"year", "treatment", "y"), row.names = c(NA, -30L), class = "data.frame")
lm(y ~ -1 + id + year + year:treatment, df)
#Coefficients:
# id1 id2 id3 id4
# 2.6585 3.9933 4.1161 5.3544
# id5 year2 year1:treatment1 year2:treatment1
# 6.1991 0.7149 -0.6317 NA
R tries to estimate the full set of interactions instead of consistently omitting a reference group. As a result, I am getting NA
's in the results.
Also, R is inconsistent with which groups it drops. I would like to estimate a model with the same omitted group (year1
) in the main effect and interactions. How to force R to omit year1
and year1:treatment1
from the preceding model?
I understand that there are several workarounds for this problem (e.g. creating all the variables by hand and writing them out in the model's formula). But the actual models I am estimating are much more complicated versions of this problem and such a workaround would be cumbersome.
See Question&Answers more detail:
os