r - Analysis using linear regression based on subgroups

Question

Welcome To Ask or Share your Answers For Others

r - Analysis using linear regression based on subgroups

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Analysis using linear regression based on subgroups

Assume I have data (t,y), where I expect a linear dependency y(t). Furthermore, there exist attributes to each observation par1, par2, par3. Is there an algorithm or technique to decide, if (one or both or all of the parameters) are relevant for the fit or not? I tried leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10) but was not able to get the formula for the best fit.

The final result should look like this if plotted. For data see below.

Thus, I want the information

Adding par1and par2 gives the best fit
The models are y_i = a_i * t_i + b_i with given a_i and b_i

Reproducible example:

t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(1, 0.3, 0.2) # slope
b <- c(-0.5, 0.5, 0.1) # offset

# create t_i, y_ti and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
  set.seed(33*i)
  d[[i]] <- sort(sample(t, 50, replace = F))
  set.seed(33*i)
  noise <- rnorm(10)
  y[[i]] <- a[i]*d[[i]] + b[i] + noise
  y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], par1=rep(1), par2=rep(10), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], par1=rep(2), par2=rep(20), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], par1=rep(2), par2=rep(30), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata <- mydata[sample(nrow(mydata)), ]

# That is what the data is looking like:
plot(mydata$t, mydata$y)

# This is the result I am looking for (ideally):
plot(d[[1]], y[[1]], col = "black", xlim = c(0, 10), ylim = c(-2, 10), xlab = "t", ylab = "y",
     main = "Fit for three different groups")
points(d[[2]], y[[2]], col = "red")
points(d[[3]], y[[3]], col = "blue")
lines(d[[1]], y_t[[1]],col = "black")
lines(d[[2]], y_t[[2]], col = "red")
lines(d[[3]], y_t[[3]], col = "blue")

Comment and question on @Roland's answer:

I understand that that with the given three parameters there are 2^3=8 groups with 2*3*3=18 factor levels. But I would expect that we only have 8 relevant groups as I always have the choice between "include parameter x or not". To me it does not make sense to only "include level x of parameter y".

I tried the following

g <- 0
t_lin1 <- mydata$t[mydata$g == g]
y_lin1 <- mydata$y[mydata$g == g]
plot(mydata$t, mydata$y)
points(t_lin1, y_lin1, col = "red")
abline(lm(y_lin1 ~ t_lin1), col = "red")
points(pred.1se ~ t, data = mydata, col = as.integer(mydata$g), pch = 16)

and realized that the fit is off. Looking back this is clear because

I include the wrong factor levels (most likely parameter 3 is not relevant)
and thus get the wrong data for the fit

So my last question is:

Where can I find the relevant groups included in the best model and
what are the corresponding fit parameters from the regression?

Sorry, if this was obvious but to me it is mystery

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:29:48+0000

The LASSO can come pretty close (although it identifies still too many effects):

#I assume these are supposed to be factors:
mydata$par1 <- factor(mydata$par1)
mydata$par2 <- factor(mydata$par2)
mydata$par3 <- factor(mydata$par3)

#create model matrix, remove intercept since glmnet adds it
x <- model.matrix(y ~ (par1 * par2 * par3) * t, data = mydata)[,-1]

#cross-validated LASSO
library(glmnet)
set.seed(42)
fit <- cv.glmnet(x, mydata$y, intercept = TRUE, nfolds = 10, alpha = 1)
plot(fit)

coef <- as.matrix(coef(fit, s = "lambda.1se"))
coef[coef != 0,]
#(Intercept)      par230           t     par12:t    par230:t   par3300:t 
# 0.47542479 -0.27612966  0.75497711 -0.42493030 -0.15044371  0.03033057

#The groups:
mydata$g <- factor((mydata$par2 == 30) + 10 * (mydata$par1 == 2) + 100 * (mydata$par3 == 300))



mydata$pred.1se <- predict(fit, newx = x, s = "lambda.1se")

library(ggplot2)
ggplot(mydata, aes(x = t, color = g)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred.1se))

You can then calculate the desired intercepts and slopes from the coefficients.

Categories

r - Analysis using linear regression based on subgroups

r - Analysis using linear regression based on subgroups

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags