Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
864 views
in Technique[技术] by (71.8m points)

r - rolling regression by group in the tidyverse?

There are many questions about rolling regression in R, but here I am specifically looking for something that uses dplyr, broom and (if needed) purrr.

This is what makes this question different. I want to be tidyverse consistent. Is is possible to do a proper running regression with tidy tools such as purrr:map and dplyr?

Please consider this simple example:

library(dplyr)
library(purrr)
library(broom)
library(zoo)
library(lubridate)

mydata = data_frame('group' = c('a','a', 'a','a','b', 'b', 'b', 'b'),
                     'y' = c(1,2,3,4,2,3,4,5),
                     'x' = c(2,4,6,8,6,9,12,15),
                     'date' = c(ymd('2016-06-01', '2016-06-02', '2016-06-03', '2016-06-04',
                                    '2016-06-03', '2016-06-04', '2016-06-05','2016-06-06')))

  group     y     x date      
  <chr> <dbl> <dbl> <date>    
1 a      1.00  2.00 2016-06-01
2 a      2.00  4.00 2016-06-02
3 a      3.00  6.00 2016-06-03
4 a      4.00  8.00 2016-06-04
5 b      2.00  6.00 2016-06-03
6 b      3.00  9.00 2016-06-04
7 b      4.00 12.0  2016-06-05
8 b      5.00 15.0  2016-06-06

For each group (in this example, a or b):

  1. compute the rolling regression of y on x over the last 2 observations.
  2. store the coefficient of that rolling regression in a column of the dataframe.

Of course, as you can see, the rolling regression can only be computed for the last 2 rows in each group.

I have tried to use the following, but without success.

data %>% group_by(group) %>% 
  mutate(rolling_coef = do(tidy(rollapply(. ,
                    width=2, 
                    FUN = function(df) {t = lm(formula=y ~ x, 
                                              data = as.data.frame(df), 
                                              na.rm=TRUE); 
                    return(t$coef) },
                    by.column=FALSE, align="right"))))
Error in mutate_impl(.data, dots) : 
  Evaluation error: subscript out of bounds.
In addition: There were 21 warnings (use warnings() to see them)

Any ideas?

Expected output for the last two rows of the first a group is 0.5 and 0.5 (there is indeed a perfect linear correlation between y and x in this example)

More specifically:

mydata_1 <- mydata %>% filter(group == 'a',
                  row_number() %in% c(1,2))
# A tibble: 2 x 3
  group     y     x
  <chr> <dbl> <dbl>
1 a      1.00  2.00
2 a      2.00  4.00
> tidy(lm(y ~ x, mydata_1))['estimate'][2,]
[1] 0.5

and also

mydata_2 <- mydata %>% filter(group == 'a',
                              row_number() %in% c(2,3)) 
# A tibble: 2 x 3
  group     y     x
  <chr> <dbl> <dbl>
1 a      2.00  4.00
2 a      3.00  6.00
> tidy(lm(y ~ x, mydata_2))['estimate'][2,]
[1] 0.5

EDIT:

interesting follow-up to this question here rolling regression with confidence interval (tidyverse)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Define a function Coef whose argument is formed from cbind(y, x) and which regresses y on x with an intercept, returning the coefficients. Then apply rollapplyr using the current and prior rows over each group. If by last you meant the 2 prior rows to the current row, i.e. exclude the current row, then replace 2 with list(-seq(2)) as an argument to rollapplyr.

Coef <- . %>% as.data.frame %>% lm %>% coef

mydata %>% 
  group_by(group) %>% 
  do(cbind(reg_col = select(., y, x) %>% rollapplyr(2, Coef, by.column = FALSE, fill = NA),
           date_col = select(., date))) %>%
  ungroup

giving:

# A tibble: 8 x 4
  group `reg_col.(Intercept)` reg_col.x date      
  <chr>                 <dbl>     <dbl> <date>    
1 a      NA                      NA     2016-06-01
2 a       0                       0.500 2016-06-02
3 a       0                       0.500 2016-06-03
4 a       0                       0.500 2016-06-04
5 b      NA                      NA     2016-06-03
6 b       0.00000000000000126     0.333 2016-06-04
7 b     - 0.00000000000000251     0.333 2016-06-05
8 b       0                       0.333 2016-06-06

Variation

A variation of the above would be:

mydata %>% 
       group_by(group) %>% 
       do(select(., date, y, x) %>% 
          read.zoo %>% 
          rollapplyr(2, Coef, by.column = FALSE, fill = NA) %>%
          fortify.zoo(names = "date")
       ) %>% 
       ungroup

Slope Only

If only the slope is needed there are further simplifications possible. We use the fact that the slope equals cov(x, y) / var(x).

slope <- . %>% { cov(.[, 2], .[, 1]) / var(.[, 2])}
mydata %>%
       group_by(group) %>%
       mutate(slope = rollapplyr(cbind(y, x), 2, slope, by.column = FALSE, fill = NA)) %>%
       ungroup

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.8k users

...