I have irregular timeseries data representing a certain type of transaction for users. Each line of data is timestamped and represents a transaction at that time. By the irregular nature of the data some users might have 100 rows in a day and other users might have 0 or 1 transaction in a day.
The data might look something like this:
data.frame(
id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
date = c("2015-01-01",
"2015-01-01",
"2015-01-05",
"2015-01-25",
"2015-02-15",
"2015-05-05",
"2015-01-01",
"2015-08-01",
"2015-01-01"),
n_widgets = c(1,2,3,4,4,5,2,4,5)
)
id date n_widgets
1 1 2015-01-01 1
2 1 2015-01-01 2
3 1 2015-01-05 3
4 1 2015-01-25 4
5 1 2015-02-15 4
6 2 2015-05-05 5
7 2 2015-01-01 2
8 3 2015-08-01 4
9 4 2015-01-01 5
Often I'd like to know some rolling statistics about users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days etc.
Corresponding to the above example, the data should look like:
id date n_widgets n_trans_30 total_widgets_30
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5
If the time window is daily then the solution is simple: data %>% group_by(id, date) %>% summarize(...)
Similarly if the time window is monthly this is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)
However the challenge I'm having is how to setup a time window for an arbitrary period: 5-days, 10-days etc.
There's also the RcppRoll
library but both RcppRoll
and the rolling functions in zoo
seem more setup for regular time series. As far as I can tell these window functions work based on the number of rows instead of a specified time period -- the key difference is that a certain time period might have a differing number of rows depending on date and user.
For example, it's possible for user 1, that the number of transactions in the 5 days previous of 2015-01-01
is equal to 100 transactions and for the same user the number of transactions in the 5 days previous of 2015-02-01
is equal to 5 transactions. Thus looking back a set number of rows will simply not work.
Additionally, there is another SO thread discussing rolling dates for irregular time series type data (Create new column based on condition that exists within a rolling date) however the accepted solution was using data.table
and I'm specifically looking for a dplyr
way of achieving this.
I suppose at the heart of this issue, this problem can be solved by answering this question: how can I group_by
arbitrary time periods in dplyr
. Alternatively, if there's a different dplyr
way to achieve above without a complicated group_by
, how can I do it?
EDIT: updated example to make nature of the rolling window more clear.
See Question&Answers more detail:
os