I am trying to calculate cumulative sum for a given window based on a condition. I have seen threads where the solution does conditional cumulative sum (Calculate a conditional running sum in R for every row in data frame) and rolling sum (Rolling Sum by Another Variable in R), but I couldn't find the two together. I also saw that data.table
doesn't have a rolling window function at R data.table sliding window. So, this problem is very challenging for me.
Moreover, the solution posted by Mike Grahan on rolling sum is beyond my comprehension. I am looking for data.table
based method primarily for speed. However, I am open to other methods if they are understandable.
Here's my input data:
DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011,
2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010),
Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575,
13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578,
13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4,
3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY",
"Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame")
Here's my expected output: (Manually created; My apologies if there is a manual error)
DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012,
2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575,
13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575,
13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D",
"E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2),
cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4,
2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum"
), row.names = c(NA, 16L), class = "data.frame")
Some commentary about the logic:
1) I want to find rolling sum in a 5-year period. Ideally, I would like this 5-year period to be variable i.e. something I can specify elsewhere in the code. This way, I have the liberty to vary the window later on for my analysis.
2) The end of Window is based on the maximum year (i.e. FY
in example above). In above example, the max FY
in DFI
is 2016
. So, starting year of the window would be 2016 - 5 + 1 = 2012
for all entries in 2016
.
3) The window sum (or running sum) is calculated by Customer
and for a specific Product
.
What I tried:
I wanted to try something before posting. Here's my code:
DFI <- data.table::as.data.table(DFI)
#Sort it first
DFI<-DFI[order(Customer,FY),]
#find cumulative sum; remove Rev column; order rows
DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)]
This code calculates the cumulative sum, but I am unable to define 5-year window and then calculate running sum. I have two questions:
Question 1) How do I calculate a 5-year running sum?
Question 2) Can someone please explain Mike's method on this thread ? It seems to be fast. However, I am not really sure what's going on there. I did see that someone requested some commentary, but I am not sure whether it is self-explanatory.
Thanks in advance. I have been struggling on this problem for two days.
See Question&Answers more detail:
os