I was looking for a way to fill a time series data set by time, per group. The very very inefficient way I was using was to split
the data set per group and apply a custom time-series fill function (create sequence between max and min, and merge) in all elements of that list. Needless to say, this operations would not go pass the splitting.
My dataset looks like,
source grp cnt
1: 83 2017-06-06 13:00:00 1
2: 83 2017-06-06 23:00:00 1
3: 83 2017-06-07 03:00:00 1
4: 83 2017-06-07 07:00:00 2
5: 83 2017-06-07 13:00:00 1
6: 83 2017-06-07 19:00:00 1
7: 83 2017-06-08 00:00:00 1
8: 83 2017-06-08 14:00:00 1
9: 83 2017-06-08 15:00:00 1
10: 83 2017-06-08 20:00:00 1
11: 137 2017-06-04 02:00:00 1
12: 137 2017-06-04 05:00:00 1
13: 137 2017-06-04 23:00:00 1
...
My attempt was to use tidyverse
methods by utilising the complete
function, i.e.
library(tidyverse)
d1 %>%
group_by(source) %>%
complete(source, grp = seq(min(grp), max(grp), by = 'hour'))
However, after about 40-45 seconds, a progress bar appeared (apparently a neat feature in some tidyverse functions - I suspect complete
in this case) which estimated 9 hours to completion. My dataset is very very big and this is not the lightest operation, so something really efficient is what I am looking for.
DATA
#dput(d1)
structure(list(source = c("83", "83", "83", "83", "83", "83",
"83", "83", "83", "83", "137", "137", "137", "137", "137", "137",
"137", "137", "137", "137", "137", "137", "137", "137"), grp = structure(c(1496743200,
1496779200, 1496793600, 1496808000, 1496829600, 1496851200, 1496869200,
1496919600, 1496923200, 1496941200, 1496530800, 1496541600, 1496606400,
1496617200, 1496649600, 1496696400, 1496808000, 1496844000, 1496876400,
1496962800, 1497880800, 1497888000, 1497978000, 1497996000), class = c("POSIXct",
"POSIXt"), tzone = ""), cnt = c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), .Names = c("source", "grp", "cnt"), row.names = c(NA, -24L
), class = "data.frame")
See Question&Answers more detail:
os