I have a data.table
(~30 million rows) consisting of a datetime
column in POSIXct
format, an id
column and a few other columns (in the example, I just left one irrelevant column x
to demonstrate that there are other columns present that need to be kept). A dput
is at the bottom of the post.
head(DT)
# datetime x id
#1: 2016-04-28 16:20:18 0.02461368 1
#2: 2016-04-28 16:41:34 0.88953932 1
#3: 2016-04-28 16:46:07 0.31818101 1
#4: 2016-04-28 17:00:56 0.14711365 1
#5: 2016-04-28 17:09:11 0.54406602 1
#6: 2016-04-28 17:39:09 0.69280341 1
Q: For each id
, I need to subset only those observations that differ by more than 30 minutes time. What could be an efficient data.table
approach to do this (if possible, without extensive looping)?
The logic can also be described as (like in my comment below):
Per id the first row is always kept. The next row that is at least 30
minutes after the first shall also be kept. Let's assume that row to
be kept is row 4. Then, compute time differences between row 4 and
rows 5:n and keep the first that differs by more than 30 mins and so
on
In the dput below, I added a colum keep
to indicate which rows should be kept in this example because they differ by more than 30 minutes from the previous observation that is kept per id. The difficulty is that it seems to be necessary to calculate the time differences iteratively (or at least, I cannot think of a more efficient approach at the moment).
library(data.table)
DT <- structure(list(
datetime = structure(c(1461853218.81561, 1461854494.81561,
1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561,
1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561,
1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561,
1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561,
1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")),
x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848,
0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696,
0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558,
0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435,
0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718,
0.655705799115822),
id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)),
.Names = c("datetime", "x", "id", "keep"),
row.names = c(NA, -20L),
class = c("data.table", "data.frame"))
setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]
Explanation of the keep
column:
- Rows 2:3 differ by less than 30 minutes from row 1 -> delete
- Row 4 differs by more than 30 minutes from row 1 -> keep
- Row 5 dufferes by less than 30 minutes from row 4 -> delete
- Row 6 differs by more than 30 minutes from row 4 -> keep
- ...
Desired output:
desiredDT <- DT[(keep)]
Thanks for three expert answers I received. I tested them on 1 and 10 million rows of data. Here's an excerpt of the benchmarks.
a) 1 million rows
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.286647 1.277104 1.185216 1.267769 1.140614 1.036749 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901 3
# eddi2(DT_Eddi2) 9.966078 9.915651 9.210168 9.866330 8.877769 8.070281 3
b) 10 million rows
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.019561 1.025427 1.026681 1.031061 1.030028 1.029037 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143 3
# eddi2(DT_Eddi2) 9.796800 9.693823 9.526193 9.594931 9.398969 9.211019 3
Apparently, @Frank's data.table approach and @Roland's Rcpp based solution are similar in performance with Rcpp having a slight advantage, while @eddi's approaches were still fast but not as performant as the others.
However, when I checked for equality of the solutions, I found that @Roland's approach has a slightly different result than the others:
a) 1 million rows
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (982228, 982224) differ"
#[2] "Component “id”: Numeric: lengths (982228, 982224) differ"
#[3] "Component “x”: Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
b) 10 million rows
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (9981898, 9981891) differ"
#[2] "Component “id”: Numeric: lengths (9981898, 9981891) differ"
#[3] "Component “x”: Numeric: lengths (9981898, 9981891) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
My current assumption is that this difference might be related to whether the differnce is > 30 minutes or >= 30 minutes though I'm not sure about that yet.
Final thought: I decided to go with @Frank's solution for two reasons: 1. it performs very well, almost equal to the Rcpp solution, and 2. it doesn't require another package with which I'm not very familiar yet (I'm using data.table anyway)
See Question&Answers more detail:
os