I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.
So, I am using .SDcols
to pass in the column to summarize, and using functions on .SD
in the j
part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:
dt <- data.table(
a=1:10,
b=as.factor(letters[1:10]),
c=c(TRUE, FALSE),
d=runif(10, 0.5, 100),
e=c(0,1),
f=as.integer(c(0,1)),
g=as.numeric(1:10),
h=c("cat1", "cat2", "cat3", "cat4", "cat5"))
mean(dt$a)
[1] 5.5
dt[, mean(.SD), .SDcols = "a"]
[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA
dt[, sum(.SD), .SDcols = "a"]
[1] 55
dt[, max(.SD), .SDcols = "a"]
[1] 10
dt[, colMeans(.SD), .SDcols = "a"]
a
5.5
dt[, lapply(.SD, mean), .SDcols = "a"]
a
1: 5.5
Interestingly, weighted.mean
gives the wrong answer (55, the sum) when I use weighted.mean(.SD)
in j. But when I use lapply(.SD, weighted.mean)
in j, it gives the right answer (5.5, the mean).
I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.
Maybe this is just a problem with using mean()
on a list (which seems to be what .SD
returns)? I guess there is never a reason to NOT use the lapply
paradigm with .SD
? It seems that only the lapply
option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).
My main question is why mean(.SD)
does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.
Thanks.
See Question&Answers more detail:
os