r - using mean with .SD and .SDcols in data.table

Question

Welcome To Ask or Share your Answers For Others

r - using mean with .SD and .SDcols in data.table

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - using mean with .SD and .SDcols in data.table

I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.

So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:

dt <- data.table(
  a=1:10, 
  b=as.factor(letters[1:10]), 
  c=c(TRUE, FALSE), 
  d=runif(10, 0.5, 100), 
  e=c(0,1), 
  f=as.integer(c(0,1)), 
  g=as.numeric(1:10), 
  h=c("cat1", "cat2", "cat3", "cat4", "cat5"))

mean(dt$a)
[1] 5.5

dt[, mean(.SD), .SDcols = "a"]

[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA

dt[, sum(.SD), .SDcols = "a"]
[1] 55

dt[, max(.SD), .SDcols = "a"]
[1] 10

dt[, colMeans(.SD), .SDcols = "a"]
  a 
5.5 

dt[, lapply(.SD, mean), .SDcols = "a"]
     a
1: 5.5

Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).

I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.

Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).

My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.

Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:36:22+0000

I think the appropriate way of approaching what you want is to just use the standard syntax:

dt[ , lapply(.SD, mean), .SDcols = "a"]

Alternatively, you can pass a variable by name as follows:

col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]

Eventually, you can generalized this approach to multiple columns as follows:

col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]

Categories

r - using mean with .SD and .SDcols in data.table

r - using mean with .SD and .SDcols in data.table

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags