Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
288 views
in Technique[技术] by (71.8m points)

r - Blend of na.omit and na.pass using aggregate?

I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case:

> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
  var1 = rep(c(1:3, NA), 3),
  var2 = 1:12,
  var3 = c(rep(NA, 4), 1:8))

> test
   name var1 var2 var3
1     A    1    1   NA
2     A    2    2   NA
3     A    3    3   NA
4     A   NA    4   NA
5     B    1    5    1
6     B    2    6    2
7     B    3    7    3
8     B   NA    8    4
9     C    1    9    5
10    C    2   10    6
11    C    3   11    7
12    C   NA   12    8

In the past, I've only had to deal with cases of mis-matched repetitions, which has been easy with aggregate(cbind(var1, var2) ~ name, test, FUN = mean, na.action = na.omit) (or the default setting). I'll get averages for each lot over three values for var1 and over four values for var2.

Unfortunately, this will leave me with a dataset completely missing lot A in this case:

 aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.omit)
  name var1 var2 var3
1    B    2    6    2
2    C    2   10    6

If I use na.pass, however, I also don't get what I want:

 aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.pass)
  name var1 var2 var3
1    A   NA  2.5   NA
2    B   NA  6.5  2.5
3    C   NA 10.5  6.5

Now I lose the good data I had in var1 since it contained instances of NA.

What I'd like is:

  • NA as the output of mean() if all unique combinations of varN ~ name are NAs
  • Output of mean() if there are one or more actual values for varN ~ name

I'm guessing this is pretty simple, but I just don't know how. Do I need to use ddply for something like this? If so... the reason I tend to avoid it is that I end up writing really long equivalents to aggregate() like so:

ddply(test, .(name), summarise,
  var1 = mean(var1, na.rm = T),
  var2 = mean(var2, na.rm = T),
  var3 = mean(var3, na.rm = T))

Yeah... so the result of that apparently does what I want. I'll leave the question anyway in case there's 1) a way to do this with aggregate() or 2) shorter syntax for ddply.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Pass both na.action=na.pass and na.rm=TRUE to aggregate. The former tells aggregate not to delete rows where NAs exist; and the latter tells mean to ignore them.

aggregate(cbind(var1, var2, var3) ~ name, test, mean,
          na.action=na.pass, na.rm=TRUE)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...