Yet another option is ave
. For good measure, I've collected the answers above, tried my best to make their output equivalent (a vector), and provided timings over 1000 runs using your example data as an input. First, my answer using ave
: ave(df$x, df$index, FUN = function(z) z/sum(z))
. I also show an example using data.table
package since it is usually pretty quick, but I know you're looking for base solutions, so you can ignore that if you want.
And now a bunch of timings:
library(data.table)
library(plyr)
dt <- data.table(df)
plyr <- function() ddply(df, .(index), transform, z = x / sum(x))
av <- function() ave(df$x, df$index, FUN = function(z) z/sum(z))
t.apply <- function() unlist(tapply(df$x, df$index, function(x) x/sum(x)))
l.apply <- function() unlist(lapply(split(df$x, df$index), function(x){x/sum(x)}))
b.y <- function() unlist(by(df$x, df$index, function(x){x/sum(x)}))
agg <- function() aggregate(df$x, list(df$index), function(x){x/sum(x)})
d.t <- function() dt[, x/sum(x), by = index]
library(rbenchmark)
benchmark(plyr(), av(), t.apply(), l.apply(), b.y(), agg(), d.t(),
replications = 1000,
columns = c("test", "elapsed", "relative"),
order = "elapsed")
#-----
test elapsed relative
4 l.apply() 0.052 1.000000
2 av() 0.168 3.230769
3 t.apply() 0.257 4.942308
5 b.y() 0.694 13.346154
6 agg() 1.020 19.615385
7 d.t() 2.380 45.769231
1 plyr() 5.119 98.442308
the lapply()
solution seems to win in this case and data.table()
is surprisingly slow. Let's see how this scales to a bigger aggregation problem:
df <- data.frame(x = sample(1:100, 1e5, TRUE), index = gl(1000, 100))
dt <- data.table(df)
#Replication code omitted for brevity, used 100 replications and dropped plyr() since I know it
#will be slow by comparison:
test elapsed relative
6 d.t() 2.052 1.000000
1 av() 2.401 1.170078
3 l.apply() 4.660 2.270955
2 t.apply() 9.500 4.629630
4 b.y() 16.329 7.957602
5 agg() 20.541 10.010234
that seems more consistent with what I'd expect.
In summary, you've got plenty of good options. Find one or two methods that work with your mental model of how aggregation tasks should work and master that function. Many ways to skin a cat.
Edit - and an example with 1e7 rows
Probably not large enough for Matt, but as big as my laptop can handle without crashing:
df <- data.frame(x = sample(1:100, 1e7, TRUE), index = gl(10000, 1000))
dt <- data.table(df)
#-----
test elapsed relative
6 d.t() 0.61 1.000000
1 av() 1.45 2.377049
3 l.apply() 4.61 7.557377
2 t.apply() 8.80 14.426230
4 b.y() 8.92 14.622951
5 agg() 18.20 29.83606