First thing to check is that data.table
FAQ 3.1 point 2 has sunk in :
One memory allocation is made for the largest group only, then that
memory is reused for the other groups. There is very little garbage
to collect.
That's one reason data.table grouping is quick. But this approach doesn't lend itself to parallelization. Parallelizing means copying the data to the other threads, instead, costing time. But, my understanding is that data.table
grouping is usually faster than plyr
with .parallel
on anyway. It depends on the computation time of the task for each group, and if that compute time can be easily reduced or not. Moving the data around often dominates (when benchmarking 1 or 3 runs of large data tasks).
More often, so far, it's actually some gotcha that's biting in the j
expression of [.data.table
. For example, recently we saw poor performance from data.table
grouping but the culprit turned out to be min(POSIXct)
(Aggregating in R over 80K unique ID's). Avoiding that gotcha yielded over 50 times speedup.
So the mantra is: Rprof
, Rprof
, Rprof
.
Further, point 1 from the same FAQ might be significant :
Only that column is grouped, the other 19 are ignored because
data.table inspects the j expression and realises it doesn’t use the
other columns.
So, data.table
really doesn't follow the split-apply-combine paradigm at all. It works differently. split-apply-combine lends itself to parallelization but it really doesn't scale to large data.
Also see footnote 3 in the data.table intro vignette :
We wonder how many people are deploying parallel techniques to code
that is vector scanning
That's trying to say "sure, parallel is significantly faster, but how long should it really take with an efficient algorithm?".
BUT if you've profiled (using Rprof
), and the task per group really is compute intensive, then the 3 posts on datatable-help including the word "multicore" might help:
multicore posts on datatable-help
Of course there are many tasks where parallelization would be nice in data.table, and there is a way to do it. But it hasn't been done yet, since usually other factors bite, so it's been low priority. If you can post reproducible dummy data with benchmarks and Rprof results, that would help increase the priority.