Overview
I'm relatively familiar with data.table
, not so much with dplyr
. I've read through some dplyr
vignettes and examples that have popped up on SO, and so far my conclusions are that:
data.table
and dplyr
are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
dplyr
has more accessible syntax
dplyr
abstracts (or will) potential DB interactions
- There are some minor functionality differences (see "Examples/Usage" below)
In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table
, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table
. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).
Question
What I want to know is:
- Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
- Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.
One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr
would offer much beyond what I can already do in data.table
. Here is the dplyr
solution (data at end of Q):
dat %.%
group_by(name, job) %.%
filter(job != "Boss" | year == min(year)) %.%
mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a data.table
solution. That said, good data.table
solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):
setDT(dat)[,
.SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by=list(id, job)
]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table
(i.e. doesn't use some of the more esoteric tricks).
Ideally what I'd like to see is some good examples were the dplyr
or data.table
way is substantially more concise or performs substantially better.
Examples
Usage
dplyr
does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do
in the answer to @eddi's question).
data.table
supports rolling joins (thanks @dholstius) as well as overlap joins
data.table
internally optimises expressions of the form DT[col == value]
or DT[col %in% values]
for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
dplyr
offers standard evaluation versions of functions (e.g. regroup
, summarize_each_
) that can simplify the programmatic use of dplyr
(note programmatic use of data.table
is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
Benchmarks
Data
This is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob",
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L,
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager",
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager",
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,
-16L))
Question&Answers:
os