I am trying to understand the logic of J() lookup when there're duplicate keys in a data.table in R.
Here's a little experiment I have tried:
library(data.table)
options(stringsAsFactors = FALSE)
x <- data.table(keyVar = c("a", "b", "c", "c"),
value = c( 1, 2, 3, 4))
setkey(x, keyVar)
y1 <- data.frame(name = c("d", "c", "a"))
x[J(y1$name), ]
## OK
y2 <- data.frame(name = c("d", "c", "a", "b"))
x[J(y2$name), ]
## Error: see below
x2 <- data.table(keyVar = c("a", "b", "c"),
value = c( 1, 2, 3))
setkey(x2, keyVar)
x2[J(y2$name), ]
## OK
The error message I am getting is :
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 5 rows; more than 4 = max(nrow(x),nrow(i)). Check for duplicate key
values in i, each of which join to the same group in x over and over again. If that's
ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group
to avoid the large allocation. If you are sure you wish to proceed, rerun with
allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki,
Stack Overflow and datatable-help for advice.
I don't really understand this. I know I should avoid duplicate keys in a lookup function, I just want to gain some insight so I won't make any error in the future.
Thanks a ton for help. This is a great tool.
See Question&Answers more detail:
os