r - Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

Question

Welcome To Ask or Share your Answers For Others

r - Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

I am trying to understand the logic of J() lookup when there're duplicate keys in a data.table in R.

Here's a little experiment I have tried:

library(data.table)
options(stringsAsFactors = FALSE)

x <- data.table(keyVar = c("a", "b", "c", "c"),
            value  = c(  1,   2,   3,   4))
setkey(x, keyVar)

y1 <- data.frame(name = c("d", "c", "a"))
x[J(y1$name), ]
## OK

y2 <- data.frame(name = c("d", "c", "a", "b"))
x[J(y2$name), ]
## Error: see below

x2 <- data.table(keyVar = c("a", "b", "c"),
                 value  = c(  1,   2,   3))
setkey(x2, keyVar)
x2[J(y2$name), ]
## OK

The error message I am getting is :

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  :
Join results in 5 rows; more than 4 = max(nrow(x),nrow(i)). Check for duplicate key
values in i, each of which join to the same group in x over and over again. If that's
ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group
to avoid the large allocation. If you are sure you wish to proceed, rerun with 
allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, 
Stack Overflow and datatable-help for advice.

I don't really understand this. I know I should avoid duplicate keys in a lookup function, I just want to gain some insight so I won't make any error in the future.

Thanks a ton for help. This is a great tool.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:07:26+0000

You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i)), you won't get this error, even if you've duplicates. It is basically a precautionary measure.

When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE if you're really sure.

Here's an (exaggerated) example that illustrates the idea behind this error message:

require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)), 
                  y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")

# not run
# DT1[DT2] ## error

dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000        2

The duplicates in DT2 resulted in 3 times the total number of "a" in DT1 (=1e7). Imagine if you performed the join with 1e4 values in DT2, the results would explode! To avoid this, there's the allow.cartesian argument which by default is FALSE.

That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.

Categories

r - Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

r - Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags