Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
89 views
in Technique[技术] by (71.8m points)

How to do cross join in R?

How can I achieve a cross join in R ? I know that "merge" can do inner join, outer join. But I do not know how to achieve a cross join in R.

Thanks

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If speed is an issue, I suggest checking out the excellent data.table package. In the example at the end it's ~90x faster than merge.

You didn't provide example data. If you just want to get all combinations of two (or more individual) columns, you can use CJ (cross join):

library(data.table)
CJ(x=1:2,y=letters[1:3])
#   x y
#1: 1 a
#2: 1 b
#3: 1 c
#4: 2 a
#5: 2 b
#6: 2 c

If you want to do a cross join on two tables, I haven't found a way to use CJ(). But you can still use data.table:

x2<-data.table(id1=letters[1:3],vals1=1:3)
y2<-data.table(id2=letters[4:7],vals2=4:7)

res<-setkey(x2[,c(k=1,.SD)],k)[y2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
res
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

Explanation of the res line:

  • Basically you add a dummy column (k in this example) to one table and set it as the key (setkey(tablename,keycolumns)), add the dummy column to the other table, and then join them.
  • The data.table structure uses column positions and not names in the join, so you have to put the dummy column at the beginning. The c(k=1,.SD) part is one way that I have found to add columns at the beginning (the default is to add them to the end).
  • A standard data.table join has a format of X[Y]. The X in this case is setkey(x2[,c(k=1,.SD)],k), and the Y is y2[,c(k=1,.SD)].
  • allow.cartesian=TRUE tells data.table to ignore the duplicate key values, and perform a cartesian join (prior versions didn't require this)
  • The [,k:=NULL] at the end just removes the dummy key from the result.

You can also turn this into a function, so it's cleaner to use:

# Version 1; easier to write:
CJ.table.1 <- function(X,Y)
  setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

CJ.table.1(x2,y2)
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

# Version 2; faster but messier:
CJ.table.2 <- function(X,Y) {
  eval(parse(text=paste0("setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],list(",paste0(unique(c(names(X),names(Y))),collapse=","),")][,k:=NULL]")))
}

Here are some speed benchmarks:

# Create a bigger (but still very small) example:
n<-1e3
x3<-data.table(id1=1L:n,vals1=sample(letters,n,replace=T))
y3<-data.table(id2=1L:n,vals2=sample(LETTERS,n,replace=T))

library(microbenchmark)
microbenchmark(merge=merge.data.frame(x3,y3,all=TRUE),
               CJ.table.1=CJ.table.1(x3,y3),
               CJ.table.2=CJ.table.2(x3,y3),
               times=3, unit="s")
#Unit: seconds
#       expr        min         lq     median         uq        max neval
#      merge 4.03710225 4.23233688 4.42757152 5.57854711 6.72952271     3
# CJ.table.1 0.06227603 0.06264222 0.06300842 0.06701880 0.07102917     3
# CJ.table.2 0.04740142 0.04812997 0.04885853 0.05433146 0.05980440     3

Note that these data.table methods are much faster than the merge method suggested by @danas.zuokas. The two tables with 1,000 rows in this example result in a cross-joined table with 1 million rows. So even if your original tables are small, the result can get big quickly and speed becomes important.

Lastly, recent versions of data.table require you to add the allow.cartesian=TRUE (as in CJ.table.1) or specify the names of the columns that should be returned (CJ.table.2). The second method (CJ.table.2) seems to be faster, but requires some more complicated code if you want to automatically specify all the column names. And it may not work with duplicate column names. (Feel free to suggest a simpler version of CJ.table.2)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...