Doing merge
on a condition is a little tricky. I don't think you can do it with merge
as it is written, so you end up having to write a custom function with by
. It is pretty inefficient, but then, so is merge
. If you have millions of rows, consider data.table
. This is how you would do a "inner join" where only rows that match are returned.
# I slightly modified your data to test multiple matches
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
Doing a full join is a little trickier. This is one way, that is still inefficient:
f<-function(x,b) {
matches<-b[agrep(x[1,1],b[,1]),]
if (nrow(matches)>0) merge(x,matches,all=TRUE)
# Ugly... but how else to create a data.frame full of NAs?
else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
# bID <NA> <NA> 63528973 lychee