regex - R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

Question

Welcome To Ask or Share your Answers For Others

regex - R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table's sample IDs are in many different formats, but most contain the required ID string for matching somewhere in their ID, e.g. sample "1234" in one table has got an ID of "ProjectB(1234)" in the other.

I have made a minimal reproducible example.

a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

using merge gets part of the way:

merge(a,b, by.x="aID", by.y="bID", all=TRUE)

       aID     aInfo       bInfo
1     1234      blue        <NA>
2     3645  cerulean        <NA>
3     4567     green       apple
4     6789 goldenrod        kiwi
5   (1234)      <NA>      banana
6    23645      <NA> pomegranate
7 63528973      <NA>      lychee

but the output that would be liked is basically:

        ID     aInfo       bInfo
1     1234      blue      banana
2     3645  cerulean pomegranate
3     4567     green       apple
4     6789 goldenrod        kiwi
5 63528973      <NA>      lychee

I just wondered if there was a way to incorporate grep into this or another R-tastic method?

Thanks in advance

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:06:53+0000

Doing merge on a condition is a little tricky. I don't think you can do it with merge as it is written, so you end up having to write a custom function with by. It is pretty inefficient, but then, so is merge. If you have millions of rows, consider data.table. This is how you would do a "inner join" where only rows that match are returned.

# I slightly modified your data to test multiple matches    
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))

#         aID     aInfo    bID       bInfo
# 1234.1 1234      blue (1234)      banana
# 1234.2 1234     blue2 (1234)      banana
# 3645   3645  cerulean  23645 pomegranate
# 4567   4567     green   4567       apple
# 6789   6789 goldenrod   6789        kiwi

Doing a full join is a little trickier. This is one way, that is still inefficient:

f<-function(x,b) {
  matches<-b[agrep(x[1,1],b[,1]),]
  if (nrow(matches)>0) merge(x,matches,all=TRUE)
  # Ugly... but how else to create a data.frame full of NAs?
  else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])

#         aID     aInfo      bID       bInfo
# 1234.1 1234      blue   (1234)      banana
# 1234.2 1234     blue2   (1234)      banana
# 3645   3645  cerulean    23645 pomegranate
# 4567   4567     green     4567       apple
# 6789   6789 goldenrod     6789        kiwi
# bID    <NA>      <NA> 63528973      lychee

Categories

regex - R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

regex - R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags