Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
654 views
in Technique[技术] by (71.8m points)

r - Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table

Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.

   dfA <- read.table(
  text = "A   B   C   D   E   F   G   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10  1   0   1   0   0   1   0   BEL   2010   BEL2010
  11  0   1   1   1   0   1   0   NLD   2010   NLD2010
  12  1   0   0   0   1   0   1   NLD   2014   NLD2014
  13  0   0   0   1   1   0   0   AUS   2010   AUS2010
  14  1   0   1   0   0   1   0   AUS   2006   AUS2006
  15  0   1   0   1   0   1   1   USA   2008   USA2008
  16  0   0   1   0   0   0   1   USA   2010   USA2010
  17  0   1   0   1   0   0   0   USA   2012   USA2012
  18  1   0   1   0   0   1   0   BLG   2008   BLG2008
  19  0   1   0   1   1   0   1   BEL   2008   BEL2008
  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)

   dfB <- read.table(
  text = "A   B   C   D   H   I   J   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2011   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)

Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match. So I tried:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

But this reduces the amount of observations to 11.

# A tibble: 11 x 18
       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode
   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      
 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    
 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    
 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    
 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    
 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    
 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    
 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    
10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    
11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012   

The preferred output would be as follows:

#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009
#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

Additional Sources:

1. The previous try

2. The try before that

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Hers is my (default) approach for a join like this, using data.table

code

library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)

result

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009
# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data

dfA <- fread(
  "A   B   C   D   E   F   G   iso   year   matchcode
  0   1   1   1   0   1   0   NLD   2010   NLD2010
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2010   AUS2010
     1   0   1   0   0   1   0   AUS   2006   AUS2006
     0   1   0   1   0   1   1   USA   2008   USA2008
     0   0   1   0   0   0   1   USA   2010   USA2010
     0   1   0   1   0   0   0   USA   2012   USA2012
     1   0   1   0   0   1   0   BLG   2008   BLG2008
     0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010
    0   1   1   1   0   1   0   NLD   2010   NLD2010
    1   0   0   0   1   0   1   NLD   2014   NLD2014
    0   0   0   1   1   0   0   AUS   2010   AUS2010
    1   0   1   0   0   1   0   AUS   2006   AUS2006
    0   1   0   1   0   1   1   USA   2008   USA2008
    0   0   1   0   0   0   1   USA   2010   USA2010
    0   1   0   1   0   0   0   USA   2012   USA2012
    1   0   1   0   0   1   0   BLG   2008   BLG2008
    0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)


dfB <- fread(
  "A   B   C   D   H   I   J   iso   year   matchcode
     0   1   1   1   0   1   0   NLD   2009   NLD2009
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2011   AUS2011
     1   0   1   0   0   1   0   AUS   2007   AUS2007
     0   1   0   1   0   1   1   USA   2007   USA2007
     0   0   1   0   0   0   1   USA   2011   USA2010
     0   1   0   1   0   0   0   USA   2013   USA2013
     1   0   1   0   0   1   0   BLG   2007   BLG2007
     0   1   0   1   1   0   1   BEL   2009   BEL2009
     1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...