r - dplyr::inner_join -- how to include certain observations even if they don't appear in both dataframes

Question

Welcome To Ask or Share your Answers For Others

r - dplyr::inner_join -- how to include certain observations even if they don't appear in both dataframes

posted Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - dplyr::inner_join -- how to include certain observations even if they don't appear in both dataframes

I have two dataframes that I want to merge by a common variable. Some observations appear in one data but not in the other, and vice verse. When merging, I want to keep only the observations that appear in both dataframes, and therefore dplyr::inner_join() is appropriate.

However, there's an exclusion. There are some observations that I'd like to include in the merged data no matter what. That is, even if they don't appear in both original dataframes. The indication to which ones are those "special" observations to keep is given as certain values in certain columns.

Example

I want to merge the following dataframes (df_population and df_gdp)

1. df_population

library(tibble)
library(dplyr)

## helper function
myFun <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

set.seed(2021)

df_population <-
  tribble(~country, ~population,
        "australia", 24.99,
        "united_kingdom", 66.65,
        "france", 66.99,
        "spain", 46.94,
        "canada", 37.59,
        "brazil", 209.5) %>%
  mutate(col_of_strings = c(myFun(5), "dont_leave_me_behind"))

## # A tibble: 6 x 3
##   country        population col_of_strings      
##   <chr>               <dbl> <chr>               
## 1 australia            25.0 GLNWN9968R          
## 2 united_kingdom       66.6 FTELH3426F          
## 3 france               67.0 NFOSZ6335V          
## 4 spain                46.9 ZFGRD8875F          
## 5 canada               37.6 GFICE2875O          
## 6 brazil              210.  dont_leave_me_behind

2. df_gdp

df_gdp <-
  tribble(~country, ~gdp_growth,
        "australia", 2.9,
        "united_kingdom", 1.4,
        "france", 1.7,
        "spain", 2.4,
        "canada", 1.9,
        "greece", 1.9) %>%
  mutate(col_of_strings = sample(c(myFun(5), "dont_leave_me_behind")))


## # A tibble: 6 x 3
##   country        gdp_growth col_of_strings      
##   <chr>               <dbl> <chr>               
## 1 australia             2.9 dont_leave_me_behind
## 2 united_kingdom        1.4 RQHHI9679V          
## 3 france                1.7 PFSZX1552L          
## 4 spain                 2.4 BQTBY7537E          
## 5 canada                1.9 OECIK9698V          
## 6 greece                1.9 VXDQQ4718J

My Problem

Normally I'd go with

dplyr::inner_join(df_population, df_gdp, by = "country")

But: Although I want only the countries that are common to both dataframes, I still want to include any country that has col_of_strings == dont_leave_me_behind

I hope there's a simple solution to this. Thanks!

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-16T17:44:58+0000

Three ways come to mind.

Per @M.Viking's suggestion, full_join first and then filter.

dplyr::full_join(df_population, df_gdp, by = "country") %>%
  dplyr::filter(
    col_of_strings.y == "dont_leave_me_behind" | !is.na(col_of_strings.x),
    col_of_strings.x == "dont_leave_me_behind" | !is.na(col_of_strings.y)
  )
# # A tibble: 6 x 5
#   country        population col_of_strings.x     gdp_growth col_of_strings.y    
#   <chr>               <dbl> <chr>                     <dbl> <chr>               
# 1 australia            25.0 LQMPB3662R                  2.9 VKBCE2969H          
# 2 united_kingdom       66.6 WDXVX4684T                  1.4 FMAKF4470M          
# 3 france               67.0 VJHBH0078U                  1.7 dont_leave_me_behind
# 4 spain                46.9 XFJPD7687T                  2.4 RMPYK2467U          
# 5 canada               37.6 AQRCR0724P                  1.9 JXMMZ3736X          
# 6 brazil              210.  dont_leave_me_behind       NA   <NA>

Do the inner join, extract missing rows from each frame, and bind_rows them back in. This steps requires some renaming because of the .x/.y names post-join.

tmp1 <- dplyr::inner_join(df_population, df_gdp, by = "country")

missing_pop <- df_population %>%
  dplyr::filter(
    col_of_strings == "dont_leave_me_behind",
    !country %in% tmp1$country
  ) %>%
  dplyr::rename(col_of_strings.x = col_of_strings)
missing_pop
# # A tibble: 1 x 3
#   country population col_of_strings.x    
#   <chr>        <dbl> <chr>               
# 1 brazil        210. dont_leave_me_behind

missing_gdp <- df_gdp %>%
  dplyr::filter(
    col_of_strings == "dont_leave_me_behind",
    !country %in% tmp1$country
  ) %>%
  dplyr::rename(col_of_strings.y = col_of_strings)
missing_gdp
# # A tibble: 0 x 3
# # ... with 3 variables: country <chr>, gdp_growth <dbl>, col_of_strings.y <chr>

out <- dplyr::bind_rows(tmp1, missing_pop, missing_gdp)
out
# # A tibble: 6 x 5
#   country        population col_of_strings.x     gdp_growth col_of_strings.y    
#   <chr>               <dbl> <chr>                     <dbl> <chr>               
# 1 australia            25.0 LQMPB3662R                  2.9 VKBCE2969H          
# 2 united_kingdom       66.6 WDXVX4684T                  1.4 FMAKF4470M          
# 3 france               67.0 VJHBH0078U                  1.7 dont_leave_me_behind
# 4 spain                46.9 XFJPD7687T                  2.4 RMPYK2467U          
# 5 canada               37.6 AQRCR0724P                  1.9 JXMMZ3736X          
# 6 brazil              210.  dont_leave_me_behind       NA   <NA>

Similar to 2, but using anti_join:

tmp1 <- dplyr::inner_join(df_population, df_gdp, by = "country")
out <- dplyr::bind_rows(
  tmp1,
  dplyr::filter(df_population, col_of_strings == "dont_leave_me_behind") %>%
    dplyr::anti_join(., tmp1, by = "country") %>%
    dplyr::rename(col_of_strings.x = col_of_strings),
  dplyr::filter(df_gdp, col_of_strings == "dont_leave_me_behind") %>%
    anti_join(., tmp1, by = "country") %>%
    dplyr::rename(col_of_strings.y = col_of_strings)
)

The latter two perform about the same with benchmarking:

bench::mark(full1=..., inner2=..., inner3=...)
# # A tibble: 3 x 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 full1       5.83ms  7.72ms     127.     1.36KB     4.32    59     2      463ms <tibb~ <Rpro~
# 2 inner2      9.54ms 11.46ms      84.9   11.28KB     2.07    41     1      483ms <tibb~ <Rpro~
# 3 inner3     13.95ms 14.92ms      62.9   11.28KB     4.49    28     2      445ms <tibb~ <Rpro~
# # ... with 2 more variables: time <list>, gc <list>

where the full_join works much better in this case. Larger data may perform significantly differently, I haven't tested it yet.

Categories

r - dplyr::inner_join -- how to include certain observations even if they don't appear in both dataframes

r - dplyr::inner_join -- how to include certain observations even if they don't appear in both dataframes

Example

My Problem

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags