Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
246 views
in Technique[技术] by (71.8m points)

r - dplyr::inner_join -- how to include certain observations even if they don't appear in both dataframes

I have two dataframes that I want to merge by a common variable. Some observations appear in one data but not in the other, and vice verse. When merging, I want to keep only the observations that appear in both dataframes, and therefore dplyr::inner_join() is appropriate.

However, there's an exclusion. There are some observations that I'd like to include in the merged data no matter what. That is, even if they don't appear in both original dataframes. The indication to which ones are those "special" observations to keep is given as certain values in certain columns.

Example

I want to merge the following dataframes (df_population and df_gdp)

1. df_population

library(tibble)
library(dplyr)

## helper function
myFun <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

set.seed(2021)

df_population <-
  tribble(~country, ~population,
        "australia", 24.99,
        "united_kingdom", 66.65,
        "france", 66.99,
        "spain", 46.94,
        "canada", 37.59,
        "brazil", 209.5) %>%
  mutate(col_of_strings = c(myFun(5), "dont_leave_me_behind"))

## # A tibble: 6 x 3
##   country        population col_of_strings      
##   <chr>               <dbl> <chr>               
## 1 australia            25.0 GLNWN9968R          
## 2 united_kingdom       66.6 FTELH3426F          
## 3 france               67.0 NFOSZ6335V          
## 4 spain                46.9 ZFGRD8875F          
## 5 canada               37.6 GFICE2875O          
## 6 brazil              210.  dont_leave_me_behind

2. df_gdp

df_gdp <-
  tribble(~country, ~gdp_growth,
        "australia", 2.9,
        "united_kingdom", 1.4,
        "france", 1.7,
        "spain", 2.4,
        "canada", 1.9,
        "greece", 1.9) %>%
  mutate(col_of_strings = sample(c(myFun(5), "dont_leave_me_behind")))


## # A tibble: 6 x 3
##   country        gdp_growth col_of_strings      
##   <chr>               <dbl> <chr>               
## 1 australia             2.9 dont_leave_me_behind
## 2 united_kingdom        1.4 RQHHI9679V          
## 3 france                1.7 PFSZX1552L          
## 4 spain                 2.4 BQTBY7537E          
## 5 canada                1.9 OECIK9698V          
## 6 greece                1.9 VXDQQ4718J 

My Problem

Normally I'd go with

dplyr::inner_join(df_population, df_gdp, by = "country")

But: Although I want only the countries that are common to both dataframes, I still want to include any country that has col_of_strings == dont_leave_me_behind

I hope there's a simple solution to this. Thanks!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Three ways come to mind.

  1. Per @M.Viking's suggestion, full_join first and then filter.

    dplyr::full_join(df_population, df_gdp, by = "country") %>%
      dplyr::filter(
        col_of_strings.y == "dont_leave_me_behind" | !is.na(col_of_strings.x),
        col_of_strings.x == "dont_leave_me_behind" | !is.na(col_of_strings.y)
      )
    # # A tibble: 6 x 5
    #   country        population col_of_strings.x     gdp_growth col_of_strings.y    
    #   <chr>               <dbl> <chr>                     <dbl> <chr>               
    # 1 australia            25.0 LQMPB3662R                  2.9 VKBCE2969H          
    # 2 united_kingdom       66.6 WDXVX4684T                  1.4 FMAKF4470M          
    # 3 france               67.0 VJHBH0078U                  1.7 dont_leave_me_behind
    # 4 spain                46.9 XFJPD7687T                  2.4 RMPYK2467U          
    # 5 canada               37.6 AQRCR0724P                  1.9 JXMMZ3736X          
    # 6 brazil              210.  dont_leave_me_behind       NA   <NA>                
    
  2. Do the inner join, extract missing rows from each frame, and bind_rows them back in. This steps requires some renaming because of the .x/.y names post-join.

    tmp1 <- dplyr::inner_join(df_population, df_gdp, by = "country")
    
    missing_pop <- df_population %>%
      dplyr::filter(
        col_of_strings == "dont_leave_me_behind",
        !country %in% tmp1$country
      ) %>%
      dplyr::rename(col_of_strings.x = col_of_strings)
    missing_pop
    # # A tibble: 1 x 3
    #   country population col_of_strings.x    
    #   <chr>        <dbl> <chr>               
    # 1 brazil        210. dont_leave_me_behind
    
    missing_gdp <- df_gdp %>%
      dplyr::filter(
        col_of_strings == "dont_leave_me_behind",
        !country %in% tmp1$country
      ) %>%
      dplyr::rename(col_of_strings.y = col_of_strings)
    missing_gdp
    # # A tibble: 0 x 3
    # # ... with 3 variables: country <chr>, gdp_growth <dbl>, col_of_strings.y <chr>
    
    out <- dplyr::bind_rows(tmp1, missing_pop, missing_gdp)
    out
    # # A tibble: 6 x 5
    #   country        population col_of_strings.x     gdp_growth col_of_strings.y    
    #   <chr>               <dbl> <chr>                     <dbl> <chr>               
    # 1 australia            25.0 LQMPB3662R                  2.9 VKBCE2969H          
    # 2 united_kingdom       66.6 WDXVX4684T                  1.4 FMAKF4470M          
    # 3 france               67.0 VJHBH0078U                  1.7 dont_leave_me_behind
    # 4 spain                46.9 XFJPD7687T                  2.4 RMPYK2467U          
    # 5 canada               37.6 AQRCR0724P                  1.9 JXMMZ3736X          
    # 6 brazil              210.  dont_leave_me_behind       NA   <NA>                
    
  3. Similar to 2, but using anti_join:

    tmp1 <- dplyr::inner_join(df_population, df_gdp, by = "country")
    out <- dplyr::bind_rows(
      tmp1,
      dplyr::filter(df_population, col_of_strings == "dont_leave_me_behind") %>%
        dplyr::anti_join(., tmp1, by = "country") %>%
        dplyr::rename(col_of_strings.x = col_of_strings),
      dplyr::filter(df_gdp, col_of_strings == "dont_leave_me_behind") %>%
        anti_join(., tmp1, by = "country") %>%
        dplyr::rename(col_of_strings.y = col_of_strings)
    )
    

The latter two perform about the same with benchmarking:

bench::mark(full1=..., inner2=..., inner3=...)
# # A tibble: 3 x 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 full1       5.83ms  7.72ms     127.     1.36KB     4.32    59     2      463ms <tibb~ <Rpro~
# 2 inner2      9.54ms 11.46ms      84.9   11.28KB     2.07    41     1      483ms <tibb~ <Rpro~
# 3 inner3     13.95ms 14.92ms      62.9   11.28KB     4.49    28     2      445ms <tibb~ <Rpro~
# # ... with 2 more variables: time <list>, gc <list>

where the full_join works much better in this case. Larger data may perform significantly differently, I haven't tested it yet.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...