Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
808 views
in Technique[技术] by (71.8m points)

search - r Finding a match based on age and sex

I have a dataset with these columns

 ID    Cancer.Date    Age   Gender   Col1     Col2  
 15    1998-03-26     35    F        Yes       No
 53    NA             65    F        No        Yes
 37    1996-11-10     84    M        Yes       No
 58    NA             90    F        Yes       No
 60    2016-12-08     70    M        Yes       No
 12    2000-04-29     20    M        No        Yes
 46    NA             72    F        Yes       No
 59    2008-05-26     34    F        Yes       No
 99    NA             89    M        Yes       No
 46    2009-06-22     87    M        No        Yes
 35    2000-02-20     24    F        Yes       Yes
 26    NA             80    F        Yes       No
 43    2001-02-20     74    M        No        No
 77    NA             81    F        No        Yes
 16    2015-11-03     52    F        No        Yes
 04    NA             27    M        Yes       No
 82    2004-05-08     45    M        No        No
 01    2006-04-25     49    F        No        Yes
 92    2004-10-26     40    F        Yes       Yes
 67    2002-09-20     67    F        No        No
            

My goal is to perform the following tasks.

Step1: Arrange the Cancer.Date column in ascending order. Earliest date on top. This case row with date 1996-11-10

Step2: Check if the date is NA. If the date is not NA, then find 3 observations that are similar to that row in Gender and closest in Age.

For example, after sorting by date (earliest first), the third row will be the 1st row. The Gender = M, Age = 84. So the three IDs that similar in gender and closest in Age are , (ID 46, Gender =M, Age = 87), (ID 99, Gender =M, Age = 89), (ID 43, Gender =M, Age = 74).

Step3: Repeat Step2 for all rows where Cancer.Date is not NA (Not Missing).

The expected Output

 ID    Cancer.Date    Age   Gender   Col1     Col2  Match.ID 
 37    1996-11-10     84    M        Yes       No   46,99,43
 15    1998-03-26     35    F        Yes       No   59,35,12
 .     .              .     .        .         .    .

Perhaps I could do this using for-loops, subset by Gender and distance by age but I suspect this would be painfully slow. I would appreciate any suggestions on accomplishing this more efficiently.

question from:https://stackoverflow.com/questions/65880695/r-finding-a-match-based-on-age-and-sex

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use purr::map to make this work.

library(tidyverse)
read.table(textConnection("ID    Cancer.Date    Age   Gender   Col1     Col2  
                          15    1998-03-26     35    F        Yes       No
                          53    NA             65    F        No        Yes
                          37    1996-11-10     84    M        Yes       No
                          58    NA             90    F        Yes       No
                          60    2016-12-08     70    M        Yes       No
                          12    2000-04-29     20    M        No        Yes
                          46    NA             72    F        Yes       No
                          59    2008-05-26     34    F        Yes       No
                          99    NA             89    M        Yes       No
                          46    2009-06-22     87    M        No        Yes
                          35    2000-02-20     24    F        Yes       Yes
                          26    NA             80    F        Yes       No
                          43    2001-02-20     74    M        No        No
                          77    NA             81    F        No        Yes
                          16    2015-11-03     52    F        No        Yes
                          04    NA             27    M        Yes       No
                          82    2004-05-08     45    M        No        No
                          01    2006-04-25     49    F        No        Yes
                          92    2004-10-26     40    F        Yes       Yes
                          67    2002-09-20     67    F        No        No"), header = T) %>% 
  as_tibble() -> df
df
#> # A tibble: 20 x 6
#>       ID Cancer.Date   Age Gender Col1  Col2 
#>    <int> <chr>       <int> <chr>  <chr> <chr>
#>  1    15 1998-03-26     35 F      Yes   No   
#>  2    53 <NA>           65 F      No    Yes  
#>  3    37 1996-11-10     84 M      Yes   No   
#>  4    58 <NA>           90 F      Yes   No   
#>  5    60 2016-12-08     70 M      Yes   No   
#>  6    12 2000-04-29     20 M      No    Yes  
#>  7    46 <NA>           72 F      Yes   No   
#>  8    59 2008-05-26     34 F      Yes   No   
#>  9    99 <NA>           89 M      Yes   No   
#> 10    46 2009-06-22     87 M      No    Yes  
#> 11    35 2000-02-20     24 F      Yes   Yes  
#> 12    26 <NA>           80 F      Yes   No   
#> 13    43 2001-02-20     74 M      No    No   
#> 14    77 <NA>           81 F      No    Yes  
#> 15    16 2015-11-03     52 F      No    Yes  
#> 16     4 <NA>           27 M      Yes   No   
#> 17    82 2004-05-08     45 M      No    No   
#> 18     1 2006-04-25     49 F      No    Yes  
#> 19    92 2004-10-26     40 F      Yes   Yes  
#> 20    67 2002-09-20     67 F      No    No

df %>% 
  mutate(Cancer.Date = Cancer.Date %>% lubridate::as_date()) %>% 
  arrange(Cancer.Date) -> df1

df1
#> # A tibble: 20 x 6
#>       ID Cancer.Date   Age Gender Col1  Col2 
#>    <int> <date>      <int> <chr>  <chr> <chr>
#>  1    37 1996-11-10     84 M      Yes   No   
#>  2    15 1998-03-26     35 F      Yes   No   
#>  3    35 2000-02-20     24 F      Yes   Yes  
#>  4    12 2000-04-29     20 M      No    Yes  
#>  5    43 2001-02-20     74 M      No    No   
#>  6    67 2002-09-20     67 F      No    No   
#>  7    82 2004-05-08     45 M      No    No   
#>  8    92 2004-10-26     40 F      Yes   Yes  
#>  9     1 2006-04-25     49 F      No    Yes  
#> 10    59 2008-05-26     34 F      Yes   No   
#> 11    46 2009-06-22     87 M      No    Yes  
#> 12    16 2015-11-03     52 F      No    Yes  
#> 13    60 2016-12-08     70 M      Yes   No   
#> 14    53 NA             65 F      No    Yes  
#> 15    58 NA             90 F      Yes   No   
#> 16    46 NA             72 F      Yes   No   
#> 17    99 NA             89 M      Yes   No   
#> 18    26 NA             80 F      Yes   No   
#> 19    77 NA             81 F      No    Yes  
#> 20     4 NA             27 M      Yes   No

closest <- function(x, df = df1){
  if(is.na(x)){
    NA
    } else{
      df1 %>% 
        filter(Cancer.Date == x) -> s_row 
      df1 %>% 
        filter((Gender == s_row$Gender & !Cancer.Date == x) %>% replace_na(T)) %>% 
        pull(Age) %>% 
        enframe(name = NULL) %>% 
        mutate(num = s_row$Age,
               diff = abs(num-value)) %>% 
        arrange(diff) %>% 
        slice(1:3) %>% 
        pull(value) -> near_ages
      df1 %>% 
        filter(Age %in% near_ages & Gender == s_row$Gender) %>% 
        pull(ID) %>% 
        paste(collapse = ",")
    }
}

df1 %>% 
  mutate(Match.ID = Cancer.Date %>% map_chr(closest))
#> # A tibble: 20 x 7
#>       ID Cancer.Date   Age Gender Col1  Col2  Match.ID
#>    <int> <date>      <int> <chr>  <chr> <chr> <chr>   
#>  1    37 1996-11-10     84 M      Yes   No    43,46,99
#>  2    15 1998-03-26     35 F      Yes   No    35,92,59
#>  3    35 2000-02-20     24 F      Yes   Yes   15,92,59
#>  4    12 2000-04-29     20 M      No    Yes   82,60,4 
#>  5    43 2001-02-20     74 M      No    No    37,46,60
#>  6    67 2002-09-20     67 F      No    No    53,46,26
#>  7    82 2004-05-08     45 M      No    No    12,60,4 
#>  8    92 2004-10-26     40 F      Yes   Yes   15,1,59 
#>  9     1 2006-04-25     49 F      No    Yes   15,92,16
#> 10    59 2008-05-26     34 F      Yes   No    15,35,92
#> 11    46 2009-06-22     87 M      No    Yes   37,43,99
#> 12    16 2015-11-03     52 F      No    Yes   92,1,53 
#> 13    60 2016-12-08     70 M      Yes   No    37,43,46
#> 14    53 NA             65 F      No    Yes   <NA>    
#> 15    58 NA             90 F      Yes   No    <NA>    
#> 16    46 NA             72 F      Yes   No    <NA>    
#> 17    99 NA             89 M      Yes   No    <NA>    
#> 18    26 NA             80 F      Yes   No    <NA>    
#> 19    77 NA             81 F      No    Yes   <NA>    
#> 20     4 NA             27 M      Yes   No    <NA>

if you want to increase the efficiency you can look into furrr package which will make the code run in parallel.

Created on 2021-01-25 by the reprex package (v0.3.0)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...