r - Merging of dataframes based off substrings in a column

Question

Welcome To Ask or Share your Answers For Others

r - Merging of dataframes based off substrings in a column

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Merging of dataframes based off substrings in a column

I have two data frames, one (df_protein) contains experimental measured data from protein fragments carrying a modification, in the other (df_modificaton) I have a database of the "name" off all modification. Now I am trying to merge those together.

Both have a column with the modified sequence (the amino acid which is modified has an asterisk). But in df_protein the sequence of the whole fragment (!) is stored (starting and ending with ""), while in df_modification only the 7 amino acids before and after the modification are given (if it is at the start or the end of the protein the remaining places are marked with "")

For better illustration here a MWE:

df_protein <- data_frame(
  Protein = c("A", "A", "A", "B", "B"),
  Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
  Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250)
)

df_modificaton <- data_frame(
  Protein = c("A", "A", "A", "B", "B", "B"),
  Sequence = c("TIPEQRLS*SSSLLAS", "PSIASDIY*LPIATQ", "PEQRLSSS*SLLASPG", "DPVPPET*PSDSDHK", "FYYEILNS*PEKACSL","_____SMS*VDLSHIP"), 
  Modification = c("S125", "Y77", "S127", "T456", "S44", "S3")
)

# How can I merge the above to the following result:
df_merged <- data_frame(
  Protein = c("A", "A", "A", "B", "B"),
  Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
  Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250),
  Modification = c("Y77", "S125", "S127", "T456", "S3")
)

I am using tidyverse but I am also fine with other packages. Thanks.

question from:https://stackoverflow.com/questions/65829768/merging-of-dataframes-based-off-substrings-in-a-column

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:37:17+0000

One approach is to use the fuzzyjoin package to perform a stringdist join:

library(dplyr)
library(fuzzyjoin)
stringdist_inner_join(df_protein, df_modificaton,
                      by = "Sequence", method = "jw", distance_col = "distance") %>%
  group_by(Sequence.x) %>%
  slice_min(distance)
# A tibble: 5 x 7
# Groups:   Sequence.x [5]
  Protein.x Sequence.x              Counts Protein.y Sequence.y       Modification distance
  <chr>     <chr>                    <dbl> <chr>     <chr>            <chr>           <dbl>
1 A         _EPTPSIASDIY*LPIATQELR_   3.46 A         PSIASDIY*LPIATQ  Y77             0.260
2 A         _S*SSSLLASPGHISVK_        6.13 A         PEQRLSSS*SLLASPG S127            0.294
3 B         _SMS*VDLSHIPLK_           7.25 B         _____SMS*VDLSHIP S3              0.15 
4 A         _SSS*SLLASPGHISVK_       10.0  A         PEQRLSSS*SLLASPG S127            0.294
5 B         _TQDPVPPET*PSDSDHK_       0    B         DPVPPET*PSDSDHK  T456            0.137

Categories

r - Merging of dataframes based off substrings in a column

r - Merging of dataframes based off substrings in a column

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags