Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
212 views
in Technique[技术] by (71.8m points)

r - Merging of dataframes based off substrings in a column

I have two data frames, one (df_protein) contains experimental measured data from protein fragments carrying a modification, in the other (df_modificaton) I have a database of the "name" off all modification. Now I am trying to merge those together.

Both have a column with the modified sequence (the amino acid which is modified has an asterisk). But in df_protein the sequence of the whole fragment (!) is stored (starting and ending with ""), while in df_modification only the 7 amino acids before and after the modification are given (if it is at the start or the end of the protein the remaining places are marked with "")

For better illustration here a MWE:

df_protein <- data_frame(
  Protein = c("A", "A", "A", "B", "B"),
  Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
  Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250)
)

df_modificaton <- data_frame(
  Protein = c("A", "A", "A", "B", "B", "B"),
  Sequence = c("TIPEQRLS*SSSLLAS", "PSIASDIY*LPIATQ", "PEQRLSSS*SLLASPG", "DPVPPET*PSDSDHK", "FYYEILNS*PEKACSL","_____SMS*VDLSHIP"), 
  Modification = c("S125", "Y77", "S127", "T456", "S44", "S3")
)

# How can I merge the above to the following result:
df_merged <- data_frame(
  Protein = c("A", "A", "A", "B", "B"),
  Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
  Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250),
  Modification = c("Y77", "S125", "S127", "T456", "S3")
) 

I am using tidyverse but I am also fine with other packages. Thanks.

question from:https://stackoverflow.com/questions/65829768/merging-of-dataframes-based-off-substrings-in-a-column

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

One approach is to use the fuzzyjoin package to perform a stringdist join:

library(dplyr)
library(fuzzyjoin)
stringdist_inner_join(df_protein, df_modificaton,
                      by = "Sequence", method = "jw", distance_col = "distance") %>%
  group_by(Sequence.x) %>%
  slice_min(distance)
# A tibble: 5 x 7
# Groups:   Sequence.x [5]
  Protein.x Sequence.x              Counts Protein.y Sequence.y       Modification distance
  <chr>     <chr>                    <dbl> <chr>     <chr>            <chr>           <dbl>
1 A         _EPTPSIASDIY*LPIATQELR_   3.46 A         PSIASDIY*LPIATQ  Y77             0.260
2 A         _S*SSSLLASPGHISVK_        6.13 A         PEQRLSSS*SLLASPG S127            0.294
3 B         _SMS*VDLSHIPLK_           7.25 B         _____SMS*VDLSHIP S3              0.15 
4 A         _SSS*SLLASPGHISVK_       10.0  A         PEQRLSSS*SLLASPG S127            0.294
5 B         _TQDPVPPET*PSDSDHK_       0    B         DPVPPET*PSDSDHK  T456            0.137

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...