I have two data frames, one (df_protein) contains experimental measured data from protein fragments carrying a modification, in the other (df_modificaton) I have a database of the "name" off all modification. Now I am trying to merge those together.
Both have a column with the modified sequence (the amino acid which is modified has an asterisk). But in df_protein the sequence of the whole fragment (!) is stored (starting and ending with ""), while in df_modification only the 7 amino acids before and after the modification are given (if it is at the start or the end of the protein the remaining places are marked with "")
For better illustration here a MWE:
df_protein <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250)
)
df_modificaton <- data_frame(
Protein = c("A", "A", "A", "B", "B", "B"),
Sequence = c("TIPEQRLS*SSSLLAS", "PSIASDIY*LPIATQ", "PEQRLSSS*SLLASPG", "DPVPPET*PSDSDHK", "FYYEILNS*PEKACSL","_____SMS*VDLSHIP"),
Modification = c("S125", "Y77", "S127", "T456", "S44", "S3")
)
# How can I merge the above to the following result:
df_merged <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250),
Modification = c("Y77", "S125", "S127", "T456", "S3")
)
I am using tidyverse
but I am also fine with other packages. Thanks.
question from:
https://stackoverflow.com/questions/65829768/merging-of-dataframes-based-off-substrings-in-a-column