r - Reading txt file in which a new row starts every n line, delimited by special character

Question

Welcome To Ask or Share your Answers For Others

r - Reading txt file in which a new row starts every n line, delimited by special character

posted Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Reading txt file in which a new row starts every n line, delimited by special character

I am reading a file that contains data about amino acid sequences for approx. 600000 proteins. for whomever this might be of interest, here the source

I am using data.table::fread due to the file size and for convenience. The "problem" is that the file contains a new entry only every 2nd line, introduced with a ">". It's not a biggie, because I can just do some minor wrangling and I have it as I want. (see desired output, or even "ideal output").

I wondered if there is a direct way to read in a file that has this kind of structure. Any other package also welcome of course, but it should handle that type of size well.

library(tidyverse)

# "text = ..." contains a shortened version of the first two entries of the downloaded txt file
prot <- data.table::fread(text = 
">101m_A mol:protein length:154  MYOGLOBIN
QGAMNKALEL
>102l_A mol:protein length:165  T4 LYSOZYME
RAKRVITTFR", 
header = FALSE
)

prot <- as.data.frame(prot)

# expected output
exp_out <- bind_cols(prot = prot[c(T, F), ], aminoseq = prot[c(F, T), ] )
exp_out
#> # A tibble: 2 x 2
#>   prot                                        aminoseq  
#>   <chr>                                       <chr>     
#> 1 >101m_A mol:protein length:154  MYOGLOBIN   QGAMNKALEL
#> 2 >102l_A mol:protein length:165  T4 LYSOZYME RAKRVITTFR

# ideal output
exp_out %>%
  separate(prot, c("mol", "length"), sep = ":protein length:") %>%
  separate(length, c("length", "name"), sep = "\s{2}+")
#> # A tibble: 2 x 4
#>   mol         length name        aminoseq  
#>   <chr>       <chr>  <chr>       <chr>     
#> 1 >101m_A mol 154    MYOGLOBIN   QGAMNKALEL
#> 2 >102l_A mol 165    T4 LYSOZYME RAKRVITTFR

^{Created on 2021-01-07 by the reprex package (v0.3.0)}

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-19T03:53:50+0000

Read odd and even rows separately, using sed, then fread with column bind, this will get you to "expected output", it is pretty fast, too, around 2 seconds with unzipped input:

# get the data
# wget ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
library(data.table)

# unzip on the fly
started.at = proc.time()
d <- cbind(
  fread(cmd = "zcat pdb_seqres.txt.gz | sed -n 'p;n'", sep = "|"),
  fread(cmd = "zcat pdb_seqres.txt.gz | sed -n 'n;p'"))
cat("Finished in", timetaken(started.at), "
")
# Finished in 4.585s elapsed (1.788s cpu)

# read unzipped input
started.at = proc.time()
d <- cbind(
  fread(cmd = "sed -n 'p;n' pdb_seqres.txt", sep = "|"),
  fread(cmd = "sed -n 'n;p' pdb_seqres.txt"))
cat("Finished in", timetaken(started.at), "
")
# Finished in 1.796s elapsed (1.111s cpu)

In theory below should work, i.e. we are column binding using bash paste before freading, but it keeps giving me errors about tempfile permissions, might work on your set up.

fread(cmd = "paste -d'|' <(sed -n 'p;n' pdb_seqres.txt) <(sed -n 'n;p' pdb_seqres.txt)",
      sep = "|")

Categories

r - Reading txt file in which a new row starts every n line, delimited by special character

r - Reading txt file in which a new row starts every n line, delimited by special character

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags