Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
337 views
in Technique[技术] by (71.8m points)

r - Reading txt file in which a new row starts every n line, delimited by special character

I am reading a file that contains data about amino acid sequences for approx. 600000 proteins. for whomever this might be of interest, here the source

I am using data.table::fread due to the file size and for convenience. The "problem" is that the file contains a new entry only every 2nd line, introduced with a ">". It's not a biggie, because I can just do some minor wrangling and I have it as I want. (see desired output, or even "ideal output").

I wondered if there is a direct way to read in a file that has this kind of structure. Any other package also welcome of course, but it should handle that type of size well.

library(tidyverse)

# "text = ..." contains a shortened version of the first two entries of the downloaded txt file
prot <- data.table::fread(text = 
">101m_A mol:protein length:154  MYOGLOBIN
QGAMNKALEL
>102l_A mol:protein length:165  T4 LYSOZYME
RAKRVITTFR", 
header = FALSE
)

prot <- as.data.frame(prot)

# expected output
exp_out <- bind_cols(prot = prot[c(T, F), ], aminoseq = prot[c(F, T), ] )
exp_out
#> # A tibble: 2 x 2
#>   prot                                        aminoseq  
#>   <chr>                                       <chr>     
#> 1 >101m_A mol:protein length:154  MYOGLOBIN   QGAMNKALEL
#> 2 >102l_A mol:protein length:165  T4 LYSOZYME RAKRVITTFR

# ideal output
exp_out %>%
  separate(prot, c("mol", "length"), sep = ":protein length:") %>%
  separate(length, c("length", "name"), sep = "\s{2}+")
#> # A tibble: 2 x 4
#>   mol         length name        aminoseq  
#>   <chr>       <chr>  <chr>       <chr>     
#> 1 >101m_A mol 154    MYOGLOBIN   QGAMNKALEL
#> 2 >102l_A mol 165    T4 LYSOZYME RAKRVITTFR

Created on 2021-01-07 by the reprex package (v0.3.0)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Read odd and even rows separately, using sed, then fread with column bind, this will get you to "expected output", it is pretty fast, too, around 2 seconds with unzipped input:

# get the data
# wget ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
library(data.table)

# unzip on the fly
started.at = proc.time()
d <- cbind(
  fread(cmd = "zcat pdb_seqres.txt.gz | sed -n 'p;n'", sep = "|"),
  fread(cmd = "zcat pdb_seqres.txt.gz | sed -n 'n;p'"))
cat("Finished in", timetaken(started.at), "
")
# Finished in 4.585s elapsed (1.788s cpu)

# read unzipped input
started.at = proc.time()
d <- cbind(
  fread(cmd = "sed -n 'p;n' pdb_seqres.txt", sep = "|"),
  fread(cmd = "sed -n 'n;p' pdb_seqres.txt"))
cat("Finished in", timetaken(started.at), "
")
# Finished in 1.796s elapsed (1.111s cpu)

In theory below should work, i.e. we are column binding using bash paste before freading, but it keeps giving me errors about tempfile permissions, might work on your set up.

fread(cmd = "paste -d'|' <(sed -n 'p;n' pdb_seqres.txt) <(sed -n 'n;p' pdb_seqres.txt)",
      sep = "|")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...