Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
441 views
in Technique[技术] by (71.8m points)

r - How to efficiently read the first character from each line of a text file?

I'd like to read only the first character from each line of a text file, ignoring the rest.

Here's an example file:

x <- c(
  "Afklgjsdf;bosfu09[45y94hn9igf",
  "Basfgsdbsfgn",
  "Cajvw58723895yubjsdw409t809t80",
  "Djakfl09w50968509",
  "E3434t"
)
writeLines(x, "test.txt")

I can solve the problem by reading everything with readLines and using substring to get the first character:

lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"

This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?

I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).


Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:

set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)    
x2 <- vapply(
  nch, 
  function(nch)
  {
    paste0(
      sample(letters, nch, replace = TRUE), 
      collapse = ""
    )    
  },
  character(1)
)
writeLines(x2, "bigtest.txt")

Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you allow/have access to Unix command-line tools you can use

scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE) 

Obviously less portable but probably very fast.

Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:

           expr         min          lq        mean      median          uq
     RC readLines   14.797830   17.083849   19.261917   18.103020   20.007341
      RS read.fwf  125.113935  133.259220  148.122596  138.024203  150.528754
 BB scan pipe cut    6.277267    7.027964    7.686314    7.337207    8.004137
      RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
          RS scan   13.927765   14.752597   16.634288   15.274470   16.992124

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...