Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
268 views
in Technique[技术] by (71.8m points)

r - Extracting Data from Text Files

There appear to be similar questions to this in other languages but I can't find one in R.

I have a number of text files in the subdirectories of a directory; they all have the extension (.log) and they contain a mixture of text and data. I want to extract a couple of lines from these relatively large files.

For example, one file goes as follows ...

blahblahblah

NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS =  210

blahblahblah

 ----------------------------------------<br />
 CPU timing information for all processes<br />
 ========================================<br />
 0: 8853.469 + 133.948 = 8987.417<br />
 1: 8850.817 + 126.587 = 8977.405<br />
 2: 8851.925 + 128.576 = 8980.501<br />
 3: 8847.992 + 125.871 = 8973.864<br />
 ----------------------------------------<br />
 ddikick.x: exited gracefully.<br />

blahblahblah

I want to harvest the number of basis functions (210 in this example) and the total amount of CPU times.

The line "NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS =" is unique to each file; ie, if I open the file in a text editor and search using this string, I only return this one line. Similarly for "CPU timing information for all processes" and "exited gracefully".

I appreciate that it appears that I haven't done a lot to help myself but I just don't know where to start. If someone could point me in the right direction, I hope to be able to fill in the rest.

After the help given to me by @Ben (see below) here is the code that I ended up using,

filesearch <- function (x) {

f <- readLines(x)
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
                    value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))
coline <- grep("^ +CPU timing information", f)
numstr <- sapply(str_extract_all(f[coline+2:5],"[0-9.]+"),as.numeric)
cline1 <- sum(numstr[4,])/60
output <- c(val, cline1)
return(cat(output,"
"))
}

I sourced this function and keyed in the file that I needed each time, then I transferred the two results to another file by hand. Not as elegant as I'd like but it saved me a lot of time doing it this way. Thanks again to @Ben.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

maybe

library(stringr)
f <- readLines("datafile.txt")
cline <- grep("NUMBER OF CARTESIAN GAUSSIAN BASIS FUNCTIONS",f,
                    value=TRUE)
val <- as.numeric(str_extract(cline,"[0-9]+$"))

will work?

To get the other values, try

cline <- grep("^ +CPU timing information",f)
(numstr <- sapply(str_extract_all(f[cline+2:5],"[0-9.]+"),as.numeric))
##         [,1]     [,2]     [,3]     [,4]
## [1,]    0.000    1.000    2.000    3.000
## [2,] 8853.469 8850.817 8851.925 8847.992
## [3,]  133.948  126.587  128.576  125.871
## [4,] 8987.417 8977.405 8980.501 8973.864

The sapply has transposed the matrix of values, so the last row is the bit we want (corresponds to the last column in the file). Extract it using numstr[4,] or numstr[nrow(numstr),] or tail(numstr,1).

(edit: allow spaces before the "CPU timing" string) (edit: do it right!)

(To do this for all the log files, package it in a function and use list.files(pattern="\.log$") in combination with sapply ...)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...