I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V
of length 10 million, which assumes value 0
if I don't want to read the row and 1
if I do want to read the row.
How do I get the io function fread
from the data.table
package to do this? I ask because fread
is so so fast compared to all other io approaches.
The best solution this question, Reading specific rows of large matrix data file, gives the following solution:
read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)
where C:/Data/target.csv
is the large CSV file and V
is the vector of 0
or 1
.
However I have noticed that this is orders of magnitude slower than simply using fread
on the entire matrix, even if the V
will only be equal to 1
for a small subset of the total number of rows.
Thus, since fread
on the whole matrix will dominate the above solution, how do I combine fread
(and specifically fread
) with row sampling?
This is not a duplicate because it is only about the function fread
.
Here's my problem setup:
#create csv
csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
#my csv has a header:
colnames(csv) <- LETTERS[1:5]
#save csv
write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
#create vector of 0s and 1s that I want to read the CSV from
read_vec <- rep(0,50)
read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
#the following is the effect that I want, but I want an efficient approach to it:
csv <- read.csv("/home/user/test_csv.csv") #inefficient!
csv <- csv[which(read_vec==1),] #inefficient!
#the alternative approach, too slow when scaled up!
csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
#the fastest approach yet still not optimal because it needs to read all rows
require(data.table)
csv <- data.matrix(fread('/home/user/test_csv.csv'))
csv <- csv[which(read_vec==1),]
See Question&Answers more detail:
os