Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
522 views
in Technique[技术] by (71.8m points)

r - Fastest way to read in 100,000 .dat.gz files

I have a few hundred thousand very small .dat.gz files that I want to read into R in the most efficient way possible. I read in the file and then immediately aggregate and discard the data, so I am not worried about managing memory as I get near the end of the process. I just really want to speed up the bottleneck, which happens to be unzipping and reading in the data.

Each dataset consists of 366 rows and 17 columns. Here is a reproducible example of what I am doing so far:

Building reproducible data:

require(data.table)

# Make dir
system("mkdir practice")

# Function to create data
create_write_data <- function(file.nm) {
  dt <- data.table(Day=0:365)
  dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rnorm(n=366))]
  write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="", quote=FALSE)
  system(paste0("gzip ./practice/", file.nm))    
}

And here is code applying:

# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))

And here is my most efficient code so far to read in the data:

# Function to read in files as fast as possible
read_Fast <- function(path.gz) {
  system(paste0("gzip -d ", path.gz)) # Unzip file
  path.dat <- gsub(".gz", "", path.gz)
  dat_run <- fread(path.dat)
}

# Apply above function
dat.files <- list.files(path="./practice", full.names = TRUE)
system.time(dat.list <- rbindlist(lapply(dat.files, read_Fast), fill=TRUE))
dat.list

I have bottled this up in a function and applied it in parallel, but it is still much much too slow for what I need this for.

I have already tried the h2o.importFolder from the wonderful h2o package, but it is actually much much slower compared to using plain R with data.table. Maybe there is a way to speed up the unzipping of files, but I am unsure. From the few times that I have run this, I have noticed that the unzipping of the files usually takes about 2/3rd of the function time.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.

tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...