Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

for loop - r - read in multiple files and select max value from column by group

I am looking for the most elegant way loop through and read in multiple files organized by date and select the most recent value if anything changed based on multiple keys.

Sadly, the reason I need to read in all the files and not just the last files is because there could be an instance in the file that disappears that I would like to capture.

Here is an example of what the files looks like (I'm posting comma separated even though it's fixed width)

file_20200101.txt
key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101

file_20200102.txt
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102

and an example of the desired output:

desired_df
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
100,foo,15,20200101

In addition, here is some code I know that works to read in multiple files and then get my ideal output, but I need it to be inside of the loop. The dataframe would be way too big to import and bind all the files:

files <- list.files(path, pattern = ".txt")

df <- files %>%
  map(function(f) {
    
    print(f)
    
    df <- fread(f)
    df <- df %>% mutate(date_as_numb = f)
    
    return(df)
    
  }) %>% bind_rows()

df <- df %>% 
  mutate(file_date = as.numeric(str_remove_all(date_as_numb, ".*_"))) %>% 
  group_by(key_1, key_2) %>% 
  filter(date_as_numb == max(date_as_numb))

Thanks in advance!

question from:https://stackoverflow.com/questions/65945347/r-read-in-multiple-files-and-select-max-value-from-column-by-group

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Don't know what is "way too big" for you. Data table can (allegedly) handle really big data. So if bind_rows of the list is not ok, maybe use data table.

(in my own experience, dplyr::group_by can be really slow with many groups (say 10^5 groups) in large-ish data (say around 10^6 rows). I don't have much experience with data.table, but all the threads mention its superiority for large data).

I've used this answer for merging a list of data tables

library(data.table)
dt1 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101")

dt2 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102")

ls_files <- list(dt1, dt2) 

# you would have created this list by calling fread with lapply, like 
# ls_files <- lapply(files, fread)

# Now here a two-liner with data table.

alldata <- Reduce(function(...) merge(..., all = TRUE), ls_files)

alldata[alldata[, .I[which.max(date_as_numb)], by = .(key_1, key_2)]$V1]
#>    key_1 key_2 value date_as_numb
#> 1:   100   foo    15     20200101
#> 2:   123   abc    50     20200102
#> 3:   456   def   500     20200102
#> 4:   789   xyz   300     20200102

Created on 2021-01-28 by the reprex package (v0.3.0)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...