Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
269 views
in Technique[技术] by (71.8m points)

dataframe - Need to speed up R loop

I need to speed up the nested loop below. Scores linked to item IDs are recorded by date. For each item with multiple scores, I need to relate the scores and the time distance between them. On toy data like that below, it works fine, but when the test data is replaced with data that is tens of thousands of rows, it becomes too slow to be useful. Are there better ways to do the same?

# create some simulated data
test <- matrix(1:18, byrow=TRUE, nrow=6)
test[,1] <- c(1,2,1,3,2,3)
test[,2] <- c(70,92,62,90,85,82)
test[,3] <- c("2019-01-01","2019-01-01", "2020-01-01", "2019-01-01", "2020-01-01", "2020-01-01")
colnames(test) <- c("ID", "Score", "Date")
test <- data.frame(test)
test$Date <- as.Date(test$Date)

# create a dataframe to hold all the post-loop data
df <- data.frame(matrix(ncol = 4, nrow = 0))
col_names <- c("ID", "Years", "BeginScore", "EndScore")

# get all the unique item IDs
ids <- unique(test$ID)

# loop through each unique item id
for(i in 1:length(ids))
{
   # get all the instances of that single item
   item <- test[test$ID == ids[i],]
   # create a matrix to hold the data
   scores <- data.frame(matrix(1:((nrow(item)-1)*4), byrow=TRUE, nrow=nrow(item)-1))
   colnames(scores) <- col_names
   
   # create an index, starting at the last (bc real data is ordered by data)
   index <- nrow(item)
   # loop through the list of instances of the sigle item and assign info
   for(j in 1:(nrow(item)-1))
   {
     scores$Years <- time_length(item[index,3]-item[(index -1),3], "years")
     scores$BeginScore <- item[(index-1),2]
     scores$EndScore <- item[index, 2]
     scores$ID <- item[index,1]
     index <- index - 1
   }
   # bind the single item to the collected data and then loop to next unique item
   df <- rbind(df, scores)
}
question from:https://stackoverflow.com/questions/65877428/need-to-speed-up-r-loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

for loop is not the right tool for such operations. Also creating an empty matrix/dataframe and filling it is also very inefficient in R.

Tens of thousands of rows is not too much of data. You can try this dplyr approach.

library(dplyr)
library(lubridate)

test %>%
  mutate(Date = as.Date(Date)) %>%
  group_by(ID) %>%
  summarise(BeginScore = nth(Score, n() - 1),
            EndScore = last(Score), 
            Years = time_length(last(Date) - nth(Date, n() - 1), 'years'))

#  ID    BeginScore EndScore Years
#  <chr> <chr>      <chr>    <dbl>
#1 1     70         62       0.999
#2 2     92         85       0.999
#3 3     90         82       0.999

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...