dataframe - Need to speed up R loop

Question

Welcome To Ask or Share your Answers For Others

dataframe - Need to speed up R loop

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

dataframe - Need to speed up R loop

I need to speed up the nested loop below. Scores linked to item IDs are recorded by date. For each item with multiple scores, I need to relate the scores and the time distance between them. On toy data like that below, it works fine, but when the test data is replaced with data that is tens of thousands of rows, it becomes too slow to be useful. Are there better ways to do the same?

# create some simulated data
test <- matrix(1:18, byrow=TRUE, nrow=6)
test[,1] <- c(1,2,1,3,2,3)
test[,2] <- c(70,92,62,90,85,82)
test[,3] <- c("2019-01-01","2019-01-01", "2020-01-01", "2019-01-01", "2020-01-01", "2020-01-01")
colnames(test) <- c("ID", "Score", "Date")
test <- data.frame(test)
test$Date <- as.Date(test$Date)

# create a dataframe to hold all the post-loop data
df <- data.frame(matrix(ncol = 4, nrow = 0))
col_names <- c("ID", "Years", "BeginScore", "EndScore")

# get all the unique item IDs
ids <- unique(test$ID)

# loop through each unique item id
for(i in 1:length(ids))
{
   # get all the instances of that single item
   item <- test[test$ID == ids[i],]
   # create a matrix to hold the data
   scores <- data.frame(matrix(1:((nrow(item)-1)*4), byrow=TRUE, nrow=nrow(item)-1))
   colnames(scores) <- col_names
   
   # create an index, starting at the last (bc real data is ordered by data)
   index <- nrow(item)
   # loop through the list of instances of the sigle item and assign info
   for(j in 1:(nrow(item)-1))
   {
     scores$Years <- time_length(item[index,3]-item[(index -1),3], "years")
     scores$BeginScore <- item[(index-1),2]
     scores$EndScore <- item[index, 2]
     scores$ID <- item[index,1]
     index <- index - 1
   }
   # bind the single item to the collected data and then loop to next unique item
   df <- rbind(df, scores)
}

question from:https://stackoverflow.com/questions/65877428/need-to-speed-up-r-loop

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:22:21+0000

for loop is not the right tool for such operations. Also creating an empty matrix/dataframe and filling it is also very inefficient in R.

Tens of thousands of rows is not too much of data. You can try this dplyr approach.

library(dplyr)
library(lubridate)

test %>%
  mutate(Date = as.Date(Date)) %>%
  group_by(ID) %>%
  summarise(BeginScore = nth(Score, n() - 1),
            EndScore = last(Score), 
            Years = time_length(last(Date) - nth(Date, n() - 1), 'years'))

#  ID    BeginScore EndScore Years
#  <chr> <chr>      <chr>    <dbl>
#1 1     70         62       0.999
#2 2     92         85       0.999
#3 3     90         82       0.999

Categories

dataframe - Need to speed up R loop

dataframe - Need to speed up R loop

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags