Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
325 views
in Technique[技术] by (71.8m points)

r - How to speed up or vectorize a for loop?

I would like to increase the speed of my for loop via vectorization or using Data.table or something else. I have to run the code on 1,000,000 rows and my code is really slow.

The code is fairly self-explanatory. I have included an explanation below just in case. I have included the input and the output of the function. Hopefully you will help me make the function faster.

My goal is to bin the vector "Volume", where each bin is equal to 100 shares. The vector "Volume" contains the number of shares traded. Here is what it looks like:

head(Volume, n = 60)
[1]  5  3  1  5  3  1  1  1  1  1  1  1 18  1  1 18  2  7 13  2  7 13  3  2  1  1  3  2  1  1  1
[32]  1  6  6  1  1  1  1  1  1  1  1 18  2  1  1  2  1 14 18  2  1  1  2  1 14  1  1  9  5

The vector "binIdexVector" is the same length of "Volume", and it contains the bin number; that is each element of the first 100 shares get the number 1, each elements of the next 100 shares get the number 2, each elements of the next 100 shares get the number 3, and so on. Here is what that vector looks like:

 head(binIdexVector, n = 60)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[48] 2 2 3 3 3 3 3 3 3 3 3 3 3

Here is my function:

#input as a vector
Volume<-c(5L, 3L, 1L, 5L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 1L, 1L, 
                   18L, 2L, 7L, 13L, 2L, 7L, 13L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 1L, 
                   1L, 1L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 2L, 1L, 
                   1L, 2L, 1L, 14L, 18L, 2L, 1L, 1L, 2L, 1L, 14L, 1L, 1L, 9L, 5L, 
                   2L, 1L, 1L, 1L, 1L, 9L, 5L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L, 
                   1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 2L, 9L, 9L, 3L, 3L, 1L, 1L, 
                   1L, 1L, 5L, 5L, 8L, 8L, 2L, 1L, 2L, 1L, 10L, 10L, 10L, 10L, 10L, 
                   10L, 10L, 10L, 9L, 9L, 1L, 1L, 8L, 1L, 8L, 1L, 8L, 8L, 2L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                   1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 
                   1L, 2L, 7L, 1L, 2L, 7L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 
                   1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 
                   10L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 3L, 1L, 1L, 1L, 4L, 3L, 1L, 
                   1L, 1L, 4L, 25L, 1L, 1L, 25L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L)

binIdexVector <- numeric(length(Volume))

# initialize 
binIdex <-1
totalVolume <-0

for(i in seq_len(length(Volume))){

  totalVolume <- totalVolume + Volume[i]  

  if (totalVolume <= 100) {

    binIdexVector[i] <- binIdex

  } else {

    binIdex <- binIdex + 1
    binIdexVector[i] <- binIdex
    totalVolume <- Volume[i]
  }
}

# output:
> dput(binIdexVector)
c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
  1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
  2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
  4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
  9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 
  10, 10, 10, 10, 10, 10, 10, 10, 10, 10)

Thank a lot for your help!

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.2
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use Rcpp when vectorization is difficult.

library(Rcpp)
cppFunction('
  IntegerVector bin(NumericVector Volume, int n) {
    IntegerVector binIdexVector(Volume.size());
    int binIdex = 1;
    double totalVolume =0;

    for(int i=0; i<Volume.size(); i++){
      totalVolume = totalVolume + Volume[i];
      if (totalVolume <= n) {
        binIdexVector[i] = binIdex;
      } else {
        binIdex++;
        binIdexVector[i] = binIdex;
        totalVolume = Volume[i];
      }
    }
    return binIdexVector;
  }')

all.equal(bin(Volume, 100), binIdexVector)
#[1] TRUE

It's faster than findInterval(cumsum(Volume), seq(0, sum(Volume), by=100)) (which of course gives an inexact answer)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...