Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
698 views
in Technique[技术] by (71.8m points)

r - Generating a very large matrix of string combinations using combn() and bigmemory package

I have a vector x of 1,344 unique strings. I want to generate a matrix that gives me all possible groups of three values, regardless of order, and export that to a csv.

I'm running R on EC2 on a m1.large instance w 64bit Ubuntu. When using combn(x, 3) I get an out of memory error:

Error: cannot allocate vector of size 9.0 Gb

The size of the resulting matrix is C1344,3 = 403,716,544 rows and three columns - which is the transpose of the result of combn() function.

I thought of using the bigmemory package to create a file backed big.matrix so I can then assign the results of the combn() function. I can create a preallocated big matrix:

library(bigmemory)
x <- as.character(1:1344)
combos <- 403716544
test <- filebacked.big.matrix(nrow = combos, ncol = 3, 
        init = 0, backingfile = "test.matrix")

But when I try to allocate the values test <- combn(x, 3) I still get the same: Error: cannot allocate vector of size 9.0 Gb

I even tried coercing the result of combn(x,3) but I think that because the combn() function is returning an error, the big.matrix function doesn't work either.

test <- as.big.matrix(matrix(combn(x, 3)), backingfile = "abc")
Error: cannot allocate vector of size 9.0 Gb
Error in as.big.matrix(matrix(combn(x, 3)), backingfile = "abc") : 
  error in evaluating the argument 'x' in selecting a method for function 'as.big.matrix'

Is there a way to combine these two functions together to get what I need? Are there any other ways of achieving this? Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here's a function I've written in R, which currently finds its (unexported) home in the LSPM package. You give it the total number of items n, the number of items to select r, and the index of the combination you want i; it returns the values in 1:n corresponding to combination i.

".combinadic" <- function(n, r, i) {

  # http://msdn.microsoft.com/en-us/library/aa289166(VS.71).aspx
  # http://en.wikipedia.org/wiki/Combinadic

  if(i < 1 | i > choose(n,r)) stop("'i' must be 0 < i <= n!/(n-r)!")

  largestV <- function(n, r, i) {
    #v <- n-1
    v <- n                                  # Adjusted for one-based indexing
    #while(choose(v,r) > i) v <- v-1
    while(choose(v,r) >= i) v <- v-1        # Adjusted for one-based indexing
    return(v)
  }

  res <- rep(NA,r)
  for(j in 1:r) {
    res[j] <- largestV(n,r,i)
    i <- i-choose(res[j],r)
    n <- res[j]
    r <- r-1
  }
  res <- res + 1
  return(res)
}

It allows you to generate each combination based on the value of the lexicographic index:

> .combinadic(1344, 3, 1)
[1] 3 2 1
> .combinadic(1344, 3, 2)
[1] 4 2 1
> .combinadic(1344, 3, 403716544)
[1] 1344 1343 1342

So you just need to loop over 1:403716544 and append the results to a file. It may take awhile, but it's at least feasible (see Dirk's answer). You also may need to do it in several loops, since the vector 1:403716544 will not fit in memory on my machine.

Or you could just port the R code to C/C++ and do the looping / writing there, since it would be a lot faster.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...