A vectorized beast:
D <- abs(outer(b_1, b_2, "-")) > 0.045
in_b1_not_in_b2 <- b_1[rowSums(D) == length(b_2)]
#[1] 1002.570 301.569
in_b2_not_in_b1 <- b_2[colSums(D) == length(b_1)]
#[1] 22.12 53.00 5666.31 100.10
hours later...
Henrik shared a question complaining the memory explosion when using outer
for long vectors: Matching two very very large vectors with tolerance (fast! but working space sparing). However, the memory bottleneck for outer
can be easily killed by blocking.
f <- function (b1, b2, threshold, chunk.size = 5000) {
n1 <- length(b1)
n2 <- length(b2)
chunk.size <- min(chunk.size, n1, n2)
RS <- numeric(n1) ## rowSums, to be accumulated
CS <- numeric(n2) ## colSums, to be accumulated
j <- 0
while (j < n2) {
chunk.size_j <- min(chunk.size, n2 - j)
ind_j <- (j + 1):(j + chunk.size_j)
b2_j <- b2[ind_j]
i <- 0
while (i < n1) {
chunk.size_i <- min(chunk.size, n1 - i)
ind_i <- (i + 1):(i + chunk.size_i)
M <- abs(outer(b1[ind_i], b2_j, "-")) > threshold
RS[ind_i] <- RS[ind_i] + rowSums(M)
CS[ind_j] <- CS[ind_j] + colSums(M)
i <- i + chunk.size_i
}
j <- j + chunk.size_j
}
list(in_b1_not_in_b2 = b1[RS == n2], in_b2_not_in_b1 = b2[CS == n1])
}
With this function, outer
never uses more memory than storing two chunk.size x chunk.size
matrices. Now let's do something crazy.
b1 <- runif(1e+5, 0, 10000)
b2 <- b1 + runif(1e+5, -1, 1)
If we do a simple outer
, we need memory to store two 1e+5 x 1e+5
matrices, which is up to 149 GB. However, on my Sandy Bridge (2011) laptop with only 4 GB RAM, computation is feasible.
system.time(oo <- f(b1, b2, 0.045, 5000))
# user system elapsed
#365.800 167.348 533.912
The performance is actually good enough, given that we have been using a very poor algorithm.
All answers here do exhausted search, that has complexity length(b1) x length(b2)
. We could reduce this to length(b1) + length(b2)
if we work on sorted arrays. But such deeply optimized algorithm can only be implemented with compiled language to obtain efficiency.