I have two matrices of interest, the first is a "bag of words" matrix, with two columns: the document ID and the term ID. For example:
bow[0:10]
Out[1]:
array([[ 0, 10],
[ 0, 12],
[ 0, 19],
[ 0, 20],
[ 1, 9],
[ 1, 24],
[ 2, 33],
[ 2, 34],
[ 2, 35],
[ 3, 2]])
In addition, I have an "index" matrix, where every row in the matrix contains the index of the first and last row for a given document ID in the bag of words matrix. Ex: row 0 is the first and last index for doc id 0. For example:
index[0:4]
Out[2]:
array([[ 0, 4],
[ 4, 6],
[ 6, 9],
[ 9, 10]])
What I'd like to do is take a random sample of document ID's and get all of the bag of word rows for those document ID's. The bag of words matrix is roughly 150M rows (~1.5Gb), so using numpy.in1d() is too slow. We need to return these rapidly for feeding into a downstream task.
The naive solution I have come up with is as follows:
def get_rows(ids):
indices = np.concatenate([np.arange(x1, x2) for x1,x2 in index[ids]])
return bow[indices]
get_rows([4,10,3,5])
Generic sample
A generic sample to put forth the problem would be with something like this -
indices = np.array([[ 4, 7],
[10,16],
[11,18]]
The expected output would be -
array([ 4, 5, 6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])
See Question&Answers more detail:
os