Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
448 views
in Technique[技术] by (71.8m points)

python - Efficiently delete each row of an array if it occurs in another array in pure numpy

I have one numpy array, where indices are stored in the shape of (n, 2). E.g.:

[[0, 1],
 [2, 3], 
 [1, 2], 
 [4, 2]]

Then I do some processing and create an array in the shape of (m, 2), where n > m. E.g.:

[[2, 3]
 [4, 2]]

Now I want to delete every row in the first array that can be found in the second array as well. So my wanted result is:

[[0, 1], 
 [1, 2]]

My current solution is as follows:

for row in second_array:
        result = np.delete(first_array, np.where(np.all(first_array == second_array, axis=1)), axis=0)

However, this is quiet time consuming if the second is large. Does someone know a numpy only solution, which does not require a loop?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here's one leveraging the fact that they are positive numbers using matrix-multiplication for dimensionality-reduction -

def setdiff_nd_positivenums(a,b):
    s = np.maximum(a.max(0)+1,b.max(0)+1)
    return a[~np.isin(a.dot(s),b.dot(s))]

Sample run -

In [82]: a
Out[82]: 
array([[0, 1],
       [2, 3],
       [1, 2],
       [4, 2]])

In [83]: b
Out[83]: 
array([[2, 3],
       [4, 2]])

In [85]: setdiff_nd_positivenums(a,b)
Out[85]: 
array([[0, 1],
       [1, 2]])

Also, it seems the second-array b is a subset of a. So, we can leverage that scenario to boost the performance even further using np.searchsorted, like so -

def setdiff_nd_positivenums_searchsorted(a,b):
    s = np.maximum(a.max(0)+1,b.max(0)+1)
    a1D,b1D = a.dot(s),b.dot(s)
    b1Ds = np.sort(b1D)
    return a[b1Ds[np.searchsorted(b1Ds,a1D)] != a1D]

Timings -

In [146]: np.random.seed(0)
     ...: a = np.random.randint(0,9,(1000000,2))
     ...: b = a[np.random.choice(len(a), 10000, replace=0)]

In [147]: %timeit setdiff_nd_positivenums(a,b)
     ...: %timeit setdiff_nd_positivenums_searchsorted(a,b)
10 loops, best of 3: 101 ms per loop
10 loops, best of 3: 70.9 ms per loop

For generic numbers, here's another using views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

def setdiff_nd(a,b):
    # a,b are the nD input arrays
    A,B = view1D(a,b)    
    return a[~np.isin(A,B)]

Sample run -

In [94]: a
Out[94]: 
array([[ 0,  1],
       [-2, -3],
       [ 1,  2],
       [-4, -2]])

In [95]: b
Out[95]: 
array([[-2, -3],
       [ 4,  2]])

In [96]: setdiff_nd(a,b)
Out[96]: 
array([[ 0,  1],
       [ 1,  2],
       [-4, -2]])

Timings -

In [158]: np.random.seed(0)
     ...: a = np.random.randint(0,9,(1000000,2))
     ...: b = a[np.random.choice(len(a), 10000, replace=0)]

In [159]: %timeit setdiff_nd(a,b)
1 loop, best of 3: 352 ms per loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...