This question is pretty popular. Since similar questions keep getting closed and linked here, I think it's worth pointing out that even though the existing answers are quite fast for thousands of data points, they start to break down after that. My potato segfaults at 10k items in each array.
The potential problem with the other answers is the algorithmic complexity. They compare everything in X
to everything in Y
. To get around that, at least on average, we need a better strategy for ruling out some of the things in Y
.
In one dimension this is easy -- just sort everything and start popping out nearest neighbors. In two dimensions there are a variety of strategies, but KD-trees are reasonably popular and are already implemented in the scipy
stack. On my machine, there's a crossover between the various methods around the point where each of X
and Y
have 6k things in them.
from scipy.spatial import KDTree
tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)
The extremely poor performance of scipy
's KDTree implementation has been a sore spot of mine for awhile, especially with as many things as have been built on top of it. There are probably data sets where it performs well, but I haven't seen one yet.
If you don't mind an extra dependency, you can get a 1000x speed boost just by switching your KDTree library. The package pykdtree
is pip-installable, and I pretty much guarantee the conda packages work fine too. With this approach, my used, budget chromebook can process X
and Y
with 10 million points each in barely 30 seconds. That beats segfaulting at 10 thousand points with a wall time ;)
from pykdtree.kdtree import KDTree
tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…