I have an application where I have multiple (10k-200k) points on a 2D-plane. I want to (greedily?) cluster these with a logic of sorts:
1. Select a random point on the plane.
2. Check for other points in radius x.
3. Form a cluster (with either 1 or more points).
These steps can be repeated with the newly formed clusters with higher x, which would result in a somewhat hierarchial clustering (I found this clustering logic on this post and I am fairly sure it would satisfy my needs).
As was mentioned in the post, the algorithm works better with some spatial indexing. Since I don't need "closest point" query type and am only interested in range query, I figured a 2D Range Tree would suit my situation the best.
I can understand the logic of a range tree to some extent, but it all goes quite complex when I have a lot of x-coordinate and y-coordinate duplicates on my data set. What also causes confusion is the median split with even number set.
Example:
Consider the following point set:
(2,7)
(2,6)
(2,5)
(2,4)
(2,3)
(5,6)
(5,5)
(7,5)
(9,3)
(9,2)
(9,1)
What would be the best approach to constructing a range tree from a dataset like that? If I remove the duplicates, how does that affect the corresponding y-tree?
Also, removing duplicates from x-tree would leave me with 2,5,7, and 9. The median of that group is 6, right? Or should I pick a point in the set? If so, is it 5 or 7?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…