Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
667 views
in Technique[技术] by (71.8m points)

python - how do I cluster a list of geographic points by distance?

I have a list of points P=[p1,...pN] where pi=(latitudeI,longitudeI).

Using Python 3, I would like to find a smallest set of clusters (disjoint subsets of P) such that every member of a cluster is within 20km of every other member in the cluster.

Distance between two points is computed using the Vincenty method.

To make this a little more concrete, suppose I have a set of points such as

from numpy import *
points = array([[33.    , 41.    ],
       [33.9693, 41.3923],
       [33.6074, 41.277 ],
       [34.4823, 41.919 ],
       [34.3702, 41.1424],
       [34.3931, 41.078 ],
       [34.2377, 41.0576],
       [34.2395, 41.0211],
       [34.4443, 41.3499],
       [34.3812, 40.9793]])

Then I am trying to define this function:

from geopy.distance import vincenty
def clusters(points, distance):
    """Returns smallest list of clusters [C1,C2...Cn] such that
       for x,y in Ci, vincenty(x,y).km <= distance """
    return [points]  # Incorrect but gives the form of the output

NOTE: Many questions cluster on geo location and attribute. My question is for location only. This is for lat/lon, not Euclidean distance. There are other questions out there that give sort-of answers but not the answer to this question (many unanswered):

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This might be a start. the algorithm attempts to k means cluster the points by iterating k from 2 to the number of points validating each solution along the way. You should pick the lowest number.

It works by clustering the points and then checking that each cluster obeys the constraint. If any cluster is not compliant the solution is labeled as False and we move on to the next number of clusters.

Because the K-means algorithm used in sklearn falls into local minima, proving whether or not this is the solution you're looking for is the best one is still to be established, but it could be one

import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import math

points = np.array([[33.    , 41.    ],
       [33.9693, 41.3923],
       [33.6074, 41.277 ],
       [34.4823, 41.919 ],
       [34.3702, 41.1424],
       [34.3931, 41.078 ],
       [34.2377, 41.0576],
       [34.2395, 41.0211],
       [34.4443, 41.3499],
       [34.3812, 40.9793]])


def distance(origin, destination): #found here https://gist.github.com/rochacbruno/2883505
    lat1, lon1 = origin[0],origin[1]
    lat2, lon2 = destination[0],destination[1]
    radius = 6371 # km
    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) 
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d

def create_clusters(number_of_clusters,points):
    kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(points)
    l_array = np.array([[label] for label in kmeans.labels_])
    clusters = np.append(points,l_array,axis=1)
    return clusters

def validate_solution(max_dist,clusters):
    _, __, n_clust = clusters.max(axis=0)
    n_clust = int(n_clust)
    for i in range(n_clust):
        two_d_cluster=clusters[clusters[:,2] == i][:,np.array([True, True, False])]
        if not validate_cluster(max_dist,two_d_cluster):
            return False
        else:
            continue
    return True

def validate_cluster(max_dist,cluster):
    distances = cdist(cluster,cluster, lambda ori,des: int(round(distance(ori,des))))
    print(distances)
    print(30*'-')
    for item in distances.flatten():
        if item > max_dist:
            return False
    return True

if __name__ == '__main__':
    for i in range(2,len(points)):
        print(i)
        print(validate_solution(20,create_clusters(i,points)))

Once a benchmark established one would have to focus more one each cluster to establish whether its' points could be distributed to others without violating the distance constraint.

You can replace the lambda function in cdist with whatever distance metric you chose, I found the great circle distance in the repo i mentioned.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...