python - 如何在k均值聚类算法中选择适合可视化的列？(How to select which columns are good for visualisation in k-Means clustering algorithm?)

Question

posted Feb 21, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am trying to understand the selection of columns in csv file which should be taken into considerations to apply k-means .

(我试图了解csv文件中列的选择，应将其应用于应用k-means。)

In the below link only annual income and spending score is taken as a column (from Mall_Customers.csv file) for visualisation and not age.

(在下面的链接中，只将年收入和支出得分作为一栏（来自Mall_Customers.csv文件）用于可视化，而不是年龄。)

https://www.kaggle.com/shrutimechlearn/step-by-step-kmeans-explained-in-detail

Please help.

(请帮忙。)

ask by Penguin Tech translate from so

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-21T04:15:44+0000

They have 3 features that they can use to cluster.

(他们具有3个可用于集群的功能。)

Usually they will just take the euclidean distance of all the features to get the distance from cluster to cluster.

(通常，它们只是采用所有要素的欧式距离来获得簇之间的距离。)

This is very easy to visualize in two dimensions.

(这在二维上非常容易可视化。)

Take two points and the distance between them is the hypotenuse of a triangle.

(取两个点，它们之间的距离是三角形的斜边。)

In three dimensions, it's a little harder to visualize.

(在三个维度上，它很难可视化。)

The author is simply using 2 dimensions so she can plot it later.

(作者只是使用2维，所以她以后可以绘制它。)

However, to use all three dimensions you would simply modify the code to:

(但是，要使用所有三个维度，您只需将代码修改为：)

X = dataset.iloc[:,[1:3]].values

and that will use age,income and spending score in the algorithm

(这将在算法中使用年龄，收入和支出得分)