Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
789 views
in Technique[技术] by (71.8m points)

machine learning - Weighted Kmeans R

I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:

     A  B  C
1    12 10 1
2    8  11 2
3    14 10 1
.    .   .  .
.    .   .  .
.    .   .  .

in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:

Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)

But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables? how can I insert their weights in the model? Thank you all

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You have to use a kmeans weighted clustering, like the one presented in flexclust package:

https://cran.r-project.org/web/packages/flexclust/flexclust.pdf

The function

cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)

Perform k-means clustering, hard competitive learning or neural gas on a data matrix. weights An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.

A toy example using iris data:

library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl  
    kcca object of family ‘kmeans’ 

    call:
    cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)

    cluster sizes:

     1  2  3 
    50 59 41 

As you can see from the output of cclust, also using competitive learning the family is always kmenas. The difference is related to cluster assignment during training phase:

If method is "kmeans", the classic kmeans algorithm as given by MacQueen (1967) is used, which works by repeatedly moving all cluster centers to the mean of their respective Voronoi sets. If "hardcl", on-line updates are used (AKA hard competitive learning), which work by randomly drawing an observation from x and moving the closest center towards that point (e.g., Ripley 1996).

The weights parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...