This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))
In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...)
, and we run heirarchal clustering using hclust(...)
. Then we cut the dendrogram into three clusters with cutree(...)
and append the cluster id's to the original strings.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…