Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
340 views
in Technique[技术] by (71.8m points)

r - Levenshtein type algorithm with numeric vectors

I have two vectors with numeric values. Such as

v1 <- c(1, 3, 4, 5, 6, 7, 8)
v2 <- c(54, 23, 12, 53, 7, 8)

I would like to compute the number of insertions, deletions and replacements that I need to turn one vector into the other with certain costs per operation c1 c2 and c3 respectively. I am aware that the function adist on the base package does this for strings but I have no knowledge of the equivalent function with numbers.

I thought about referencing each number with a letter but I have more than 2000 unique numbers so if anybody knows how to get 2000 different characters in R that would also be a solution for me.

Thanks for your help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

An integer vector can be seen as a single string encoded in UTF-32 (in which one Unicode code point is represented as a single 32-bit integer). You may obtain an "ordinary" string, just by converting such a vector to UTF-8 with intToUtf8.

intToUtf8(c(65, 97))
## [1] "Aa"

By the way, adist does utf8ToInt (reverse op) by default on its inputs anyway. So internally, it computes the results according to integer vectors. No big hack.

This is the solution.

adist(intToUtf8(c(1, 3, 4, 5, 6, 7, 8)), intToUtf8(c(54, 23, 12, 53, 7, 8)), counts=TRUE)
##      [,1]
## [1,]    5
## attr(,"counts")
## , , ins
## 
##      [,1]
## [1,]    0
## 
## , , del
## 
##      [,1]
## [1,]    1
## 
## , , sub
## 
##      [,1]
## [1,]    4
## 
## attr(,"trafos")
##      [,1]     
## [1,] "SSSSDMM"

The above code should work if at least all the numbers are strictly greater than 0. R treats Unicode code points quite liberally (in fact, too liberally, but in this case you're a winner), even the largest possible integer is accepted:

utf8ToInt(intToUtf8(c(2147483647)))
## 2147483647

If you have a vector with negative values, you may transform it somehow, e.g. with x <- x-min(x)+1.

If you need different costs for insertion, removal, replacement, check out the adist's costs argument. There is also a package called stringdist, which included many other string metrics. The above scheme should also work there.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...