Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
699 views
in Technique[技术] by (71.8m points)

regex - Levenshtein distance in regular expression

Is it possible to include Levenshtein distance in a regular expression query?

(Except by making union between permutations, like this to search for "hello" with Levenshtein distance 1:

.ello | h.llo | he.lo | hel.o | hell.

since this is stupid and unusable for larger Levenshtein distances.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can generate the regex programmatically. I will leave that as an exercise for the reader, but for the output of this hypothetical function (given an input of "word") you want something like this string:

"^(?>word|wodr|wrod|owrd|word.|wor.d|wo.rd|w.ord|.word|wor.?|wo.?d|w.?rd|.?ord)$"

In English, first you try to match on the word itself, then on every possible single transposition, then on every possible single insertion, then on every possible single omission or substitution (can be done simultaneously).

The length of that string, given a word of length n, is linear (and notably not exponential) with n.

Which is reasonable, I think.

You pass this to your regex generator (like in Ruby it would be Regexp.new(str)) and bam, you got a matcher for ANY word with a Damerau-Levenshtein distance of 1 from a given word.

(Damerau-Levenshtein distances of 2 are far more complicated.)

Note use of the (?> non-backtracing construct which means the order of the individual |'d expressions in that output matter.

I could not think of a way to "compact" that expression.

EDIT: I got it to work, at least in Elixir! https://github.com/pmarreck/elixir-snippets/blob/master/damerau_levenshtein_distance_1.exs

I wouldn't necessarily recommend this though (except for educational purposes) since it will only get you to distances of 1; a legit D-L library will let you compute distances > 1. Although since this is regex, it would probably work pretty fast once constructed (note that you should save the "compiled" regex somewhere since this code currently reconstructs it on EVERY comparison!)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...