algorithm - Efficient way of calculating likeness scores of strings when sample size is large?

Question

Welcome To Ask or Share your Answers For Others

algorithm - Efficient way of calculating likeness scores of strings when sample size is large?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

algorithm - Efficient way of calculating likeness scores of strings when sample size is large?

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list.

I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another.

Let's say that I define "suspiciously close to another email address" as two strings having a Levenshtein score less than N.

Is there a more efficient way to find pairs of strings whose score is lower than this threshold besides comparing every possible string to every other possible string in the list? In other words, can this type of problem be solved quicker than O(n^2)?

Is Levenshtein score a poor choice of algorithms for this problem?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:37:16+0000

Yup - you can find all strings within a given distance of a string in O(log n) time by using a BK-Tree. Alternate solutions involving generating every string with distance n may be faster for a levenshtein distance of 1, but the amount of work rapidly balloons out of control for longer distances.

Categories

algorithm - Efficient way of calculating likeness scores of strings when sample size is large?

algorithm - Efficient way of calculating likeness scores of strings when sample size is large?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags