Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
344 views
in Technique[技术] by (71.8m points)

algorithm - String similarity metrics in Python

I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.

  1. I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
  2. False negatives are acceptable, False positives, except in extremely rare cases are not.
  3. This is done in a non realtime setting, so speed is not (much) of concern.
  4. [Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I realize it's not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...