There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.
Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string.
Normalizations may include:
- lowercase conversion,
- no whitespaces, special characters removal,
- transform unicode to ascii equivalents if possible,use unicodedata.normalize or unidecode module )
This would result in "Andrew H Smith"
, "andrew h. smith"
, "ándréw h. smith"
generating same key "andrewhsmith"
, and would reduce your set of million names to a smaller set of unique/similar grouped names.
You can use this utlity method to normalize your string (does not include the unicode part though) :
def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
""" Processes string for similarity comparisons , cleans special characters and extra whitespaces
if normalized is True and removes the substrings which are in ignore_list)
Args:
input_str (str) : input string to be processed
normalized (bool) : if True , method removes special characters and extra whitespace from string,
and converts to lowercase
ignore_list (list) : the substrings which need to be removed from the input string
Returns:
str : returns processed string
"""
for ignore_str in ignore_list:
input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)
if normalized is True:
input_str = input_str.strip().lower()
#clean special chars and extra whitespace
input_str = re.sub("W", "", input_str).strip()
return input_str
Now similar strings will already lie in the same bucket if their normalized key is same.
For further comparison, you will need to compare the keys only, not the names. e.g
andrewhsmith
and andrewhsmeeth
, since this similarity
of names will need fuzzy string matching apart from the normalized
comparison done above.
Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not.
So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.
Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith
, ándréw h. smith
, and Andrew H. Smeeth
generate keys andrewhsmith
,andrewhsmith
, and andrewhsmeeth
respectively.
They will usually not differ in the first character, so you can run matching for keys starting with a
to other keys which start with a
, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith
to bndrewhsmith
as such a name variation with first letter will rarely exist.
Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:
def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
""" Calculates matching ratio between two strings
Args:
first_str (str) : First String
second_str (str) : Second String
normalized (bool) : if True ,method removes special characters and extra whitespace
from strings then calculates matching ratio
ignore_list (list) : list has some characters which has to be substituted with "" in string
Returns:
Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
equal weightage to each
Examples:
>>> find_string_similarity("hello world","Hello,World!",normalized=True)
1.0
>>> find_string_similarity("entrepreneurship","entreprenaurship")
0.95625
>>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
1.0
"""
first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
return match_ratio
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…