So I am using levenshire distance to find closest match and replace many values in a large data frame using this answer as a base:
import operator
def levenshteinDistance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
def closest_match(string, matchings):
scores = {}
for m in matchings:
scores[m] = 1 - levenshteinDistance(string,m)
return max(scores.items(), key=operator.itemgetter(1))[0]
So while replacing many values in a moderately large dataframe (100k+ rows) from another of the similar size as follows takes forever to run: (Running from last half hour ha!)
results2.products = [closest_match(string, results2.products)
if string not in results2.products else string
for string in results.products]
So is there a way to do this more efficiently? I added if-else condition for the same purpose so that if there is a direct match there wont be any calculations involved which also would have produced same result.
Sample Data
results
:
products
0, pizza
1, ketchup
2, salami
3, anchovy
4, pepperoni
5, marinara
6, olive
7, sausage
8, cheese
9, bbq sauce
10, stuffed crust
results2
:
products
0, salaaaami
1, kechap
2, lives
3, ppprn
4, pizzas
5, marinara
6, sauce de bbq
7, marinara sauce
8, chease
9, sausages
10, crust should be stuffed
I want values in results2
to be replaced by the closest match in results
See Question&Answers more detail:
os