Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
280 views
in Technique[技术] by (71.8m points)

python - How to efficiently replace values in a large dataframe (100k+ rows) from another based on closest match?

So I am using levenshire distance to find closest match and replace many values in a large data frame using this answer as a base:

import operator

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

def closest_match(string, matchings):
    scores = {}
    for m in matchings:
        scores[m] = 1 - levenshteinDistance(string,m)
    
    return max(scores.items(), key=operator.itemgetter(1))[0]

So while replacing many values in a moderately large dataframe (100k+ rows) from another of the similar size as follows takes forever to run: (Running from last half hour ha!)

results2.products = [closest_match(string, results2.products) 
                    if string not in results2.products else string 
                    for string in results.products]

So is there a way to do this more efficiently? I added if-else condition for the same purpose so that if there is a direct match there wont be any calculations involved which also would have produced same result.

Sample Data

results:

   products
0, pizza
1, ketchup
2, salami
3, anchovy
4, pepperoni
5, marinara
6, olive
7, sausage
8, cheese
9, bbq sauce
10, stuffed crust

results2:

   products
0, salaaaami
1, kechap
2, lives
3, ppprn
4, pizzas
5, marinara
6, sauce de bbq
7, marinara sauce
8, chease
9, sausages
10, crust should be stuffed

I want values in results2 to be replaced by the closest match in results

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use compiled Python.

Use Cython / CPython

Use PyPy aka Stackless Python

Use Numba for both your function as follows:

from numba import jit
@jit
def levenshteinDistance(s1, s2):
...

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...