I have the following problem
I have a dataframe master that contains sentences, such as
master
Out[8]:
original
0 this is a nice sentence
1 this is another one
2 stackoverflow is nice
For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy
. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc).
For instance, slave could be
slave
Out[10]:
my_value name
0 2 hello world
1 1 congratulations
2 2 this is a nice sentence
3 3 this is another one
4 1 stackoverflow is nice
Here is a fully-functional, wonderful, compact working example :)
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
import difflib
master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})
slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [2,1,2,3,1]})
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
#use fuzzywuzzy to see how close original and name are
slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'my_value']
master['my_value'] = master.original.apply(lambda x: helper(x,slave))
The 1 million dollars question is: can I parallelize my apply code above?
After all, every row in master
is compared to all the rows in slave
(slave is a small dataset and I can hold many copies of the data into the RAM).
I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time).
Problem: I dont know how to do that or if thats even possible.
Any help greatly appreciated!
See Question&Answers more detail:
os