Let's try to analyze the problem for a second:
If you have N
rows, then you have N*N
"pairs" to consider in your similarity function. In the general case, there is no escape from evaluating all of them (sounds very rational, but I can't prove it). Hence, you have at least O(n^2) time complexity.
What you can try, however, is to play with the constant factors of that time complexity.
The possible options I found are:
1. Parallelization:
Since you have some large DataFrame
, parallelizing the processing is the best obvious choice. That will gain you (almost) linear improvement in time complexity, so if you have 16 workers you will gain (almost) 16x improvement.
For example, we can partition the rows of the df
into disjoint parts, and process each part individually, then combine the results.
A very basic parallel code might look like this:
from multiprocessing import cpu_count,Pool
def work(part):
"""
Args:
part (DataFrame) : a part (collection of rows) of the whole DataFrame.
Returns:
DataFrame: the same part, with the desired property calculated and added as a new column
"""
# Note that we are using the original df (pandas_df) as a global variable
# But changes made in this function will not be global (a side effect of using multiprocessing).
for index, _id, word in part.itertuples(): # iterate over the "part" tuples
value = sum(
pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
part.loc[index, 'bar'] = value
return part
# New code starts here ...
cores = cpu_count() #Number of CPU cores on your system
data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts
Note: I've tried to keep the code as close to OP's code as possible. This is just a basic demonstration code and a lot of better alternatives exist.
2. "Low level" optimizations:
Another solution is to try to optimize the similarity function computation and iterating/mapping. I don't think this will gain you much speedup compared to the previous option or the next one.
3. Function-dependent pruning:
The last thing you can try are similarity-function-dependent improvements. This doesn't work in the general case, but will work very well if you can analyze the similarity function. For example:
Assuming you are using Levenshtein distance (LD
), you can observe that the distance between any two strings is >= the difference between their lengths. i.e. LD(s1,s2) >= abs(len(s1)-len(s2))
.
You can use this observation to prune the possible similar pairs to consider for evaluation. So for each string with length l1
, compare it only with strings having length l2
having abs(l1-l2) <= limit
. (limit is the maximum accepted dis-similarity, 2 in your provided example).
Another observation is that LD(s1,s2) = LD(s2,s1)
. That cuts the number of pairs by a factor of 2.
This solution may actually get you down to O(n)
time complexity (depends highly on the data).
Why? you may ask.
That's because if we had 10^9
rows, but on average we have only 10^3
rows with "close" length to each row, then we need to evaluate the function for about 10^9 * 10^3 /2
pairs, instead of 10^9 * 10^9
pairs. But that's (again) depends on the data. This approach will be useless if (in this example) you have strings all which have length 3.