This answer builds upon @Rafaels answer.
process.extractOne
in FuzzyWuzzy uses the scorer fuzz.WRatio
by default. This is a combination of multiple scorers provided by FuzzyWuzzy, that works well for the dataset Seatgeek is working with. So you might want to try around with other scorers to see which one performs best for you. However note, that quite a few of your elements might be hard to distinguish using the edit distance. E.g. Vitamin E
<-> Vitamin D
only need a single edit, even though they are something completely different. The same behavior also occurs with glycereth-7
FuzzyWuzzy is relatively slow, so when your working with a bigger dataset you might want to use RapidFuzz instead (I am the author) which provides similar algorithms, but has a better performance.
process.extractOne
preprocesses the input strings by default (lowercases and replaces non alphanumeric characters with whitespaces). Since your probably searching for elements multiple times it would make sense to preprocess the possible choices once ahead of time and deactivate this behavior to safe some time:
process.extractOne(str2Match,strOptions, processor=None)
Differences between RapidFuzz and FuzzyWuzzy
Since you reported differences in the results between RapidFuzz and FuzzyWuzzy here are some possible reasons:
- I do not round results. So you wil get a floating point like 42.22 instead of 42 as a result
- In case your not using the fast FuzzyWuzzy implementation, that uses python-Levenshtein you might get different results, since it uses difflib which is a different metric. It produces very similar results most of the time but not always
- In case your using the fast implementation any partial ratio like partial_ratio, WRatio ... might return wrong results in FuzzyWuzzy since partial_ratio is broken (see here)
- Passing processor=None to extract/extractOne has a different meaning in RapidFuzz and FuzzyWuzzy. In RapidFuzz it will deactivate preprocessing, while in FuzzyWuzzy it will still use the default of the score. As an example for
extract(..., scorer=fuzz.WRatio, processor=None)
FuzzyWuzzy will still preprocess the strings inside WRatio, so there is no way to deactivate preprocessing. I personally think this is a bad design, so I changed it to give the user the possibility to deactivate the processor, which is likely what you want to achieve when passing processor=None
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…