python - Query segmentation with spell check

Question

Welcome To Ask or Share your Answers For Others

python - Query segmentation with spell check

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Query segmentation with spell check

Assuming I have a fixed list of multi word names like: Water Tocopherol (Vitamin E) Vitamin D PEG-60 Hydrogenated Castor Oil

I want the following input/output results:

Water, PEG-60 Hydrogenated Castor Oil -> Water, PEG-60 Hydrogenated Castor Oil
PEG-60 Hydrnated Castor Oil -> PEG-60 Hydrogenated Castor Oil
wter PEG-60 Hydrnated Castor Oil -> Water, PEG-60 Hydrogenated Castor Oil
Vitamin E -> Tocopherol (Vitamin E)

I need it to be performant and the ability to recognize that either there are too many close matches and no close matches. With 1 its relatively easy because I can separate by the comma. Most times the input list is separated by the comma so this works 80% of the time but even this has the small issue. Take for example 4. Once separated, 4's ideal match is not returned by most spellcheck libraries (I've tried a number) because the edit distance to Vitamin D is much smaller. There are some websites that do this well but I'm lost as to how to do it.

The second part to this problem is, how do I do word segmentation on top. Let's say a given list doesn't have a comma, I need to be able to recognize that. Simplest example being Water Vtamin D should become Water, Vitamin D. I can give a ton of examples but I think this gives a good idea of the problem.

Here's a list of names that can be used.

question from:https://stackoverflow.com/questions/65844582/query-segmentation-with-spell-check

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:30:45+0000

This answer builds upon @Rafaels answer.

process.extractOne in FuzzyWuzzy uses the scorer fuzz.WRatio by default. This is a combination of multiple scorers provided by FuzzyWuzzy, that works well for the dataset Seatgeek is working with. So you might want to try around with other scorers to see which one performs best for you. However note, that quite a few of your elements might be hard to distinguish using the edit distance. E.g. Vitamin E <-> Vitamin D only need a single edit, even though they are something completely different. The same behavior also occurs with glycereth-7
FuzzyWuzzy is relatively slow, so when your working with a bigger dataset you might want to use RapidFuzz instead (I am the author) which provides similar algorithms, but has a better performance.
process.extractOne preprocesses the input strings by default (lowercases and replaces non alphanumeric characters with whitespaces). Since your probably searching for elements multiple times it would make sense to preprocess the possible choices once ahead of time and deactivate this behavior to safe some time:

process.extractOne(str2Match,strOptions, processor=None)

Differences between RapidFuzz and FuzzyWuzzy

Since you reported differences in the results between RapidFuzz and FuzzyWuzzy here are some possible reasons:

I do not round results. So you wil get a floating point like 42.22 instead of 42 as a result
In case your not using the fast FuzzyWuzzy implementation, that uses python-Levenshtein you might get different results, since it uses difflib which is a different metric. It produces very similar results most of the time but not always
In case your using the fast implementation any partial ratio like partial_ratio, WRatio ... might return wrong results in FuzzyWuzzy since partial_ratio is broken (see here)
Passing processor=None to extract/extractOne has a different meaning in RapidFuzz and FuzzyWuzzy. In RapidFuzz it will deactivate preprocessing, while in FuzzyWuzzy it will still use the default of the score. As an example for

extract(..., scorer=fuzz.WRatio, processor=None)

FuzzyWuzzy will still preprocess the strings inside WRatio, so there is no way to deactivate preprocessing. I personally think this is a bad design, so I changed it to give the user the possibility to deactivate the processor, which is likely what you want to achieve when passing processor=None

Categories

python - Query segmentation with spell check

python - Query segmentation with spell check

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Differences between RapidFuzz and FuzzyWuzzy

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags