Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
46 views
in Technique[技术] by (71.8m points)

python - Query segmentation with spell check

Assuming I have a fixed list of multi word names like: Water Tocopherol (Vitamin E) Vitamin D PEG-60 Hydrogenated Castor Oil

I want the following input/output results:

  1. Water, PEG-60 Hydrogenated Castor Oil -> Water, PEG-60 Hydrogenated Castor Oil
  2. PEG-60 Hydrnated Castor Oil -> PEG-60 Hydrogenated Castor Oil
  3. wter PEG-60 Hydrnated Castor Oil -> Water, PEG-60 Hydrogenated Castor Oil
  4. Vitamin E -> Tocopherol (Vitamin E)

I need it to be performant and the ability to recognize that either there are too many close matches and no close matches. With 1 its relatively easy because I can separate by the comma. Most times the input list is separated by the comma so this works 80% of the time but even this has the small issue. Take for example 4. Once separated, 4's ideal match is not returned by most spellcheck libraries (I've tried a number) because the edit distance to Vitamin D is much smaller. There are some websites that do this well but I'm lost as to how to do it.

The second part to this problem is, how do I do word segmentation on top. Let's say a given list doesn't have a comma, I need to be able to recognize that. Simplest example being Water Vtamin D should become Water, Vitamin D. I can give a ton of examples but I think this gives a good idea of the problem.

Here's a list of names that can be used.

question from:https://stackoverflow.com/questions/65844582/query-segmentation-with-spell-check

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This answer builds upon @Rafaels answer.

  1. process.extractOne in FuzzyWuzzy uses the scorer fuzz.WRatio by default. This is a combination of multiple scorers provided by FuzzyWuzzy, that works well for the dataset Seatgeek is working with. So you might want to try around with other scorers to see which one performs best for you. However note, that quite a few of your elements might be hard to distinguish using the edit distance. E.g. Vitamin E <-> Vitamin D only need a single edit, even though they are something completely different. The same behavior also occurs with glycereth-7

  2. FuzzyWuzzy is relatively slow, so when your working with a bigger dataset you might want to use RapidFuzz instead (I am the author) which provides similar algorithms, but has a better performance.

  3. process.extractOne preprocesses the input strings by default (lowercases and replaces non alphanumeric characters with whitespaces). Since your probably searching for elements multiple times it would make sense to preprocess the possible choices once ahead of time and deactivate this behavior to safe some time:

process.extractOne(str2Match,strOptions, processor=None)

Differences between RapidFuzz and FuzzyWuzzy

Since you reported differences in the results between RapidFuzz and FuzzyWuzzy here are some possible reasons:

  1. I do not round results. So you wil get a floating point like 42.22 instead of 42 as a result
  2. In case your not using the fast FuzzyWuzzy implementation, that uses python-Levenshtein you might get different results, since it uses difflib which is a different metric. It produces very similar results most of the time but not always
  3. In case your using the fast implementation any partial ratio like partial_ratio, WRatio ... might return wrong results in FuzzyWuzzy since partial_ratio is broken (see here)
  4. Passing processor=None to extract/extractOne has a different meaning in RapidFuzz and FuzzyWuzzy. In RapidFuzz it will deactivate preprocessing, while in FuzzyWuzzy it will still use the default of the score. As an example for
extract(..., scorer=fuzz.WRatio, processor=None)

FuzzyWuzzy will still preprocess the strings inside WRatio, so there is no way to deactivate preprocessing. I personally think this is a bad design, so I changed it to give the user the possibility to deactivate the processor, which is likely what you want to achieve when passing processor=None


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...