I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities
I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):
http://tartarus.org/~martin/PorterStemmer/php.txt
This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".
I've tried "Snowball" (suggested within another Stack Overflow thread).
http://snowball.tartarus.org/demo.php
For my example (community / communities), Snowball stems to "communiti".
Question
Are there any other stemming algorithms that will do this? Has anyone else solved this problem?
My current thinking is that I could use a stemming algorithm to avoid duplicates and then pick the shortest word I encounter to be the actual word to display.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…