Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
811 views
in Technique[技术] by (71.8m points)

python - nltk words corpus does not contain "okay"?

The NLTK word corpus does not have the phrase "okay", "ok", "Okay"?

> from nltk.corpus import words
> words.words().__contains__("check")
> True

> words.words().__contains__("okay")
> False

> len(words.words())
> 236736

Any ideas why?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

TL;DR

from nltk.corpus import words
from nltk.corpus import wordnet 

manywords = words.words() + wordnet.words() 

In Long

From the docs, the nltk.corpus.words are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix)

Which in Unix, you can do:

ls /usr/share/dict/

And reading the README:

$ cd /usr/share/dict/
/usr/share/dict$ cat README
#   @(#)README  8.1 (Berkeley) 6/5/93
# $FreeBSD$

WEB ---- (introduction provided by jaw@riacs) -------------------------

Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier.  The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases.  The wordlist makes a dandy 'grep' victim.

     -- James A. Woods    {ihnp4,hplabs}!ames!jaw    (or jaw@riacs)

Country names are stored in the file /usr/share/misc/iso3166.


FreeBSD Maintenance Notes ---------------------------------------------

Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.

A few words have been removed because their spellings have depreciated.
This list of words includes:
    corelation (and its derivatives)    "correlation" is the preferred spelling
    freen               typographical error in original file
    freend              archaic spelling no longer in use;
                    masks common typo in modern text

--

A list of technical terms has been added in the file 'freebsd'.  This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation.  It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.

Since it's a fixed list of 234,936, there are bound to be words that don't exist in that list.

If you need to extend your word list, you can add to the list using the words from WordNet using nltk.corpus.wordnet.words().

Most probably, all you need is a large enough corpus of text, e.g. Wikipedia dump and then tokenize it and extract all unique words.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...