The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:
text = nltk.Text(tokens)
len(text)
However, it doesn't - it gives a word and punctuation count.
How can you get a real word count (ignoring punctuation)?
Similarly, how can you get the average number of characters in a word?
The obvious answer is:
word_average_length =(len(string_of_text)/len(text))
However, this would be off because:
- len(string_of_text) is a character count, including spaces
- len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.
Am I missing something here? This must be a very common NLP task...
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…