I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick:
import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"'
print "
-----
".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs.
-----
Hussey?
-----
"
'''
I don't expect perfection here, considering that Melville's syntax is a bit dated, but NLTK ought to be able to handle terminal double quotes and titles like "Mrs." Since the tokenizer is the result of an unsupervised training algo, however, I can't figure out how to tinker with it.
Anyone have recommendations for a better sentence tokenizer? I'd prefer a simple heuristic that I can hack rather than having to train my own parser.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…