PunktSentenceTokenizer
is the abstract class for the default sentence tokenizer, i.e. sent_tokenize()
, provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence
Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79
Given a paragraph with multiple sentence, e.g:
>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('
')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '
You can use the sent_tokenize()
:
>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
... print sent
... print '--------'
...
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world.
--------
The sent_tokenize()
uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle
. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:
alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle finnish.pickle norwegian.pickle slovene.pickle
danish.pickle french.pickle polish.pickle spanish.pickle
dutch.pickle german.pickle portuguese.pickle swedish.pickle
english.pickle greek.pickle PY3 turkish.pickle
estonian.pickle italian.pickle README
Given a text in another language, do this:
>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, G?ttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südnieders?chsischen Orgellandschaft vollst?ndig oder in Teilen erhalten. "
>>> for sent in sent_tokenize(german_text, language='german'):
... print sent
... print '---------'
...
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, G?ttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südnieders?chsischen Orgellandschaft vollst?ndig oder in Teilen erhalten.
---------
To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py and training data format for nltk punkt