For nltk version 3.1, inside nltk/tag/__init__.py
, pos_tag
is defined like this:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
So each call to pos_tag
first instantiates PerceptronTagger
which takes some time because it involves loading a pickle file. _pos_tag
simply calls tagger.tag
when tagset
is None
.
So you can save some time by loading the file once, and calling tagger.tag
yourself instead of calling pos_tag
:
from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger()
def __remove_stop_words(self, tokenized_text, stop_words, tagger=tagger):
sentences_pos = tagger.tag(tokenized_text)
filtered_words = [word for (word, pos) in sentences_pos
if pos not in stop_words and word not in stop_words]
return filtered_words
pos_tag_sents
uses the same trick as above -- it instantiates PerceptronTagger
once before calling _pos_tag
many times. So you'll get a comparable gain in performance using the above code as you would by refactoring and calling pos_tag_sents
.
Also, if stop_words
is a long list, you may save a bit of time by making stop_words
a set:
stop_words = set(stop_words)
since checking membership in a set (e.g. pos not in stop_words
) is a O(1)
(constant time) operation while checking membership in a list is a O(n)
operation (i.e. it requires time which grows proportionally to the length of the list.)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…