Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
534 views
in Technique[技术] by (71.8m points)

python - Why is pos_tag() so painfully slow and can this be avoided?

I want to be able to get POS-Tags of sentences one by one like in this manner:

def __remove_stop_words(self, tokenized_text, stop_words):

    sentences_pos = nltk.pos_tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

But the problem is that pos_tag() takes about a second for each sentence. There is another option to use pos_tag_sents() to do this batch-wise and speed things up. But my life would be easier if I could do this sentence by sentence.

Is there a way to do this faster?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

For nltk version 3.1, inside nltk/tag/__init__.py, pos_tag is defined like this:

from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)    

So each call to pos_tag first instantiates PerceptronTagger which takes some time because it involves loading a pickle file. _pos_tag simply calls tagger.tag when tagset is None. So you can save some time by loading the file once, and calling tagger.tag yourself instead of calling pos_tag:

from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger() 
def __remove_stop_words(self, tokenized_text, stop_words, tagger=tagger):
    sentences_pos = tagger.tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

pos_tag_sents uses the same trick as above -- it instantiates PerceptronTagger once before calling _pos_tag many times. So you'll get a comparable gain in performance using the above code as you would by refactoring and calling pos_tag_sents.


Also, if stop_words is a long list, you may save a bit of time by making stop_words a set:

stop_words = set(stop_words)

since checking membership in a set (e.g. pos not in stop_words) is a O(1) (constant time) operation while checking membership in a list is a O(n) operation (i.e. it requires time which grows proportionally to the length of the list.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...