Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
327 views
in Technique[技术] by (71.8m points)

python - Calling NLTK's concordance - how to get text before/after a word that was used?

I'm would like to find out what text comes after the instance that concordace returns. So for instance, if you look at an example they give in 'Searching Text' section, they get concordance of word 'monstrous'. How would you get words that come right after an instance of monstrous?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
import nltk
import nltk.book as book
text1 = book.text1
c = nltk.ConcordanceIndex(text1.tokens, key = lambda s: s.lower())
print([text1.tokens[offset+1] for offset in c.offsets('monstrous')])

yields

['size', 'bulk', 'clubs', 'cannibal', 'and', 'fable', 'Pictures', 'pictures', 'stories', 'cabinet', 'size']

I found this by looking up how the concordance method is defined.

This shows text1.concordance is defined in /usr/lib/python2.7/dist-packages/nltk/text.py:

In [107]: text1.concordance?
Type:       instancemethod
Base Class: <type 'instancemethod'>
String Form:    <bound method Text.concordance of <Text: Moby Dick by Herman Melville 1851>>
Namespace:  Interactive
File:       /usr/lib/python2.7/dist-packages/nltk/text.py

In that file you'll find

def concordance(self, word, width=79, lines=25):
    ... 
        self._concordance_index = ConcordanceIndex(self.tokens,
                                                   key=lambda s:s.lower())
    ...            
    self._concordance_index.print_concordance(word, width, lines)

This shows how to instantiate ConcordanceIndex objects.

And in the same file you'll also find:

class ConcordanceIndex(object):
    def __init__(self, tokens, key=lambda x:x):
        ...
    def print_concordance(self, word, width=75, lines=25):
        ...
        offsets = self.offsets(word)
        ...
        right = ' '.join(self._tokens[i+1:i+context])

With some experimentation in the IPython interpreter, this shows self.offsets('monstrous') gives a list of numbers (offsets) where the word monstrous can be found. You can access the actual words with self._tokens[offset], which is the same as text1.tokens[offset].

So the next word after monstrous is given by text1.tokens[offset+1].


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...