python - get indices of original text from nltk word_tokenize

Question

Welcome To Ask or Share your Answers For Others

python - get indices of original text from nltk word_tokenize

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - get indices of original text from nltk word_tokenize

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.

import nltk
x = 'hello world'
tokens = nltk.word_tokenize(x)
>>> ['hello', 'world']

How can I also get the array [0, 7] corresponding to the raw indices of the tokens?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:18:08+0000

You can also do this:

def spans(txt):
    tokens=nltk.word_tokenize(txt)
    offset = 0
    for token in tokens:
        offset = txt.find(token, offset)
        yield token, offset, offset+len(token)
        offset += len(token)


s = "And now for something completely different and."
for token in spans(s):
    print token
    assert token[0]==s[token[1]:token[2]]

And get:

('And', 0, 3)
('now', 4, 7)
('for', 8, 11)
('something', 12, 21)
('completely', 22, 32)
('different', 33, 42)
('.', 42, 43)

Categories

python - get indices of original text from nltk word_tokenize

python - get indices of original text from nltk word_tokenize

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags