python - How nltk.TweetTokenizer different from nltk.word_tokenize?

Question

Welcome To Ask or Share your Answers For Others

python - How nltk.TweetTokenizer different from nltk.word_tokenize?

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:28:45+0000

Well, both tokenizers almost work the same way, to split a given sentence into words. But you can think of TweetTokenizer as a subset of word_tokenize. TweetTokenizer keeps hashtags intact while word_tokenize doesn't.

I hope the below example will clear all your doubts...

from nltk.tokenize import TweetTokenizer
from nltk.tokenize import  word_tokenize
tt = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaaayyyy too much for you!!!!!!"
print(tt.tokenize(tweet))
print(word_tokenize(tweet))

# output
# ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']
# ['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']

You can see that word_tokenize has split #dummysmiley as '#' and 'dummysmiley', while TweetTokenizer didn't, as '#dummysmiley'. TweetTokenizer is built mainly for analyzing tweets. You can learn more about tokenizer from this link

Categories

python - How nltk.TweetTokenizer different from nltk.word_tokenize?

python - How nltk.TweetTokenizer different from nltk.word_tokenize?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags