You should specify a word tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer instance, using the tokenizer
parameter.
For example, nltk.tokenize.TreebankWordTokenizer
treats most punctuation characters as separate tokens:
import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size),
tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
outputs:
4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python',
u"it 's pretty awesome", u'like python , it', u"python , it 's",
u'really like python ,']
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…