python - How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?

Question

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:03:30+0000

You should customize the token_pattern parameter when you instantiate the vectorizer. For example:

vent = CountVectorizer(token_pattern=r"(?u)ww+|!|?|"|'")