Is there any way for me to preserve punctuation marks of !, ?, " and ' from my text documents using text CountVectorizer or TfidfVectorizer parameters in scikit-learn?
CountVectorizer
TfidfVectorizer
You should customize the token_pattern parameter when you instantiate the vectorizer. For example:
token_pattern
vent = CountVectorizer(token_pattern=r"(?u)ww+|!|?|"|'")
1.4m articles
1.4m replys
5 comments
57.0k users