I am trying out this code
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
train_data = ["football is the sport","gravity is the movie", "education is imporatant"]
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
print "Applying first train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()
print "
Applying second train data"
train_data = ["cricket", "Transformers is a film","AIMS is a college"]
X_train = vectorizer.transform(train_data)
print vectorizer.get_feature_names()
print "
Applying fit transform onto second train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()
The output for this one is
Applying first train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']
Applying second train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']
Applying fit transform onto second train data
[u'aims', u'college', u'cricket', u'film', u'transformers']
I gave the first set of data using fit_transform to vectorizer so it gave me feature names like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']
after that i applied another train set to the same vectorizer but it gave me the same feature names as I didnt use fit or fit_transform. But I want to know how to update the features of a vectorizer without overwriting the previous oncs. If I use fit_transform again the previous features will get overwritten. So I want to update the feature list of the vectorizer. So i want something like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport',u'aims', u'college', u'cricket', u'film', u'transformers']
How can I get that.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…