Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
497 views
in Technique[技术] by (71.8m points)

python - How do I store a TfidfVectorizer for future use in scikit-learn?

I have a TfidfVectorizer that vectorizes collection of articles followed by feature selection.

vectroizer = TfidfVectorizer()
X_train = vectroizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)

Now, I want to store this and use it in other programs. I don't want to re-run the TfidfVectorizer() and the feature selector on the training dataset. How do I do that? I know how to make a model persistent using joblib but I wonder if this is the same as making a model persistent.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can simply use the built in pickle library:

import pickle
pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))

and load it with:

vectorizer = pickle.load(open("vectorizer.pickle", "rb"))
selector = pickle.load(open("selector.pickle", "rb"))

Pickle will serialize the objects to disk and load them in memory again when you need it

pickle lib docs


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...