Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
681 views
in Technique[技术] by (71.8m points)

python - ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error

I am running a very simple experiment with ColumnTransformer with an intent to transform an array of columns, ["a"] in this example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", tfidf, ["a"])],remainder="passthrough")
clmn.fit_transform(dataset)

Which gives me:

ValueError: empty vocabulary; perhaps the documents only contain stop words

Obviously, TfidfVectorizer can do fit_transform() on its own:

tfidf.fit_transform(dataset.a)
<2x5 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

What could be a reason for such an error and how to correct for it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

That's because you are providing ["a"] instead of "a" in ColumnTransformer. According to the documentation:

A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.

Now, TfidfVectorizer requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer (even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer. And hence the error.

Change that to:

clmn = ColumnTransformer([("tfidf", tfidf, "a")],
                         remainder="passthrough")

For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:

dataset['a']

vs 

dataset[['a']]

Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:

tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"), 
                          ("tfidf_2", tfidf_2, "b")
                         ],
                         remainder="passthrough")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...