I have a dataset of reviews which has a class label of positive/negative. I am applying Naive Bayes to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
I am splitting the data into train and test dataset.
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
I am applying the naive bayes algorithm as follows
optimal_alpha = 1
NB_optimal = BernoulliNB(alpha=optimal_aplha)
# fitting the model
NB_optimal.fit(X_tr, y_tr)
# predict the response
pred = NB_optimal.predict(X_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('
The accuracy of the NB classifier for k = %d is %f%%' % (optimal_aplha, acc))
Here X_test is test dataset in which pred variable gives us whether the vector in X_test is positive or negative class.
The X_test shape is (54626 rows, 82343 dimensions)
length of pred is 54626
My question is I want to get the words with highest probability in each vector so that I can get to know by the words that why it predicted as positive or negative class. Therefore, how to get the words which have highest probability in each vector?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…