I'm trying to use the packages quanteda
and caret
together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda
with the ones in caret
. However, I can't seem to get caret
to work right.
Here is some code for reproduction. First on the quanteda
side:
library(quanteda)
library(quanteda.corpora)
library(caret)
corp <- data_corpus_movies
set.seed(300)
id_train <- sample(docnames(corp), size = 1500, replace = FALSE)
# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
dfm(stem = TRUE)
# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
dfm(stem = TRUE) %>%
dfm_select(pattern = training_dfm,
selection = "keep")
# train model on sentiment
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
class_table_quanteda <- table(actual_class, predicted_class)
class_table_quanteda
#> predicted_class
#> actual_class neg pos
#> neg 202 47
#> pos 49 202
Not bad. The accuracy is 80.8% percent without tuning. Now the same (as far as I know) in caret
training_m <- convert(training_dfm, to = "matrix")
test_m <- convert(test_dfm, to = "matrix")
nb_caret <- train(x = training_m,
y = as.factor(docvars(training_dfm, "Sentiment")),
method = "naive_bayes",
trControl = trainControl(method = "none"),
tuneGrid = data.frame(laplace = 1,
usekernel = FALSE,
adjust = FALSE),
verbose = TRUE)
predicted_class_caret <- predict(nb_caret, newdata = test_m)
class_table_caret <- table(actual_class, predicted_class_caret)
class_table_caret
#> predicted_class_caret
#> actual_class neg pos
#> neg 246 3
#> pos 249 2
Not only is the accuracy abysmal here (49.6% - roughly chance), the pos class is hardly ever predicted at all! So I'm pretty sure I'm missing something crucial here, as I would assume the implementations should be fairly similar, but not sure what.
I already looked at the source code for the quanteda
function (hoping that it might be built on caret
or the underlying package anyway) and saw that there is some weighting and smoothing going on. If I apply the same to my dfm before training (setting laplace = 0
later on), accuracy is a bit better. Yet also only 53%.
See Question&Answers more detail:
os