r - How to recreate same DocumentTermMatrix with new (test) data

Question

Welcome To Ask or Share your Answers For Others

r - How to recreate same DocumentTermMatrix with new (test) data

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - How to recreate same DocumentTermMatrix with new (test) data

Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.

I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.

Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.

However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?

Specifically, what I am trying to find out is that how do I create the exact DTM on new data?

If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.

Any ideas/thoughts?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:43:57+0000

tm has so many pitfalls... See much more efficient text2vec and vectorization vignette which fully answers to the question.

For tm here is probably one more simple way to reconstruct DTM matrix for second corpus:

crude2.dtm <- DocumentTermMatrix(crude2, control = list
               (dictionary=Terms(crude1.dtm), wordLengths = c(3,10)) )

Categories

r - How to recreate same DocumentTermMatrix with new (test) data

r - How to recreate same DocumentTermMatrix with new (test) data

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags