I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the highest probability, how can I do that?
myCorpus <- Corpus(VectorSource(userbios$bio))
docs <- userbios$twitter_id
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
myStopwords <- c("twitter", "tweets", "tweet", "tweeting", "account")
# remove stopwords from corpus
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
# stem words
# require(rJava) # needed for stemming function
# library(Snowball) # also needed for stemming function
# a <- tm_map(myCorpus, stemDocument, language = "english")
myDtm <- DocumentTermMatrix(myCorpus, control = list(wordLengths=c(2,Inf), weighting=weightTf))
myDtm2 <- removeSparseTerms(myDtm, sparse=0.85)
dtm <- myDtm2
library(topicmodels)
rowTotals <- apply(dtm, 1, sum)
dtm2 <- dtm[rowTotals>0]
dim(dtm2)
dtm_LDA <- LDA(dtm2, 30)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…