Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
696 views
in Technique[技术] by (71.8m points)

r - Math of tm::findAssocs how does this function work?

I have been using findAssoc() with textmining (tm package) but realized that something doesn't seem right with my dataset.

My dataset is 1500 open ended answers saved in one column of csv file. So I called the dataset like this and used typical tm_map to make it to corpus.

library(tm)
Q29 <- read.csv("favoritegame2.csv")
corpus <- Corpus(VectorSource(Q29$Q29))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
dtm<- DocumentTermMatrix(corpus)

findAssocs(dtm, "like", .2)
> cousin  fill  ....
  0.28    0.20      

Q1. When I find Terms associated with like, I don't see the output like = 1 as part of the output. However,

dtm.df <-as.data.frame(inspect(dtm))

this dataframe consists of 1500 obs. of 1689 variables..(Or is it because the data is save in a row of csv file?)

Q2. Even though cousin and fill showed up once when the target term like showed up once, the score is different like this. Shouldn't they be same?

I'm trying to find the math of findAssoc() but no success yet. Any advice is highly appreciated!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I don't think anyone has answered your final question.

I'm trying to find the math of findAssoc() but no success yet. Any advice is highly appreciated!

The math of findAssoc() is based on the standard function cor() in the stats package of R. Given two numeric vectors, cor() computes their covariance divided by both the standard deviations.

So given a DocumentTermMatrix dtm containing terms "word1" and "word2" such that findAssocs(dtm, "word1", 0) returns "word2" with a value of x, the correlation of the term vectors for "word1" and "word2" is x.

For a long-winded example

> data <-  c("", "word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5") 
> dtm <- DocumentTermMatrix(VCorpus(VectorSource(data)))
> as.matrix(dtm)
    Terms
Docs word1 word2 word3 word4 word5
   1     0     0     0     0     0
   2     1     0     0     0     0
   3     1     1     0     0     0
   4     1     1     1     0     0
   5     1     1     1     1     0
   6     1     1     1     1     1
> findAssocs(dtm, "word1", 0) 
$word1
word2 word3 word4 word5 
 0.63  0.45  0.32  0.20 

> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word2"])
[1] 0.6324555
> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word3"])
[1] 0.4472136

and so on for words 4 and 5.

See also http://r.789695.n4.nabble.com/findAssocs-tt3845751.html#a4637248


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...