Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
290 views
in Technique[技术] by (71.8m points)

php - How to build a 'related questions' engine?

One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff. When the same question pops up very often they can add this particular question to the Faq.

In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).

What ways are there to build this kind of feature? I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?

Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.

In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.

I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:

http://en.wikipedia.org/wiki/Tf%E2%80%93idf


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...