Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

algorithm - Python digest/hash for string similarity

I'm looking for an algorithm which can generate a short (fx 16 chars (not important) hashcode/digest from a longer string.

The main requirement is that strings which is almost identical should result in the same digest.

Fx 2 almost identical mail:

Hi Martin. Here are some ... spam for you. Regards XYZ. => AAAA AAAA AAAA AAAA

Hi Bo. Here are some ... spam for you. Regards EFG. => AAAA AAAA AAAA AAAA

returns the same diges (or almost the same), where as a different mail:

Hello Finn. This is a test mail. => CCCC CCCC CCCC CCCC

will return a different digest.

This algorithm would be part of a spam filter. The filter will remember digests from mails which it is certain is spam. If the same digest shows up in mails where it is in doubt, the identical digest will cause the filter to increase the spamscore.

I know about Levenshtein, but it requires me to know the strings up front. In this situation i do not have this information. I could have this information, but that would require the filter for store all spam e-mail and check against each one, which would be a very slow process.

Maybe some loose compression algorithm coupled with a calc of the Levenshtein distance between the two could work.

Any pointers appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It looks like you want locality-sensitive hashing. Consider using minhash or shingling. There's a great explanation of both in Rajaraman & Ullman's book, Mining Massive Datasets. You'll find numerous, short implementations in python searching blogs for the keywords above.

There seem to be other approaches to this (that I don't know much about), but that may be of interest to you since they are specially tailored for spam messages, in particular the nilsimsa hash:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...