In an effort to accelerate fast-out behaviour on testing strings for anagrams, I came up with a prime-based hashing scheme -- although it looks like I wasn't the first.
The basic idea is to map letters to prime numbers, and to compute the product of these primes. Any rearrangement of the letters will have the same product, and if the result can be arbitrarily large then no combination of other letters can produce the same result.
I had initially envisioned this as just a hash. Eventually the product would overflow and start to alias other letter combinations. However, by mapping the most frequent letters to the smallest primes the product grows slowly and can often avoid overflow altogether. In this case we get a perfect hash, giving both definite positive and negative results without additional testing.
What's notable is that it doesn't fill the coding space very efficiently before overflowing. No result will have any prime factors greater than 103, and the distribution of small primes is fixed and not necessarily a great match to letter frequency.
Now I'm wondering if there's something substantially better than this. Something that covers more results with perfect hashes and has strong distribution in the remaining cases.
The densest coding scheme I can think of is to sort the letters and then pack them into a word with an entropy coder. In this scheme the letter frequency will obviously be enormously biased because of the range constraints applied to each position (eg., the likelihood of a sorted array starting with z is substantially lower than that of a sorted array ending with a z).
That sounds like a whole lot of work, though -- and I can't see it guaranteeing to give good distribution in the overflow case.
Perhaps there's a better set of factors to map the letters to, and a better way to detect when the risk of aliasing has started. Or a hashing scheme that doesn't rely on multiplication? Something that's easy to calculate?
So that's:
- A perfect hash for as much real-world input as possible (for some sensible number of bits).
- A strong hash for remaining cases, with a means of distinguishing the two cases.
- Easy to calculate.
English language constraints (26 letters with typical English-like word structure) will do fine. Multi-byte coding schemes are a whole other problem.
C code preferred because I understand it.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…