Another tool from me: a simple and practical implementation of string hash functions. If needed, familiarize yourself a little with hashing here: http://en.wikipedia.org/wiki/Hash_fu... To summarize, it's where you have a function on a key (in my case, a string) that returns a (hopefully!) unique numerical value.
In particular, I have written two functions (+ an auxiliary function): - A fast Fowler-Noll-Vo customized hash function, useful for quick hash generation when you're willing to work with sparse arrays and risks of collision, described here: http://isthe.com/chongo/tech/comp/fn... - A MP hash function using the Hash, Displace, and Compress method, useful for high-quality zero-collision hashes when you know all the keys, described in this paper: http://cmph.sourceforge.net/papers/e...
The MP hash generator should work in O(n) - the actual hashing functions work in O(1) (well, technically O(N) where N is the length of the key). Merely the fact that it handled the 2031 keys in the example so quickly is a phenomenal result. Note, however, that HDC is dependent on FNV hash, so include both if you want to use HDC hash only.
I have plans to rewrite my IsWord? function in terms of this more complex but far faster way of storing dictionaries.----------------- Hopefully PA is inconsistent.
Hashing is the concept of storing a value into a unique number, with no chance that another value could result in the same number. Correct? Well, in the terms of string hashing, why not compress the string into a binary number? You can store strings of any length, with any combination of characters, and no other string will ever result in the same number. Would that work, or am I missing the point, here?----------------- Orange is my favorite number.
A hash function does effectively what you're saying by "[compressing] the string into a binary number", but it also doesn't do exactly what you say. The condition that there is no chance of any two strings have the same hash is the perfect condition. The FNV hash is not perfect, which is why it's more useful when you want sparse data storage.
What you're suggesting with compression also has a few issues. It's very easy to say "oh, let's just compress the string", but the question is what compression algorithm will you use? If you try directly converting the UTF-16 string into an array of bytes which correspond to a number, you've actually done what's called a trivial hashing. This conversion is indeed perfect, but often absolutely useless, as the number you converted to could either overflow easily or be completely outside the bounds of what you want.
If you try design your compression method to be lossless, then keep in mind that the inherent limit to how much you can compress the string is wholly dependent on its entropy as well. The lossless condition will also imply that every key that you won\'t use will also get a unique value, which is also pointless when you want to have a one-to-one function for your keys only: the minimal condition.
There's also one very fundamental issue with compression vs. hashing. You can't usually control the behavior of collisions when applying the compressed string value modulo size, while hash functions are commonly designed to minimize these collisions via prime numbers. Funnily enough, a hash function is like a very lossy compression algorithm.
I highly recommend reading up on hashing. It's both mathematically and computationally interesting.----------------- Hopefully PA is inconsistent.
I've played around with it quite a bit before. xD I am a programmer, after all. And yes, it is a very interesting topic! I know several ways that a string can be hashed, but I only suggested binary conversion because it was simple, and still made my point. You could also use index based hashing, and stuff like that. There's no way you can design a perfect collision-less system without overflowing integers being an issue. Also the method of compression is also highly dependent on the algorithm used. But the same issue still applies.
All and all, good job on your algorithm. ^^----------------- Orange is my favorite number.