7

after I read JDK's source code ,I find HashMap's hash() function seems fun. Its soucre code like this:

    static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

Parameter h is the hashCode from Objects which was put into HashMap. How does this method work and why? Why this method can defend against poor hashCode functions?

0

1 Answer 1

14

Hashtable uses the 'classical' approach of prime numbers: to get the 'index' of a value, you take the hash of the key and perform the modulus against the size. Taking a prime number as size, gives (normally) a nice spread over the indexes (depending on the hash as well, of course).

HashMap uses a 'power of two'-approach, meaning the sizes are a power of two. The reason is it's supposed to be faster than prime number calculations. However, since a power of two is not a prime number, there would be more collisions, especially with hash values having the same lower bits.

Why? The modulus performed against the size to get the (bucket/slot) index is simply calculated by: hash & (size-1) (which is exactly what's used in HashMap to get the index!). That's basically the problem with the 'power-of-two' approach: if the length is limited, e.g. 16, the default value of HashMap, only the last bits are used and hence, hash values with the same lower bits will result in the same (bucket) index. In the case of 16, only the last 4 bits are used to calculate the index.

That's why an extra hash is calculated and basically it's shifting the higher bit values, and operate on them with the lower bit values. The reason for the numbers 20, 12, 7 and 4, I don't really know. They used be different (in Java 1.5 or so, the hash function was little different). I suppose there's more advanced literature available. You might find more info about why they use the numbers they use in all kinds of algorithm-related literature, e.g.

http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming

http://mitpress.mit.edu/books/introduction-algorithms

http://burtleburtle.net/bob/hash/evahash.html#lookup uses different algorithms depending on the length (which makes some sense).

http://www.javaspecialists.eu/archive/Issue054.html is probably interesting as well. Check the reaction of Joshua Bloch near the bottom of the article: "The replacement secondary hash function (which I developed with the aid of a computer) has strong statistical properties that pretty much guarantee good bucket distribution.") So, if you ask me, the numbers come from some kind of analysis performed by Josh himself, probably assisted by who knows who.

So: power of two gives faster calculation, but the necessity for additional hash calculation in order to have a nice spread over the slots/buckets.

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.