'Expected pair of collisions in two choice hashing

In two choice hashing(with chaining), two random hash functions h1, h2 are selected to hash n keys to m positions. The process goes like this:

Insert all n keys sequentially, by evaluating for each key x, the two hash functions, and adding the key to the shorter list of those indexed by h1(x), h2(x). Let Y be the number of pairs (x, x') such that they end up in the same linked list. What would the expectation of Y(E[Y]) be like?

Assume h1, h2 hash keys uniformly and independently

algorithm hash

Solution 1:^[1]

I doubt that there is a clean analytic solution to this problem.

But it is very tractable to produce numerical approximations for the case of a large number of buckets. Let x be the ratio between the number of keys stored and the number of buckets. Let f_n(x) be the expected portion of buckets with n keys and F_n(x) be the expected portion of buckets with at most n keys.

Then trivially F_n(x) = f₀(x) + f₁(x) + ... + f_n(x). And also f_n'(x) is the probability that with a key add a bucket of size n-1 will be added to, minus the probability that with a key add, a bucket of size n will be to.

The probability that a bucket of size n will be added to next is the probability that the first hash function chooses a bucket of size n and the second chooses one of size at least n, plus the probability that the first hash function chooses a bucket of size greater than n and the second chooses one of size n. The probability of choosing a bucket of size greater than n is simply 1 - F_n(x). So this probability is f_n(x)(1 - F_n-1(x)) + (1 - F_n(x))f_n(x).

That makes f_n'(x) = f_n-1(x)(1 - F_n-2(x)) + (1 - F_n-1(x))f_n-1(x) - f_n(x)(1 - F_n-1(x)) - (1 - F_n(x))f_n(x). Which is a horribly non-linear system of equations and there is no need to expect any nice solution to it. But it is also amenable to http://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods and you shouldn't need too many terms to get accurate estimates since the portion heavily filled buckets falls off faster than exponentially. (To see why, convince yourself that if n is significantly larger than x, then f_n+1'(x) is approximately f_n²(x). The repeated squaring of small terms means that there is a very sharp falloff from f_n(x) < 0.01 to f_n(x) < 0.00000001.)

Solution 2:^[2]

It depends on m. For m being O(n^(3/2)), E[Y] = O(1).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	btilly
Solution 2	morteza hosseini

'Expected pair of collisions in two choice hashing

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]