Hash Function PDF

Average number of collisions in a hash function
Sergei Winitzki
2001-10-13 to December 30, 2008
1 Statistics of random hash For n > 1 the generating function is equal to the prod-
uct of the n (identical) generating functions (2):
function
1 n
1.1 Formulation of the problem G(n; q1 , ..., qN ) = (q1 + ... + qN )
Nn
A p-bit hash function is a function from N to the 1 X N!
= n q s1 ...qN
sN
.
integer range {0, 1, ..., 2p 1}. Such functions are used N P s1 !...sN ! 1
si 0; i si =n
as check sums on data files. A data file is considered (3)
as a stream of bits, that is, a binary representation
of a nonnegative integer number. If the hash func- This generating function contains, in principle, the
tion gives different results on two files, the files are complete information about the probabilities of draw-
surely different. For example, the MD5 sum is a 64-bit ing various sets of integers. Our task now is to use this
hash function frequently used to verify file integrity. A generating function for the computations we need to
good hash function will yield different results for even perform.
slightly different files; heuristically, a good hash func-
tion yields a random value. However, it is clear that
there will be, by pure chance, some cases where differ- 1.3 Average number of different inte-
ent inputs yield the same hash function value. These gers
are called hash collisions. The problem is to estimate
Each possible drawing of the n random integers is rep-
the frequency of hash collisions, assuming a perfect
resented in the generating function G by a term such
hash, i.e. that the hash values are perfectly random,
as q1 q32 q4 , which signifies a drawing of {1, 3, 3, 4}. The
uniformly distributed numbers in the hash range.
number of different integers in this drawing is 3. The
Therefore, the problem of finding the frequency of
generating function G is the sum of all these terms with
hash collisions is equivalent to the following mathe-
the coefficients equal to the probabilities of the draw-
matical problem. Suppose x1 , ..., xn are independent,
ings. The average number of different integers will be
uniformly randomly chosen integers, each ranging from
computed if we replace in G(n; q1 , ..., qN ) every term
1 to N (in the case of a p-bit hash function, we choose
q1s1 ...qN
sN
by the number of different qi s in that term.
N = 2p ). We need to compute the average number of
The number of different qi s in the term q1s1 ...qN sN
can
different integers in the set {x1 , ..., xn }. We would like
be computed as f (s1 ) + ... + f (sN ), where the function
to compute also the average number of pair collisions,
f (s) is defined as
triple collisions, etc.
(
0, s = 0,
1.2 The basic generating function f (s) = (4)
1, s 1.
One drawing of n integers can be described if we spec-
ify how many times each possible integer from the So we only need to replace q1s1 ...qNsN
by f (s1 ) + ... +
set {1, ..., N } is selected. Consider the probability f (sN ).
p(n; s1 , ..., sN ) that the integer i is selected si times An elegant way of doing this is to find an explicit
(i = 1, ..., N ). The generating function for this proba- formula for a linear map from polynomials in {qi } to
bility can be defined as integers, so that q1s1 ...qN
sN
is mapped to f (s1 ) + ... +
X f (sN ). This map can be found as follows.
G(n; q1 , ..., qN ) = q1s1 ...qN
sN
p(n; s1 , ..., sN ). (1)
First let us try to find the map for just one variable.
si0
We need a formula for a linear map such that q s is
For n = 1 we have mapped into f (s). In particular, we need f (s) = 1 for
( all s 1. In other words, q 2 is equivalent to q after
1
, if only one of si is 1, the map; this suggests that q should be replaced by a
p(1; s1 , ..., sN ) = N
0, otherwise. projection matrix. However, once we got the idea of
using a matrix we do not need to limit ourselves to a
So the generating function for n = 1 is simply particular choice of f (s). Let us keep f (s) general and
1 substitute instead of q some matrix T such that T s is
G(1; q1 , ...qN ) = (q1 + ... + qN ) . (2) mapped into f (s). This can be arranged if we choose
N
1
some vector u V and some covector v V such Therefore the average number of distinct integers is
that h n i
hv , T s ui = f (s), (5) nd = N 1 1 N 1 . (12)
where the operator T acts in the vector space V . This This formula describes the average number of collisions
construction yields a linear map from polynomials in q in a perfect hash function.
into numbers, such that q s is mapped into f (s). As a realistic example, let us assume that we have
Now let us generalize to N variables {qi }. We need a computed the 32-bit hash sums of one million different
linear map that yields f (s1 )+...+f (sN ). This suggests files. How many different hash sums do we have on
that we use a direct sum of N copies of the linear space the average? We substitute N = 232 and n = 106 into
V and substitute instead of qi the operators Eq. (12) and find
Ti 1V ... T ... 1V End(V ... V ) (6) nd (232 , 106 ) 106 116.4,
where the operator T acts on the i-th copy of V and
which means that about 116 files will have the same
1V is the identity operator in V . We now define the
hash sum even though the files are different. So we
vector u and the covector v ,
need to use a larger hash range; with N = 264 we find

u ... u, v
u v ... v , (7)
nd (264 , 106 ) = 106 2.7 108 . (13)
and verify that
This indicates a negligible chance of hash collisions.
v , Tis u
h i = f (s). (8) Therefore, a 64-bit hash sum is sufficient for a million
files.
When we substitute Ti instead of qi in a polynomial Let us perform an asymptotic estimate of the colli-
term q1s1 ...qN
sN
, we obtain an operator T s1 ... T sN , sion rate for very large N . We may expand Eq. (12)
which will yield as
v , (T s1 ... T sN )
h ui = f (s1 ) + ... + f (sN ). (9)

n n(n 1) n(n 1)
nd N 1 1 + 2
=n .
Therefore, we constructed a linear map that can be N 2N 2N
applied directly to the polynomial G(n; q1 , ..., qN ) to Therefore, the collision rate is negligible (n nd 1)
yield the average number of different integers if f (s) is when N n2 .
chosen as shown above.
Let us perform this computation using the explicit
form of G(n; q1 , ..., qN ). We substitute Ti instead of qi
1.4 Average number of pairs, triples,
and obtain etc.
n
If we wanted to find the average number of pairs, we

T1 + ... + TN
G(n; T1 , ..., TN ) = . could replace the term q1s1 ...qN sN
in the generating func-
N
tion G(n; q1 , ..., qN ) by f2 (s1 )+ ...+ f2 (sN ) where f2 (s)
The operator T1 + ... + TN can be simplified to is defined as
T1 +...+TN = [(N 1)1V + T ]...[(N 1)1V + T ] .
(
1, s = 2;
f2 (s) = s2 =
Let us denote for brevify 0, otherwise.
N 1 1 We can similarly consider the triples or, more generally,
Q= 1V + T.
N N p-tuples of coincident integers, by taking the function
Then we can write fp (s) = sp . We can describe all these p-tuples at once
if we consider the generating function of the average
G(n; T1 , ..., TN ) = Qn ... Qn . (10) number of p-tuples; this means introducing an addi-
Now we can evaluate the application tional formal parameter t and defining
X
ui = N hv , Qn ui .
v , G(n; T1 , ..., TN )
h f (s; t) = fp (s)tp = ts .
p0
Consider the function f (s) defined by Eq. (4). One can
certainly choose an operator T and vectors u, v such Hence, we use the same derivation as in the previous
that Eq. (5) holds for this f (s). Then we find section up to Eq. (11), but now we substitute the func-
tion f (s) = ts instead of the previously used f (s) in
n
1 X n nk
Eq. (11). Then we find
hv , Qn ui = n v , T ku

(N 1)
N k
k=0 h ui = N hv , Qn ui
v , G(n; T1 , ..., TN )
n
1 X n nk n
N X n
= n (N 1) f (k) (11) = (N 1)
nk k
t
N k Nn k
k=0
k=0
n n
1 X n nk N n (N 1) (N 1 + t)
n
= n (N 1) = n
. = . (14)
N k N N n1
k=1
2
The average number of pairs is read off from Eq. (14)
as the coefficient at t2 . The average number of p-tuples
is np
n (N 1)
np = .
p N n1
For example, with p = 2 we find
n2
n(n 1) (N 1)n2

n(n 1) 1
n2 = = 1 .
2 N n1 2N N
1.5 Remarks
Perhaps the calculation can be performed directly
without the procedure with the substitution of some
complicated operators into the generating function.
Maybe one can directly consider the generating func-
tion of the average number of p-tuples, starting with
Eq. (11).

Hash Function PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hash Function PDF

Uploaded by

Copyright:

Available Formats

Average number of collisions in a hash function

You might also like