137
Analysis of Algorithms Hash Tables Andres Mendez-Vazquez October 3, 2014 1 / 81

08 Hash Tables

Embed Size (px)

Citation preview

Page 1: 08 Hash Tables

Analysis of AlgorithmsHash Tables

Andres Mendez-Vazquez

October 3, 2014

1 / 81

Page 2: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

2 / 81

Page 3: 08 Hash Tables

First: About Basic Data Structures

RemarkIt is quite interesting to notice that many data structures actually sharesimilar operations!!!

YesIf you think them as ADT

3 / 81

Page 4: 08 Hash Tables

First: About Basic Data Structures

RemarkIt is quite interesting to notice that many data structures actually sharesimilar operations!!!

YesIf you think them as ADT

3 / 81

Page 5: 08 Hash Tables

Examples

Search(S,k)Example: Search in a BST

8

3

1 6

4 7

10

14

13

k=7

4 / 81

Page 6: 08 Hash Tables

Examples

Insert(S,x)Example: Insert in a linked list

CA EDB

firstNode

NULL

K

5 / 81

Page 7: 08 Hash Tables

And Again

Delete(S,x)Example: Delete in a BST

8

3

1 4

7

10

14

13

8

3

1 6

4 7

10

14

13

Delete = 6

6 / 81

Page 8: 08 Hash Tables

Basic data structures and operations.

ThereforeThis are basic structures, it is up to you to read about them.

Chapter 10 Cormen’s book

7 / 81

Page 9: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

8 / 81

Page 10: 08 Hash Tables

Hash tables: Concepts

DefinitionA hash table or hash map T is a data structure, most commonly anarray, that uses a hash function to efficiently map certain identifiers ofkeys (e.g. person names) to associated values.

AdvantagesThey have the advantage of having a expected complexity ofoperations of O(1 + α)

I Still, be aware of α

However, If you have a large number of keys, UThen, it is impractical to store a table of the size of |U|.Thus, you can use a hash function h : U→{0, 1, ...,m − 1}

9 / 81

Page 11: 08 Hash Tables

Hash tables: Concepts

DefinitionA hash table or hash map T is a data structure, most commonly anarray, that uses a hash function to efficiently map certain identifiers ofkeys (e.g. person names) to associated values.

AdvantagesThey have the advantage of having a expected complexity ofoperations of O(1 + α)

I Still, be aware of α

However, If you have a large number of keys, UThen, it is impractical to store a table of the size of |U|.Thus, you can use a hash function h : U→{0, 1, ...,m − 1}

9 / 81

Page 12: 08 Hash Tables

Hash tables: Concepts

DefinitionA hash table or hash map T is a data structure, most commonly anarray, that uses a hash function to efficiently map certain identifiers ofkeys (e.g. person names) to associated values.

AdvantagesThey have the advantage of having a expected complexity ofoperations of O(1 + α)

I Still, be aware of α

However, If you have a large number of keys, UThen, it is impractical to store a table of the size of |U|.Thus, you can use a hash function h : U→{0, 1, ...,m − 1}

9 / 81

Page 13: 08 Hash Tables

When you have a small universe of keys, U

RemarksIt is not necessary to map the key values.Key values are direct addresses in the array.Direct implementation or Direct-address tables.

Operations1 Direct-Address-Search(T , k)

I return T [k]

2 Direct-Address-Search(T , x)

I T [x .key ] = x3 Direct-Address-Delete(T , x)

I T [x .key ] = NIL

10 / 81

Page 14: 08 Hash Tables

When you have a small universe of keys, U

RemarksIt is not necessary to map the key values.Key values are direct addresses in the array.Direct implementation or Direct-address tables.

Operations1 Direct-Address-Search(T , k)

I return T [k]

2 Direct-Address-Search(T , x)

I T [x .key ] = x3 Direct-Address-Delete(T , x)

I T [x .key ] = NIL

10 / 81

Page 15: 08 Hash Tables

When you have a large universe of keys, U

ThenThen, it is impractical to store a table of the size of |U|.

You can use a especial function for mapping

h : U→{0, 1, ...,m − 1} (1)

ProblemWith a large enough universe U, two keys can hash to the same value

This is called a collision.

11 / 81

Page 16: 08 Hash Tables

When you have a large universe of keys, U

ThenThen, it is impractical to store a table of the size of |U|.

You can use a especial function for mapping

h : U→{0, 1, ...,m − 1} (1)

ProblemWith a large enough universe U, two keys can hash to the same value

This is called a collision.

11 / 81

Page 17: 08 Hash Tables

When you have a large universe of keys, U

ThenThen, it is impractical to store a table of the size of |U|.

You can use a especial function for mapping

h : U→{0, 1, ...,m − 1} (1)

ProblemWith a large enough universe U, two keys can hash to the same value

This is called a collision.

11 / 81

Page 18: 08 Hash Tables

Collisions

This is a problemWe might try to avoid this by using a suitable hash function h.

IdeaMake appear to be “random” enough to avoid collisions altogether(Highly Improbable) or to minimize the probability of them.

You still have the problem of collisionsPossible Solutions to the problem:

1 Chaining2 Open Addressing

12 / 81

Page 19: 08 Hash Tables

Collisions

This is a problemWe might try to avoid this by using a suitable hash function h.

IdeaMake appear to be “random” enough to avoid collisions altogether(Highly Improbable) or to minimize the probability of them.

You still have the problem of collisionsPossible Solutions to the problem:

1 Chaining2 Open Addressing

12 / 81

Page 20: 08 Hash Tables

Collisions

This is a problemWe might try to avoid this by using a suitable hash function h.

IdeaMake appear to be “random” enough to avoid collisions altogether(Highly Improbable) or to minimize the probability of them.

You still have the problem of collisionsPossible Solutions to the problem:

1 Chaining2 Open Addressing

12 / 81

Page 21: 08 Hash Tables

Hash tables: Chaining

A Possible SolutionInsert the elements that hash to the same slot into a linked list.

U(Universe of Keys)

13 / 81

Page 22: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

14 / 81

Page 23: 08 Hash Tables

Analysis of hashing with Chaining: Assumptions

AssumptionsWe have a load factor α = n

m , where m is the size of the hash tableT , and n is the number of elements to store.Simple uniform hashing property:

I This means that any of the m slots can be selected.I This means that if n = n0 + n1 + ...+ nm−1, we have that E (nj) = α.

To simplify the analysis, you need to consider two casesUnsuccessful searchSuccessful search

15 / 81

Page 24: 08 Hash Tables

Analysis of hashing with Chaining: Assumptions

AssumptionsWe have a load factor α = n

m , where m is the size of the hash tableT , and n is the number of elements to store.Simple uniform hashing property:

I This means that any of the m slots can be selected.I This means that if n = n0 + n1 + ...+ nm−1, we have that E (nj) = α.

To simplify the analysis, you need to consider two casesUnsuccessful searchSuccessful search

15 / 81

Page 25: 08 Hash Tables

Why?

After allYou are always looking for keys when

SearchingInsertingDeleting

It is clear that we have two possibilitiesFinding the key or not finding the key

16 / 81

Page 26: 08 Hash Tables

Why?

After allYou are always looking for keys when

SearchingInsertingDeleting

It is clear that we have two possibilitiesFinding the key or not finding the key

16 / 81

Page 27: 08 Hash Tables

For this, we have the following theorems

Theorem 11.1In a hash table in which collisions are resolved by chaining, an unsuccessfulsearch takes average-case time Θ (1 + α), under the assumption of simpleuniform hashing.

Theorem 11.2In a hash table in which collisions are resolved by chaining, a successfulsearch takes average-case time Θ (1 + α) under the assumption of simpleuniform hashing.

17 / 81

Page 28: 08 Hash Tables

For this, we have the following theorems

Theorem 11.1In a hash table in which collisions are resolved by chaining, an unsuccessfulsearch takes average-case time Θ (1 + α), under the assumption of simpleuniform hashing.

Theorem 11.2In a hash table in which collisions are resolved by chaining, a successfulsearch takes average-case time Θ (1 + α) under the assumption of simpleuniform hashing.

17 / 81

Page 29: 08 Hash Tables

Analysis of hashing: Constant time.

FinallyThese two theorems tell us that if n = O(m)

α = nm = O(m)

m = O(1)

Or search time is constant.

18 / 81

Page 30: 08 Hash Tables

Analysis of hashing: Which hash function?

Consider that:Good hash functions should maintain the property of simple uniformhashing!

The keys have the same probability 1/m to be hashed to any bucket!!!A uniform hash function minimizes the likelihood of an overflow whenkeys are selected at random.

Then:What should we use?

If we know how the keys are distributed uniformly at the followinginterval 0 ≤ k < 1 then h(k) = bkmc.

19 / 81

Page 31: 08 Hash Tables

Analysis of hashing: Which hash function?

Consider that:Good hash functions should maintain the property of simple uniformhashing!

The keys have the same probability 1/m to be hashed to any bucket!!!A uniform hash function minimizes the likelihood of an overflow whenkeys are selected at random.

Then:What should we use?

If we know how the keys are distributed uniformly at the followinginterval 0 ≤ k < 1 then h(k) = bkmc.

19 / 81

Page 32: 08 Hash Tables

What if...Question:What about something with keys in a normal distribution?

20 / 81

Page 33: 08 Hash Tables

Possible hash functions when the keys are natural numbers

The division methodh(k) = k mod m.Good choices for m are primes not too close to a power of 2.

The multiplication methodh(k) = bm(kA mod 1)c with 0 < A < 1.The value of m is not critical.Easy to implement in a computer.

21 / 81

Page 34: 08 Hash Tables

Possible hash functions when the keys are natural numbers

The division methodh(k) = k mod m.Good choices for m are primes not too close to a power of 2.

The multiplication methodh(k) = bm(kA mod 1)c with 0 < A < 1.The value of m is not critical.Easy to implement in a computer.

21 / 81

Page 35: 08 Hash Tables

When they are not, we need to interpreting the keys asnatural numbers

Keys interpreted as natural numbersGiven a string “pt”, we can say p = 112 and t=116 (ASCII numbers)

ASCII has 128 possible symbols.I Then (128× 112) + 1280 × 116 = 14452

NeverthelessThis is highly dependent on the origins of the keys!!!

22 / 81

Page 36: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

23 / 81

Page 37: 08 Hash Tables

Hashing methods: The division method

Hash functionh(k) = k mod m

Problems with some selectionsm = 2p, h(k) is only the p lowest-order bits.m = 2p − 1, when k is interpreted as a character string interpreted inradix 2p, permuting characters in k does not change the value.

It is better to selectPrime numbers not too close to an exact power of two.

I For example, given n = 2000 elements.F We can use m = 701 because it is near to 2000/3 but not near a power

of two.

24 / 81

Page 38: 08 Hash Tables

Hashing methods: The division method

Hash functionh(k) = k mod m

Problems with some selectionsm = 2p, h(k) is only the p lowest-order bits.m = 2p − 1, when k is interpreted as a character string interpreted inradix 2p, permuting characters in k does not change the value.

It is better to selectPrime numbers not too close to an exact power of two.

I For example, given n = 2000 elements.F We can use m = 701 because it is near to 2000/3 but not near a power

of two.

24 / 81

Page 39: 08 Hash Tables

Hashing methods: The division method

Hash functionh(k) = k mod m

Problems with some selectionsm = 2p, h(k) is only the p lowest-order bits.m = 2p − 1, when k is interpreted as a character string interpreted inradix 2p, permuting characters in k does not change the value.

It is better to selectPrime numbers not too close to an exact power of two.

I For example, given n = 2000 elements.F We can use m = 701 because it is near to 2000/3 but not near a power

of two.

24 / 81

Page 40: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

25 / 81

Page 41: 08 Hash Tables

Hashing methods: The multiplication method

The multiplication method for creating hash functions has two steps1 Multiply the key k by a constant A in the range 0 < A < 1 and

extract the fractional part of kA.2 Then, you multiply the value by m an take the floor,

h(k) = bm (kA mod 1)c.

The mod allows to extract that fractional part!!!kA mod 1 = kA− bkAc, 0 < A < 1.

Advantages:m is not critical, normally m = 2p.

26 / 81

Page 42: 08 Hash Tables

Hashing methods: The multiplication method

The multiplication method for creating hash functions has two steps1 Multiply the key k by a constant A in the range 0 < A < 1 and

extract the fractional part of kA.2 Then, you multiply the value by m an take the floor,

h(k) = bm (kA mod 1)c.

The mod allows to extract that fractional part!!!kA mod 1 = kA− bkAc, 0 < A < 1.

Advantages:m is not critical, normally m = 2p.

26 / 81

Page 43: 08 Hash Tables

Hashing methods: The multiplication method

The multiplication method for creating hash functions has two steps1 Multiply the key k by a constant A in the range 0 < A < 1 and

extract the fractional part of kA.2 Then, you multiply the value by m an take the floor,

h(k) = bm (kA mod 1)c.

The mod allows to extract that fractional part!!!kA mod 1 = kA− bkAc, 0 < A < 1.

Advantages:m is not critical, normally m = 2p.

26 / 81

Page 44: 08 Hash Tables

Implementing in a computer

FirstFirst, imagine that the word in a machine has w bits size and k fits onthose bits.

SecondThen, select an s in the range 0 < s < 2w and assume A = s

2w .

ThirdNow, we multiply k by the number s = A2w .

27 / 81

Page 45: 08 Hash Tables

Implementing in a computer

FirstFirst, imagine that the word in a machine has w bits size and k fits onthose bits.

SecondThen, select an s in the range 0 < s < 2w and assume A = s

2w .

ThirdNow, we multiply k by the number s = A2w .

27 / 81

Page 46: 08 Hash Tables

Implementing in a computer

FirstFirst, imagine that the word in a machine has w bits size and k fits onthose bits.

SecondThen, select an s in the range 0 < s < 2w and assume A = s

2w .

ThirdNow, we multiply k by the number s = A2w .

27 / 81

Page 47: 08 Hash Tables

Example

FourthThe result of that is r12w + r0, a 2w -bit value word, where the first p-mostsignificative bits of r0 are the desired hash value.

Graphically

28 / 81

Page 48: 08 Hash Tables

Example

FourthThe result of that is r12w + r0, a 2w -bit value word, where the first p-mostsignificative bits of r0 are the desired hash value.

Graphically

extract p bits

28 / 81

Page 49: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

29 / 81

Page 50: 08 Hash Tables

However

Sooner or LatterWe can pick up a hash function that does not give us the desired uniformrandomized property

ThusWe are required to analyze the possible clustering of the data by the hashfunction

30 / 81

Page 51: 08 Hash Tables

However

Sooner or LatterWe can pick up a hash function that does not give us the desired uniformrandomized property

ThusWe are required to analyze the possible clustering of the data by the hashfunction

30 / 81

Page 52: 08 Hash Tables

Measuring Clustering through a metric C

DefinitionIf bucket i contains ni elements, then

C =m

n − 1

[∑mi=1 n2

in − 1

](2)

Properties1 If C = 1, then you have uniform hashing.2 If C > 1, it means that the performance of the hash table is slowed

down by clustering by approximately a factor of C .3 If C < 1, the spread of the elements is more even than uniform!!! Not

going to happen!!!

31 / 81

Page 53: 08 Hash Tables

Measuring Clustering through a metric C

DefinitionIf bucket i contains ni elements, then

C =m

n − 1

[∑mi=1 n2

in − 1

](2)

Properties1 If C = 1, then you have uniform hashing.2 If C > 1, it means that the performance of the hash table is slowed

down by clustering by approximately a factor of C .3 If C < 1, the spread of the elements is more even than uniform!!! Not

going to happen!!!

31 / 81

Page 54: 08 Hash Tables

However

UnfortunatelyHash table do not give a way to measure clustering

Thus, table designersThey should provide some clustering estimation as part of the interface.

ThusThe reason the clustering measure works is because it is based on anestimate of the variance of the distribution of bucket sizes.

32 / 81

Page 55: 08 Hash Tables

However

UnfortunatelyHash table do not give a way to measure clustering

Thus, table designersThey should provide some clustering estimation as part of the interface.

ThusThe reason the clustering measure works is because it is based on anestimate of the variance of the distribution of bucket sizes.

32 / 81

Page 56: 08 Hash Tables

However

UnfortunatelyHash table do not give a way to measure clustering

Thus, table designersThey should provide some clustering estimation as part of the interface.

ThusThe reason the clustering measure works is because it is based on anestimate of the variance of the distribution of bucket sizes.

32 / 81

Page 57: 08 Hash Tables

Thus

FirstIf clustering is occurring, some buckets will have more elements than theyshould, and some will have fewer.

SecondThere will be a wider range of bucket sizes than one would expect froma random hash function.

33 / 81

Page 58: 08 Hash Tables

Thus

FirstIf clustering is occurring, some buckets will have more elements than theyshould, and some will have fewer.

SecondThere will be a wider range of bucket sizes than one would expect froma random hash function.

33 / 81

Page 59: 08 Hash Tables

Analysis of C : First, keys are uniformly distributed

Consider the following random variableConsider bucket i containing ni elements, with Xij= I{element j lands inbucket i}

Then, given

ni =n∑

j=1Xij (3)

We have that

E [Xij ] =1m , E

[X 2

ij

]=

1m (4)

34 / 81

Page 60: 08 Hash Tables

Analysis of C : First, keys are uniformly distributed

Consider the following random variableConsider bucket i containing ni elements, with Xij= I{element j lands inbucket i}

Then, given

ni =n∑

j=1Xij (3)

We have that

E [Xij ] =1m , E

[X 2

ij

]=

1m (4)

34 / 81

Page 61: 08 Hash Tables

Analysis of C : First, keys are uniformly distributed

Consider the following random variableConsider bucket i containing ni elements, with Xij= I{element j lands inbucket i}

Then, given

ni =n∑

j=1Xij (3)

We have that

E [Xij ] =1m , E

[X 2

ij

]=

1m (4)

34 / 81

Page 62: 08 Hash Tables

Next

We look at the dispersion of Xij

Var [Xij ] = E[X 2

ij

]− (E [Xij ])

2 =1m −

1m2 (5)

What about the expected number of elements at each bucket

E [ni ] = E

n∑j=1

Xij

=nm = α (6)

35 / 81

Page 63: 08 Hash Tables

Next

We look at the dispersion of Xij

Var [Xij ] = E[X 2

ij

]− (E [Xij ])

2 =1m −

1m2 (5)

What about the expected number of elements at each bucket

E [ni ] = E

n∑j=1

Xij

=nm = α (6)

35 / 81

Page 64: 08 Hash Tables

Then, we have

Because independence of {Xij}, the scattering of ni

Var [ni ] = Var

n∑j=1

Xij

=

n∑j=1

Var [Xij ]

= nVar [Xij ]

36 / 81

Page 65: 08 Hash Tables

ThenWhat about the range of the possible number of elements at eachbucket?

Var [ni ] =nm −

nm2

= α− α

m

But, we have that

E[n2

i

]= E

n∑j=1

X 2ij +

n∑j=1

n∑k=1,k 6=j

XijXik

(7)

Or

E[n2

i

]=

nm +

n∑j=1

n∑k=1,k 6=j

1m2 (8)

37 / 81

Page 66: 08 Hash Tables

ThenWhat about the range of the possible number of elements at eachbucket?

Var [ni ] =nm −

nm2

= α− α

m

But, we have that

E[n2

i

]= E

n∑j=1

X 2ij +

n∑j=1

n∑k=1,k 6=j

XijXik

(7)

Or

E[n2

i

]=

nm +

n∑j=1

n∑k=1,k 6=j

1m2 (8)

37 / 81

Page 67: 08 Hash Tables

ThenWhat about the range of the possible number of elements at eachbucket?

Var [ni ] =nm −

nm2

= α− α

m

But, we have that

E[n2

i

]= E

n∑j=1

X 2ij +

n∑j=1

n∑k=1,k 6=j

XijXik

(7)

Or

E[n2

i

]=

nm +

n∑j=1

n∑k=1,k 6=j

1m2 (8)

37 / 81

Page 68: 08 Hash Tables

Thus

We re-express the range on term of expected values of ni

E[n2

i

]=

nm +

n (n − 1)

m2 (9)

Then

E[n2

i

]− E [ni ]

2 =nm +

n (n − 1)

m2 − n2

m2

=nm −

nm2

= α− α

m

38 / 81

Page 69: 08 Hash Tables

Thus

We re-express the range on term of expected values of ni

E[n2

i

]=

nm +

n (n − 1)

m2 (9)

Then

E[n2

i

]− E [ni ]

2 =nm +

n (n − 1)

m2 − n2

m2

=nm −

nm2

= α− α

m

38 / 81

Page 70: 08 Hash Tables

Then

Finally, we have that

E[n2

i

]= α

(1− 1

m

)+ α2 (10)

39 / 81

Page 71: 08 Hash Tables

Then, we have that

Now we build an estimator of the mean of n2i which is part of C

1n

m∑i=1

n2i (11)

Thus

E[1n

m∑i=1

n2i

]=

1n

m∑i=1

E[n2

i

]=

mn

(1− 1

m

)+ α2

]=

(1− 1

m

)+ α2

]= 1− 1

m + α

40 / 81

Page 72: 08 Hash Tables

Then, we have that

Now we build an estimator of the mean of n2i which is part of C

1n

m∑i=1

n2i (11)

Thus

E[1n

m∑i=1

n2i

]=

1n

m∑i=1

E[n2

i

]=

mn

(1− 1

m

)+ α2

]=

(1− 1

m

)+ α2

]= 1− 1

m + α

40 / 81

Page 73: 08 Hash Tables

Finally

We can plug back on C using the expected value

E [C ] =m

n − 1

[E[∑m

i=1 n2i

n

]− 1

]

=m

n − 1

[1− 1

m + α− 1]

=m

n − 1

[ nm −

1m

]=

mn − 1

[n − 1m

]= 1

41 / 81

Page 74: 08 Hash Tables

Explanation

Using a hash table that enforce a uniform distribution in the bucketsWe get that C = 1 or the best distribution of keys

42 / 81

Page 75: 08 Hash Tables

Now, we have a really horrible hash function ≡ It hits onlyone of every b bucketsThus

E [Xij ] = E[X 2

ij

]=

bm (12)

Thus, we have

E [ni ] = αb (13)

Then, we have

E[1n

m∑i=1

n2i

]=

1n

m∑i=1

E[n2

i

]= αb − b

m + 1

43 / 81

Page 76: 08 Hash Tables

Now, we have a really horrible hash function ≡ It hits onlyone of every b bucketsThus

E [Xij ] = E[X 2

ij

]=

bm (12)

Thus, we have

E [ni ] = αb (13)

Then, we have

E[1n

m∑i=1

n2i

]=

1n

m∑i=1

E[n2

i

]= αb − b

m + 1

43 / 81

Page 77: 08 Hash Tables

Now, we have a really horrible hash function ≡ It hits onlyone of every b bucketsThus

E [Xij ] = E[X 2

ij

]=

bm (12)

Thus, we have

E [ni ] = αb (13)

Then, we have

E[1n

m∑i=1

n2i

]=

1n

m∑i=1

E[n2

i

]= αb − b

m + 1

43 / 81

Page 78: 08 Hash Tables

Finally

We can plug back on C using the expected value

E [C ] =m

n − 1

[E[∑m

i=1 n2i

n

]− 1

]

=m

n − 1

[αb − b

m + 1− 1]

=m

n − 1

[nbm −

bm

]=

mn − 1

[b (n − 1)

m

]= b

44 / 81

Page 79: 08 Hash Tables

Explanation

Using a hash table that enforce a uniform distribution in the bucketsWe get that C = b > 1 or a really bad distribution of the keys!!!

Thus, you only need the following to evaluate a hash function1n

m∑i=1

n2i (14)

45 / 81

Page 80: 08 Hash Tables

Explanation

Using a hash table that enforce a uniform distribution in the bucketsWe get that C = b > 1 or a really bad distribution of the keys!!!

Thus, you only need the following to evaluate a hash function1n

m∑i=1

n2i (14)

45 / 81

Page 81: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

46 / 81

Page 82: 08 Hash Tables

A Possible Solution: Universal Hashing

IssuesIn practice, keys are not randomly distributed.Any fixed hash function might yield retrieval Θ(n) time.

GoalTo find hash functions that produce uniform random table indexesirrespective of the keys.

IdeaTo select a hash function at random from a designed class of functions atthe beginning of the execution.

47 / 81

Page 83: 08 Hash Tables

A Possible Solution: Universal Hashing

IssuesIn practice, keys are not randomly distributed.Any fixed hash function might yield retrieval Θ(n) time.

GoalTo find hash functions that produce uniform random table indexesirrespective of the keys.

IdeaTo select a hash function at random from a designed class of functions atthe beginning of the execution.

47 / 81

Page 84: 08 Hash Tables

A Possible Solution: Universal Hashing

IssuesIn practice, keys are not randomly distributed.Any fixed hash function might yield retrieval Θ(n) time.

GoalTo find hash functions that produce uniform random table indexesirrespective of the keys.

IdeaTo select a hash function at random from a designed class of functions atthe beginning of the execution.

47 / 81

Page 85: 08 Hash Tables

Hashing methods: Universal hashing

Example

Set of hash functions

Choose a hash function randomly

(At the beginning of the execution)

HASH TABLE

48 / 81

Page 86: 08 Hash Tables

Definition of universal hash functions

DefinitionLet H = {h : U → {0, 1, ...,m − 1}} be a family of hash functions. H iscalled a universal family if

∀x , y ∈ U, x 6= y : Prh∈H

(h(x) = h(y)) ≤ 1m (15)

Main resultWith universal hashing the chance of collision between distinct keys k andl is no more than the 1

m chance of collision if locations h(k) and h(l) wererandomly and independently chosen from the set {0, 1, ...,m − 1}.

49 / 81

Page 87: 08 Hash Tables

Definition of universal hash functions

DefinitionLet H = {h : U → {0, 1, ...,m − 1}} be a family of hash functions. H iscalled a universal family if

∀x , y ∈ U, x 6= y : Prh∈H

(h(x) = h(y)) ≤ 1m (15)

Main resultWith universal hashing the chance of collision between distinct keys k andl is no more than the 1

m chance of collision if locations h(k) and h(l) wererandomly and independently chosen from the set {0, 1, ...,m − 1}.

49 / 81

Page 88: 08 Hash Tables

Hashing methods: Universal hashing

Theorem 11.3Suppose that a hash function h is chosen randomly from a universalcollection of hash functions and has been used to hash n keys into a tableT of size m, using chaining to resolve collisions. If key k is not in thetable, then the expected length E [nh(k)] of the list that key k hashes to isat most the load factor α = n

m . If key k is in the table, then the expectedlength E [nh(k)] of the list containing key k is at most 1 + α.

Corollary 11.4Using universal hashing and collision resolution by chaining in an initiallyempty table with m slots, it takes expected time Θ(n) to handle anysequence of n INSERT, SEARCH, and DELETE operations O(m) INSERToperations.

50 / 81

Page 89: 08 Hash Tables

Hashing methods: Universal hashing

Theorem 11.3Suppose that a hash function h is chosen randomly from a universalcollection of hash functions and has been used to hash n keys into a tableT of size m, using chaining to resolve collisions. If key k is not in thetable, then the expected length E [nh(k)] of the list that key k hashes to isat most the load factor α = n

m . If key k is in the table, then the expectedlength E [nh(k)] of the list containing key k is at most 1 + α.

Corollary 11.4Using universal hashing and collision resolution by chaining in an initiallyempty table with m slots, it takes expected time Θ(n) to handle anysequence of n INSERT, SEARCH, and DELETE operations O(m) INSERToperations.

50 / 81

Page 90: 08 Hash Tables

Example of Universal HashProceed as follows:

Choose a primer number p large enough so that every possible key k is inthe range [0, ..., p − 1]

Zp = {0, 1, ..., p − 1}and Z∗p = {1, ..., p − 1}

Define the following hash function:

ha,b(k) = ((ak + b) mod p) mod m,∀a ∈ Z∗p and b ∈ Zp

The family of all such hash functions is:

Hp,m = {ha,b : a ∈ Z∗p and b ∈ Zp}

Importanta and b are chosen randomly at the beginning of execution.The class Hp,m of hash functions is universal.

51 / 81

Page 91: 08 Hash Tables

Example of Universal HashProceed as follows:

Choose a primer number p large enough so that every possible key k is inthe range [0, ..., p − 1]

Zp = {0, 1, ..., p − 1}and Z∗p = {1, ..., p − 1}

Define the following hash function:

ha,b(k) = ((ak + b) mod p) mod m,∀a ∈ Z∗p and b ∈ Zp

The family of all such hash functions is:

Hp,m = {ha,b : a ∈ Z∗p and b ∈ Zp}

Importanta and b are chosen randomly at the beginning of execution.The class Hp,m of hash functions is universal.

51 / 81

Page 92: 08 Hash Tables

Example: Universal hash functions

Examplep = 977, m = 50, a and b random numbers

I ha,b(k) = ((ak + b) mod p) mod m

52 / 81

Page 93: 08 Hash Tables

Example of key distribution

Example, mean = 488.5 and dispersion = 5

53 / 81

Page 94: 08 Hash Tables

Example with 10 keys

Universal Hashing Vs Division Method

54 / 81

Page 95: 08 Hash Tables

Example with 50 keys

Universal Hashing Vs Division Method

55 / 81

Page 96: 08 Hash Tables

Example with 100 keys

Universal Hashing Vs Division Method

56 / 81

Page 97: 08 Hash Tables

Example with 200 keys

Universal Hashing Vs Division Method

57 / 81

Page 98: 08 Hash Tables

Another Example: Matrix MethodThen

Let us say keys are u-bits long.Say the table size M is power of 2.an index is b-bits long with M = 2b.

The h functionPick h to be a random b-by-u 0/1 matrix, and define h(x) = hxwhere after the inner product we apply mod 2

Example

h

b

1 0 0 00 1 1 11 1 1 0

u

x1010

=

h (x) 110

58 / 81

Page 99: 08 Hash Tables

Another Example: Matrix MethodThen

Let us say keys are u-bits long.Say the table size M is power of 2.an index is b-bits long with M = 2b.

The h functionPick h to be a random b-by-u 0/1 matrix, and define h(x) = hxwhere after the inner product we apply mod 2

Example

h

b

1 0 0 00 1 1 11 1 1 0

u

x1010

=

h (x) 110

58 / 81

Page 100: 08 Hash Tables

Another Example: Matrix MethodThen

Let us say keys are u-bits long.Say the table size M is power of 2.an index is b-bits long with M = 2b.

The h functionPick h to be a random b-by-u 0/1 matrix, and define h(x) = hxwhere after the inner product we apply mod 2

Example

h

b

1 0 0 00 1 1 11 1 1 0

u

x1010

=

h (x) 110

58 / 81

Page 101: 08 Hash Tables

Proof of being a Universal Family

First fix assume that you have two different keys l 6= mWithout loosing generality assume the following

1 li 6= mi ⇒ li = 0 and mi = 12 lj = mj ∀j 6= i

ThusThe column i does not contribute to the final answer of h (l) because ofthe zero!!!

NowImagine that we fix all the other columns in h, thus there is only oneanswer for h (l)

59 / 81

Page 102: 08 Hash Tables

Proof of being a Universal Family

First fix assume that you have two different keys l 6= mWithout loosing generality assume the following

1 li 6= mi ⇒ li = 0 and mi = 12 lj = mj ∀j 6= i

ThusThe column i does not contribute to the final answer of h (l) because ofthe zero!!!

NowImagine that we fix all the other columns in h, thus there is only oneanswer for h (l)

59 / 81

Page 103: 08 Hash Tables

Proof of being a Universal Family

First fix assume that you have two different keys l 6= mWithout loosing generality assume the following

1 li 6= mi ⇒ li = 0 and mi = 12 lj = mj ∀j 6= i

ThusThe column i does not contribute to the final answer of h (l) because ofthe zero!!!

NowImagine that we fix all the other columns in h, thus there is only oneanswer for h (l)

59 / 81

Page 104: 08 Hash Tables

NowFor ith columnThere are 2b possible columns when changing the ones and zeros

Thus, given the randomness of the zeros and onesThe probability that we get the zero column

00......0

(16)

is equal

12b (17)

with h (l) = h (m)60 / 81

Page 105: 08 Hash Tables

NowFor ith columnThere are 2b possible columns when changing the ones and zeros

Thus, given the randomness of the zeros and onesThe probability that we get the zero column

00......0

(16)

is equal

12b (17)

with h (l) = h (m)60 / 81

Page 106: 08 Hash Tables

Then

We get the probability

P (h (l) = h (m)) ≤ 12b (18)

61 / 81

Page 107: 08 Hash Tables

Implementation of the column*vector mod 2

Code

i n t p roduc t ( i n t row , i n t v e c t o r ){

i n t i = row & v e c t o r ;

i = i − ( ( i >> 1) & 0 x55555555 ) ;i = ( i & 0 x33333333 ) + ( ( i >> 2) & 0 x33333333 ) ;i = ( ( ( i + ( i >> 4)) & 0x0F0F0F0F ) ∗ 0 x01010101 ) >> 24 ;

r e t u r n i & i & 0 x00000001 ;

}

62 / 81

Page 108: 08 Hash Tables

Advantages of universal hashing

AdvantagesUniversal hashing provides good results on average, independently ofthe keys to be stored.Guarantees that no input will always elicit the worst-case behavior.Poor performance occurs only when the random choice returns aninefficient hash function; this has a small probability.

63 / 81

Page 109: 08 Hash Tables

Open addressing

DefinitionAll the elements occupy the hash table itself.

What is it?We systematically examine table slots until either we find the desiredelement or we have ascertained that the element is not in the table.

AdvantagesThe advantage of open addressing is that it avoids pointers altogether.

64 / 81

Page 110: 08 Hash Tables

Open addressing

DefinitionAll the elements occupy the hash table itself.

What is it?We systematically examine table slots until either we find the desiredelement or we have ascertained that the element is not in the table.

AdvantagesThe advantage of open addressing is that it avoids pointers altogether.

64 / 81

Page 111: 08 Hash Tables

Open addressing

DefinitionAll the elements occupy the hash table itself.

What is it?We systematically examine table slots until either we find the desiredelement or we have ascertained that the element is not in the table.

AdvantagesThe advantage of open addressing is that it avoids pointers altogether.

64 / 81

Page 112: 08 Hash Tables

Insert in Open addressing

Extended hash function to probeInstead of being fixed in the order 0, 1, 2, ...,m − 1 with Θ (n) searchtimeExtend the hash function toh : U × {0, 1, ...,m − 1} → {0, 1, ...,m − 1}This gives the probe sequence 〈h(k, 0), h(k, 1), ..., h(k,m − 1)〉

I A permutation of 〈0, 1, 2, ...,m − 1〉

65 / 81

Page 113: 08 Hash Tables

Hashing methods in Open Addressing

HASH-INSERT(T , k)1 i = 02 repeat3 j = h (k , i)4 if T [j ] == NIL5 T [j ] = k6 return j7 else i = i + 18 until i == m9 error “Hash Table Overflow”

66 / 81

Page 114: 08 Hash Tables

Hashing methods in Open Addressing

HASH-SEARCH(T,k)1 i = 02 repeat3 j = h (k , i)4 if T [j ] == k5 return j6 i = i + 17 until T [j ] == NIL or i == m8 return NIL

67 / 81

Page 115: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

68 / 81

Page 116: 08 Hash Tables

Linear probing: Definition and properties

Hash functionGiven an ordinary hash function h′ : 0, 1, ...,m − 1→ U fori = 0, 1, ...,m − 1, we get the extended hash function

h(k, i) =(h′(k) + i

)mod m, (19)

Sequence of probesGiven key k, we first probe T [h′(k)], then T [h′(k) + 1] and so on untilT [m − 1]. Then, we wrap around T [0] to T [h′(k)− 1].

Distinct probesBecause the initial probe determines the entire probe sequence, there arem distinct probe sequences.

69 / 81

Page 117: 08 Hash Tables

Linear probing: Definition and properties

Hash functionGiven an ordinary hash function h′ : 0, 1, ...,m − 1→ U fori = 0, 1, ...,m − 1, we get the extended hash function

h(k, i) =(h′(k) + i

)mod m, (19)

Sequence of probesGiven key k, we first probe T [h′(k)], then T [h′(k) + 1] and so on untilT [m − 1]. Then, we wrap around T [0] to T [h′(k)− 1].

Distinct probesBecause the initial probe determines the entire probe sequence, there arem distinct probe sequences.

69 / 81

Page 118: 08 Hash Tables

Linear probing: Definition and properties

Hash functionGiven an ordinary hash function h′ : 0, 1, ...,m − 1→ U fori = 0, 1, ...,m − 1, we get the extended hash function

h(k, i) =(h′(k) + i

)mod m, (19)

Sequence of probesGiven key k, we first probe T [h′(k)], then T [h′(k) + 1] and so on untilT [m − 1]. Then, we wrap around T [0] to T [h′(k)− 1].

Distinct probesBecause the initial probe determines the entire probe sequence, there arem distinct probe sequences.

69 / 81

Page 119: 08 Hash Tables

Linear probing: Definition and properties

DisadvantagesLinear probing suffers of primary clustering.Long runs of occupied slots build up increasing the average searchtime.

I Clusters arise because an empty slot preceded by i full slots gets fillednext with probability i+1

m .

Long runs of occupied slots tend to get longer, and the averagesearch time increases.

70 / 81

Page 120: 08 Hash Tables

Example

Example using keys uniformly distributedIt was generated using the division method

Then

71 / 81

Page 121: 08 Hash Tables

ExampleExample using keys uniformly distributedIt was generated using the division method

Then

71 / 81

Page 122: 08 Hash Tables

Example

Example using Gaussian keysIt was generated using the division method

Then

72 / 81

Page 123: 08 Hash Tables

ExampleExample using Gaussian keysIt was generated using the division method

Then

72 / 81

Page 124: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

73 / 81

Page 125: 08 Hash Tables

Quadratic probing: Definition and properties

Hash functionGiven an auxiliary hash function h′ : 0, 1, ...,m − 1→ U fori = 0, 1, ...,m − 1, we get the extended hash function

h(k, i) = (h′(k) + c1i + c2i2) mod m, (20)

where c1, c2 are auxiliary constants

Sequence of probesGiven key k, we first probe T [h′(k)], later positions probed are offsetby amounts that depend in a quadratic manner on the probe numberi .The initial probe determines the entire sequence, and so only mdistinct probe sequences are used.

74 / 81

Page 126: 08 Hash Tables

Quadratic probing: Definition and properties

Hash functionGiven an auxiliary hash function h′ : 0, 1, ...,m − 1→ U fori = 0, 1, ...,m − 1, we get the extended hash function

h(k, i) = (h′(k) + c1i + c2i2) mod m, (20)

where c1, c2 are auxiliary constants

Sequence of probesGiven key k, we first probe T [h′(k)], later positions probed are offsetby amounts that depend in a quadratic manner on the probe numberi .The initial probe determines the entire sequence, and so only mdistinct probe sequences are used.

74 / 81

Page 127: 08 Hash Tables

Quadratic probing: Definition and properties

AdvantagesThis method works much better than linear probing, but to make full useof the hash table, the values of c1,c2, and m are constrained.

DisadvantagesIf two keys have the same initial probe position, then their probe sequencesare the same, since h(k1, 0) = h(k2, 0) implies h(k1, i) = h(k2, i). Thisproperty leads to a milder form of clustering, called secondary clustering.

75 / 81

Page 128: 08 Hash Tables

Quadratic probing: Definition and properties

AdvantagesThis method works much better than linear probing, but to make full useof the hash table, the values of c1,c2, and m are constrained.

DisadvantagesIf two keys have the same initial probe position, then their probe sequencesare the same, since h(k1, 0) = h(k2, 0) implies h(k1, i) = h(k2, i). Thisproperty leads to a milder form of clustering, called secondary clustering.

75 / 81

Page 129: 08 Hash Tables

Outline1 Basic data structures and operations

2 Hash tablesHash tables: ConceptsAnalysis of hashing under Chaining

3 Hashing MethodsThe Division MethodThe Multiplication MethodClustering Analysis of Hashing FunctionsA Possible Solution: Universal Hashing

4 Open AddressingLinear ProbingQuadratic ProbingDouble Hashing

5 Excercises

76 / 81

Page 130: 08 Hash Tables

Double hashing: Definition and properties

Hash functionDouble hashing uses a hash function of the form

h(k, i) = (h1(k) + ih2(k)) mod m, (21)

where i = 0, 1, ...,m − 1 and h1, h2 are auxiliary hash functions (Normallyfor a Universal family)

Sequence of probesGiven key k, we first probe T [h1(k)], successive probe positions areoffset from previous positions by the amount h2(k) mod m.Thus, unlike the case of linear or quadratic probing, the probesequence here depends in two ways upon the key k, since the initialprobe position, the offset, or both, may vary.

77 / 81

Page 131: 08 Hash Tables

Double hashing: Definition and properties

Hash functionDouble hashing uses a hash function of the form

h(k, i) = (h1(k) + ih2(k)) mod m, (21)

where i = 0, 1, ...,m − 1 and h1, h2 are auxiliary hash functions (Normallyfor a Universal family)

Sequence of probesGiven key k, we first probe T [h1(k)], successive probe positions areoffset from previous positions by the amount h2(k) mod m.Thus, unlike the case of linear or quadratic probing, the probesequence here depends in two ways upon the key k, since the initialprobe position, the offset, or both, may vary.

77 / 81

Page 132: 08 Hash Tables

Double hashing: Definition and properties

AdvantagesWhen m is prime or a power of 2, double hashing improves over linearor quadratic probing in that Θ(m2) probe sequences are used, ratherthan Θ(m) since each possible (h1(k), h2(k)) pair yields a distinctprobe sequence.The performance of double hashing appears to be very close to theperformance of the “ideal” scheme of uniform hashing.

78 / 81

Page 133: 08 Hash Tables

Why?

Look At This

79 / 81

Page 134: 08 Hash Tables

Analysis of Open Addressing

Theorem 11.6Given an open-address hash table with load factor α = n

m < 1, theexpected number of probes in an unsuccessful search is at most 1

1−αassuming uniform hashing.

CorollaryInserting an element into an open-address hash table with load factor̨requires at most 1

1−α probes on average, assuming uniform hashing.

Theorem 11.8Given an open-address hash table with load factor α < 1, the expectednumber of probes in a successful search is at most 1

α ln 11−α assuming

uniform hashing and assuming that each key in the table is equally likelyto be searched for.

80 / 81

Page 135: 08 Hash Tables

Analysis of Open Addressing

Theorem 11.6Given an open-address hash table with load factor α = n

m < 1, theexpected number of probes in an unsuccessful search is at most 1

1−αassuming uniform hashing.

CorollaryInserting an element into an open-address hash table with load factor̨requires at most 1

1−α probes on average, assuming uniform hashing.

Theorem 11.8Given an open-address hash table with load factor α < 1, the expectednumber of probes in a successful search is at most 1

α ln 11−α assuming

uniform hashing and assuming that each key in the table is equally likelyto be searched for.

80 / 81

Page 136: 08 Hash Tables

Analysis of Open Addressing

Theorem 11.6Given an open-address hash table with load factor α = n

m < 1, theexpected number of probes in an unsuccessful search is at most 1

1−αassuming uniform hashing.

CorollaryInserting an element into an open-address hash table with load factor̨requires at most 1

1−α probes on average, assuming uniform hashing.

Theorem 11.8Given an open-address hash table with load factor α < 1, the expectednumber of probes in a successful search is at most 1

α ln 11−α assuming

uniform hashing and assuming that each key in the table is equally likelyto be searched for.

80 / 81

Page 137: 08 Hash Tables

Excercises

From Cormen’s book, chapters 1111.1-211.2-111.2-211.2-311.3-111.3-3

81 / 81