Hash Tables

1

Hash Tables

Gordon College

CS212

2

Hash Tables

• Recall order of magnitude of searches– Linear search O(n)

– Binary search O(log2n)

– Balanced binary tree search O(log2n)

– Unbalanced binary tree can degrade to O(n)

3

Hash Tables

• In some situations faster search is needed– Solution is to use a hash function– Value of key field given to hash function– Location in a hash table is calculated

Like an array but much better:

Do not have to set aside space to account for every possible key

4

Hash Functionsmapping from key to index

• Simple function: mod (%) the key by arbitrary integer

int h(int i){ return i % maxSize; }

• Note the max number of locations in table maxSize

5

Hash Function Access

• Note that we have traded speed for wasted space–Table must be considerably larger than number of items anticipated

6

Hash Function Access

• Example:7 digit serial number

Need 10 million records*

* Not practical to have this much space when in reality you are only stocking at most a few thousand records

Why 10 million records?

n!/(n-r)! 10!/(10-7)!Number of r-permutations of a set with n elements

7

Hash Function (Mapping)

• Example:7 digit serial number

Use only 10000 slots

Hashing (Mapping) function - unsigned int Hf(int key)

Hf(1234567) = 1234567 % 10000 = 4567

1234567/10000 = 123.4567

1234567 - (123 * 10000) = 4567

8

Hash Function (Mapping)

• Design Considerations– Efficient– Minimize collisions– Produce uniformly distributed mappings

• (helps minimize collisions)

– Must be able to deal with int, char, string, etc. types for keys

– Must be able to associate a hash function with a container

9

Function Objects• Can pass a function to a function• Can use Function Objects

template<typename t>class functionobject{

public:returntype operator() (arguments) const

{return returnvalue;

}…….};

10

Function ObjectsExample function class: less than

template<typename T>

class lessThan {public:

bool operator() (const T& x, const T& y) const

{return x < y;

}

};

11

Function ObjectsExample function class use

template <typename T, typename Compare>void insertionSort(vector<T>& v, Compare comp){ int i, j, n = v.size(); T temp;

…..}

Called:

insertionSort(v, lessThan<int>());

12

Function ObjectsExample function class use (as seen with the SET container)

template<typename T>class lessThan {

public:bool operator() (const T& x, const T& y) const{

return x < y;}

};

set <int, lessThan<int> > A(arr, arr+arrSize);

for(set <int, lessThan<int> >::iterator ii=A.begin();ii!=A.end();ii++)cout << *ii << " ";

cout << endl;

13

CollisionsHash Function Access Problem

Collisions are possible: Depending on the number of slots and the size of the key mapping

14

CollisionsHash Function Access Problem

• Problem: same value returned by h(i) for different values of i– Called collisions

• Simple solution: linear probing– Linear search begins at

collision location– Continues until empty

slot found for insertion

15

Linear Probing77

89

14

94

0

1

2

3

4

5

6

7

8

9

10

(a)

1

1

1

1

1

Insert54, 77, 94, 89, 14

2

77

89

45

14

94

0

1

2

3

4

5

6

7

8

9

10

(b)

1

1

1

1

1

Insert45

2

77

89

45

14

35

94

0

1

2

3

4

5

6

7

8

9

10

(c)

1

1

1

1

1

Insert35

3

2

77

89

45

14

35

76

94

0

1

2

3

4

5

6

7

8

9

10

(d)

1

1

1

1

1

Insert76

3

7

54 54 5454

16

Hash Functions

• Retrieving a value:linear probe until found– If empty slot encountered

then value is not in table

• What if deletions permitted? Slot can be marked so it will

not be empty and cause an

invalid linear probe

17

Hash Functions

• Improved performance strategies:– Increase table capacity (less collisions)– Use different collision resolution technique– Devise different hash function

• Hash table capacity– Size of table must be 1.5 to 2 times the size of

the number of items to be stored– Otherwise probability of collisions is too high

18

Other Collision Strategies

• Linear probing can result in primary clustering

Consider:• quadratic probing

– Probe sequence from location i isi + 1, i – 1, i + 4, i – 4, i + 9, i – 9, …

– Secondary clusters can still form

• Double hashing – Use a second hash function to determine probe

sequence• hF(key) --> index hF(index)--> next index

19

Collision Strategies

• Chaining– Table is a list or vector of head nodes to linked

lists– When item hashes to location, it is added to that

linked list

20

Chaining< bucket0 >

< bucketn-1 >

< bucket2 >

< bucket1 >

. . . .

< Bucket1 > 89(1) 45(2)

< Bucket0 >

< Bucket3 > 14(1)

< Bucket2 > 35(1)

< Bucket10> 54(1) 76(2)

< Bucket6 > 94(1)

< Bucket9 >

< Bucket8 >

< Bucket7 >

< Bucket5 >

< Bucket4 >

77(1)

21

Improving the Hash Function

• Ideal hash function– Simple to evaluate (fast)– Scatters items uniformly throughout table

• Modulo arithmetic not so good for strings– Possible to manipulate numeric (ASCII) value of

first and last characters of a name

22

Hash Function (basic mapping)

class hFintID

{

public:

unsigned int operator() (int item) const

{

return (unsigned int) item % 10000;

}

};

hFintID hf;Hf(12341234) = 1234;

23

Hash Function (better)

class hFint {

public:unsigned int operator() (int item) const{

unsigned int value = (unsigned int) item;value *= value;value /=256; //discard low order 8 bits

// (division performs a shift right)

return value % 65536;}

};

Midsquare techniquemixes up the digits in the serial number

24

String Hash Functions

class hFstring {

public:unsigned int operator() (const string & item) const{

unsigned int prime = 2049982463;int n = 0, i;for (i = 0; i < item.length(); i++)

n = n*8 + item[i];

return n > 0 ? (n % prime) : (-n % prime);

}};

GOAL: random distribution

25

Custom Hash Functions

class hfCode {

public:unsigned int operator() (const code & item) const{

return (unsigned int )item.getNum % NumofSlots;

}};

FILE0000.CHK, FILE0001.CHK, FILE0002.CHK

26

Search AlgorithmsSequential Search

- search O(n) (fairly slow)

+ good when data set size is small and does have to be sorted

Binary Search (sorted vector)+ search O(log n) [much faster]

+ low cost when it comes to space

- however, requires data be sorted

- not good when the data set is very dynamic (sorting overhead)

Binary Search Tree+ search O(log n)

+ can scan data in order

- higher cost when it comes to space (various pointers)

Hashing+ search O(1) [fastest]

- higher cost when it comes to space (depends on method)

Documents

Hash Tables