40
Lecture 10: Hashing and Dynamic Dictionary Shang-Hua Teng

Lecture 10: Hashing and Dynamic Dictionary

  • Upload
    morrie

  • View
    33

  • Download
    2

Embed Size (px)

DESCRIPTION

Lecture 10: Hashing and Dynamic Dictionary. Shang-Hua Teng. Dictionary/Table. Keys. Operation supported: search Given a student ID find the record (entry). Keys. Entry. Data Format. What if student ID is 9-digit social security number. - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 10: Hashing and Dynamic Dictionary

Lecture 10:Hashing and Dynamic Dictionary

Shang-Hua Teng

Page 2: Lecture 10: Hashing and Dynamic Dictionary

Dictionary/Table

Keys

Operation supported: searchGiven a student ID find the record (entry)

Page 3: Lecture 10: Hashing and Dynamic Dictionary

Data Format

Keys Entry

Page 4: Lecture 10: Hashing and Dynamic Dictionary

What if student ID is 9-digit social security number

• Well, we can still sort by the ids and apply binary search.

• If we have n students, we need O(n) space

• And O(log n) search time

Page 5: Lecture 10: Hashing and Dynamic Dictionary

What if new students come and current students leave

• Dynamic dictionary– Yellow page update once in a while– Which is not truly dynamic

• Operations to support– Insert: add a new (key, entry) pair– Delete: remove a (key, entry) pair from the dictionary– Search: Given a key, find if it is in the dictionary, and

if it is , return the data record associated with the key

Page 6: Lecture 10: Hashing and Dynamic Dictionary

How should we implement a dynamic dictionary?

• How often are entries inserted and removed?

• How many of the possible key values are likely to be used?

• What is the likely pattern of searching for keys?

Page 7: Lecture 10: Hashing and Dynamic Dictionary

(Key,Entry) pair• For searching purposes, it is best to store

the key and the entry separately (even though the key’s value may be inside the entry)

“Smith” “Smith”, “124 Hawkers Lane”, “9675846”

“Yeo” “Yeo”, “1 Apple Crescent”, “0044 1970 622455”

key entry

(key,entry)

Page 8: Lecture 10: Hashing and Dynamic Dictionary

Implementation 1:unsorted sequential array

• An array in which (key,entry)-pair are stored consecutively in any order

• insert: add to back of array; O(1)• search: search through the keys

one at a time, potentially all of the keys; O(n)

• remove: find + replace removed node with last node; O(n)

0

key entry

123

and so on

Page 9: Lecture 10: Hashing and Dynamic Dictionary

Implementation 2:sorted sequential array

• An array in which (key,entry) pair are stored consecutively, sorted by key

• insert: add in sorted order; O(n)

• find: binary search; O(log n)• remove: find, remove node

and shuffle down; O(n)

0

key entry

123

and so on

Page 10: Lecture 10: Hashing and Dynamic Dictionary

Implementation 3:linked list (unsorted or sorted)

• (key,entry) pairs are again stored consecutively

• insert: add to front; O(1)or O(n) for a sorted list

• find: search through potentially all the keys, one at a time; O(n)still O(n) for a sorted list

• remove: find, remove using pointer alterations; O(n)

key entry

and so on

Page 11: Lecture 10: Hashing and Dynamic Dictionary

Direct Addressing

• Suppose:– The range of keys is 0..m-1 (Universe)– Keys are distinct

• The idea:– Set up an array T[0..m-1] in which

• T[i] = x if x T and key[x] = i• T[i] = NULL otherwise

Page 12: Lecture 10: Hashing and Dynamic Dictionary

• Direct addressing is a simple technique that works well when the universe of keys is small.

Assuming each key corresponds to a unique slot.Direct-Address-Search(T,k)

return T[k]Direct-Address-Insert(T,x)

return T[key[x]] xDirect-Address-Delete(T,x)

return T[key[x]] Nil

17

5

01

234567

/

/

/

//

1

5

7

entry

Direct-address Table

O(1) time for all operations

Page 13: Lecture 10: Hashing and Dynamic Dictionary

The Problem With Direct Addressing

• Direct addressing works well when the range m of keys is relatively small

• But what if the keys are 32-bit integers?– Example: spell checking– Problem 1: direct-address table will have

232 entries, more than 4 billion– Problem 2: even if memory is not an issue, the time to

initialize the elements to NULL may be• Solution: map keys to smaller range 0..m-1• This mapping is called a hash function

Page 14: Lecture 10: Hashing and Dynamic Dictionary

Hash function• A hash function determines the slot of the hash table where the

key is placed.• Previous example the hash function is the identity function• We say that a record with key k hashes into slot h(k)

T0

m - 1

h(k1)h(k4)

h(k2) = h(k5)

h(k3)

k4

k2 k3

k1

k5

U(universe of keys)

K(actualkeys)

Page 15: Lecture 10: Hashing and Dynamic Dictionary

Next Problem• collision

T0

m - 1

h(k1)h(k4)

h(k2) = h(k5)

h(k3)

k4

k2 k3

k1

k5

U(universe of keys)

K(actualkeys)

Page 16: Lecture 10: Hashing and Dynamic Dictionary

Pigeonhole Principle

Parque de las PalomasSan Juan, Puerto Rico

Page 17: Lecture 10: Hashing and Dynamic Dictionary

Resolving Collisions

• How can we solve the problem of collisions?• Solution 1: chaining• Solution 2: open addressing

Page 18: Lecture 10: Hashing and Dynamic Dictionary

Chaining• Chaining puts elements that hash to the

same slot in a linked list:

——

——

——————

——T

k4

k2k3

k1

k5

U(universe of keys)

K(actualkeys)

k6k8

k7

k4 k1 ——

k7

k3

k8

——k6 ——

k5 k2 ——

Page 19: Lecture 10: Hashing and Dynamic Dictionary

Chaining (insert at the head)

——

——

——————

——T

k4

k2k3

k1

k5

U(universe of keys)

K(actualkeys)

k6k8

k7

k1 ——

——

————

Page 20: Lecture 10: Hashing and Dynamic Dictionary

Chaining (insert at the head)

——

——

——————

——T

k4

k2k3

k1

k5

U(universe of keys)

K(actualkeys)

k6k8

k7

k1 ——

k2

k3 ——

——

——

Page 21: Lecture 10: Hashing and Dynamic Dictionary

Chaining (insert at the head)

——

——

——————

——T

k4

k2k3

k1

k5

U(universe of keys)

K(actualkeys)

k6k8

k7

k1 ——

k2

k3 ——

——

——

k4 k1 ——

Page 22: Lecture 10: Hashing and Dynamic Dictionary

Chaining (insert at the head)

——

——————

——T

k4

k2k3

k1

k5

U(universe of keys)

K(actualkeys)

k6k8

k7

k1 ——

k2

k3 ——

——

——

k4 k1 ——

k5 k2 ——

k6 ——

Page 23: Lecture 10: Hashing and Dynamic Dictionary

Chaining (Insert to the head)

——

——

——————

——T

k4

k2k3

k1

k5

U(universe of keys)

K(actualkeys)

k6k8

k7

k4 k1 ——

k7

k3

k8

——k6 ——

k5 k2 ——

Page 24: Lecture 10: Hashing and Dynamic Dictionary

Operations

Direct-Hash-Search(T,k) Search for an element with key k in list T[h(k)]

(running time is proportional to length of the list)Direct-Hash-Insert(T,x) (worst case O(1)) Insert x at the head of the list T[h(key[x])]Direct-Hash-Delete(T,x) Delete x from the list T[h(key[x])]

(For singly linked list we might need to find the predecessor first. So the complexity is just like that of search)

Page 25: Lecture 10: Hashing and Dynamic Dictionary

Analysis of hashing with chaining

• Given a hash table with m slots and n elements• The load factor = n/m• The worst case behavior is when all n elements hash into the same

location ((n) for searching)• The average performance depends on how well the hash function

distributes elements• Assumption: simple uniform hashing: Any element is equally likely

to hash into any of the m slot• For any key h(k) can be computed in O(1)• Two cases for a search:

– The search is unsuccessful– The search is successful

Page 26: Lecture 10: Hashing and Dynamic Dictionary

Unsuccessful searchTheorem 11.1 : In a hash table in which collisions are resolved by chaining, an unsuccessful search takes (1+ ), on the average, under the assumption of simple uniform hashing.

Proof:• Simple uniform hashing any key k is equally likely to hash into any

of the m slots.• The average time to search for a given key k is the time it takes to

search a given slot.• The average length of each slot is = n/m: the load factor.• The time it takes to compute h(k) is O(1). Total time is (1+).

Page 27: Lecture 10: Hashing and Dynamic Dictionary

Successful SearchTheorem 11.2 : In a hash table in which collisions are resolved by chaining, a successful search takes (1+ ), under the assumption of simple uniform hashing.

Proof:• Simple uniform hashing any key k is equally likely to hash into any

of the m slots.• Note Chained-Hash-Insert inserts a new element in the front of the list• The expected number of elements visited during the search is 1 more

than the number of elements of the list after the element is inserted

Page 28: Lecture 10: Hashing and Dynamic Dictionary

Successful Search• Take the average over the n elements

• (i 1)/m is the expected length of the list to which i was added. The expected length of each list increases as more elements are added.

n

i

n

i

inmm

in 11

111111

nn

nm 2111

m21

21

(1)

(2)

(3)

Page 29: Lecture 10: Hashing and Dynamic Dictionary

Analysis of Chaining

• Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot

• Given n keys and m slots in the table, the load factor = n/m = average # keys per slot

• What will be the average cost of an unsuccessful search for a key? O(1+)

• What will be the average cost of a successful search? O(1 + /2) = O(1 + )

Page 30: Lecture 10: Hashing and Dynamic Dictionary

Analysis of Chaining Continued

• So the cost of searching = O(1 + )• If the number of keys n is proportional to

the number of slots in the table, what is ?• A: = O(1)

– In other words, we can make the expected cost of searching constant if we make constant

Page 31: Lecture 10: Hashing and Dynamic Dictionary

Choosing A Hash Function

• Choosing the hash function well is crucial– Bad hash function puts all elements in same slot– A good hash function:

• Should distribute keys uniformly into slots• Should not depend on patterns in the data

• Three popular methods:– Division method– Multiplication method– Universal hashing

Page 32: Lecture 10: Hashing and Dynamic Dictionary

The Division Method

• h(k) = k mod m– In words: hash k into a table with m slots using the slot

given by the remainder of k divided by m • Elements with adjacent keys hashed to different

slots: good• If keys bear relation to m: bad• In Practice: pick table size m = prime number

not too close to a power of 2 (or 10)

Page 33: Lecture 10: Hashing and Dynamic Dictionary

The Multiplication Method

• For a constant A, 0 < A < 1:• h(k) = m (kA - kA)

• In practice:– Choose m = 2P

– Choose A not too close to 0 or 1– Knuth: Good choice for A = (5 - 1)/2

Fractional part of kA

Page 34: Lecture 10: Hashing and Dynamic Dictionary

Universal Hashing

• When attempting to foil an malicious adversary, randomize the algorithm

• Universal hashing: pick a hash function randomly when the algorithm begins – Guarantees good performance on average, no

matter what keys adversary chooses– Need a family of hash functions to choose from– Think of quicksort

Page 35: Lecture 10: Hashing and Dynamic Dictionary

Universal Hashing

• Let be a (finite) collection of hash functions – …that map a given universe U of keys…– …into the range {0, 1, …, m - 1}.

is said to be universal if:– for each pair of distinct keys x, y U,

the number of hash functions h for which h(x) = h(y) is ||/m

– In other words:• With a random hash function from , the chance of a collision

between x and y is exactly 1/m (x y)

Page 36: Lecture 10: Hashing and Dynamic Dictionary

Universal Hashing• Theorem 11.3:

– Choose h from a universal family of hash functions– Hash n keys into a table of m slots, n m– Then the expected number of collisions involving a

particular key x is less than 1– Proof:

• For each pair of keys y, z, let cyx = 1 if y and z collide, 0 otherwise

• E[cyz] = 1/m (by definition)• Let Cx be total number of collisions involving key x•

• Since n m, we have E[Cx] < 1

mncC

xyTy

xyx1][E][E

Page 37: Lecture 10: Hashing and Dynamic Dictionary

A Universal Hash Function

• Choose table size m to be prime• Decompose key x into r+1 bytes, so that

x = {x0, x1, …, xr}– Only requirement is that max value of byte < m– Let a = {a0, a1, …, ar} denote a sequence of r+1

elements chosen randomly from {0, 1, …, m - 1}– Define corresponding hash function ha :

– With this definition, has mr+1 members

r

iiia mxaxh

0

mod

Page 38: Lecture 10: Hashing and Dynamic Dictionary

A Universal Hash Function

is a universal collection of hash functions (Theorem 11.5)

• How to use:– Pick r based on m and the range of keys in U– Pick a hash function by (randomly) picking the

a’s– Use that hash function on all keys

Page 39: Lecture 10: Hashing and Dynamic Dictionary

Example

• Let m = 5, and the size of each string is 2 bits (binary). Note the maximum value of a string is 3 and m = 5

• a = 1,3, chosen at random from 0,1,2,3,4• Example for x = 4 = 01,00 (note r = 1)• ha(4) = 1 (01) + 3 (00) = 1

Page 40: Lecture 10: Hashing and Dynamic Dictionary

Open Addressing

• Basic idea (details in Section 12.4): – To insert: if slot is full, try another slot, …, until

an open slot is found (probing)– To search, follow same sequence of probes as

would be used when inserting the element• If reach element with correct key, return it• If reach a NULL pointer, element is not in table

• Good for fixed sets (adding but no deletion)• Table needn’t be much bigger than n