COMP3600/6466 –Algorithms - cs.anu.edu.au · Assignment 2 •The end of the grace period has been...

Preview:

Citation preview

COMP3600/6466 – Algorithms Hash Tables[CLRS ch. 11]

Hanna Kurniawati

https://cs.anu.edu.au/courses/comp3600/Comp_3600_6466@anu.edu.au

Assignment 2• The end of the grace period has been extended to

Saturday 26 Sep 13:00• Note: This extends the end of the grace period. Hence, we will

no longer accept submission beyond the above time, unless you have gotten additional extension prior to the end of that grace period

• This week tutorial will be help on A2, esp. programming help• For those who have submitted / uploaded a draft and

do not perform anymore update in wattle after today 13:00, we will reward your hard work: You’ll get a 7.5pts bonus in this assignment• Note: The maximum total mark for this class remain 100 and

we do not mark/grade on a curve

Topics•What is a hash table?• Commonly Used: Hashing with Chaining +

simple uniform hashing• Other hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing

What is a hash table• An abstract data structure, where • The data maintained are pairs of key and satellite

data. The pairs are often called item. The keys are unique. • The operations are: • Insert(item)• Delete (item)• Search (key): Find the item with this key• Desired time complexity for all of the above operations are

O(1)• Can be thought of as a generalized array• Instead of using index which is independent of data, hash

tables generate the index based on a function of the key

Idea: A bit more global picture

[CLRS] Fig. 11.2

Some terminologies and notations• Universe (U): The set of all possible keys• u: Number of elements of U• n: Number of elements currently in the table•m: Number of slots in the hash table • A hash function: A function that transforms a

key into an integer index ∈ [0, m]

Why not just use array?• Hash tables can be thought of as in-

between array and linked list• Array: • Constant time access• Limited in size (can resize but takes linear time) or

reserve a large size (which is inefficient)• Linked list: • Linear time access• Not limited in size• Hash tables take the best of both world:• Constant time access• Not limited in size

Issue: Collision• Happened when 2 keys map to the same index• Solution• How data is store:• Hashing with chaining: Used linked list to allow multiple

data to be stored at the same slot in the table• Open addressing: All data are in the hash table

• Better hash function

TopicsüWhat is a hash table?• Commonly Used: Hashing with Chaining +

simple uniform hashing• Other hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing

Hashing with Chaining• Each slot in the hashtable contains a linked list.• If a key is hashed into a non-empty slot, place the new

pair at the front of the list at the particular slot• Insert (𝐾, 𝐷) :

• To search for an item (𝑘, 𝑑) : compute ℎ(𝑘) and search the item in the linked list at index ℎ(𝑘) of the hashtable

key Satellite data𝑘 → ℎ 𝑘 = 𝑖𝑑𝑥!𝑘′ → ℎ 𝑘′ = 𝑖𝑑𝑥"𝑘′′ → ℎ 𝑘′′ = 𝑖𝑑𝑥!

Simple Uniform Hashing• A hash function where any given key is equally likely to

hash into any of the m slots, independently of where any other key has hashed to, i.e., the hash function ℎsuch that

𝑃 ℎ 𝑘 = 𝑣 =1𝑚 ; 𝑣 ∈ [0,𝑚 − 1]

• Assuming the input key is uniformly distributed and independent, the following holds• Collision probability:

𝑃!#"!$ ℎ 𝑘# = ℎ 𝑘$ =1𝑚

Time complexity?• Does it satisfy the desire to have constant time

access?• On average, yeah sort of• Time complexity to search a key (regardless of successful

outcome or not) is on average Θ(1 + 𝛼), where 𝛼 = %&

• The notation 𝛼 is usually called the load factor. If we can maintain the load factor constant, then the time complexity for search is constant

• Proof? There’s 2 cases to prove• When the key is not found• When the key is found

Proof for Unsuccessful Outcome• To compute the average time to get a fail from

searching for an item (𝑘, 𝑑), we compute:• Time to compute the hash ℎ(𝑘): Θ(1)• Average #elements to check before returning a fail

is the same as the average number of elements in the linked list pointed by index ℎ(𝑘) in the hash table: %

&

• Total average time: Θ 1 + %&

= Θ 1 + 𝛼

Proof for Successful Outcome• To compute the average time to get a success

from searching for an item (𝑘, 𝑑), we compute:• Time to compute the hash ℎ(𝑘): Θ(1)• Average #elements to check before (𝑘, 𝑑) is found

can be computed as the average #items added to the hashtable that has the same hash as ℎ(𝑘) and added after (𝑘, 𝑑) is added. • To compute the above, let’s first define an indicator random

variable

𝑋'( = 01 ℎ 𝑘' = ℎ(𝑘()0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Based on the collision probability, 𝐸[𝑋'(] = 𝑃(𝑋'( = 1) = !&

Proof for Successful Outcome• Now, we can compute the average #items added to the

linked list after an item is added as:

𝐸1𝑛=')!

%

1 + =()'*!

%

𝑋'(

• Note: The average above is taken over all elements in the hashtable• Now, we just need to compute the above average

#elements that appear before the ith item (𝑘! , 𝑑!)

The check that returns true

Proof for Successful Outcome

𝐸1𝑛=')!

%

1 + =()'*!

%

𝑋'(

= !%∑')!% 1 + ∑()'*!% 𝐸[𝑋'(] Linearity of expectation

= !%∑')!% 1 + ∑()'*!

% !&

The expectation, we’ve computed 2 slides ago

= !%𝑛 + !

%∑')!% ∑()'*!% !

&

= 1 + !%&

∑')!% ∑()'*!% 1

= 1 + !%&

∑')!% (𝑛 − 𝑖)

= 1 + !%&

∑')!% 𝑛 − ∑')!

% 𝑖 = 1 + !%&

𝑛" − %(!*%)"

= 1 + %&− !*%

"&

= 1 + "%-(!*%)"&

= 1 + %-!"&

= 1 + %"&

− !"&

= 1 + ."− .

"%

Proof for Successful Outcome• To compute the average time to get a success

from searching for an item (𝑘, 𝑑), we compute:• Time to compute the hash ℎ(𝑘): Θ(1)• Average #elements to check before (𝑘, 𝑑) is found

can be computed as the average #items added to the hashtable that has the same hash as ℎ(𝑘) and added after (𝑘, 𝑑) is added: Θ(1 + '

$− '$%)

• Total: Θ 1 + Θ 1 + '$− '$%

= Θ 2 +𝛼2 −

𝛼2𝑛 = Θ 1 + 𝛼

TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +

simple uniform hashing• Other hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing

Hash functions• In simple uniform hashing, we assume that each

key has equal probability to be mapped to any one of the index in the hash table• This assumption is actually considered as a good

property for hash function. • But, not that easy to get because:• In general, we don’t know the distribution of the keys• Moreover, the keys might be drawn in a dependent

manner

Some commonly used hash functions• Some heuristics for hash functions:• Simplest: ℎ 𝑘 = 𝑘𝑚 when the key is a real number

uniformly distributed in [0, 1)• Division method: ℎ 𝑘 = 𝑘 𝑚𝑜𝑑 𝑚 when the key is an

integer. Usually, we want 𝑚 to be prime number. • Multiplication method: ℎ 𝑘 = 𝑘𝐴 𝑚𝑜𝑑 1 𝑚 , where A

is a constant in the range 0 < A < 1• 𝑘𝐴 𝑚𝑜𝑑 1 = 𝑘𝐴 − 𝑘𝐴 ; the fractional component of 𝑘𝐴• Reduce dependencies on #slots in our hashtable.

Issues• The heuristics on hash function in the previous

slides are deterministic, which we need. But, can be easily fooled to perform at the worst case easily by an adversary• Worst case: Average search time linear in 𝑛 (#elements

in the hash table)• Ex.: Division method with 𝑚 = 2/ − 1, when the keys are

integers with base 2/. What’s the problem?• Those who have done A2 can think about this while the tutors are

helping your colleagues

Universal Hashing: Idea• Choose a random hash function ℎ from a collection of

hash functions, denoted as 𝐻• The collection 𝐻 is universal whenever for any pair of keys in

the universe of keys 𝑈 (ie, 𝑘, 𝑘0 ∈ 𝑈), the number of hash functions in 𝐻 such that ℎ 𝑘 = ℎ 𝑘′ is at most |2|

&(as usual,

𝑚 is #slots in the hashtable).• The above definition implies that for any randomly selected

function ℎ ∈ 𝐻, 𝑃3∈2 ℎ 𝑘 = ℎ 𝑘′ ≤ !&

for any 𝑘 ≠ 𝑘′ in the universe of keys 𝑈

• The above hashing function will provide an average #collision similar to simple hashing independent on the key distribution

Universal Hashing: Average #Collisions• Suppose we have a hashtable 𝑇 of size 𝑚 that uses

chaining and universal hashing. And, suppose 𝑇 already contains 𝑛 items with arbitrary distinct keys. Given a key 𝑘, if ℎ 𝑘 = 𝑖 for a hash function ℎselected uniformly at random from the collection of universal hash functions, 𝐻, then:• E[#elements already in 𝑇[𝑖]] ≤ %

&if 𝑘 is not already in 𝑇

• E[#elements already in 𝑇[𝑖]] ≤ %&+ 1 if 𝑘 is already in 𝑇

• Next, we prove the above bound holds with no assumption on the distributions of the keys, and solely depend on the hash function

Proof• Let’s define an indicator random variable for any pair of keys 𝑘, 𝑙 ∈ 𝑈:

𝑋56 = 01 ℎ 𝑘 = ℎ(𝑙)0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Based on the implication of universal hashing function (2 slides back), we know E 𝑋56 = 𝑃(𝑋56 = 1) ≤ !

&• We can then define the random variable 𝑌5 as the number of

keys other than 𝑘, which is in 𝑇 and hash to the same index as 𝑘𝐸[𝑌5] = 𝐸[∑675,6∈59:;(<)𝑋56] where 𝑘𝑒𝑦𝑠 𝑇 is the set of keys already in 𝑇. Using linearity of expectation, 𝐸[𝑌5] =𝐸 ∑675,6∈59:; < 𝑋56 = ∑675,6∈59:; < 𝐸[𝑋56] ≤ ∑675,6∈59:; <

!&

Proof• if 𝑘 is not already in 𝑇

𝐸[𝑌5] ≤ ∑675,6∈59:; <!&≤ %

&= α

• if 𝑘 is already in 𝑇:E[#elements already in 𝑇[𝑖]] = 𝐸[𝑌5] + 1 ≤

%&+ 1 = α + 1.

The additional 1 is because in 𝐸[𝑌5], we exclude 𝑘 in 𝑇

Generating Universal Hashing Functions: Dot Product Hash Family• Suppose 𝑚 is a prime number• View key 𝑘 in base 𝑚: 𝑘 = 𝑘(, 𝑘#, … , 𝑘)*# , where 𝑘+ ∈0,m − 1 , i ∈ [0, 𝑟 − 1]

• Dot product hash family:• Choose a random key 𝑎 = 𝑎=, 𝑎!, … , 𝑎>-!• Define ℎ? 𝑘 = 𝑎. 𝑘 𝑚𝑜𝑑 𝑚 = ∑')=>-! 𝑎'𝑘' 𝑚𝑜𝑑 𝑚• The dot product hash family is 𝐻 = ℎ?| 𝑎 ∈ 0, 1, … , 𝑢 − 1

where 𝑢 = 𝑚>

Example• Suppose the keys are integer [0, 20]• Let’s set 𝑚 = 5• Hashing IP (let’s use IPv4)?

Is the Dot Product Hash Family Universal?• Yes•We will skip the proof

Generating Universal Hashing Functions: [CLRS example]• ℎTU 𝑘 = 𝑎𝑘 + 𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑚• 𝑎, 𝑏, 𝑝 are constants• 𝐻 = ℎTU | 𝑎 ∈ 1, 𝑝 − 1 , 𝑏 ∈ 0, 𝑝 − 1• 𝑝 > 𝑚 and 𝑝 is prime

Is a family of universal hashing always good?• No• Example:• For each element of U, pick a random number

in [0, m-1], keep it in memory, and use this mapping for hashing• Problem: Takes Θ(𝑢) time & memory, u is

usually much larger than m (and n)

TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +

simple uniform hashingüOther hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing

Open Addressing• All items are stored in the hash table (i.e., no

linked list, one item per slot)• Here, the hash function maps (key, #trials) to

index in the hash table:ℎ:𝑈× 0, 1,… ,𝑚 − 1 → {0, 1,… ,𝑚 − 1}

• The functions specifies order of slots to try• And ℎ 𝑘, 0 , ℎ 𝑘, 1 ,… , ℎ(𝑘,𝑚 − 1) is a

permutation of 0, 1, …, m-1• If we keep trying, then eventually we hit all slots

Insertion

Search

Be Careful with Deletion• What’s the issue?• Imagine data (k, d) has been stored in T[h(k, 3)].

Suppose we then delete T[h(k, 2)] as usual (i.e., assign T[h(k,2)] = NIL). Then, we can no longer find (k, d) in the hash table, because the Search function will find that T[h(k,2)] is NIL and therefore stop searching• Solution: • Mark as deleted. For instance, in the above example, set

T[h(k,2)] = Deleted• Modify Insertion line-4, such that it’s NIL or Deleted• The problem now becomes the time complexity of search

Probing Strategies• Linear Probing: ℎ 𝑘, 𝑖 = (ℎ, 𝑘 + 𝑖) 𝑚𝑜𝑑 𝑚, where ℎ,

is a usual hash function (i.e., hash function without the probing parameter)• Issue: Clustering, i.e., consecutive group of occupied slots

becomes longer• Double hashing: ℎ 𝑘, 𝑖 = (ℎ# 𝑘 + 𝑖. ℎ$(𝑘)) 𝑚𝑜𝑑 𝑚,

where ℎ# and ℎ$ are the usual hash functions

TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +

simple uniform hashingüOther hash function: Universal hashingüAnother type of storage: Open addressing• Perfect hashing

Perfect hashing• If the data is static (i.e., no insert / delete), we

can exploit it to ensure worst case O(1) search time• Idea: 2-level hashing• Each slot points to another hash table• Use universal hashing in both levels• Properties:• Polynomial build time with high probability• O(1) search time in the worst case• O(n) space is worst case

How to build a perfect hashing?1. Pick a universal hash function ℎ# ∈ 𝐻 and set 𝑚 =

Θ 𝑛 , usually 𝑚 is set to be prime2. If ∑+-(&*#𝑛+$ > 𝑐𝑛, for a selected constant 𝑐, redo 13. If there’s ni elements in slot-i, construct a hash-table

with size ni2 slots, and choose a universal hash function ℎ$,+ ∈ 𝐻 to be the hash function for this 2ndlevel hash table

4. As long as there’s ℎ$,+(𝑘) = ℎ$,+(𝑘′) for any 𝑘 ≠ 𝑘′, pick a different ℎ$,+ and rehash those elements in 𝑛+

Perfect Hashing• Once the hash table is built, it’s

guaranteed there’s no collision in search. Hence, it’s guaranteed to have O(1) search time.• The question is how many times do we

need to repeat finding the hash function (step 2 and step 3)?

Additional Bounds on Time • E[#trials] ≤ 2 and #trials = O(log n) with high

probability• Total time spent in step-3 & step-4: O(n log2n)• Total time spent in step-1 & step-2: O(n log n)•We will skip the proof

TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +

simple uniform hashingüOther hash function: Universal hashingüAnother type of storage: Open addressingüPerfect hashing

Next: Algorithm Design Techniques

Recommended