COMP3600/6466 – Algorithms Hash Tables[CLRS ch. 11]
Hanna Kurniawati
https://cs.anu.edu.au/courses/comp3600/[email protected]
Assignment 2• The end of the grace period has been extended to
Saturday 26 Sep 13:00• Note: This extends the end of the grace period. Hence, we will
no longer accept submission beyond the above time, unless you have gotten additional extension prior to the end of that grace period
• This week tutorial will be help on A2, esp. programming help• For those who have submitted / uploaded a draft and
do not perform anymore update in wattle after today 13:00, we will reward your hard work: You’ll get a 7.5pts bonus in this assignment• Note: The maximum total mark for this class remain 100 and
we do not mark/grade on a curve
Topics•What is a hash table?• Commonly Used: Hashing with Chaining +
simple uniform hashing• Other hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing
What is a hash table• An abstract data structure, where • The data maintained are pairs of key and satellite
data. The pairs are often called item. The keys are unique. • The operations are: • Insert(item)• Delete (item)• Search (key): Find the item with this key• Desired time complexity for all of the above operations are
O(1)• Can be thought of as a generalized array• Instead of using index which is independent of data, hash
tables generate the index based on a function of the key
Idea: A bit more global picture
[CLRS] Fig. 11.2
Some terminologies and notations• Universe (U): The set of all possible keys• u: Number of elements of U• n: Number of elements currently in the table•m: Number of slots in the hash table • A hash function: A function that transforms a
key into an integer index ∈ [0, m]
Why not just use array?• Hash tables can be thought of as in-
between array and linked list• Array: • Constant time access• Limited in size (can resize but takes linear time) or
reserve a large size (which is inefficient)• Linked list: • Linear time access• Not limited in size• Hash tables take the best of both world:• Constant time access• Not limited in size
Issue: Collision• Happened when 2 keys map to the same index• Solution• How data is store:• Hashing with chaining: Used linked list to allow multiple
data to be stored at the same slot in the table• Open addressing: All data are in the hash table
• Better hash function
TopicsüWhat is a hash table?• Commonly Used: Hashing with Chaining +
simple uniform hashing• Other hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing
Hashing with Chaining• Each slot in the hashtable contains a linked list.• If a key is hashed into a non-empty slot, place the new
pair at the front of the list at the particular slot• Insert (𝐾, 𝐷) :
• To search for an item (𝑘, 𝑑) : compute ℎ(𝑘) and search the item in the linked list at index ℎ(𝑘) of the hashtable
key Satellite data𝑘 → ℎ 𝑘 = 𝑖𝑑𝑥!𝑘′ → ℎ 𝑘′ = 𝑖𝑑𝑥"𝑘′′ → ℎ 𝑘′′ = 𝑖𝑑𝑥!
Simple Uniform Hashing• A hash function where any given key is equally likely to
hash into any of the m slots, independently of where any other key has hashed to, i.e., the hash function ℎsuch that
𝑃 ℎ 𝑘 = 𝑣 =1𝑚 ; 𝑣 ∈ [0,𝑚 − 1]
• Assuming the input key is uniformly distributed and independent, the following holds• Collision probability:
𝑃!#"!$ ℎ 𝑘# = ℎ 𝑘$ =1𝑚
Time complexity?• Does it satisfy the desire to have constant time
access?• On average, yeah sort of• Time complexity to search a key (regardless of successful
outcome or not) is on average Θ(1 + 𝛼), where 𝛼 = %&
• The notation 𝛼 is usually called the load factor. If we can maintain the load factor constant, then the time complexity for search is constant
• Proof? There’s 2 cases to prove• When the key is not found• When the key is found
Proof for Unsuccessful Outcome• To compute the average time to get a fail from
searching for an item (𝑘, 𝑑), we compute:• Time to compute the hash ℎ(𝑘): Θ(1)• Average #elements to check before returning a fail
is the same as the average number of elements in the linked list pointed by index ℎ(𝑘) in the hash table: %
&
• Total average time: Θ 1 + %&
= Θ 1 + 𝛼
Proof for Successful Outcome• To compute the average time to get a success
from searching for an item (𝑘, 𝑑), we compute:• Time to compute the hash ℎ(𝑘): Θ(1)• Average #elements to check before (𝑘, 𝑑) is found
can be computed as the average #items added to the hashtable that has the same hash as ℎ(𝑘) and added after (𝑘, 𝑑) is added. • To compute the above, let’s first define an indicator random
variable
𝑋'( = 01 ℎ 𝑘' = ℎ(𝑘()0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Based on the collision probability, 𝐸[𝑋'(] = 𝑃(𝑋'( = 1) = !&
Proof for Successful Outcome• Now, we can compute the average #items added to the
linked list after an item is added as:
𝐸1𝑛=')!
%
1 + =()'*!
%
𝑋'(
• Note: The average above is taken over all elements in the hashtable• Now, we just need to compute the above average
#elements that appear before the ith item (𝑘! , 𝑑!)
The check that returns true
Proof for Successful Outcome
𝐸1𝑛=')!
%
1 + =()'*!
%
𝑋'(
= !%∑')!% 1 + ∑()'*!% 𝐸[𝑋'(] Linearity of expectation
= !%∑')!% 1 + ∑()'*!
% !&
The expectation, we’ve computed 2 slides ago
= !%𝑛 + !
%∑')!% ∑()'*!% !
&
= 1 + !%&
∑')!% ∑()'*!% 1
= 1 + !%&
∑')!% (𝑛 − 𝑖)
= 1 + !%&
∑')!% 𝑛 − ∑')!
% 𝑖 = 1 + !%&
𝑛" − %(!*%)"
= 1 + %&− !*%
"&
= 1 + "%-(!*%)"&
= 1 + %-!"&
= 1 + %"&
− !"&
= 1 + ."− .
"%
Proof for Successful Outcome• To compute the average time to get a success
from searching for an item (𝑘, 𝑑), we compute:• Time to compute the hash ℎ(𝑘): Θ(1)• Average #elements to check before (𝑘, 𝑑) is found
can be computed as the average #items added to the hashtable that has the same hash as ℎ(𝑘) and added after (𝑘, 𝑑) is added: Θ(1 + '
$− '$%)
• Total: Θ 1 + Θ 1 + '$− '$%
= Θ 2 +𝛼2 −
𝛼2𝑛 = Θ 1 + 𝛼
TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +
simple uniform hashing• Other hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing
Hash functions• In simple uniform hashing, we assume that each
key has equal probability to be mapped to any one of the index in the hash table• This assumption is actually considered as a good
property for hash function. • But, not that easy to get because:• In general, we don’t know the distribution of the keys• Moreover, the keys might be drawn in a dependent
manner
Some commonly used hash functions• Some heuristics for hash functions:• Simplest: ℎ 𝑘 = 𝑘𝑚 when the key is a real number
uniformly distributed in [0, 1)• Division method: ℎ 𝑘 = 𝑘 𝑚𝑜𝑑 𝑚 when the key is an
integer. Usually, we want 𝑚 to be prime number. • Multiplication method: ℎ 𝑘 = 𝑘𝐴 𝑚𝑜𝑑 1 𝑚 , where A
is a constant in the range 0 < A < 1• 𝑘𝐴 𝑚𝑜𝑑 1 = 𝑘𝐴 − 𝑘𝐴 ; the fractional component of 𝑘𝐴• Reduce dependencies on #slots in our hashtable.
Issues• The heuristics on hash function in the previous
slides are deterministic, which we need. But, can be easily fooled to perform at the worst case easily by an adversary• Worst case: Average search time linear in 𝑛 (#elements
in the hash table)• Ex.: Division method with 𝑚 = 2/ − 1, when the keys are
integers with base 2/. What’s the problem?• Those who have done A2 can think about this while the tutors are
helping your colleagues
Universal Hashing: Idea• Choose a random hash function ℎ from a collection of
hash functions, denoted as 𝐻• The collection 𝐻 is universal whenever for any pair of keys in
the universe of keys 𝑈 (ie, 𝑘, 𝑘0 ∈ 𝑈), the number of hash functions in 𝐻 such that ℎ 𝑘 = ℎ 𝑘′ is at most |2|
&(as usual,
𝑚 is #slots in the hashtable).• The above definition implies that for any randomly selected
function ℎ ∈ 𝐻, 𝑃3∈2 ℎ 𝑘 = ℎ 𝑘′ ≤ !&
for any 𝑘 ≠ 𝑘′ in the universe of keys 𝑈
• The above hashing function will provide an average #collision similar to simple hashing independent on the key distribution
Universal Hashing: Average #Collisions• Suppose we have a hashtable 𝑇 of size 𝑚 that uses
chaining and universal hashing. And, suppose 𝑇 already contains 𝑛 items with arbitrary distinct keys. Given a key 𝑘, if ℎ 𝑘 = 𝑖 for a hash function ℎselected uniformly at random from the collection of universal hash functions, 𝐻, then:• E[#elements already in 𝑇[𝑖]] ≤ %
&if 𝑘 is not already in 𝑇
• E[#elements already in 𝑇[𝑖]] ≤ %&+ 1 if 𝑘 is already in 𝑇
• Next, we prove the above bound holds with no assumption on the distributions of the keys, and solely depend on the hash function
Proof• Let’s define an indicator random variable for any pair of keys 𝑘, 𝑙 ∈ 𝑈:
𝑋56 = 01 ℎ 𝑘 = ℎ(𝑙)0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Based on the implication of universal hashing function (2 slides back), we know E 𝑋56 = 𝑃(𝑋56 = 1) ≤ !
&• We can then define the random variable 𝑌5 as the number of
keys other than 𝑘, which is in 𝑇 and hash to the same index as 𝑘𝐸[𝑌5] = 𝐸[∑675,6∈59:;(<)𝑋56] where 𝑘𝑒𝑦𝑠 𝑇 is the set of keys already in 𝑇. Using linearity of expectation, 𝐸[𝑌5] =𝐸 ∑675,6∈59:; < 𝑋56 = ∑675,6∈59:; < 𝐸[𝑋56] ≤ ∑675,6∈59:; <
!&
Proof• if 𝑘 is not already in 𝑇
𝐸[𝑌5] ≤ ∑675,6∈59:; <!&≤ %
&= α
• if 𝑘 is already in 𝑇:E[#elements already in 𝑇[𝑖]] = 𝐸[𝑌5] + 1 ≤
%&+ 1 = α + 1.
The additional 1 is because in 𝐸[𝑌5], we exclude 𝑘 in 𝑇
Generating Universal Hashing Functions: Dot Product Hash Family• Suppose 𝑚 is a prime number• View key 𝑘 in base 𝑚: 𝑘 = 𝑘(, 𝑘#, … , 𝑘)*# , where 𝑘+ ∈0,m − 1 , i ∈ [0, 𝑟 − 1]
• Dot product hash family:• Choose a random key 𝑎 = 𝑎=, 𝑎!, … , 𝑎>-!• Define ℎ? 𝑘 = 𝑎. 𝑘 𝑚𝑜𝑑 𝑚 = ∑')=>-! 𝑎'𝑘' 𝑚𝑜𝑑 𝑚• The dot product hash family is 𝐻 = ℎ?| 𝑎 ∈ 0, 1, … , 𝑢 − 1
where 𝑢 = 𝑚>
Example• Suppose the keys are integer [0, 20]• Let’s set 𝑚 = 5• Hashing IP (let’s use IPv4)?
Is the Dot Product Hash Family Universal?• Yes•We will skip the proof
Generating Universal Hashing Functions: [CLRS example]• ℎTU 𝑘 = 𝑎𝑘 + 𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑚• 𝑎, 𝑏, 𝑝 are constants• 𝐻 = ℎTU | 𝑎 ∈ 1, 𝑝 − 1 , 𝑏 ∈ 0, 𝑝 − 1• 𝑝 > 𝑚 and 𝑝 is prime
Is a family of universal hashing always good?• No• Example:• For each element of U, pick a random number
in [0, m-1], keep it in memory, and use this mapping for hashing• Problem: Takes Θ(𝑢) time & memory, u is
usually much larger than m (and n)
TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +
simple uniform hashingüOther hash function: Universal hashing• Another type of storage: Open addressing• Perfect hashing
Open Addressing• All items are stored in the hash table (i.e., no
linked list, one item per slot)• Here, the hash function maps (key, #trials) to
index in the hash table:ℎ:𝑈× 0, 1,… ,𝑚 − 1 → {0, 1,… ,𝑚 − 1}
• The functions specifies order of slots to try• And ℎ 𝑘, 0 , ℎ 𝑘, 1 ,… , ℎ(𝑘,𝑚 − 1) is a
permutation of 0, 1, …, m-1• If we keep trying, then eventually we hit all slots
Insertion
Search
Be Careful with Deletion• What’s the issue?• Imagine data (k, d) has been stored in T[h(k, 3)].
Suppose we then delete T[h(k, 2)] as usual (i.e., assign T[h(k,2)] = NIL). Then, we can no longer find (k, d) in the hash table, because the Search function will find that T[h(k,2)] is NIL and therefore stop searching• Solution: • Mark as deleted. For instance, in the above example, set
T[h(k,2)] = Deleted• Modify Insertion line-4, such that it’s NIL or Deleted• The problem now becomes the time complexity of search
Probing Strategies• Linear Probing: ℎ 𝑘, 𝑖 = (ℎ, 𝑘 + 𝑖) 𝑚𝑜𝑑 𝑚, where ℎ,
is a usual hash function (i.e., hash function without the probing parameter)• Issue: Clustering, i.e., consecutive group of occupied slots
becomes longer• Double hashing: ℎ 𝑘, 𝑖 = (ℎ# 𝑘 + 𝑖. ℎ$(𝑘)) 𝑚𝑜𝑑 𝑚,
where ℎ# and ℎ$ are the usual hash functions
TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +
simple uniform hashingüOther hash function: Universal hashingüAnother type of storage: Open addressing• Perfect hashing
Perfect hashing• If the data is static (i.e., no insert / delete), we
can exploit it to ensure worst case O(1) search time• Idea: 2-level hashing• Each slot points to another hash table• Use universal hashing in both levels• Properties:• Polynomial build time with high probability• O(1) search time in the worst case• O(n) space is worst case
How to build a perfect hashing?1. Pick a universal hash function ℎ# ∈ 𝐻 and set 𝑚 =
Θ 𝑛 , usually 𝑚 is set to be prime2. If ∑+-(&*#𝑛+$ > 𝑐𝑛, for a selected constant 𝑐, redo 13. If there’s ni elements in slot-i, construct a hash-table
with size ni2 slots, and choose a universal hash function ℎ$,+ ∈ 𝐻 to be the hash function for this 2ndlevel hash table
4. As long as there’s ℎ$,+(𝑘) = ℎ$,+(𝑘′) for any 𝑘 ≠ 𝑘′, pick a different ℎ$,+ and rehash those elements in 𝑛+
Perfect Hashing• Once the hash table is built, it’s
guaranteed there’s no collision in search. Hence, it’s guaranteed to have O(1) search time.• The question is how many times do we
need to repeat finding the hash function (step 2 and step 3)?
Additional Bounds on Time • E[#trials] ≤ 2 and #trials = O(log n) with high
probability• Total time spent in step-3 & step-4: O(n log2n)• Total time spent in step-1 & step-2: O(n log n)•We will skip the proof
TopicsüWhat is a hash table?üCommonly Used: Hashing with Chaining +
simple uniform hashingüOther hash function: Universal hashingüAnother type of storage: Open addressingüPerfect hashing
Next: Algorithm Design Techniques