Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Hash tables
Hash tables
Dictionary
Definition
A dictionary is a data structure that stores a set of elementswhere each element has a unique key, and supports thefollowing operations:
Search(S , k) Return the element whose key is k .
Insert(S , x) Add x to S .
Delete(S , x) Remove x from S
Assume that the keys are from U = 0, . . . , u − 1.
Hash tables
Direct addressing
S is stored in an array T [0..u − 1]. The entry T [k]contains a pointer to the element with key k , if suchelement exists, and NULL otherwise.
Search(T , k): return T [k].
Insert(T , x): T [x .key]← x .
Delete(T , x): T [x .key]← NULL.
What is the problem with this structure?0123456789
23
5
8
keysatellitedata
1
94
07
6
25
38
U
S
Hash tables
Hashing
In order to reduce the space complexity of directaddressing, map the keys to a smaller range0, . . .m − 1 using a hash function.
There is a problem of collisions (two or more keys thatare mapped to the same value).
There are several methods to handle collisions.
ChainingOpen addressing
Hash tables
Hash table with chaining
Let h : U → 0, . . . ,m − 1 be a hash function (m < u).
S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].
Search(T , k): Search the list T [h(k)].
Insert(T , x): Insert x at the head of T [h(x .key)].
Delete(T , x): Delete x from the list containing x .
S=30,9,26,19,1,6
m=5, h(x)=x mod 5
Hash tables
Hash table with chaining
Let h : U → 0, . . . ,m − 1 be a hash function (m < u).
S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].
Search(T , k): Search the list T [h(k)].
Insert(T , x): Insert x at the head of T [h(x .key)].
Delete(T , x): Delete x from the list containing x .
S=30,9,26,19,1,6
m=5, h(x)=x mod 5
30
Hash tables
Hash table with chaining
Let h : U → 0, . . . ,m − 1 be a hash function (m < u).
S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].
Search(T , k): Search the list T [h(k)].
Insert(T , x): Insert x at the head of T [h(x .key)].
Delete(T , x): Delete x from the list containing x .
S=30,9,26,19,1,6
m=5, h(x)=x mod 5
30
9
Hash tables
Hash table with chaining
Let h : U → 0, . . . ,m − 1 be a hash function (m < u).
S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].
Search(T , k): Search the list T [h(k)].
Insert(T , x): Insert x at the head of T [h(x .key)].
Delete(T , x): Delete x from the list containing x .
S=30,9,26,19,1,6
m=5, h(x)=x mod 5
9
26
30
Hash tables
Hash table with chaining
Let h : U → 0, . . . ,m − 1 be a hash function (m < u).
S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].
Search(T , k): Search the list T [h(k)].
Insert(T , x): Insert x at the head of T [h(x .key)].
Delete(T , x): Delete x from the list containing x .
S=30,9,26,19,1,6
m=5, h(x)=x mod 5
19 9
26
30
Hash tables
Hash table with chaining
Let h : U → 0, . . . ,m − 1 be a hash function (m < u).
S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].
Search(T , k): Search the list T [h(k)].
Insert(T , x): Insert x at the head of T [h(x .key)].
Delete(T , x): Delete x from the list containing x .
S=30,9,26,19,1,6
m=5, h(x)=x mod 5
19 9
261
30
Hash tables
Hash table with chaining
Let h : U → 0, . . . ,m − 1 be a hash function (m < u).
S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].
Search(T , k): Search the list T [h(k)].
Insert(T , x): Insert x at the head of T [h(x .key)].
Delete(T , x): Delete x from the list containing x .
S=30,9,26,19,1,6
m=5, h(x)=x mod 5
19 9
6 261
30
Hash tables
Analysis — Worst case
The worst case running times of the operations are:
Search: Θ(n).
Insert: Θ(1).
Delete: Θ(n) (can be reduced to Θ(1) using doublylinked list).
Hash tables
Analysis
Assumption of simple uniform hashing: any element isequally likely to hash into any of the m slots,independently of where other elements have hashed into.
The above assumption is true when the keys are chosenuniformly and independently at random (withrepetitions), h(x) = x mod m, and u is divisible by m.
We want to analyze the performance of hashing under theassumption of simple uniform hashing. This is the ballsinto bins problem.
Suppose we randomly place n balls into m bins. Let X bethe number of balls in bin 0.
The expected time of an unsuccessful search in a hashtable is Θ(1 + E [X ]).
Hash tables
The expectation of X
Each ball has probability 1/m to be in bin 0.
The random variable X has binomial distribution withparameters n and 1/m.
Therefore, E [X ] = n/m.
The expected time of an unsuccessful search in a hashtable is Θ(1 + α), where α = n/m.
The value α is called the load factor.
The expected time of searching a random key from S isalso Θ(1 + α).
Hash tables
Example
If α = 1 and n is large:
Pr[X = 0] ≈ 0.368 Pr[X = 4] ≈ 0.015
Pr[X = 1] ≈ 0.368 Pr[X = 5] ≈ 0.003
Pr[X = 2] ≈ 0.184 Pr[X = 6] ≈ 0.0005
Pr[X = 3] ≈ 0.061 Pr[X = 7] ≈ 0.00007
Hash tables
The maximum number of balls in a bin
Suppose that n = m = 106. Most bins contain 0–3 balls.
The probability that a specific bin contains at least 8 ballsis ≈ 0.000009.
However, it is very likely that some bin will contain 8balls.
Theorem
If m ∈ Θ(n) then the size of the largest bin isΘ(log n/ log log n) with probability at least 1− 1/nΘ(1).
Hash tables
Universal hash functions
Random keys + fixed hash function (e.g. mod)⇒ The hashed keys are random numbers.
A fixed set of keys + random hash function (selectedfrom a universal set)⇒ The hashed keys are semi-random numbers.
Hash tables
Universal hash functions
Definition
A collection H of hash functions is a universal if for every pairof distinct keys x , y ∈ U , Prh∈H[h(x) = h(y)] ≤ 1
m.
Example
Let p be a prime number larger than u.fa,b(x) = ((ax + b) mod p) mod mHp,m = fa,b|a ∈ 1, 2, . . . , p − 1, b ∈ 0, 1, . . . , p − 1
Universal hash functions
Theorem
Suppose that H is a universal collection of hash functions.If a hash table for S is built using a randomly chosen h ∈ H,then for every k ∈ U, the expected length of T [h(k)] is atmost 1 + n/m.
Proof.
Let X = length of T [h(k)].X =
∑y∈S Iy where Iy = 1 if h(y .key) = h(k) and Iy = 0
otherwise.
E [X ] = E
[∑y∈S
Iy
]=
∑y∈S
E [Iy ] =∑y∈S
Prh∈H
[h(y .key) = h(k)]
If k /∈ S , E [X ] ≤ n · 1m
. Otherwise, E [X ] ≤ 1 + (n − 1) 1m
.
Hash tables
Rehashing
We wish to maintain n ∈ O(m) in order to have Θ(1)expected search time.
This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.
19 34
6 21
h(x)=x mod 5
30
34
21
6
30
h'(x)=x mod 10
19
Hash tables
Rehashing
We wish to maintain n ∈ O(m) in order to have Θ(1)expected search time.
This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.The cost of rehashing is Θ(n).
Hash tables
Rehashing
In order to keep the space usage Θ(n), we needn ∈ Ω(m).
We can keep α ∈ [ 14, 1]. If an insert would make α > 1,
create a new table of size 2m. If a delete would makeα < 1
4, create a new table of size m/2.
Hash tables
Time complexity of chaining
Under the assumption of simple uniform hashing, theexpected time of a search is Θ(1 + α) time.
If α = Θ(1), and under the assumption of simple uniformhashing, the worst case time of a search isΘ(log n/ log log n), with probability at least 1− 1/nΘ(1).
If the hash function is chosen from a universal collectionat random, the expected time of a search is Θ(1 + α).
The worst case time of insert is Θ(1) if there is norehashing.
Using doubly linked lists, the worst case time of delete isΘ(1).
Hash tables
Linear probing
In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.
To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.
Insert(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”
0123456789
h(k)=k mod 10
Hash tables
Linear probing
In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.
To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.
Insert(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”
0123456789
14
insert(T,14)
Hash tables
Linear probing
In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.
To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.
Insert(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”
0123456789
14
20
insert(T,20)
Hash tables
Linear probing
In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.
To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.
Insert(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”
0123456789
14
20
34
insert(T,34)
Hash tables
Linear probing
In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.
To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.
Insert(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”
0123456789
14
20
34
19insert(T,19)
Hash tables
Linear probing
In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.
To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.
Insert(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”
0123456789
14
20
34
19
55
insert(T,55)
Hash tables
Linear probing
In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.
To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.
Insert(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”
0123456789
14
20
34
19
55
49
insert(T,49)
Hash tables
Searching
Search(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = k(4) return j(5) if T [j ] = NULL(6) return NULL(7) j ← j + 1 mod m(8) return NULL
0123456789
14
20
34
19
55
49
search(T,55)Hash tables
Searching
Search(T , k)
(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = k(4) return j(5) if T [j ] = NULL(6) return NULL(7) j ← j + 1 mod m(8) return NULL
0123456789
14
20
34
19
55
49
search(T,25)Hash tables
Deletion
If we want to delete the element T [i ], we cannotwrite NULL into T [i ].
0123456789
14
20
34
19
55
49
23
delete 14
Hash tables
Deletion
If we want to delete the element T [i ], we cannotwrite NULL into T [i ].
0123456789
20
34
19
55
49
23
delete 14
Hash tables
Deletion
Delete method 1: To delete element T [i ], storein T [i ] a special value DELETED.
Change line (3) in Insert to:
if T [j ] = NULL OR T [j ] = DELETED
0123456789
20
34
19
55
49
23
delete 14
D
Hash tables
Deletion
Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.
0123456789
14
20
34
19
55
49
23
delete 14
Hash tables
Deletion
Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.
0123456789
20
19
49
23
delete 14
Hash tables
Deletion
Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.
0123456789
20
19
49
23
delete 14
34
Hash tables
Deletion
Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.
0123456789
20
19
49
23
delete 14
3455
Hash tables
Time complexity of linear probing
Assuming simple uniform hashing and that α ≤ 1− ε for somefixed constant ε > 0:
The expected time of search is Θ(1).
The expected time of insert/delete without rehashing isΘ(1).
Hash tables
Open addressing
Open addressing is a generalization oflinear probing.
Leth : U × 0, . . .m − 1 → 0, . . .m − 1be a hash function such thath(k , 0), h(k , 1), . . . h(k ,m − 1) is apermutation of 0, . . .m − 1 for everyk ∈ U .
The slots examined during search/insertare h(k , 0), then h(k , 1), h(k , 2) etc.
In the example on the right,h(k , i) = (h1(k) + ih2(k)) mod 13 whereh1(k) = k mod 13h2(k) = 1 + (k mod 11)
0123456789
101112
79
6998
72
50
14
insert(T,14)
h(14,0)
h(14,1)
h(14,2)
Hash tables
Open addressing
Insertion is the same as in linear probing:Insert(T , k)
(1) for i = 0 to m − 1(2) j ← h(k , i)(3) if T [j ] = NULL OR T [j ] = DELETED(4) T [j ]← k(5) return(6) error “hash table overflow”
Deletion is done using delete method 1 defined above(using special value DELETED).
Hash tables
Double hashing
In the double hashing method,
h(k , i) = (h1(k) + ih2(k)) mod m
for some hash functions h1 and h2.The value h2(k) must be relatively prime to m for theentire hash table to be searched. This can be ensured byeither
Taking m to be a power of 2, and the image of h2
contains only odd numbers.Taking m to be a prime number, and the image of h2
contains integers from 1, . . . ,m − 1.For example,
h1(k) = k mod m
h2(k) = 1 + (k mod m′)
where m is prime and m′ < m.Hash tables
Comparison of hash table methods – Search
8 10 12 14 16 18 20 220
50
100
150
200
250
300
log n
Clo
ck C
ycle
s
Successful Lookup
CuckooTwo−Way ChainingChained HashingLinear Probing
Hash tables
Comparison of hash table methods – Search
8 10 12 14 16 18 20 220
50
100
150
200
250
300
log n
Clo
ck C
ycle
s
Unsuccessful Lookup
CuckooTwo−Way ChainingChained HashingLinear Probing
Hash tables
Comparison of hash table methods – Insert
8 10 12 14 16 18 20 220
50
100
150
200
250
300
350
400
450
log n
Clo
ck C
ycle
s
Insert
CuckooTwo−Way ChainingChained HashingLinear Probing
Hash tables
Comparison of hash table methods – Delete
8 10 12 14 16 18 20 220
50
100
150
200
250
300
350
log n
Clo
ck C
ycle
s
Delete
CuckooTwo−Way ChainingChained HashingLinear Probing
Hash tables
Hash function for strings
A string can be viewed as a sequence of integer, whereeach character is replaced by its ASCII value.
For example, the string ABXY is the sequence65,66,88,89.
To hash a sequence of integers x0, . . . , xr−1 from1, . . . , u, we use a prime number p > u. We select arandom number z ∈ 0, . . . , p − 1 and use the functionhz(x0, . . . , xr−1) = (x0z0 + x1z1 + · · ·+ xr−1z r−1) mod p.
For example h300(65, 66, 88, 89) =(65 + 66 · 300 + 88 · 3002 + 89 · 3003) mod p.
Hash tables
Hash function for strings
We definedhz(x0, . . . , xr−1) = (x0z0 + x1z1 + · · ·+ xr−1z r−1) mod p
Theorem
Given two different sequences x0, . . . , xr−1 and y0, . . . , yr−1
over 1, . . . , u,Pr[hz(x0, . . . , xr−1) = hz(y0, . . . , yr−1)] ≤ (r − 1)/p when z ischosen randomly from 0, . . . , p − 1.
Proof.
A non-trivial polynomial of degree d over the field Zp has atmost d roots. Therefore, the polynomialQ(z) = hz(x0, . . . , xr−1)− hz(y0, . . . , yr−1) =(x0 − y0)z0 + · · ·+ (xr−1 − yr−1)z r−1
has at most r − 1 roots over the field Zp.The prob. of choosing z to be a root of Q is ≤ (r − 1)/p.
Hash tables
Applications of hash functions
Applications of hash functions
Applications of hash functions
Data deduplication: Suppose that we have many files, andsome files have duplicates. In order to save storage space,we want to store only one instance of each distinct file.
Rabin-Karp pattern matching algorithm To find theoccurrences of a string P of length m in a string T ,compute a hash function on every substring of T oflength m and compare with h(P).
Distributed storage: Storing files on several servers.
Applications of hash functions
Rabin-Karp pattern matching algorithm
Suppose we are given strings P and T , and we want tofind all occurrences of P in T .
The Rabin-Karp algorithm is as follows:Compute h(P)for every substring T ′ of T of length |P |
if h(T ′) = h(P) check whether T ′ = P .
The values h(T ′) for all T ′ can be computed in Θ(|T |)time using rolling hash.
Example
Let T = CABXYZ, P = ABXY.
h300(CABX ) = (67 + 65 · 300 + 66 · 3002 + 88 · 3003) mod p
h300(ABXY ) = (65 + 66 · 300 + 88 · 3002 + 89 · 3003) mod p
= ((h300(CABX )− 67)/300 + 89 · 3003) mod p
Applications of hash functions
Distributed storage
Suppose we have many files, and we want to store themon several servers.
Let m denote the number of servers.
The simple solution is to use a hash functionh : U → 1, . . . ,m, and assign file x to server h(x).
The problem with this solution is that if we add orremove a server, we need to do rehashing which will movemost files between servers.
Applications of hash functions
Distributed storage: Consistent hashing
Suppose that the server have identifiers s1, . . . , sm.
Let h : U → [0, 1] be a hash function.
For each server i associate a point h(si) on the unit circle.
For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).
0
server 1
server 3
server 2
Applications of hash functions
Distributed storage: Consistent hashing
Suppose that the server have identifiers s1, . . . , sm.
Let h : U → [0, 1] be a hash function.
For each server i associate a point h(si) on the unit circle.
For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).
0
server 1
server 3
server 2
Applications of hash functions
Distributed storage: Consistent hashing
When a new server m + 1 is added, let i be the serverwhose point is the first server point after h(sm+1).
We only need to reassign some of the files that wereassigned to server i .
The expected number of files reassignments is n/(m + 1).
0
server 1
server 3
server 2
server 4
Applications of hash functions
Bloom Filter
Bloom Filter
Blocking malicious web pages
Suppose we want to implement blocking of malicious webpages in a web browser.
Assume we have a list of 10,000,000 malicious pages.
We can store all pages in a hash table, but this requires alarge amount of memory.
Bloom filter is a solution for the problem that uses smallamount of memory.
Bloom filter stores a set of elements and supportMembership and Insert operations.
For an element that is in the set, the Membershipoperation always returns a positive answer.
For an element that is not in the set, the Membershipoperation can return either a negative answer, or apositive answer (false positive).
Bloom Filter
Blocking malicious web pages
Suppose that the list of malicious pages is stored in aBloom filter.
For a query URL, if the answer is negative, we know thatthe URL is not a malicious page.
If the answer is positive, we do not know the correctanswer. In this case, we get the answer from a server.
We want a small probability for a false positive.
Bloom Filter
Bloom filter — simple version
Suppose we want to store a set S .
We use an array A of size m initialized to 0.
We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .
For a query y , if A[h(y)] = 0, we know that y /∈ S .
If A[h(y)] = 1, either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
0
Bloom Filter
Bloom filter — simple version
Suppose we want to store a set S .
We use an array A of size m initialized to 0.
We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .
For a query y , if A[h(y)] = 0, we know that y /∈ S .
If A[h(y)] = 1, either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
01
Bloom Filter
Bloom filter — simple version
Suppose we want to store a set S .
We use an array A of size m initialized to 0.
We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .
For a query y , if A[h(y)] = 0, we know that y /∈ S .
If A[h(y)] = 1, either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
011
Bloom Filter
Bloom filter — simple version
Suppose we want to store a set S .
We use an array A of size m initialized to 0.
We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .
For a query y , if A[h(y)] = 0, we know that y /∈ S .
If A[h(y)] = 1, either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
011
Bloom Filter
Bloom filter — simple version
Suppose we want to store a set S .
We use an array A of size m initialized to 0.
We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .
For a query y , if A[h(y)] = 0, we know that y /∈ S .
If A[h(y)] = 1, either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
011
Bloom Filter
Probability of false positive
Assume simple uniform hashing: any element is equallylikely to hash into any of the m slots, independently ofwhere other elements have hashed into.
For fixed i , the probability that A[i ] = 0 isp = (1− 1/m)n.
For y /∈ S , the probability that A[h(y)] = 1 is 1− p.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
011
Bloom Filter
Reducing the false positive probability
We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .
For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .
If A[hj(y)] = 1 for all j , either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
0
Bloom Filter
Reducing the false positive probability
We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .
For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .
If A[hj(y)] = 1 for all j , either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
01 11
Bloom Filter
Reducing the false positive probability
We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .
For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .
If A[hj(y)] = 1 for all j , either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
01 1111
Bloom Filter
Reducing the false positive probability
We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .
For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .
If A[hj(y)] = 1 for all j , either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
01 1111
Bloom Filter
Reducing the false positive probability
We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .
For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .
If A[hj(y)] = 1 for all j , either y is in S or not.
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
01 1111
Bloom Filter
Probability of false positive
For fixed i , the probability that A[i ] = 0 isp = (1− 1/m)tn.
For y /∈ S , the probability that A[h(y)] = 1 is (1− p)t .
000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9
01 1111
Bloom Filter
Optimizing the probability of false positive
For the previous example (n = 2, m = 10):
For t = 1, p = 0.81, (1− p)1 = 0.19.
For t = 2, p ≈ 0.656, (1− p)2 ≈ 0.118.
For t = 3, p ≈ 0.531, (1− p)3 ≈ 0.103.
For t = 4, p ≈ 0.430, (1− p)4 ≈ 0.105.
For t = 5, p ≈ 0.349, (1− p)5 ≈ 0.117.
Bloom Filter
Optimizing the probability of false positive
Let f (t) be the probability of false positive when using thash functions.
p = (1− 1/m)tn ≈ e−tn/m
f (t) = (1− p)t ≈ (1− e−tn/m)t = et·ln(1−e−tn/m)
Minimizing f (t) is equivalent to minimizingg(t) = t · ln(1− e−tn/m).dgdt
= ln(1− e−tn/m) + tnm
e−tn/m
1−e−tn/m .
The minimum of g(t) is when dg/dt = 0 =⇒ t = mn
ln 2.
The minimum of f (t) is ≈ 0.6185m/n.
Example: For m = 8n, mn
ln 2 ≈ 5.55. f (5) ≈ 0.0217,f (6) ≈ 0.0216.
Bloom Filter