46
Chapter 17 Hashing

Ch17 Hashing

Embed Size (px)

Citation preview

Page 1: Ch17 Hashing

Chapter 17

Hashing

Page 2: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-2

Chapter Objectives

• Define hashing

• Examine various hashing functions

• Examine the problem of collisions in hash tables

• Explore the Java Collections API implementations of hashing

Page 3: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-3

Hashing

• In the collections we have discussed thus far, we have made one of three assumptions regarding order:– Order is unimportant

– Order is determined by the way elements are added

– Order is determined by comparing the values of the elements

Page 4: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-4

Hashing

• Hashing is the concept that order is determined by some function of the value of the element to be stored– More precisely, the location within the

collection is determined by some function of the value of the element to be stored

Page 5: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-5

Hashing

• In hashing, elements are stored in a hash table with their location in the table determined by a hashing function

• Each location in the hash table is called a cell or a bucket

Page 6: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-6

Hashing

• Consider a simple example where we create an array that will hold 26 elements

• Wishing to store names in our array, we create a simple hashing function that uses the first letter of each name

Page 7: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-7

FIGURE 17.1 A simple hashing example

Page 8: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-8

Hashing

• Notice that using a hashing approach results in the access time to a particular element being independent of the number of elements in the table

• This means that all of the operations on an element of a hash table should be O(1)

• However, this efficiency is only realized if each element maps to a unique position in the table

Page 9: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-9

Hashing

• A collision occurs when two or more elements map to the same location

• A hashing function that maps each element to a unique location is called perfect hashing function

• While it is possible to develop a perfect hashing function, a function that does a good job of distributing the elements in the table will still result in O(1) operations

Page 10: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-10

Hashing

• Another issue surrounding hashing is how large the table should be

• If we are assured of a dataset of size n and a perfect hashing function, then the table should be of size n

• If a perfect hashing function is not available but we know the size of the dataset (n), then a good rule of thumb is to make the table 150% the size of the dataset

Page 11: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-11

Hashing

• The third case is the case where we do not know the size of the dataset

• In this case, we depend upon dynamic resizing

• Dynamic resizing of a hash table involves creating a new, larger, hash table and then inserting all of the elements of the old table into the new one

Page 12: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-12

Hashing

• Deciding when to resize is the key to dynamic resizing

• One possibility is simply to resize when the table is full– However, it is the nature of hash tables that their

performance seriously degrades as they become full

• A better approach is to use a load factor

• The load factor of a hash table is the percentage occupancy of the table at which the table will be resized

Page 13: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-13

Hashing Functions - Extraction

• There are a wide variety of hashing functions that provide good distribution of various types of data

• The method we used in our earlier example is called extraction

• Extraction involves using only a part of an element’s value or key to compute the location at which to store the element

• In our previous example, we simply extracted the first letter of a string and computed its value relative to the letter A

Page 14: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-14

Hashing Functions - Division

• Creating a hashing function by division simply means we will use the remainder of the key divided by some positive integer pHashcode(key) = Math.abs(key)%p

• This function will yield a result in the range 0 to p-1

• If we use our table size as p, we then have an index that maps directly to the table

Page 15: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-15

Hashing Functions - Folding

• In the folding method, the key is divided into two parts that are then combined or folded together to create an index into the table

• This is done by first dividing the key into parts where each of the parts of the key will be the same length as the desired key

• In the shift folding method, these parts are then added together to create the index– Using the SSN 987-65-4321 we divide into three parts 987,

654, 321, and then add these together to get 1962

– We would then use either division or extraction to get a three digit index

Page 16: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-16

Hashing Functions - Folding

• In the boundary folding method, some parts of the key are reversed before adding

• Using the same example of a SSN 987-65-4321– divide the SSN into parts 987, 654, 321

– reverse the middle element 987, 456, 321

– add yielding 1764

– then use either division or extraction to get a three digit index

• Folding may also be used on strings

Page 17: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-17

Hashing Functions - Mid-square

• In the mid-square method, the key is multiplied by itself and then the extraction method is used to extract the needed number of digits from the middle of the result

• For example, if our key is 4321– Multiply the key by itself yielding 18671041

– Extract the needed three digits

• It is critical that the same three digits by extract each time

• We may also extract bits and then reconstruct an index from the bits

Page 18: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-18

Hashing Functions - Radix Transformation• In the radix transformation method, the key

is transformed into another numeric base

• For example, if our key is 23 in base 10, we might convert it to 32 in base 7

• We then use the division method and divide the converted key by the table size and use the remainder as our index

Page 19: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-19

Hashing Functions - Digit Analysis• In the digit analysis method, the index is formed by

extracting, and then manipulating specific digits from the key

• For example, if our key is 1234567, we might select the digits in positions 2 through 4 yielding 234

• The manipulation can then take many forms– Reversing the digits (432)

– Performing a circular shift to the right (423)

– Performing a circular shift to the left (342)

– Swapping each pair of digits (324)

Page 20: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-20

Hashing Functions - Length-Dependent• In the length-dependent method, the key and the

length of the key are combined in some way to form either the index itself or an intermediate version

• For example, if our key is 8765, we might multiply the first two digits by the length and then divide by the last digit yielding 69

• If our table size is 43, we would then use the division method resulting in an index of 26

Page 21: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-21

Hashing Functions in the Java Language• The java.lang.Object class defines a method called

hashcode that returns an integer based on the memory location of the object

• This is generally not very useful

• Class that are derived from Object often override the inherited definition of hashcode to provide their own version

• For example, the String and Integer classes define their own hashcode methods

• These more specific hashcode functions are more effective

Page 22: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-22

Hashing - Resolving Collisions

• If we are able to develop a perfect hashing function, then we do not need to be concerned about collisions or table size

• However, often we do not know the size of the dataset and are not able to develop a perfect hashing function

• In these cases, we must choose a method for handling collisions

Page 23: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-23

Resolving Collisions - Chaining

• The chaining method simply treats the hash table conceptually as a table of collections rather than a table of individual elements

• Thus each cell is a pointer to the collection associated with that location in the table

• Usually, this internal collection is either an unordered list or an ordered list

• Chaining can be accomplished using a linked structure or an array based structure with an overflow area

Page 24: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-24

FIGURE 17.2 The chaining method of collision handling

Page 25: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-25

FIGURE 17.3 Chaining using an overflow area

Page 26: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-26

Resolving Collisions - Open Addressing

• The open addressing method looks for another open position in the table other than the one to which the element is originally hashed

• The simplest of these methods is linear probing

• In linear probing, if an element hashes to position p and that position is occupied we simply try position (p+1)%s where s is the size of the table

• One problem with linear probing is the development of clusters of occupied cells called primary clusters

Page 27: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-27

FIGURE 17.4 Open addressing using linear probing

Page 28: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-28

Resolving Collisions - Open Addressing

• A second form of open addressing is quadratic probing

• In this method, once we have a collision, the next index to try is computed bynewhashcode(x) = hashcode(x) + (-1)i-1((i+1)/2)2

• Quadratic probing helps to eliminate the problem of primary clusters

Page 29: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-29

FIGURE 17.5 Open addressing using quadratic probing

Page 30: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-30

Resolving Collisons - Open Addressing

• A third form of the open addressing method is double hashing

• In this method, collisions are resolved by supplying a secondary hashing function

• For example, if a key x hashes to a position p that is already occupied, then the next position p’ will be

p’ = p + secondaryhashcode(x)

• Given the same example, if we use a secondary hashing function that uses the length of the string, we get the following result

Page 31: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-31

FIGURE 17.6 Open addressing using double hashing

Page 32: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-32

Deleting Elements from a Hash Table

• Removing an element from a chained implementation falls into one of five cases

• If the element we are attempting to remove is the only one mapped to the location, we simply remove the element by setting the table position to null

Page 33: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-33

Deleting Elements from a Hash Table

• If the element we are attempting to remove is stored in the table but has an index into the overflow area– Replace the element and the next index value in

the table with the element and next index value of the array position pointed to by the element to be removed

– We then must also set the position in the overflow area to null and add it back to our list of free overflow cells

Page 34: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-34

Deleting Elements from a Hash Table

• If the element we are attempting to remove is at the end of the list of elements stored at that location the table– Set its position in the table to null

– Set the next pointer of the previous element to null

– Add that position to the list of free overflow cells

Page 35: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-35

Deleting Elements from a Hash Table

• If the element we are attempting to remove is in the middle of the list of elements stored at that location– Set its position in the overflow area to null

– Set the next pointer of the previous element to point to the next element of the element being removed

– Add the position back to the list of free overflow cells

• If the element we are attempting to remove is not in the table, then we must throw an exception

Page 36: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-36

Deleting Elements from a Hash Table

• If we have chosen to implement our table using open addressing, the deletion creates more of a challenge

• This is because we cannot remove a deleted item with possibly affecting our ability to find items that collided with it on entry

• The solution to this problem is to mark items as deleted but not actually remove them from the table until some future point when either the deleted item is overwritten or the entire table is rehashed

Page 37: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-37

FIGURE 17.7 Open addressing and deletion

Page 38: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-38

Hash Tables in the Java Collections API

• The Java Collections API provides seven implementations of hashing:– Hashtable

– HashMap

– HashSet

– IdentityHashMap

– LinkedHashSet

– LinkedHashMap

– WeakHashMap

Page 39: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-39

TABLE 17.1 Operations on the Hashtable class

Page 40: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-40

TABLE 17.1 Operations on the Hashtable class (cont.)

Page 41: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-41

TABLE 17.2 Operations on the HashSet class

Page 42: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-42

TABLE 17.3 Operations on the HashMap class

Page 43: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-43

TABLE 17.4 Operations on the IdentityHashMap class

Page 44: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-44

TABLE 17.5 Operations on the WeakHashMap class

Page 45: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-45

TABLE 17.6 Additional operations on the LinkedHashSet class

Page 46: Ch17 Hashing

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 17-46

TABLE 17.7 Additional operations on the LinkedHashMap class