1 Chapter 11 Hash Tables Part 1. 2 / 48 Introduction Very VERY fast way to build and access tables (and records in files too). Provide nearly O(1) performance

11

Chapter 11 Chapter 11 Hash TablesHash Tables

Part 1

22//4848

Introduction Very VERY fast way to build and

access tables (and records in files too).

Provide nearly O(1) performance. Simple to do too. Extremely fast; significantly faster

than trees, which are O(log2n)!

33//4848

More…More…

Similar to arrays – which is a disadvantage.

Do need to have some idea how many items will populate the hash table so table size can be determines.

Also quite difficult (makes no sense) to access sequentially (as you will see).

But you will know the size of the table and hence the size of your ‘database.’

44//4848

Introduction to HashingIntroduction to Hashing Often called Randomizing… We will have a range of (input) values (e.g.

state names, account numbers, String value, etc.) that will be mapped into a range of array values.

The function that will map the key values of these records, objects, etc. to table locations (or relative file locations) is called a hash function.

So, items (objects, records, attribute, etc.) will ‘hash’ to locations in the hash table based on a value within a record / object.

Then, the record / object will be moved into and occupy those locations.

55//4848

Introduction to Hashing - Introduction to Hashing - ExampleExample

Let’s take objects of type Person that have an SSAN attribute (field) in it.

We may take the last four digits, like 1234, and divide (mod) that by some prime number, say 47, such that ALL objects will hash to values between, say, 0 and 46.

So, given an input object, we will access this key value, hash to an index between 0 and 46 and then move the entire object (or whatever it is…) into this location (indexed by the hashed-to value (0 to 46).

66//4848

Hashing: Exception CasesHashing: Exception Cases Existing Key: Sometimes there is an attribute

within an object that can be directly used as an index into the hash table without invoking a hashing function.

Examples: e.g. (from your book) employee number in an object or record, or, in a laboratory, an ‘experiment’ number attribute, which might start with a lab number or experiment number 0 or 1.

In these cases, the index number into the hash table

is the same as the key value, and we don’t have to actually ‘hash’ to a table value.

We can use this integer as the index into the table we are building.

77//4848

Introduction Hashing Introduction Hashing AlgorithmsAlgorithms

So to ultimately store an object into a hash table (or to a location on disk), we need to arrive at a number, an integer that will be an index into the hash table or the address of the record on disk.

We can start with numbers and restrict the range of acceptable values by using the mod function (remainder method).

We discussed this a couple of slides back.

But sometimes our input into the hashing algorithm may not be a number.

88//4848

Hashing: Converting Words to Hashing: Converting Words to NumbersNumbers

Our input into a hashing algorithm may sometimes be characters or a String. So we may want to convert them to numeric form to facilitate randomizing.

One way to convert letters to numbers is to translate: a = 1; b = 2, t = 20, s = 19. Add the total for a

word such as cats, which yields a 43. So ‘cats’ would be stored in location 43.

Unfortunately, it may be that many words might ‘hash’ to 43. These are called ‘synonyms.’

But synonyms cannot all occupy the same single space…So what to do? (Coming soon!)

99//4848

Introduction Hashing Introduction Hashing Algorithms Algorithms

Given a key (attribute) in an object (or record), we need a hash function that will map all possible keys in the input space into a specific location within the bounds of the hash table.

What we’d like is to hash to a table location and discover this location is unoccupied and then move the object into the table at the hashed-to location.

It just doesn’t always work that way in practice.

1010//4848

Introduction Hashing Algorithms Introduction Hashing Algorithms (Collisions and Synonyms – (Collisions and Synonyms –

TerminologyTerminology)) When one or more keys hash to the same

hash table location, we call this situation a collision.

Those elements (objects, records, …) that do hash to this location are called synonyms.

So, we need a mechanism (algorithm) to deal with synonyms.

1111//4848

Introduction Hashing Introduction Hashing AlgorithmsAlgorithms So, practical considerations require us to map key

values (including synonyms) into hash table locations that may NOT be one-to-one.

Also, all input values themselves must map to the range of hash table entries (the table size) either directly or via a hashing algorithm.

Thus, we may have something like: Arrayindex = someNumber % arraySize or

Arrayindex = someKeyConvertedToANumber % arraySize

1212//4848

Introduction Hashing Introduction Hashing AlgorithmsAlgorithms

Arrayindex = someNumber % arraySize classic example of a hashing function!

Called Division-Remainder Method.

This algorithm hashes (converts) a ‘number’ in a large range (naturally occurring values – say social security numbers) into a number in a much smaller range (the table size).

(As one might expect, there is a very large number of hashing functions available. (See Donald Knuth))

1313//4848

Hashing and Collision Hashing and Collision ExampleExample

So, we have: Arrayindex = someNumber % arraySize

So, any integer mod arraySize results in an Arrayindex range of 0 to arraySize-1.

5280 (number of feet in a mile) mod 51 yields a number between 0 (if 51 divides 5280 evenly) and 50.

Any number mod 51 yields a number between 0 and 50. Period!

Again, this particular hashing algorithm is called the Division Remainder method!

Consider the following exercise:

1414//4848

Hashing and Collision Hashing and Collision ExampleExample Clearly, we do not have space for all naturally-

occurring numbers uniquely. Consider an attribute: 1. 42000 = QUOT = 16, REM=0 => index = 0 125 2. 42200 = QUOT = 17, REM=75 => index = 75 125 3. 42200 = QUOT = 17, REM=95 => index = 95 125 4. 42249 = QUOT = 17, REM=124 => index = 124

LOOKS GOOD...

ALL KEY-TO-ADDRESSS TRANSFORMATIONS SEEM TO HASH TO ADDRESSES BETWEEN 1 AND 125.

Actually, we ‘forced’ all input attributes to hash into this range.

1515//4848


BUT: CONSIDER THE KEYS: 42125, 42250, 42375, ... ALL HASH TO THE SAME ADDRESS!!!

And 40,001, 42,126, 42,151 ALL HASH TO 1 THAT IS, REL KEY = 1 or index = 1.

THUS WE WILL HAVE / may have MANY keys that will hash to the same address (location) in the hash table.

Those input integers (integers in this case) are examples of synonyms!

1616//4848


This oftentimes arises from a large range of input numbers mapping into much smaller target array sizes.

This is the price we pay to save space. We cannot reserve a ‘spot’ for every possible value

that just ‘might’ hash to a value in a large range. Our table would have to be much too large to be

practical.

There is no guarantee a key will not hash to the same location occupied by another object.

As stated, when two or more keys hash to the same location, we have a collision.

So, what do we do? Only one object can occupy a single location in the hash table.

1717//4848

Organization of Chapter 11Organization of Chapter 11 Much of the remainder of this chapter is

devoted to collision algorithms. Then, much later in the chapter, your author

goes back to hashing algorithms.

So, for now, only consider the division remainder hashing algorithm (dividing by a prime number).

(I realize we did not mention prime numbers, but we will get back to this later. Please accept this simple division / remainder hashing algorithm for the time being.)

1818//4848

Examples of Collision Examples of Collision AlgorithmsAlgorithms

Open Addressing – three schemes Linear probing Quadratic probing, and Double hashing

1919//4848

1. Open Addressing – 1. Open Addressing – Linear Linear ProbeProbe

If we hash to an address that is occupied, we merely add 1 and place the object in the next slot in the hash table.

Hash Algorithm: arrayIndex = key % arraySize.

All operations require numeric values or something converted to numeric values (such as a String).

This scheme will work well if the hash table is not more than half full or at most two-thirds full.

As table becomes more full, Open Addressing does have problems…

2020//4848

Open Addressing – Open Addressing – LinearLinear ProbeProbe

Major problems with linear probe: Clustering

If a number of items hash to the same location, the linear probe merely places them in the ‘next’ spot, which, unfortunately is the home address of some other key. But for a table not full, this may yield good performance.

As table becomes more full, the probability of clustering increases dramatically.

Performance can become quite poor as the number of table entries and table size approach each other.

Duplicates can be included – at your peril…

2121//4848

Open Addressing – Open Addressing – LinearLinear ProbeProbe

Clustering can result in very long probe lengths especially when the table starts to become full.

Yet, if table is large enough such that perhaps at most 2/3 to ¾ will be filled, the simplicity may be worth the potential degradation in performance.

2222//4848

ExampleExample Let’s look at some code and as we do

this; look at the find() method.

2323//4848

// hash.java

// Note: this is just the data item (object here) to be stored in the hash table…// demonstrates hash table with linear probingclass DataItem { // (could have more data) private int iData; // data item (key)//-------------------------------------------------------------- public DataItem(int ii) // constructor { iData = ii; }//-------------------------------------------------------------- public int getKey() { return iData; }//-------------------------------------------------------------- } // end class DataItem

Linear Probe Collision AlgorithmWe will look at some of the methods here….

2424//4848

class HashTable { private DataItem[ ] hashArray; // array holds hash table pointer private int arraySize; private DataItem nonItem; // for deleted items// ------------------------------------------------------------- public HashTable(int size) // constructor Observe… { arraySize = size; hashArray = new DataItem[arraySize]; // creates the hash table (array) nonItem = new DataItem(-1); //creates an item (object) with -1 value }// end Constructor// ------------------------------------------------------------- public void displayTable() { System.out.print("Table: "); for(int j=0; j<arraySize; j++) { if(hashArray[j] != null) System.out.print(hashArray[j].getKey() + " "); What does this do and how else does it do it? System.out.print("** "); }// end for System.out.println(""); }// end displayTable()// -------------------------------------------------------------

2525//4848

public int hashFunc(int key) { return key % arraySize; // hash function }// ------------------------------------------------------------- public void insert(DataItem item) // insert a DataItem // (assumes table not full) { int key = item.getKey(); // extract key from object to be inserted into table int hashVal = hashFunc(key); // hash the key (hopefully, cell is empty). while(hashArray[hashVal] != null && // implies location is occupied hashArray[hashVal].getKey() != -1)//-1 no more entries to insert { // if true (occupied), undertake linear probe!!! ++hashVal; // go to next cell (open addressing…) hashVal %= arraySize; // wraparound if necessary } hashArray[hashVal] = item; // insert item // if open, insert value } // end insert()// ------------------------------------------------------------- -1 also used to indicate logically deleted item… (ahead)

2626//4848

Searching Hash Table Searching Hash Table Must use same scheme to search as

to build!!

2727//4848

Let’s look at find() Let’s look at find() (next slide)(next slide) We first need to hash the item to be

searched for to get key into hash table Thus, we called hashFunc().

Then, given the index. We look to see if the location in the hash table is empty. If empty, in this scheme, item is not there!

Return a suitable message to client. If non-empty, it checks to see if the value is

the one we are looking for; if so, return it; if not, loop.

Continue until hit or empty slot in hash table.

Next slide has the code for find()

2828//4848

public DataItem find(int key) // find item with key { int hashVal = hashFunc(key); // hash the key; get hashVal

while(hashArray[hashVal] != null) // until empty cell, { if(hashArray[hashVal].getKey() == key) // found the key? return hashArray[hashVal]; // yes, return item; leave method

++hashVal; // go to next cell if occupied. hashVal %= arraySize; // wraparound if necessary } return null; // can't find item }// ------------------------------------------------------------- } // end class HashTable

Be aware, table could be full!!! Then what??

2929//4848

Let’s look at delete()Let’s look at delete() Delete marks the item as ‘logically deleted’ by

moving a -1 value (other flags / values could be used….) to the key value (often a status byte is used too…)

Later, the hash table can be reorganized where logically deleted entries are physically deleted and the hash table rebuilt…Lots of complexity here…. Not a free lunch and is likely not done immediately.

Discuss: What if there is a chain of synonyms and, say, the second one is deleted? What to do?

Do you delete an item in its ‘home address’ (thus moving ‘null’ into this location if there are synonyms that follow?

How do you know if the next location is a synonym or the home address of a real key?

3030//4848

public DataItem delete(int key) // delete a DataItem{ int hashVal = hashFunc(key); // hash the key

while(hashArray[hashVal] != null) // until empty cell, { // found the key? if(hashArray[hashVal].getKey() == key) { DataItem temp = hashArray[hashVal]; // save item for return hashArray[hashVal] = nonItem; // delete item; -1? return temp; // return item } ++hashVal; // go to next cell hashVal %= arraySize; // wraparound if necessary }// end while() return null; // can't find item } // end delete()// -------------------------------------------------------------(// What if hash table is full???)

3131//4848

class HashTableApp >>>>NEXT THREE SLIDES >>>>>>>>>>>This is the entire application……………..HashApp uses a user interface… { public static void main(String[] args) throws IOException { DataItem aDataItem; int aKey, size, n, keysPerCell; // get sizes System.out.print("Enter size of hash table: "); size = getInt(); System.out.print("Enter initial number of items: "); GO THROUGH THIS APP n = getInt(); ON YOUR OWN. keysPerCell = 10; // make table HashTable theHashTable = new HashTable(size);

for(int j=0; j<n; j++) // insert data { aKey = (int)(java.lang.Math.random() * keysPerCell * size); aDataItem = new DataItem(aKey); theHashTable.insert(aDataItem); }

3232//4848

while(true) { // interact with use System.out.print("Enter first letter of "); System.out.print("show, insert, delete, or find: "); char choice = getChar(); switch(choice) { case 's': theHashTable.displayTable(); break; case 'i': System.out.print("Enter key value to insert: "); aKey = getInt(); aDataItem = new DataItem(aKey); theHashTable.insert(aDataItem); break; case 'd': System.out.print("Enter key value to delete: "); aKey = getInt(); theHashTable.delete(aKey); break; case 'f': System.out.print("Enter key value to find: "); aKey = getInt(); aDataItem = theHashTable.find(aKey); if(aDataItem != null) System.out.println("Found " + aKey); else System.out.println("Could not find " + aKey); break; default: System.out.print("Invalid entry\n"); } // end switch } // end while } // end main()

3333//4848

public static String getString() throws IOException { InputStreamReader isr = new InputStreamReader(System.in); BufferedReader br = new BufferedReader(isr); String s = br.readLine(); return s; }//-------------------------------------------------------------- public static char getChar() throws IOException { String s = getString(); return s.charAt(0); }//------------------------------------------------------------- public static int getInt() throws IOException { String s = getString(); return Integer.parseInt(s); }//-------------------------------------------------------------- } // end class HashTableApp

Go through on your own.

Documents

1 Chapter 11 Hash Tables Part 1. 2 / 48 Introduction Very VERY fast way to build and access tables (and records in files too). Provide nearly O(1) performance