Lecture 18 - University of California, San Diegocseweb.ucsd.edu/~kube/cls/100/Lectures/lec18.hashing2/lec18.pdfPage 5 of 30 CSE 100, UCSD: LEC 18 Analysis of separate-chaining hashing

Lecture 18

• Separate chaining• Dictionary data types• Hashtables vs. balanced search trees• A hashtable implementation: java.util.Hashtable

Reading: Weiss, Ch 5

Page 1 of 30CSE 100, UCSD: LEC 18

Open addressing vs. separate chaining

• Linear probing, double and random hashing are appropriate if the keys are kept as entries in the hashtable itself... doing that is called "open addressing" it is also called "closed hashing"

• Another idea: Entries in the hashtable are just pointers to the head of a linked list (“chain”); elements of the linked list contain the keys... this is called "separate chaining" it is also called "open hashing"

• Collision resolution becomes easy with separate chaining: no need to probe other table locations; just insert a key in its linked list if it is not already there.

• (It is possible to use fancier data structures than linked lists for this; but linked lists work very well in the average case, as we will see)


Separate chaining: basic algorithms

• When inserting a key K in a table with hash function H(K)

1. Set indx = H(K)2. Insert key in linked list headed at indx. (Search the list first to avoid duplicates.)

• When searching for a key K in a table with hash function H(K)

1. Set indx = H(K)2. Search for key in linked list headed at indx, using linear search.

• When deleting a key K in a table with hash function H(K)

1. Set indx = H(K)2. Delete key in linked list headed at indx

• Advantages: average case performance stays good as number of entries approachesand even exceeds M; delete is easier to implement than with open addressing

• Disadvantages: requires dynamic data, requires storage for pointers in addition to data, can have poor locality which causes poor caching performance


Separate chaining, an example

M = 7, H(K) = K mod Minsert these keys 701, 145, 217, 19, 13, 749in this table, using separate chaining:

index: 0 1 2 3 4 5 6


Analysis of separate-chaining hashing

• Keep in mind the load factor measure of how full the table is:

= N/M

where M is the size of the table, and N is the number of keys that have been inserted in the table

• With separate chaining, it is possible to have

• Given a load factor , we would like to know the time costs, in the best, average, and worst case of new-key insert and unsuccessful find (these are the same) successful find

• The best case is O(1) and worst case is O(N) for all of these... let’s analyze the average case


Average case costs with separate chaining

• Assume a table with load factor = N/M• There are N items total distributed over M linked lists (some of which may be empty),

so the average number of items per linked list is:

• In any unsuccessful find/insert, the hash table entry for the key is accessed; then the linked list headed there is exhaustively searched

• Therefore, assuming all table entries are equally likely to be hit by the hash function, the average number of steps for insert or unsuccessful find with separate chaining is

• In successful find, the hash table entry for the key is accessed; then the linked list headed there is linearly searched. Therefore, (with the same probabilistic assumption) the average number of steps for successful find with separate chaining is

• These are less than 2 and 1.5 respectively, when < 1• And these remain O(1), independent of M, even when exceeds 1.

U 1 +=

S 12---+=


Dictionary data types

• A data structure is intended to hold data An insert operation inserts a data item into the structure; a find operation says

whether a data item is in the structure; delete removes a data item; etc.

• A Dictionary is a specialized kind of data structure:

A Dictionary structure is intended to hold pairs: each pair consists of a key, together with some related data

An insert operation inserts a key-data pair in the table; a find operation takes a key and returns the data in the key-data pair with that key; delete takes a key and removes the key-data pair with that key; etc.

• Dictionaries are sometimes called "Table” or “Map” abstract data types, or "associative memories"


Dictionary as ADT

• Domain: a collection of pairs; each pair consists of a key, and some additional data

• Operations (typical): Create a table (initially empty) Insert a new key-data pair in the table; if a key-data pair with the same key is already

there, update the data part of the pair Find the key-data pair in the table corresponding to a given key; return the data Delete the key-data pair corresponding to a given key Enumerate (traverse) all key-data pairs in the table


Implementing the Dictionary ADT

• A Dictionary can be implemented in various ways: using a list, binary search tree, hashtable, etc., etc.

• In each case: the implementing data structure has to be able to hold key-data pairs the implementing data structure has to be able to do insert, find, and delete

operations paying attention to the key

• This could be done in a generic data structure, where the user can specify the comparison function to be used by the insert, find, and delete functions


The Dictionary ADT and search engine indexes

• The Dictionary ADT is useful in any situation where you want to store, retrieve, and manipulate data based on associated keys

• One important application is a document search engine index

• An index associates words (keys) with information (data) such as what documents a word occurs in, how many times it occurs, what its position is within the document, etc. When a word is read for the first time, an "insert" operation is done in the index to

associate that word with the document in which it occurs (and possibly other information)

When a word is encountered again, "insert" or "update" operation is done to add or modify associations with that word (additional document in which it occurs, increment the number of times it occurs, etc.)

If a document is no longer available, words contained in it have their associations changed, and the "delete" operation may be necessary

By doing a “find” operation in the index using a word as key, a user can find the documents that contain that word


Hashtables vs. balanced search trees

• Hashtables and balanced search trees can both be used in applications that need fast insert and find

• What are advantages and disadvantages of each?

Balanced search trees guarantee worst-case performance O(log N), which is quite good

A well-designed hash table has typical performance O(1), which is excellent; but worst-case is O(N), which is bad

Search trees require that keys be well-ordered: For any keys K1, K2, either K1 K2

Hashtables only require that keys be testable for equality, and that you can compute a hash function for them


Hashtables vs. balanced search trees, cont’d

A search tree can easily be used to return keys close in value to a given key, or to return the smallest key in the tree, or to output the keys in sorted order

A hashtable does not normally deal with ordering information efficiently

In a balanced search tree, delete is as efficient as insert In a hashtable that uses open addressing, delete can be inefficient, and somewhat

tricky to implement (easy with separate chaining though)

Overall, balanced search trees are rather difficult to implement correctly Hash tables are relatively easy to implement


A look at Java’s Hashtable

• The java.util.Hashtable class has existed in the Java standard library since JDK1.0

• In JDK 1.2, Hashtable was incorporated into the “Collections Framework”, and declared declared to implement Map

• java.util.Hashtable is similar to java.util.HashMap They both implement Map, so they have the same public interface, but the

implementation is slightly different One difference is Hashtable has synchronized methods (this makes them slightly

slower; if you don’t need synchronization for multitheaded programming, use HashMap)

• In JDK 1.5, Hashtable and Hashmap were made generic, with type parameters for keys and values


Hashtable.java

package java.util;import java.io.*;/** * This class implements a hashtable, which maps keys to values. * Any non-null object can be used as a key or as a value. *

* To successfully store and retrieve objects from a hashtable, the * objects used as keys must implement the hashCode * method and the equals method. */public class Hashtable extends Dictionary implements Map, Cloneable, java.io.Serializable {


Dictionary abstract class

• Dictionary is an abstract class, that specifies some abstract methods. It acts like an interface specification, and probably should have been an interface instead of a class. Very similar to the interface java.util.Map. Methods shown here, without comments:

public abstract class Dictionary {

abstract public int size();

abstract public boolean isEmpty();

abstract public Enumeration keys();

abstract public Enumeration elements();

abstract public V get(Object key);

abstract public V put(K key, V value);

abstract public V remove(Object key);}


Instance variables

• Here are the instance variables declared in the Hashtable class: /** * The hash table data. */ private transient Entry table[];

/** * The total number of entries in the hash table. */ private transient int count;

/** * Rehashes the table when count exceeds this threshold. */ private int threshold;

• What is the type of elements of the array implementing the hashtable?


Entry

• The Hashtable.java file also defines this inner class:

private static class Entry implements Map.Entry { int hash; K key; V value; Entry next;}

• Entries in a Hashtable object’s table[] array are pointers to objects of this class.

• From these declarations so far, can you tell what collision resolution strategy is used?


Hashtable methods

• We will look at these instance methods in the Hashtable class:

constructors

get()

put()

keySet()


Hashtable constructors

/** * Constructs a new, empty hashtable with the specified initial * capacity and the specified load factor. * * @param initialCapacity the initial capacity of the table * @param loadFactor a number between 0.0 and 1.0. * @exception IllegalArgumentException if the initial capacity is * less than zero, or if the load factor * is less than or equal to zero. * @since JDK1.0 */public Hashtable(int initialCapacity, float loadFactor) {

if ((initialCapacity < 0) || (loadFactor

Hashtable default constructor

/** * Constructs a new, empty hashtable with a default capacity and * load factor. * * @since JDK1.0 */public Hashtable() {

this(11, 0.75);}

• How do the default values for size and load factor compare to the hash table design principles we talked about?...


get()

/** * Returns the value to which the specified key is mapped in this * hashtable. * * @param key a key in the hashtable. * @return the value to which the key is mapped in this hashtable; * null if the key is not mapped to any value in * this hashtable. */public synchronized V get(Object key) {

int hash = key.hashCode();int index = (hash & 0x7FFFFFFF) % table.length;for (Entry e = table[index] ; e != null ; e = e.next) { if ( e.hash == hash && e.key.equals(key) ) {

return e.value; }}return null;

}


put()

• Here are the javadoc comments:/** * Maps the specified key to the specified * value in this hashtable. Neither the key nor the * value can be null. *

* The value can be retrieved by calling the get * method with a key that is equal to the original key. * * @param key the hashtable key. * @param value the value. * @return the previous value of the specified key in this * hashtable,or null if it did not have one. * @exception NullPointerException if the key or value is * null. * @since JDK1.0 */• ... and the code follows.


public synchronized V put(K key, V value) {// Make sure the value is not nullif (value == null) { throw new NullPointerException();}

// If the key is already in the hashtable, update its valueint hash = key.hashCode();int index = (hash & 0x7FFFFFFF) % table.length;for (Entry e = table[index] ; e != null ; e = e.next) { if ( e.hash == hash && e.key.equals(key) ) {

V old = e.value; e.value = value; return old;

}}

if (count >= threshold) { // Rehash the table if the threshold is exceeded rehash(); // this enlarges the capacity of the table index = (hash & 0x7FFFFFFF) % table.length;}


// Create and add the new entry.Entry e = new Entry();e.hash = hash;e.key = key;e.value = value;e.next = table[index];table[index] = e;count++;return null;

}


Rehashing

/** Increases the capacity of and internally reorganizes this * hashtable, in order to accommodate and access its entries more * efficiently. */protected void rehash() {

int oldCapacity = table.length;Entry oldMap[] = table;int newCapacity = oldCapacity * 2 + 1;Entry newMap[] = new Entry[newCapacity];threshold = (int)(newCapacity * loadFactor);table = newMap;for (int i = oldCapacity ; i-- > 0 ;) {

for (Entry old = oldMap[i] ; old != null ; ) {Entry e = old;old = old.next;int index = (e.hash & 0x7FFFFFFF) % newCapacity;e.next = newMap[index];newMap[index] = e;

} }


keySet()

• For any key value, you can find out if that key is in the table or not: just use get()

• But how can you get a listing of all the keys in the table? There are many possible keys, and only a few of them will be in the table; it’s not feasible to check them all with get()

• The keySet() method returns a Set object that contains only the keys in the table: /* Returns a Set view of the keys contained in this Hashtable. * The Set supports element removal (which removes the * corresponding entry from the Hashtable), but not element * addition. * @return a Set view of the keys contained in this Map. * @since 1.2 */ public Set keySet() { //... }

• An Iterator for the Set can then be used to iterate efficiently over the keys in the table


Serializable objects

• Since JDK1.1, Java has had the ability to “serialize” objects

• Serialization is the process of converting an existing object to a sequence of bytes, in order to be sent over a stream (e.g. saved to a file, or transmitted over a network connection, etc.) serializing an object also sometimes called ‘persisting’ or ‘pickling’ the object

• This is done in such a way that the object can be deserialized, i.e. reconstituted, later (e.g. by reading from the file, or when the serialized object is received at the other end of the network connection, etc.)

• In order for an object to be serialized, its class must be declared to implement the java.io.Serializable interface

• This interface does not specify any methods: a class that declares itself to implement it is just indicating that instances of it can be serialized

• Many Java library classes are serializable; user-defined classes can also be serializable


Serializing a serializable class

• If a class is Serializable, objects that are instances of that class or a subclass can be serialized

• To serialize an object, pass it to the writeObject() method of an appropriately created java.io.ObjectOutputStream object

• The object can be deserialized by creating a corresponding java.io.ObjectInputStream object and calling its readObject() method (you will want to downcast the returned Object reference to be of the appropriate type)


Designing a serializable class

• If all the instance variables of a user-defined class are of primitive types or Serializable class types, then the class can be declared to implement the Serializable interface and instances of the class can be serialized

• If an instance variable is not of a Serializable class type, or you do not want it to be part of the serialized representation, the instance variable must be marked transient

• transient instance variables are serialized as their default values (null for class types, “zero” for primitive types) to change this you can write your own serialization and deserialization methods,

which can call the default methods; see online documentation for how to do this

• Classes themselves are not serialized, only objects! So, to get everything to work, the same class definition must be available in both serialization and deserialization contexts As a corollary, static variables are never serialized: they are created and initialized

when the class is loaded into the Java virtual machine, not when an instance of the class is deserialized


Next time

• Self-organizing data structures• Self-organizing lists• Splay trees• Spatial data structures• K-D trees• The C++ Standard Template Library


Lecture 18• Separate chaining• Dictionary data types• Hashtables vs. balanced search trees• A hashtable implementation: java.util.Hashtable Reading: Weiss, Ch 5

Open addressing vs. separate chaining• Linear probing, double and random hashing are appropriate if the keys are kept as entries in the hashtable itself...doing that is called "open addressing"it is also called "closed hashing"

• Another idea: Entries in the hashtable are just pointers to the head of a linked list (“chain”); elements of the linked list contain the keys...this is called "separate chaining"it is also called "open hashing"

• Collision resolution becomes easy with separate chaining: no need to probe other table locations; just insert a key in its linked list if it is not already there.• (It is possible to use fancier data structures than linked lists for this; but linked lists work very well in the average case, as we will see)

Separate chaining: basic algorithms• When inserting a key K in a table with hash function H(K) 1. Set indx = H(K) 2. Insert key in linked list headed at indx. (Search the list first to avoid duplicates.)• When searching for a key K in a table with hash function H(K) 1. Set indx = H(K) 2. Search for key in linked list headed at indx, using linear search.• When deleting a key K in a table with hash function H(K) 1. Set indx = H(K) 2. Delete key in linked list headed at indx• Advantages: average case performance stays good as number of entries approaches and even exceeds M; delete is easier to implement than with open addressing• Disadvantages: requires dynamic data, requires storage for pointers in addition to data, can have poor locality which causes poor caching performance

Separate chaining, an example M = 7, H(K) = K mod M insert these keys 701, 145, 217, 19, 13, 749 in this table, using separate chaining:Analysis of separate-chaining hashing• Keep in mind the load factor measure of how full the table is: a = N/M where M is the size of the table, and N is the number of keys that have been inserted in the table• With separate chaining, it is possible to have a > 1• Given a load factor a, we would like to know the time costs, in the best, average, and worst case ofnew-key insert and unsuccessful find (these are the same)successful find

• The best case is O(1) and worst case is O(N) for all of these... let’s analyze the average case

Average case costs with separate chaining• Assume a table with load factor a = N/M• There are N items total distributed over M linked lists (some of which may be empty), so the average number of items per linked list is:• In any unsuccessful find/insert, the hash table entry for the key is accessed; then the linked list headed there is exhaustively searched• Therefore, assuming all table entries are equally likely to be hit by the hash function, the average number of steps for insert or unsuccessful find with separate chaining is• In successful find, the hash table entry for the key is accessed; then the linked list headed there is linearly searched. Therefore, (with the same probabilistic assumption) the average number of steps for successful find with separate chaining is• These are less than 2 and 1.5 respectively, when a < 1• And these remain O(1), independent of M, even when a exceeds 1.

Dictionary data types• A data structure is intended to hold dataAn insert operation inserts a data item into the structure; a find operation says whether a data item is in the structure; delete removes a data item; etc.

• A Dictionary is a specialized kind of data structure:A Dictionary structure is intended to hold pairs: each pair consists of a key, together with some related dataAn insert operation inserts a key-data pair in the table; a find operation takes a key and returns the data in the key-data pair with that key; delete takes a key and removes the key-data pair with that key; etc.

• Dictionaries are sometimes called "Table” or “Map” abstract data types, or "associative memories"

Dictionary as ADT• Domain:a collection of pairs; each pair consists of a key, and some additional data

• Operations (typical):Create a table (initially empty)Insert a new key-data pair in the table; if a key-data pair with the same key is already there, update the data part of the pairFind the key-data pair in the table corresponding to a given key; return the dataDelete the key-data pair corresponding to a given keyEnumerate (traverse) all key-data pairs in the table

Implementing the Dictionary ADT• A Dictionary can be implemented in various ways:using a list, binary search tree, hashtable, etc., etc.

• In each case:the implementing data structure has to be able to hold key-data pairsthe implementing data structure has to be able to do insert, find, and delete operations paying attention to the key

• This could be done in a generic data structure, where the user can specify the comparison function to be used by the insert, find, and delete functions

The Dictionary ADT and search engine indexes• The Dictionary ADT is useful in any situation where you want to store, retrieve, and manipulate data based on associated keys• One important application is a document search engine index• An index associates words (keys) with information (data) such as what documents a word occurs in, how many times it occurs, what its position is within the document, etc.When a word is read for the first time, an "insert" operation is done in the index to associate that word with the document in which it occurs (and possibly other information)When a word is encountered again, "insert" or "update" operation is done to add or modify associations with that word (additional document in which it occurs, increment the number of times it occurs, etc.)If a document is no longer available, words contained in it have their associations changed, and the "delete" operation may be necessaryBy doing a “find” operation in the index using a word as key, a user can find the documents that contain that word

Hashtables vs. balanced search trees• Hashtables and balanced search trees can both be used in applications that need fast insert and find• What are advantages and disadvantages of each?Balanced search trees guarantee worst-case performance O(log N), which is quite goodA well-designed hash table has typical performance O(1), which is excellent; but worst-case is O(N), which is badSearch trees require that keys be well-ordered: For any keys K1, K2, either K1 K2Hashtables only require that keys be testable for equality, and that you can compute a hash function for them

Hashtables vs. balanced search trees, cont’dA search tree can easily be used to return keys close in value to a given key, or to return the smallest key in the tree, or to output the keys in sorted orderA hashtable does not normally deal with ordering information efficientlyIn a balanced search tree, delete is as efficient as insertIn a hashtable that uses open addressing, delete can be inefficient, and somewhat tricky to implement (easy with separate chaining though)Overall, balanced search trees are rather difficult to implement correctlyHash tables are relatively easy to implement

A look at Java’s Hashtable• The java.util.Hashtable class has existed in the Java standard library since JDK1.0• In JDK 1.2, Hashtable was incorporated into the “Collections Framework”, and declared declared to implement Map• java.util.Hashtable is similar to java.util.HashMapThey both implement Map, so they have the same public interface, but the implementation is slightly differentOne difference is Hashtable has synchronized methods (this makes them slightly slower; if you don’t need synchronization for multitheaded programming, use HashMap)

• In JDK 1.5, Hashtable and Hashmap were made generic, with type parameters for keys and values

Hashtable.javapackage java.util;import java.io.*;/*** This class implements a hashtable, which maps keys to values.* Any non-null object can be used as a key or as a value.*

* To successfully store and retrieve objects from a hashtable, the* objects used as keys must implement the hashCode* method and the equals method.*/public class Hashtableextends Dictionaryimplements Map, Cloneable, java.io.Serializable {

Dictionary abstract class• Dictionary is an abstract class, that specifies some abstract methods. It acts like an interface specification, and probably should have been an interface instead of a class. Very similar to the interface java.util.Map. Methods shown here, withou...public abstract class Dictionary {abstract public int size();abstract public boolean isEmpty();abstract public Enumeration keys();abstract public Enumeration elements();abstract public V get(Object key);abstract public V put(K key, V value);abstract public V remove(Object key);}

Instance variables• Here are the instance variables declared in the Hashtable class:/*** The hash table data.*/private transient Entry table[];/*** The total number of entries in the hash table.*/private transient int count;/*** Rehashes the table when count exceeds this threshold.*/private int threshold;• What is the type of elements of the array implementing the hashtable?

Entry• The Hashtable.java file also defines this inner class:private static class Entry implements Map.Entry {int hash;K key;V value;Entry next;}• Entries in a Hashtable object’s table[] array are pointers to objects of this class.• From these declarations so far, can you tell what collision resolution strategy is used?

Hashtable methods• We will look at these instance methods in the Hashtable class:constructorsget()put()keySet()

Hashtable constructors/*** Constructs a new, empty hashtable with the specified initial* capacity and the specified load factor.** @param initialCapacity the initial capacity of the table* @param loadFactor a number between 0.0 and 1.0.* @exception IllegalArgumentException if the initial capacity is* less than zero, or if the load factor* is less than or equal to zero.* @since JDK1.0*/public Hashtable(int initialCapacity, float loadFactor) {if ((initialCapacity < 0) || (loadFactor = threshold) {// Rehash the table if the threshold is exceededrehash(); // this enlarges the capacity of the tableindex = (hash & 0x7FFFFFFF) % table.length;}// Create and add the new entry.Entry e = new Entry();e.hash = hash;e.key = key;e.value = value;e.next = table[index];table[index] = e;count++;return null;}

Rehashing/** Increases the capacity of and internally reorganizes this* hashtable, in order to accommodate and access its entries more* efficiently.*/protected void rehash() {int oldCapacity = table.length;Entry oldMap[] = table;int newCapacity = oldCapacity * 2 + 1;Entry newMap[] = new Entry[newCapacity];threshold = (int)(newCapacity * loadFactor);table = newMap;for (int i = oldCapacity ; i-- > 0 ;) {for (Entry old = oldMap[i] ; old != null ; ) {Entry e = old;old = old.next;int index = (e.hash & 0x7FFFFFFF) % newCapacity;e.next = newMap[index];newMap[index] = e;}}

keySet()• For any key value, you can find out if that key is in the table or not: just use get()• But how can you get a listing of all the keys in the table? There are many possible keys, and only a few of them will be in the table; it’s not feasible to check them all with get()• The keySet() method returns a Set object that contains only the keys in the table:/* Returns a Set view of the keys contained in this Hashtable.* The Set supports element removal (which removes the* corresponding entry from the Hashtable), but not element* addition.* @return a Set view of the keys contained in this Map.* @since 1.2*/public Set keySet() {//...}• An Iterator for the Set can then be used to iterate efficiently over the keys in the table

Serializable objects• Since JDK1.1, Java has had the ability to “serialize” objects• Serialization is the process of converting an existing object to a sequence of bytes, in order to be sent over a stream (e.g. saved to a file, or transmitted over a network connection, etc.)serializing an object also sometimes called ‘persisting’ or ‘pickling’ the object

• This is done in such a way that the object can be deserialized, i.e. reconstituted, later (e.g. by reading from the file, or when the serialized object is received at the other end of the network connection, etc.)• In order for an object to be serialized, its class must be declared to implement the java.io.Serializable interface• This interface does not specify any methods: a class that declares itself to implement it is just indicating that instances of it can be serialized• Many Java library classes are serializable; user-defined classes can also be serializable

Serializing a serializable class• If a class is Serializable, objects that are instances of that class or a subclass can be serialized• To serialize an object, pass it to the writeObject() method of an appropriately created java.io.ObjectOutputStream object• The object can be deserialized by creating a corresponding java.io.ObjectInputStream object and calling its readObject() method (you will want to downcast the returned Object reference to be of the appropriate type)

Designing a serializable class• If all the instance variables of a user-defined class are of primitive types or Serializable class types, then the class can be declared to implement the Serializable interface and instances of the class can be serialized• If an instance variable is not of a Serializable class type, or you do not want it to be part of the serialized representation, the instance variable must be marked transient• transient instance variables are serialized as their default values (null for class types, “zero” for primitive types)to change this you can write your own serialization and deserialization methods, which can call the default methods; see online documentation for how to do this

• Classes themselves are not serialized, only objects! So, to get everything to work, the same class definition must be available in both serialization and deserialization contextsAs a corollary, static variables are never serialized: they are created and initialized when the class is loaded into the Java virtual machine, not when an instance of the class is deserialized

Next time• Self-organizing data structures• Self-organizing lists• Splay trees• Spatial data structures• K-D trees• The C++ Standard Template Library

Documents

Lecture 18 - University of California, San Diegocseweb.ucsd.edu/~kube/cls/100/Lectures/lec18.hashing2/lec18.pdfPage 5 of 30 CSE 100, UCSD: LEC 18 Analysis of separate-chaining hashing