Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hashing

15-211 Fundamental Data Structures and Algorithms

Margaret Reid-Miller

18 January 2005

Plan

Today

Seat assignments

Hash functions

Reading:

For today and next time:

Sedgewick Chapter 14

Reminder: HW0 due on Thursday

Hash Tables

An Alternative Representationfor Dictionaries

Dictionary Interface

An Abstract Data Type that maintains a dynamic set is a Dictionary.

Crucial operations:

Insert

Find

Remove

Standard operations:

create, destroy, copy,…

Dictionary Interface

insert: may or may not allow multiple occurrences

find: membership query, often also retrieve associated information

remove: may use deferred actions for speed upamortized running time

Small Universe

Suppose we have a small universe U = {0,1,2,…,M-1} of items.

We want to maintain a subset A of U.

Ease: Use an array of bits (boolean) of size M.

Insert: A[k] = 1

Find: return A[k] != 0

Remove: A[k] = 0

Operations are constant time.

Direct Access Tables

In most applications we do not store simple items but pairs

(key, object).

Use an array of pointers (references to objects). Insert: A[key] = object Find: return A[key] Remove: A[key] = null

Again operations are constant time.

Large Universe

But what if the universe U of keys is large (and the subset is small)?

e.g., names, symbol table of a compiler.

Even when the identifiers are at most 16 long there are some

1028

possibilities.

Hashing – the Idea

Map keys into integers in the range 0 .. m-1, m<<M and m is the table size.

Pick a “good” mapping from keys to integers:

Easy to compute

Even distribution into the table

0123456789

10

a b c d e f l h i j k l m n o p q r s t u v w x y z

Hashing – Terminology

The array in which we store the objects is the hash table.

To enter an object into the table, we compute an index from the key.

The map from the key to the index is a hash function h:

h(key) = index

Space-Time Tradeoff

A direct table has O(1) operations in the worse case. But space may be prohibitive.

Minimize space by using a sequential search.

Hashing balances space and time (on average) by changing the size of the hash table.

Problem - Collisions

Fundamental problem: Some keys map to the same location, a collision: h(x) = h(y).

Can we prevent collisions?

Pigeonhole Principal

There is no way to avoid collisions.

Since m << M there must be at least two keys that map to the same index.

The famous Pigeonhole Principle:

If you put more than k items into k bins, then at least one bin contains more than one item.

Problem - Hash Function

Second problem: How do we find a suitable hash function?

Ideally, we want to distribute the keys uniformly over the hash table to minimize collisions.

That is, we want h to appear random, as though “hashing” the keys.

Hash Functions

Hashing-Efficiency

We also need to make sure h(k) is easy to compute.

Note that k could be a fairly complicated data structure. How do you turn an array of integers into a single integer? Or how about a tree?

Goal: All operations should be constant time.

But things can go badly wrong on rare occasions.

Division method

Assume wlog the keys are integers.

A simple hash function is

h(k) = k mod m,

where m is the table size.

The choice of m is crucial.

Good choice: m prime.

Division method

Primes are fairly dense, so this is no great restriction on the table size.

In fact, we can nearly double the hash table:

31, 61, 127,251, 509, 1021, 2039,…

Store these values in a table; don’t try to compute on the fly.

Multipication Method

Another hash function is

h(x) = floor( m ( k r mod 1) )

where 0 < r < 1 is cleverly chosen.

Advantage: the choice of m is not critical

Ideally should be irrational, then the values (i r mod 1), i = 1, 2,...,M are very evenly distributed over [0,1].

Of course, there is a little problem here.

Random Input

Note that good hash functions are easy to come by if the input is random (as a bit pattern). Then we can take simply a few bits from the input (say, the first or last 16 bits).

However, such a method would fail miserably if the input shows some regularity. No good for general use.

Integer keys?

The assumption objects in U are integers has to be taken with a grain of salt.

Often we have to massage things a bit to extract numbers.

Of course, in the end everything is just one (possibly huge) number written in binary. This can be used in some languages like C to directly extract hash values from these bits.

Example: Strings

public int hashCode(String key, int m) {

int h = 0;

for (int i=0; i<key. length(); i++) h = 37 * h + key.charAt(i); // 37 is magic number

h %= m; if (h < 0) // overflow? h += m;

return h;}

This is really an interpretation of the string as a number in base 37 (not ordinary radix notation, though.)

Hash functions

Desired propertiesApproximates a random distribution

Over the range of table index values

Efficient calculation

ApproachesModular arithmetic

Many

Perfect hashingWhen full set of input keys known in advance

Next time: Collisions

Documents

Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005