24
Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Embed Size (px)

Citation preview

Page 1: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hashing

15-211 Fundamental Data Structures and Algorithms

Margaret Reid-Miller

18 January 2005

Page 2: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Plan

Today

Seat assignments

Hash functions

Reading:

For today and next time:

Sedgewick Chapter 14

Reminder: HW0 due on Thursday

Page 3: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hash Tables

An Alternative Representationfor Dictionaries

Page 4: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Dictionary Interface

An Abstract Data Type that maintains a dynamic set is a Dictionary.

Crucial operations:

Insert

Find

Remove

Standard operations:

create, destroy, copy,…

Page 5: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Dictionary Interface

insert: may or may not allow multiple occurrences

find: membership query, often also retrieve associated information

remove: may use deferred actions for speed upamortized running time

Page 6: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Small Universe

Suppose we have a small universe U = {0,1,2,…,M-1} of items.

We want to maintain a subset A of U.

Ease: Use an array of bits (boolean) of size M.

Insert: A[k] = 1

Find: return A[k] != 0

Remove: A[k] = 0

Operations are constant time.

Page 7: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Direct Access Tables

In most applications we do not store simple items but pairs

(key, object).

Use an array of pointers (references to objects). Insert: A[key] = object Find: return A[key] Remove: A[key] = null

Again operations are constant time.

Page 8: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Large Universe

But what if the universe U of keys is large (and the subset is small)?

e.g., names, symbol table of a compiler.

Even when the identifiers are at most 16 long there are some

1028

possibilities.

Page 9: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hashing – the Idea

Map keys into integers in the range 0 .. m-1, m<<M and m is the table size.

Pick a “good” mapping from keys to integers:

Easy to compute

Even distribution into the table

0123456789

10

a b c d e f l h i j k l m n o p q r s t u v w x y z

Page 10: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hashing – Terminology

The array in which we store the objects is the hash table.

To enter an object into the table, we compute an index from the key.

The map from the key to the index is a hash function h:

h(key) = index

Page 11: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Space-Time Tradeoff

A direct table has O(1) operations in the worse case. But space may be prohibitive.

Minimize space by using a sequential search.

Hashing balances space and time (on average) by changing the size of the hash table.

Page 12: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Problem - Collisions

Fundamental problem: Some keys map to the same location, a collision: h(x) = h(y).

Can we prevent collisions?

Page 13: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Pigeonhole Principal

There is no way to avoid collisions.

Since m << M there must be at least two keys that map to the same index.

The famous Pigeonhole Principle:

If you put more than k items into k bins, then at least one bin contains more than one item.

Page 14: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Problem - Hash Function

Second problem: How do we find a suitable hash function?

Ideally, we want to distribute the keys uniformly over the hash table to minimize collisions.

That is, we want h to appear random, as though “hashing” the keys.

Page 15: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hash Functions

Page 16: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hashing-Efficiency

We also need to make sure h(k) is easy to compute.

Note that k could be a fairly complicated data structure. How do you turn an array of integers into a single integer? Or how about a tree?

Goal: All operations should be constant time.

But things can go badly wrong on rare occasions.

Page 17: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Division method

Assume wlog the keys are integers.

A simple hash function is

h(k) = k mod m,

where m is the table size.

The choice of m is crucial.

Good choice: m prime.

Page 18: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Division method

Primes are fairly dense, so this is no great restriction on the table size.

In fact, we can nearly double the hash table:

31, 61, 127,251, 509, 1021, 2039,…

Store these values in a table; don’t try to compute on the fly.

Page 19: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Multipication Method

Another hash function is

h(x) = floor( m ( k r mod 1) )

where 0 < r < 1 is cleverly chosen.

Advantage: the choice of m is not critical

Ideally should be irrational, then the values (i r mod 1), i = 1, 2,...,M are very evenly distributed over [0,1].

Of course, there is a little problem here.

Page 20: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Random Input

Note that good hash functions are easy to come by if the input is random (as a bit pattern). Then we can take simply a few bits from the input (say, the first or last 16 bits).

However, such a method would fail miserably if the input shows some regularity. No good for general use.

Page 21: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Integer keys?

The assumption objects in U are integers has to be taken with a grain of salt.

Often we have to massage things a bit to extract numbers.

Of course, in the end everything is just one (possibly huge) number written in binary. This can be used in some languages like C to directly extract hash values from these bits.

Page 22: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Example: Strings

public int hashCode(String key, int m) {

int h = 0;

for (int i=0; i<key. length(); i++) h = 37 * h + key.charAt(i); // 37 is magic number

h %= m; if (h < 0) // overflow? h += m;

return h;}

This is really an interpretation of the string as a number in base 37 (not ordinary radix notation, though.)

Page 23: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Hash functions

Desired propertiesApproximates a random distribution

Over the range of table index values

Efficient calculation

ApproachesModular arithmetic

Many

Perfect hashingWhen full set of input keys known in advance

Page 24: Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005

Next time: Collisions