65
Maps and Hashing

Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Embed Size (px)

Citation preview

Page 1: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Maps and Hashing

Page 2: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Part I: Basic Concepts

Eric RobertsCS 106B

Page 3: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

An Illustrative Mapping Application• Suppose that you want to write a program that displays the

name of a state given its two-letter postal abbreviation.

• This program is an ideal application for the Map class because what you need is a map between two-letter codes and state names. Each two-letter code uniquely identifies a particular state and therefore serves as a key for a Map; the state names are the corresponding values.

• To implement this program in C++, you need to perform the following steps, which are illustrated on the following slide:

Create a Map<string,string> containing the key/value pairs.1.Read in the two-letter abbreviation to translate.2.Call get on the Map to find the state name.3.Print out the name of the state.4.

Page 4: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

The PostalLookup Program

skip simulation

int main() { Map<string,string> stateMap; initStateMap(stateMap); while (true) { cout << "Enter two-letter state abbreviation: "; string code = getLine(); if (code == "") break; if (stateMap.containsKey(code)) { cout << code << " = " << stateMap.get(code) << endl; } else { cout << code << " = ???" << endl; } }}

code stateMap

Enter two-letter state abbreviation:

PostalLookup

HIHI = HawaiiEnter two-letter state abbreviation: WIWI = WisconsinEnter two-letter state abbreviation: VEVE = ???Enter two-letter state abbreviation:

AL=AlabamaAK=AlaskaAZ=Arizona

FL=FloridaGA=GeorgiaHI=Hawaii

WI=WisconsinWY=Wyoming

. . .

. . .

HIWIVE

void initStateMap(Map<string,string> & map) { map.put("AL", "Alabama"); map.put("AK", "Alaska"); map.put("AZ", "Arizona");

map.put("FL", "Florida"); map.put("GA", "Georgia"); map.put("HI", "Hawaii");

map.put("WI", "Wisconsin"); map.put("WY", "Wyoming");}

map

. . .

. . .

int main() { Map<string,string> stateMap; initStateMap(stateMap); while (true) { cout << "Enter two-letter state abbreviation: "; string code = getLine(); if (code == "") break; if (stateMap.containsKey(code)) { cout << code << " = " << stateMap.get(code) << endl; } else { cout << code << " = ???" << endl; } }}

stateMapcode

Page 5: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

A Simplified Version of the Map Class

map.size()Returns the number of key/value pairs in the map.

map.isEmpty()Returns true if the map is empty.

map.put(key, value)Makes an association between key and value, discarding any existing one.

map.get(key)Returns the most recent value associated with key.

map.containsKey(key)Returns true if there is a value associated with key.

map.clear()Removes all key/value pairs from the map.

Page 6: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Implementation Strategies for MapsThere are several strategies you might choose to implement the map operations get and put. Those strategies include:

Linear search. Keep track of all the name/value pairs in an array. In this model, both the get and put operations run in O(N) time.

1.

Page 7: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Exercise: Linear Search MapAs our programming exercise, we’ll build the linear search version of the map. We start with the PostalLookup.cpp program and the map.h interface. What we need to write are the following files: • The mappriv.h file that describes the representation.

• The mapimpl.cpp file that implements the public methods.

Page 8: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Implementation Strategies for MapsThere are several strategies you might choose to implement the map operations get and put. Those strategies include:

Linear search. Keep track of all the name/value pairs in an array. In this model, both the get and put operations run in O(N) time.

1.

Binary search. If you keep the array sorted by the two-character code, you can use binary search to find the key. Using this strategy improves the performance of get to O(log N).

2.

Table lookup in a grid. In this specific example, you can store the state names in a 26 x 26 Grid<string> in which the first and second indices correspond to the two letters in the code. Because you can now find any code in a single step, this strategy is O(1), although this performance comes at a cost in memory space.

3.

Page 9: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

The Idea of Hashing• The third strategy on the preceding slide shows that one can

make the get and put operations run very quickly, even to the point that the cost of finding a key is independent of the number of keys in the table. This O(1) performance is possible only if you know where to look for a particular key.

• To get a sense of how you might achieve this goal in practice, it helps to think about how you find a word in a dictionary. Most dictionaries have thumb tabs that indicate where each letter appear. Words starting with A are in the A section, and so on.

• The most common implementations of maps use a strategy called hashing, which is conceptually similar to the thumb tabs in a dictionary. The critical idea is that you can improve performance enormously if you use the key to figure out where to look.

Page 10: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash Codes• The rest of today’s lecture focuses on the implementation of

the HashMap class in the Stanford libraries, which uses the hashing strategy.

• The HashMap class requires the existence of a free function called hashCode that transforms a key into a nonnegative integer. The hash code tells the implementation where it should look for a particular key, thereby reducing the search time dramatically.

• For today, I’ll focus on the case when the keys are strings. The important things to remember about hash codes are:

Every string has a hash code, even if you don’t know what it is.1.

The hash code for any particular string is always the same.2.

If two strings are equal (i.e., they contain the same characters), they have the same hash code.

3.

Page 11: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

The hashCode Function for Strings

const int HASH_SEED = 5381; /* Starting point for first cycle */const int HASH_MULTIPLIER = 33; /* Multiplier for each cycle */const int HASH_MASK = unsigned(-1) >> 1; /* All bits except the sign */

/* * Function: hashCode * Usage: int code = hashCode(key); * -------------------------------- * This function takes a string key and uses it to derive a hash code, * which is nonnegative integer related to the key by a deterministic * function that distributes keys well across the space of integers. * The general method is called linear congruence, which is also used * in random-number generators. The specific algorithm used here is * called djb2 after the initials of its inventor, Daniel J. Bernstein, * Professor of Mathematics at the University of Illinois at Chicago. */

int hashCode(string str) { unsigned hash = HASH_SEED; int nchars = str.length(); for (int i = 0; i < nchars; i++) { hash = HASH_MULTIPLIER * hash + str[i]; } return (hash & HASH_MASK);}

Page 12: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

The Bucket Hashing Strategy• One common strategy for implementing a map is to use the hash

code for each key to select an index into an array that will contain all the keys with that hash code. Each element of that array is conventionally called a bucket.

• In practice, the array of buckets is smaller than the number of hash codes, making it necessary to convert the hash code into a bucket index, typically by executing a statement like

int index = hashCode(key) % nBuckets;

• The value in each element of the bucket array cannot be a single key/value pair given the chance that different keys fall into the same bucket. Such situations are called collisions.

• To take account of the possibility of collisions, each elements of the bucket array is a linked list of the keys that fall into that bucket, as illustrated on the next slide.

Page 13: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Simulating Bucket HashingstateMap.put("AL", "Alabama")

hashCode("AL") 2091

2091 % 7 5

The key "AL" therefore goes in bucket 5.

stateMap.put("AK", "Alaska")

hashCode("AK") 2090

2090 % 7 4

The key "AK" therefore goes in bucket 4.

stateMap.put("AZ", "Arizona")

hashCode("AZ") 2105

2105 % 7 5

The key "AZ" therefore goes in bucket 5.

Because bucket 5 already contains "AL",the "AZ" must be added to the chain.

The rest of the keys are added similarly.CA

California

null

CACalifornia

null

COColorado

DE

Delaware

CACalifornia

null

COColorado

IDIdaho

CACalifornia

null

COColorado

DEDelaware

IDIdaho

CACalifornia

null

COColorado

DEDelaware

KSKansas

IDIdaho

CACalifornia

null

COColorado

DEDelaware

KSKansas

MTMontana

IDIdaho

CACalifornia

null

COColorado

DEDelaware

KSKansas

MTMontana

NJNew

Jersey

IDIdaho

CACalifornia

null

COColorado

DEDelaware

KSKansas

MTMontana

NJNew

Jersey

NCNorth

Carolina

IDIdaho

CACalifornia

null

COColorado

DEDelaware

KSKansas

MTMontana

NJNew

Jersey

NCNorth

Carolina

WYWyoming

ILIllinois

null

ILIllinois

null

MNMinnesota

ILIllinois

null

MNMinnesota

NYNew York

ILIllinois

null

MNMinnesota

NYNew York

NDNorth

Dakota

ILIllinois

null

MNMinnesota

NYNew York

NDNorth

Dakota

OHOhio

ILIllinois

null

MNMinnesota

NYNew York

NDNorth

Dakota

OHOhio

SCSouth

Carolina

ILIllinois

null

MNMinnesota

NYNew York

NDNorth

Dakota

OHOhio

SCSouth

Carolina

TNTennessee

ILIllinois

null

MNMinnesota

NYNew York

NDNorth

Dakota

OHOhio

SCSouth

Carolina

TNTennessee

VAVirginia

HIHawaii

null

HIHawaii

null

MAMassachusetts

HIHawaii

null

MAMassachusetts

MOMissouri

HIHawaii

null

MAMassachusetts

MOMissouri

NENebraska

HIHawaii

null

MAMassachusetts

MOMissouri

NENebraska

SDSouth

Dakota

INIndiana

null

INIndiana

null

MIMichigan

INIndiana

null

MIMichigan

NMNew

Mexico

INIndiana

null

MIMichigan

NMNew

Mexico

UTUtah

AKAlaska

null

AKAlaska

null

ARArkansas

AKAlaska

null

ARArkansas

IAIowa

OKOklahoma

AKAlaska

null

ARArkansas

IAIowa

OKOklahoma

AKAlaska

null

ARArkansas

IA

Iowa

OROregon

OKOklahoma

AKAlaska

null

ARArkansas

IAIowa

OROregon

PAPennsylvania

OKOklahoma

AKAlaska

null

ARArkansas

IAIowa

OROregon

PAPennsylvania

RIRhode

Island

OKOklahoma

AKAlaska

null

ARArkansas

IAIowa

OROregon

PAPennsylvania

RIRhode

Island

TXTexas

OKOklahoma

AKAlaska

null

ARArkansas

IAIowa

OROregon

PAPennsylvania

RIRhode

Island

TXTexas

WAWashington

WVWest

Virginia

OKOklahoma

AKAlaska

null

ARArkansas

IAIowa

OROregon

PAPennsylvania

RIRhode

Island

TXTexas

WAWashington

ALAlabama

null

ALAlabama

null

AZArizona

CTConnecticut

ALAlabama

null

AZArizona

GAGeorgia

ALAlabama

null

AZArizona

CTConnecticut

GAGeorgia

ALAlabama

null

AZArizona

CTConnecticut

MDMaryland

GAGeorgia

ALAlabama

null

AZArizona

CTConnecticut

MDMaryland

NVNevada

GAGeorgia

ALAlabama

null

AZArizona

CTConnecticut

MDMaryland

NVNevada

NHNew

Hampshire

GAGeorgia

ALAlabama

null

AZArizona

CTConnecticut

MDMaryland

NVNevada

NHNew

Hampshire

WIWisconsin

FLFlorida

null

FLFlorida

null

KYKentucky

FLFlorida

null

KYKentucky

LALouisiana

FLFlorida

null

KYKentucky

LALouisiana

MEMaine

FLFlorida

null

KYKentucky

LALouisiana

MEMaine

MSMississippi

FLFlorida

null

KYKentucky

LALouisiana

MEMaine

MSMississippi

VTVermont

0

1

2

3

4

5

6Suppose you call stateMap.get("HI")

hashCode("HI") 2305

2305 % 7 2

The key "HI" must therefore be in bucket 2and can be located by searching the chain.

SDSouth

Dakota

NENebraska

MOMissouri

MAMassachusetts

HIHawaii

null

skip simulation

Page 14: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Achieving O(1) Performance• The simulation on the previous side uses only seven buckets

to emphasize what happens when collisions occur: the smaller the number of buckets, the more likely collisions become.

• In practice, the implementation of HashMap uses a much larger value for nBuckets to minimize the opportunity for collisions. If the number of buckets is considerably larger than the number of keys, most of the bucket chains will either be empty or contain exactly one key/value pair.

• The ratio of the number of keys to the number of buckets is called the load factor of the map. Because a map achieves O(1) performance only if the load factor is small, the library implementation of HashMap increases the number of buckets when the table becomes too full. This process is called rehashing.

Page 15: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * File: hashmap.h * --------------- * This interface exports the HashMap class, which maintains a collection * of key/value pairs using a hashtable as the underlying data structure. */

#ifndef _hashmap_h#define _hashmap_h

#include <string>

/* * Class: HashMap<KeyType,ValueType> * --------------------------------- * The HashMap class maintains an association between keys and values. * The types used for keys and values are specified as template parameters, * which makes it possible to use this structure with any data type. */

template <typename KeyType, typename ValueType>class HashMap {

public:

The hashmap.h Interface

Page 16: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * File: hashmap.h * --------------- * This interface exports the HashMap class, which maintains a collection * of key/value pairs using a hashtable as the underlying data structure. */

#ifndef _hashmap_h#define _hashmap_h

#include <string>

/* * Class: HashMap<KeyType,ValueType> * --------------------------------- * The HashMap class maintains an association between keys and values. * The types used for keys and values are specified as template parameters, * which makes it possible to use this structure with any data type. */

template <typename KeyType, typename ValueType>class HashMap {

public:

/* * Constructor: HashMap * Usage: HashMap<KeyType,ValueType> map; * -------------------------------------- * Initializes a new empty map that associates keys and values of the * specified types. The type used for the key must define the == operator, * and there must be a free function with the following signature: * * int hashCode(KeyType key); * * that returns a positive integer determined by the key. This interface * exports hashCode functions for string and the C++ primitive types. */

HashMap();

/* * Destructor: ~HashMap * Usage: (usually implicit) * ------------------------- * Frees any heap storage associated with this map. */

~HashMap();

The hashmap.h Interface

Page 17: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * Constructor: HashMap * Usage: HashMap<KeyType,ValueType> map; * -------------------------------------- * Initializes a new empty map that associates keys and values of the * specified types. The type used for the key must define the == operator, * and there must be a free function with the following signature: * * int hashCode(KeyType key); * * that returns a positive integer determined by the key. This interface * exports hashCode functions for string and the C++ primitive types. */

HashMap();

/* * Destructor: ~HashMap * Usage: (usually implicit) * ------------------------- * Frees any heap storage associated with this map. */

~HashMap();

/* * Method: size * Usage: int nEntries = map.size(); * --------------------------------- * Returns the number of entries in this map. */

int size();

/* * Method: isEmpty * Usage: if (map.isEmpty()) . . . * ------------------------------- * Returns true if this map contains no entries. */

bool isEmpty();

/* * Method: clear * Usage: map.clear(); * ------------------- * Removes all entries from this map. */

void clear();

The hashmap.h Interface

Page 18: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * Method: size * Usage: int nEntries = map.size(); * --------------------------------- * Returns the number of entries in this map. */

int size();

/* * Method: isEmpty * Usage: if (map.isEmpty()) . . . * ------------------------------- * Returns true if this map contains no entries. */

bool isEmpty();

/* * Method: clear * Usage: map.clear(); * ------------------- * Removes all entries from this map. */

void clear();

/* * Method: put * Usage: map.put(key, value); * --------------------------- * Associates key with value in this map. Any previous value associated * with key is replaced by the new value. */

void put(KeyType key, ValueType value);

/* * Method: get * Usage: ValueType value = map.get(key); * -------------------------------------- * Returns the value associated with key in this map. If key is not * found, the get method signals an error. */

ValueType get(KeyType key);

/* * Method: containsKey * Usage: if (map.containsKey(key)) . . . * -------------------------------------- * Returns true if there is an entry for key in this map. */

bool containsKey(KeyType key);

The hashmap.h Interface

Page 19: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * Method: put * Usage: map.put(key, value); * --------------------------- * Associates key with value in this map. Any previous value associated * with key is replaced by the new value. */

void put(KeyType key, ValueType value);

/* * Method: get * Usage: ValueType value = map.get(key); * -------------------------------------- * Returns the value associated with key in this map. If key is not * found, the get method signals an error. */

ValueType get(KeyType key);

/* * Method: containsKey * Usage: if (map.containsKey(key)) . . . * -------------------------------------- * Returns true if there is an entry for key in this map. */

bool containsKey(KeyType key);

#include "hashmappriv.h"

};

#include "hashmapimpl.cpp"

/* * Function: hashCode * Usage: int hash = hashCode(key); * -------------------------------- * Returns a hash code for the specified key, which is always a * nonnegative integer. This function is overloaded to support * all of the primitive types and the C++ <code>string</code> type. */

int hashCode(std::string key);int hashCode(int key);int hashCode(char key);int hashCode(long key);int hashCode(double key);

#endif

The hashmap.h Interface

Page 20: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * File: hashmappriv.h * ------------------- * This file contains the private section of the hashmap.h interface. */

private:

/* Type definition for cells in the bucket chain */

struct Cell { KeyType key; ValueType value; Cell *link; };

/* Instance variables */

Cell **buckets; /* Dynamic array of pointers to cells */ int nBuckets; /* The number of buckets in the array */ int count; /* The number of entries in the map */

The hashmappriv.h File

Page 21: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * File: hashmappriv.h * ------------------- * This file contains the private section of the hashmap.h interface. */

private:

/* Type definition for cells in the bucket chain */

struct Cell { KeyType key; ValueType value; Cell *link; };

/* Instance variables */

Cell **buckets; /* Dynamic array of pointers to cells */ int nBuckets; /* The number of buckets in the array */ int count; /* The number of entries in the map */

/* * Private method: findCell * Usage: Cell *cp = findCell(bucket, key); * ---------------------------------------- * Finds a cell in the chain for the specified bucket that matches key. * If a match is found, the return value is a pointer to the cell containing * the matching key. If no match is found, the function returns NULL. */

Cell *findCell(int bucket, ValueType key) { Cell *cp = buckets[bucket]; while (cp != NULL && key != cp->key) { cp = cp->link; } return cp; }

The hashmappriv.h File

Page 22: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * File: hashmapimpl.cpp * --------------------- * This file contains the private section of the hashmap.cpp interface. * Because of the way C++ compiles templates, this code must be * available to the compiler when it reads the header file. */

#ifdef _hashmap_h

#include "error.h"

/* * Implementation notes: HashMap class * ----------------------------------- * In this map implementation, the entries are stored in a hashtable. * The hashtable keeps an array of buckets, where each bucket is a * linked list of elements that share the same hash code. If two or * more keys have the same hash code (which is called a "collision"), * each of those keys will be on the same list. Ideally, however, the * number of such collisions will be small, so that all of the operations * can run in constant time. To achieve that goal, it is necessary to * expand the number of buckets when the lists start to fill up. That * operation (called "rehashing") is not implemented here and is instead * left as an exercise. */

The hashmapimpl.cpp Implementation

Page 23: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* * File: hashmapimpl.cpp * --------------------- * This file contains the private section of the hashmap.cpp interface. * Because of the way C++ compiles templates, this code must be * available to the compiler when it reads the header file. */

#ifdef _hashmap_h

#include "error.h"

/* * Implementation notes: HashMap class * ----------------------------------- * In this map implementation, the entries are stored in a hashtable. * The hashtable keeps an array of buckets, where each bucket is a * linked list of elements that share the same hash code. If two or * more keys have the same hash code (which is called a "collision"), * each of those keys will be on the same list. Ideally, however, the * number of such collisions will be small, so that all of the operations * can run in constant time. To achieve that goal, it is necessary to * expand the number of buckets when the lists start to fill up. That * operation (called "rehashing") is not implemented here and is instead * left as an exercise. */

/* Constant definitions */

const int INITIAL_BUCKET_COUNT = 101;

/* * Implementation notes: constructor and destructor * ------------------------------------------------ * The constructor allocates the array of buckets and initializes * each bucket to the empty list. The destructor must free the memory, * but the clear function already takes care of that. */ template <typename KeyType,typename ValueType>HashMap<KeyType,ValueType>::HashMap() { nBuckets = INITIAL_BUCKET_COUNT; buckets = new Cell*[nBuckets]; for (int i = 0; i < nBuckets; i++) { buckets[i] = NULL; } count = 0;}

template <typename KeyType,typename ValueType>HashMap<KeyType,ValueType>::~HashMap() { clear();}

The hashmapimpl.cpp Implementation

Page 24: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

/* Constant definitions */

const int INITIAL_BUCKET_COUNT = 101;

/* * Implementation notes: constructor and destructor * ------------------------------------------------ * The constructor allocates the array of buckets and initializes * each bucket to the empty list. The destructor must free the memory, * but the clear function already takes care of that. */ template <typename KeyType,typename ValueType>HashMap<KeyType,ValueType>::HashMap() { nBuckets = INITIAL_BUCKET_COUNT; buckets = new Cell*[nBuckets]; for (int i = 0; i < nBuckets; i++) { buckets[i] = NULL; } count = 0;}

template <typename KeyType,typename ValueType>HashMap<KeyType,ValueType>::~HashMap() { clear();}

template <typename KeyType,typename ValueType>int HashMap<KeyType,ValueType>::size() { return count;}

template <typename KeyType,typename ValueType>bool HashMap<KeyType,ValueType>::isEmpty() { return count == 0;}

template <typename KeyType,typename ValueType>void HashMap<KeyType,ValueType>::clear() { for (int i = 0; i < nBuckets; i++) { Cell *cp = buckets[i]; while (cp != NULL) { Cell *oldCell = cp; cp = cp->link; delete oldCell; } } count = 0;}

The hashmapimpl.cpp Implementation

Page 25: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

template <typename KeyType,typename ValueType>int HashMap<KeyType,ValueType>::size() { return count;}

template <typename KeyType,typename ValueType>bool HashMap<KeyType,ValueType>::isEmpty() { return count == 0;}

template <typename KeyType,typename ValueType>void HashMap<KeyType,ValueType>::clear() { for (int i = 0; i < nBuckets; i++) { Cell *cp = buckets[i]; while (cp != NULL) { Cell *oldCell = cp; cp = cp->link; delete oldCell; } } count = 0;}

template <typename KeyType,typename ValueType>void HashMap<KeyType,ValueType>::put(KeyType key, ValueType value) { int bucket = hashCode(key) % nBuckets; Cell *cp = findCell(bucket, key); if (cp == NULL) { cp = new Cell; cp->key = key; cp->link = buckets[bucket]; buckets[bucket] = cp; } cp->value = value;}

template <typename KeyType,typename ValueType>ValueType HashMap<KeyType,ValueType>::get(KeyType key) { int bucket = hashCode(key) % nBuckets; Cell *cp = findCell(bucket, key); if (cp == NULL) error("get: No value for key"); return cp->value;}

template <typename KeyType,typename ValueType>bool HashMap<KeyType,ValueType>::containsKey(KeyType key) { int bucket = hashCode(key) % nBuckets; return findCell(bucket, key) != NULL;}

#endif

The hashmapimpl.cpp Implementation

Page 26: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Java Reviewpublic boolean equals(Object obj) {

return (this == obj);

}

Page 27: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Part II: Advanced Concepts

Jingyu Zhou

Page 28: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Map

• A very useful abstraction– Any kind of dictionary, lookup table, index,

databases, etc.

• Stroe key-value pairs– Fast access via key– Operations to optimize: put, get

Page 29: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Basic Idea

• Use hash function to map keys into positions in a hash table

Ideally

• If element e has key k and h is hash function, then e is stored in position h(k) of table

• To search for e, compute h(k) to locate position. If no element, dictionary does not contain e.

Page 30: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Example• Dictionary Student Records

– Keys are ID numbers (951000 - 952000), no more than 100 students

– Hash function: h(k) = k-951000 maps ID into distinct table positions 0-1000

– array table[1001]

...

0 1 2 3 1000

hash table

buckets

Page 31: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Analysis (Ideal Case)• O(b) time to initialize hash table (b number of

positions or buckets in hash table)

• O(1) time to perform insert, remove, search

Page 32: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Ideal Case is Unrealistic

• Works for implementing dictionaries, but many applications have key ranges that are too large to have 1-1 mapping between buckets and keys!

Example:

• Suppose key can take on values from 0 .. 65,535 (2 byte unsigned int)

• Expect 1,000 records at any given time

• Impractical to use hash table with 65,536 slots!

Page 33: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash Functions• If key range too large, use hash table with fewer

buckets and a hash function which maps multiple keys to same bucket:

h(k1) = = h(k2): k1 and k2 have collision at slot • Popular hash functions: hashing by division

h(k) = k%D, where D number of buckets in hash table

• Example: hash table with 11 bucketsh(k) = k%1180 3 (80%11= 3), 40 7, 65 1058 3 collision!

Page 34: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Collision Resolution Policies

• Two classes:1. Open hashing, a.k.a. separate chaining

2. Closed hashing, a.k.a. open addressing

• Difference has to do with whether collisions are stored outside the table (open hashing) or whether collisions result in storing one of the records at another slot in the table (closed hashing)

Page 35: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Closed Hashing• Associated with closed hashing is a rehash strategy:

“If we try to place x in bucket h(x) and find it occupied, find alternative location h1(x), h2(x), etc. Try each in order, if none empty table is full,”

• h(x) is called home bucket

• Simplest rehash strategy is called linear hashinghi(x) = (h(x) + i) % D

• In general, collision resolution strategy is to generate a sequence of hash table slots (probe sequence) that can hold the record; test each slot until find empty one (probing)

Page 36: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Example Linear (Closed) Hashing

• D=8, keys a,b,c,d have hash values h(a)=3, h(b)=0, h(c)=4, h(d)=3– Where do we insert d? 3 already filled– Probe sequence using linear hashing:

• h1(d) = (h(d)+1)%8 = 4%8 = 4

• h2(d) = (h(d)+2)%8 = 5%8 = 5*

• h3(d) = (h(d)+3)%8 = 6%8 = 6

• etc.

• 7, 0, 1, 2

– Wraps around the beginning of the table!

0

2

3

4

5

6

7

1

b

a

c

d

Page 37: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Operations Using Linear Hashing• Test for membership: findItem

– Examine h(k), h1(k), h2(k), …, until we find k or an empty bucket or home bucket

• If no deletions possible, strategy works!

• What if deletions?– If we reach empty bucket, cannot be sure that k is not somewhere

else and empty bucket was occupied when k was inserted

– Need special placeholder deleted, to distinguish bucket that was never used from one that once held a value

– May need to reorganize table after many deletions

Page 38: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Performance Analysis - Worst Case

• Initialization: O(b), b# of buckets

• Insert and search: O(n), n number of elements in table; all n key values have same home bucket

• No better than linear list for maintaining dictionary!

Page 39: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Performance Analysis - Avg Case

• Distinguish between successful and unsuccessful searches– Delete = successful search for record to be deleted– Insert = unsuccessful search along its probe

sequence

• Expected cost of hashing is a function of how full the table is: load factor = n/m

Page 40: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash AnalysisOpen Addressing Search :

Assuming: uniform hashing , the length is m

n+1

i=1

q i : not hit in the hash table for the ith probing 。

= ××× …...

1

m

n-1

m-1 m-2

n-2 n-i+2

m-i+2

n-(i-1)

m-(i-1)

1-

and : 1 <= i <= n+1

So : Un = ∑ i × qi = 1/(1-n/(m+1)) ≈n+1

i=1 1 - αα = n/m

n

• Unsuccessful Search : Insert

Un = ∑ i × qi

Page 41: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash AnalysisOpen Addressing Search :

Assuming: uniform hashing , the length is m

• Successful Search: A search for a key k follows the same probe sequence as was followed when the element with key k was inserted.

•If k was the (i + 1)st key inserted into the hash table, the expected number of probes made in a search for k is at most 1 / (1 - i/m) = m/(m - i).

• Averaging over all n keys in the hash table gives us the average number of probes in a successful search:

Page 42: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Improved Collision Resolution• Linear probing: hi(x) = (h(x) + i) % D

– All buckets in table will be candidates for inserting a new record before the probe sequence returns to home position

– Clustering of records, leads to long probing sequences• Linear probing with skipping: hi(x) = (h(x) + ic) % D

– c constant other than 1– Records with adjacent home buckets will not follow

same probe sequence• (Pseudo)Random probing: hi(x) = (h(x) + ri) % D

– ri is the ith value in a random permutation of numbers from 1 to D-1

– Insertions and searches use the same sequence of “random” numbers

Page 43: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

0

1

2

3

4

5

6

7

8

9

10

1001

9537

3016

9874

2009

9875

1. What if next element has home bucket 0? go to bucket 3Same for elements with home bucket 1 or 2!Only a record with home position 3 will stay. p = 4/11 that next record will go to bucket 32. Similarly, records hashing to 7,8,9

will end up in 10

3. Only records hashing to 4 will end upin 4 (p=1/11); same for 5 and 6

I II

insert 1052 (h.b. 7)

0

1

2

3

4

5

6

7

8

9

10

1001

9537

3016

9874

2009

9875

1052

next element in bucket3 with p = 8/11

Example

Page 44: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash Functions - Numerical Values

• Consider: h(x) = x%16– poor distribution, not very random– depends solely on least significant four bits of key

• Better, mid-square method– if keys are integers in range 0,1,…,K , pick integer C

such that DC2 about equal to K2, thenh(x) = x2/C % D

extracts middle r bits of x2, where 2r=D (a base-D

digit)– better, because most or all of bits of key contribute to

result

Page 45: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash Function – Strings of Characters

• Folding Method:int h(String x, int D) {

int i, sum;

for (sum=0, i=0; i<x.length(); i++)sum+= (int)x.charAt(i);

return (sum%D);

}

– Sums the ASCII values of the letters in the string• ASCII value for “A” =65; sum will be in range 650-900 for

10 upper-case letters; good when D around 100, for example

– Order of chars in string has no effect

Page 46: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash Function – Strings of Characters

• Much better: Cyclic Shiftstatic long hashCode(String key, int D) {

int h=0;

for (int i=0, i<key.length(); i++){h = (h << 4) | ( h >> 27);

h += (int) key.charAt(i);

}

return h%D;

}

Page 47: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Open Hashing

• Each bucket in the hash table is the head of a linked list

• All elements that hash to a particular bucket are placed on that bucket’s linked list

• Records within a bucket can be ordered in several ways– By order of insertion, by key value order, or by

frequency of access order

Page 48: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Open Hashing Data Organization0

1

2

3

4

D-1

...

...

...

Page 49: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Analysis• Open hashing is most appropriate when the hash table is kept

in main memory, implemented with a standard in-memory linked list

• We hope that number of elements per bucket roughly equal in size, so that the lists will be short

• If there are n elements in set, then each bucket will have roughly n/D

• If we can estimate n and choose D to be roughly as large, then the average bucket will have only one or two members

Page 50: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Analysis Cont’d

Average time per dictionary operation:• D buckets, n elements in dictionary average

n/D elements per bucket• insert, search, remove operation take O(1+n/D)

time each• If we can choose D to be about n, constant time• Assuming each element is likely to be hashed to

any bucket, running time constant, independent of n

Page 51: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Comparison with Closed Hashing• Worst case performance is O(n) for both

• Number of operations for hashing– 23 6 8 10 23 5 12 4 9 19– D=9– h(x) = x % D

Page 52: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hashing Problem

• Draw the 11 entry hashtable for hashing the keys 12, 44, 13, 88, 23, 94, 11, 39, 20 using the function (2i+5) mod 11, closed hashing, linear probing

• Pseudo-code for listing all identifiers in a hashtable in lexicographic order, using open hashing, the hash function h(x) = first character of x. What is the running time?

Page 53: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

53

Review: Expected Number of Probes in Searches

Let λ = n / m (load factor)

Page 54: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Hash Functions

• A hash function is denoted by

h: {0, 1}* {0, 1}n

where n is a security parameter, say 128, 160, 256 or 512.

• In English:– A function which is applicable to data of any size.

– Its produces a fixed length output, usually short.

• 3 Types of Security Requirements:– One-way: given an output z, it is difficult to find x such that z = h(x).

– Weak collision-resistant: given x, it is difficult to find y x such that h(y) = h(x).

– Strong collision-resistant: it is difficult to find any pair (x, y) such that h(x) = h(y).

• Note: Strong collision-resistant Weak collision-resistant One-way

• Let m be some message. h(x) is called the message digest.

54

Page 55: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Birthday Attack

• Birthday Paradox:– If there are 23 people in a room, the probability that at least two

people have the same birthday is slightly more than 50%. If there are 30, the probability is around 70%.

• This process is analogous to throwing k balls randomly into n bins and checking to see if some bin contains at least two balls.

• For having more than half chance of finding at least two balls in one bin,

k 1.17 n1/2

– E.g. n = 365 k 23

55

Page 56: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Birthday Attack Against a Hash Function

• Finding collisions of a hash function using Birthday Paradox.1. randomly chooses k messages, x1, x2, …, xk

2. search if there is a pair of messages, say xi and xj such that

h(xi) = h(xj).

If so, one collision is found.

• This birthday attack imposes a lower bound on the size of message digests.

• e.g. 40-bit message digest would be very insecure, since a collision could be found with probability at least ½ after doing slight over 220 (about a million) random hashes.

56

Page 57: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Size of a Message Digest / Hash Value

• h : {0,1}*{0,1}n

– If n = 64, the probability of finding one collision will be higher than half after slightly more than 232 random hashes being tried.

– If there exists a machine which can carry out 100,000 hashes per second, it takes 12 hours for finding the first collision with probability higher than half.

• Recommended message digest lengths (in bits): 128 (MD5), 160 (SHA-1), 256 (SHA-256) or 512 (SHA-512)

• For those recommended lengths, because the number of possible hashes is so large, the odds of finding one by chance is negligibly small (one in 280 for SHA-1).

57

Page 58: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Examples: MD5 and SHA-1

MD5• MD – Message Digest, designed by Ron Rivest in 1992.• Available at http://www.ietf.org/rfc/rfc1321• Output length: 128 bits• A Birthday Attack can be launched using 264 trials.

SHA-1• Developed by NIST based on MD4, a precursor to MD5, in 1995• Available at http://www.itl.nist.gov/fipspubs/fip180-1.htm• Output length: 160 bits• More difficult to launch a birthday attack: needs 280 trials.

SHA-2 (SHA 256/384/512)• Based on SHA-1 with a longer hash value

58

Page 59: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Security Updates of Hash Functions

MD5

• In Aug 2004, Wang, et al. showed that it is “easy” to find collisions in MD5. They found many collisions in very short time (in minutes)

• http://eprint.iacr.org/2004/199.pdf

SHA-1

• In Feb 2005, Wang, et al. showed that collisions can be found in SHA-1 with an estimated effort of 269 hash computations.– Less than 280 hash computations by birthday attack.

• http://www.schneier.com/blog/archives/2005/02/sha1_broken.html

Impacts

• Hurts digital signatures

• Does not affect HMAC where collisions aren’t important.

• For applications require underlying hash functions should be collision resistant, it’s time to migrate away from SHA-1.

• Start using new standards SHA-256 and SHA-512.

• http://csrc.nist.gov/CryptoToolkit/tkhash.html 59

Page 60: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Some Details about Finding Collisions in SHA-1

Q: How hard would it be to find collisions in SHA-1?A: The reported attacks require an estimated work factor of 269 (approximately 590 billion billion) hash computations. While this is well beyond what is currently feasible using a normal computer, this is potentially feasible for attackers who have specialized hardware. For example, with 10,000 custom ASICs that can each perform 2 billion hash operations per second, the attack would take about one year. Computing improvements predicted by Moore 's Law will make the attack more practical over time, e.g. making it possible for a wide-spread Internet virus to use compromised computers to mount such attacks as well. Once a collision has been found, additional collisions can be found trivially by concatenating data to the matching messages.

v0.1 60

Borrowed from http://www.cryptography.com/cnews/hash.html

Page 61: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

SHA-1 Broken

• Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu (mostly from Shandong University in China)– collisions in the the full SHA-1 in 2**69 hash

operations, much less than the brute-force attack of 2**80 operations based on the hash length.

– collisions in SHA-0 in 2**39 operations.– collisions in 58-round SHA-1 in 2**33

operations.

Page 62: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

62

Insert

Search

The Art of Computer ProgrammingIII

Page 63: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Reference

• PAC Chapter 15, DS Chapter 7.4

Page 64: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

Next

• Expression Tree

• PAC Chapter 17

Page 65: Maps and Hashing. Part I: Basic Concepts Eric Roberts CS 106B

The End