Meljun Cortes Algorithm Space and Time Tradeoffs II

Design and Analysis of Algorithms

Space and Time Tradeoffs *Property of STI Page 1 of 13

TOPIC TITLE: Space and Time Tradeoffs Specific Objectives: At the end of the topic session, the students are expected to: Cognitive:

1. Identify the importance of Space and time tradeoffs in designing algorithm.

2. Understand the importance of input enhancement in string matching.

3. Understand and apply Horspool’s algorithm in string matching. 4. Understand the application of Boyer-Moore algorithm design

technique in pattern matching. 5. Understand the use of Hashing in algorithm design. 6. Differentiate open hashing from closed hashing.

Affective:

1. Listen to others with respect. 2. Participate in class discussions actively. 3. Share ideas to the class.

MATERIALS/EQUIPMENT:

o topic slides o OHP

TOPIC PREPARATION:

o Have the students research on the following: Definition and application of Space and time tradeoffs

in designing algorithm Definition of Horspool, Boyer-Moore, and Hashing

algorithm design techniques Application of Horspool, Boyer-Moore, and Hashing

algorithm design techniques o Provide sample problems that can be solved using string

matching. o It is imperative for the instructor to incorporate various kinds of

teaching strategies while discussing the suggested topics. The instructor may use the suggested learning activities below to facilitate a thorough and creative discussion of the topic.

o Prepare the slides to be presented in the class. o Prepare seatwork for the students to apply the lessons learned

under Space and tradeoffs topic.

TOPIC PRESENTATION: The topic will cover the application of Space and time tradeoffs algorithm design technique. The following is the suggested flow of discussion for the course topic:

1. Call on students and ask an idea regarding Space and time tradeoffs.

2. Discuss the application of Space and time tradeoffs Algorithm Design Technique of the following:



a. Horspool’s algorithm b. Boyer-Moore algorithm c. Hashing

i. Open hashing ii. Closed hashing

3. Simulate sample problems solved using Space and time tradeoffs algorithm design technique.

Space and Time Tradeoffs Page 1 of 29

Space and Time Tradeoffs The following are the topics to be discussed under Space and Time Tradeoffs:

o Importance of space and time tradeoffs in designing good algorithm

o Strength and weaknesses of space and time tradeoffs o Input enhancement in string manipulation

o Horspool’s algorithm o Boyer-Moore algorithm o Hashing algorithm

o Sample problem using the Horspool’s, Boyer-Moore, and Hashing algorithms.

[Space and Time Tradeoffs, Page 1 of 29]

What is Space and Time Tradeoff? Page 2 of 29

What is Space and Time Tradeoff?

A space-time or time-memory tradeoff is a situation where the memory use can be reduced at the cost of slower program execution, or vice versa [www.wikipedia.com].

The space and time tradeoff is one of the most important issues in computer science, it is believed that the computation time can be reduced at the cost of increased memory employed. Usually, the space and time tradeoff is applied in problems involving tables, like a scenario where we are going to compute values of functions at many points in its domain. The best way to solve this is to pre-compute the functions values and store them in a table. Though the application of tables have lost much of their allure in the extensive use of computers, the underlying ideas has proven to be quite useful in the development of different algorithm problems. Generally, the idea to preprocess the problem’s input in whole or just simple part of the whole, and store the additional information obtained to accelerate solving problem afterward which is referred to as input enhancement.

[What is Space and Time Tradeoff?, Page 2 of 29]



Input Enhancement in String Manipulation Page 3 of 29

Horspool’s Algorithm Page 4 of 29

Page 5 of 29

Input Enhancement in String Manipulation Input enhancement can be applied in different string manipulation problems. Keep in mind that the problem of string matching requires finding occurrences of a given string of characters, which is referred to as the pattern in a longer string of characters called text. Input enhancement using Brute force algorithm: Figure 11.1 Representation of input enhancement using Brute force algorithm

[Input Enhancement in String Manipulation, Page 3 of 29] Horspool’s Algorithm Horspool’s algorithm is a simple algorithm design technique that realizes this distinction for most input patterns. Its main objective is to make the comparison of pattern characters to text from right to left instead of left to right. A shift table will be created based on the given pattern and contains a shift value for each possible character. In addition, it also determines how much to shift the pattern when a mismatch occurs (input enhancement). But still you have to take note that the pattern is still shifted from left to right. Only comparison is right to left. Let us examine the given example below searching for the pattern BARBER in some text:

S0…. c………Sn-1 B A R B E R

First, it will start with the last R of the pattern moving from right to the left, a comparison to the matching pairs of characters in the given pattern and the text. If the whole pattern matches, we can say that a match of the given substring is found. Subsequently, the search can be discontinued or continued whenever a different occurrence of the same pattern is desired. Nevertheless, if you encounter a mismatch, you have to shift the pattern to the right. Obviously, you would like to make a large shift as possible without risking the possibility of missing a matching substring in the text. Horspool’s algorithm determines the size of such a shift by looking at the pattern. In general, the following four cases might occur:

1. Whenever there are no c’s in the pattern, for instance, if you set c as letter S based on the illustration below, you can carefully shift the pattern by its complete length (if you shift less, some character of the pattern would be aligned against

1. Align pattern at beginning of text 2. Moving from left to right, compare each character of

pattern to the corresponding character in text until either:

All characters are found to match (successful search); or

A mismatch is detected 3. While pattern is not found and the text is not yet

exhausted, realign pattern one position to the right and repeat step 2.



Page 6 of 29

Page 7 of 29

Page 8 of 29

the text’s character c which is known not be in the pattern): Example: S0…. S ………Sn-1

II B A R B E R

B A R B E R

2. In the stipulation that there are occurrences of characters c in the pattern, but it is not the last one, for instance, let B the value of c. The shift must align the rightmost occurrence of c in the pattern with c in the text. Example: S0…. B ………Sn-1

II B A R B E R

B A R B E R

3. If c occurs to be the only remaining character in the given

pattern, but there are no c’s among its other m-1 character, the shift should be similar to the first case where the pattern should be shifted by the entire pattern’s length m. Example:

S0…. M E R ………Sn-1

II II II L E A D E R L E A D E R

4. Lastly, if c happens to be the last character in the pattern and there are other c’s among its m-1 characters, the shift should be similar to that of Case 2 where the rightmost occurrence of c among the first m-1 characters in the pattern should be aligned with the text in c.

Example:

S0…. O R ………Sn-1

II II R E O R D E R R E O R D E R These cases evidently shows that the right-to-left character comparisons can lead to farther shifts of the pattern that the shifts by only one position constantly made by the brute-force algorithm. On the other hand, if such an algorithm had to test all the characters of the pattern on every trial, it would lose much of its advantages. Luckily the thought of input enhancement makes iterative comparison unnecessary. You can pre-compute shift sizes and store them in a table. The table will then be indexed by all probable characters that can be encountered in a text, including natural language texts, spaces,



Page 9 of 29

Page 10 of 29

Page 11 of 29

punctuation symbols, and other special characters. Always remember that no other information about the text in which eventual searching will be done is required. The table entries will indicate the shift sizes computed by the formula.

The pattern’s length m If c is not found among first m-1 characters of the pattern

t(c) = Otherwise, the distance from the rightmost c among the first m-1 characters of the pattern to its last character

In the example of the pattern BARBER, all the table entries will be equal to 6 except for E, B, R and A which will be 1, 2, 3 and 4. The main application of Shift table includes the following:

1. Stores number of characters to shift by depending on first character compared.

2. Construct by scanning pattern before searching starts. 3. Indexed by the alphabet of text and pattern. 4. All entries are initialized to pattern length.

Example: BAOBAB

Figure 11.2 Illustration of the Shift table

In the given figure, to determine the occurrence of c in sample pattern, it is important to update table entry to the distance of rightmost occurrence of c from end of pattern. This can be done by processing the pattern from L→R. The figure below represents the Horspool’s algorithm:

Figure 11.3 Representation of the Horspool’s Algorithm

Example: Pattern: ZIGZAG Text: A ZIG, A ZAG, AGAIN A ZIGZAG Shift Table:

Figure 11.4 Shift table of the given example applying Horspool’s algorithm

1. Construct the Shift Table for a given pattern and alphabet

2. Align the pattern against the beginning of the text 3. Repeat until a match is found or the pattern reaches

beyond the text: o Starting from the pattern end, compare

corresponding characters until all m are matched (success!) or a mismatch is found.

o In a mismatch, retrieve the shift table entry t(c), where c is the text character aligned with the end of the pattern. Shift the pattern right by t(c).



Page 12 of 29

Boyer-Moore Algorithm Page 13 of 29

Page 14 of 29

[Horspool’s Algorithm, Pages 4-12 of 29] Boyer-Moore Algorithm Boyer-Moore algorithm is when the comparison of the rightmost character in the pattern with the corresponding character c in the text fails. This algorithm does exactly the same thing with Horspool's algorithm.[LEV07] Even though the Horspool’s algorithm is associated with the Boyer-Moore algorithm, they still act differently. If some positive number (0<k<m) successfully matched before a mismatch is found – there is difference.

For this reason, the Boyer-Moore algorithm determines the shift size by considering quantities. There are two things to consider here. The first one is the character c that caused a mismatch and is referred to as the bad symbol shift. If c is not in pattern, you shift the pattern to pass this c in text. Shift size can be computed as t1(c)-k, where t1 (c) is an entry in the pre-computed table (same as for Horspool's algorithm), k is the number of matched characters.

If shift is computed as t1(c)-k <0, you do not need to shift pattern by 0 or negative number.

You restart your Brute Force thinking and shift by 1

Bad symbol shift: d1=max{t1(c) - k, 1} The second type of shift is guided by k>0 successfully matching characters in pattern.

This ending part of pattern is called suffix of size k – suff(k)

You call this as good suffix shift

Text

Pattern

Text

Pattern



Page 15 of 29

Page 16 of 29

Page 17 of 29

You fill the good suffix shift table using the same reasoning as for bad symbol shift table, for patterns of sizes1... m-1. First case: there is another occurrence of suff(k) in the pattern (another occurrence of suff(k) not proceeded by the same character as last one – if the character before suffix is the same as the last one, you will have same failure).

You can shift the pattern by distance d2 between such second rightmost occurrence of suff(k) and its rightmost occurrence.

1 ABCBAB 2 2 ABCBAB 4

k PATTERN d2

1 ABCBAB 2

2 ABCBAB 4

If there is no other occurrence of suff (k) not proceeded by the same character, you can usually shift the pattern by its entire length m. To avoid errors, compute the longest prefix of size l<k that matches the suffix of the same size. If such prefix exists, the shift distance is computed as the distance between prefix and suffix.

k PATTERN d2

1 ABCBAB 2

2 ABCBAB 4

3 ABCBAB 4

4 ABCBAB 4

5 ABCBAB 4

Steps in Boyer-Moore algorithm:

Figure 11.4 Boyer-Moore algorithm

In later case, from table t1 (c) retrieve bad-suffix shift distance. If k>0, also get d2 from good-suffix table, Distance to shift is computed d = d1, if k==0 d = max{d1,d2}, if k >0; (d1=max{t1(c)-k,1}) [Boyer-Moore Algorithm, Pages 13-17 of 29]

Step 1: For given pattern and the alphabet construct bad symbol shift table. Step 2: Using pattern, construct good-suffix shift table. Step 3: Align pattern against text. Step 4: Repeat steps 1-3. Afterwards, starting with last character of pattern, compare corresponding characters until all m characters match (stop) or mismatch is found after k>=0 character pairs are matched successfully.



Hashing Algorithm Page 18 of 29

Page 19 of 29

Page 20 of 29

Hashing Algorithm Hashing algorithm is a different approach to searching on the value of the key. It is based on the idea of distributing keys among a one-dimensional array, which is referred to as the hash table.[LEV07] Hashing algorithm is considered as a very efficient way to implement dictionaries. Recall that a dictionary is an abstract data type, namely, set with operations of finding/searching, inserting, and deleting elements. The elements of this set can be an arbitrary nature: numbers, characters of some alphabet, character strings, and so on. In practice, the most important case is of records (student records in schools, citizen records in a governmental office, and book records in a library). Typically, records comprise several fields, each responsible for keeping a particular type of information about an entity the record represents. For instance, a student record may contain fields for the student’s ID, name, date of birth, gender, home address, etc. Among record fields, there is usually at least one called a key that is used for identifying entities represented by the records, such as student’s ID number. In the subsequent discussion, we will assume that we have to implement a dictionary of n records with keys K1, K2….Kn.

Hash Tables and Hash Functions Hashing is based on the idea of distributing keys among a one-dimensional array H[0…m-1] called hash table. The distribution is done by computing for each key, the value of some predefined function h called hash function. The idea of hashing is to map keys of a given file of size n into a table of size m, called the hash table, by using a predefined function, called the hash function. This function assigns an integer between 0 to m-1, called the hash address, to a key. For instance, if keys are nonnegative integers, a hash function can be of the form h(K) = K mod m (obviously, the remainder division by m is always between 0 to m-1). If keys are letters of some alphabet, you can first assign a letter to its position in the alphabet (denoted here ord(K) and then apply the same kind of function used for integers. If K is a character string c0c1…..cs-1,you can use, as a very unsophisticated option, ( ; a

better option is to compute h(K) as follows: H 0; for i 0 to s-1 do h (h * C + ord(ci)) mod m, where C is a constant larger than every ord(ci)). Example: student records, key = SSN. Hash function: h(K) = K mod m where m is some integer (typically, prime) If m = 1000, where is record with SSN= 314159265 stored? Generally, a hash function should satisfy two conflicting requirements:

1. A hash function needs to distribute keys among the cells of the hash table as evenly as possible (Because of this requirement, the value of m is usually a prime number. The requirement also makes it desirable, for most applications, to have a hash



Page 21 of 29

Page 22 of 29

Page 23 of 29

function dependent on all bits of a key, not just some of them). 2. A hash function has to be easy to compute.

Collisions Good hash functions result in fewer collisions, but some collisions should be expected. If you choose a hash table’s size m to be smaller than the number of keys, you will get collision. Collisions are phenomenon of two or more keys being hashed into the same cell of the hash table. Example: h(K1) = h(K2), there is a collision:

0 b m-1 Figure 11.5 Collision of two keys in hashing

There are two principal hashing schemes that handle collisions differently and these are:

Open hashing o each cell is a header of linked list of all keys hashed to

it

Closed hashing o one key per cell o in case of collision, it finds another cell by

– linear probing: use next free bucket – double hashing: use second hash function to

compute increment Open Hashing (Separate Chaining) Open hashing is where the keys are stored in linked lists outside a hash table whose elements serve as the lists’ headers. For instance, you have the following list of words:

A, FOOL, AND, HIS, MONEY, ARE, SOON, PARTED

To illustrate a hash function, use the simple function for strings mentioned above. ,Add the positions of word’s letters in the alphabet and compute the sum’s remainder after division by 13.

h(K) = sum of K’s letters’ positions in the alphabet MOD 13 In the given example, start with an empty table. The first key is the word A; its value is h(A) = 1 mod 13 = 1. The second key is the word FOOL and is installed in the 9

th cell with the value (6 + 15 + 15 + 12)

mod 13 = 9. Do the same procedure with the other words in the list.

Ki Kj



Page 24 of 29

Page 25 of 29

Page 26 of 29

The figure below is the output of the computation:

Figure 11.6 Example of a hash table

Recall that searching algorithm can be implemented in a dictionary using a search key which has the same procedure that was used for creating the given table. For instance, if you want to search for the key KID in the hash table in Fig. 11.6, you have to compute the value of the same hash function for the key: h (KID) = 11. Since the list attached to cell 11 is not empty, its linked list may contain the search key. But because of possible collisions, you cannot tell whether this is the case until you traverse this linked list. After comparing the string KID with the string ARE and then with the string SOON, you end up with unsuccessful search. Generally, the effectiveness of searching depends on the lengths of the linked list, which, in turn, depend on the dictionary and table sizes, as well as the quality of the hash function. If the hash function distributes n keys among m cells of the hash table almost evenly, each list will be about n/m keys long. If hash function distributes keys uniformly, average length of linked list will be α = n/m. This ratio is called load factor. For ideal hash functions, the average numbers of probes in successful, S, and unsuccessful searches, U:

S 1+α/2, U = α Under the normal assumption of searching for a randomly selected elements and a hash function distributing keys evenly among the table’s cells. These results are quite natural. They are almost identical to searching sequentially in a linked list; what you have gained by hashing is a reduction in average list size by a factor of m, the size of the hash table. Usually, you wish the load factor to be closer from 1. Having it very small would entail a lot of empty lists and for this reason, ineffective use of space. Having it too large would mean longer linked lists and therefore there will be longer search times. However, if the load factor is around 1; you have an efficient scheme that makes it possible to search for a given key for, on average, the price of one or two comparisons. In addition to comparisons, you need to spend time on computing the value of the hash function for a search key, but it is a constant-time operation, independent from n and m. Therefore, we can conclude that the two common dictionary procedures, insertion and deletion, are almost the same with searching. Insertion is usually done at the end of a list although sometimes it’s not. Deletion is achieved by searching for a key to be deleted and then



Page 27 of 29

Page 28 of 29

Page 29 of 29

removing it from the list. Consequently, the efficiency if these operations are identical to that of searching, and they are all Θ(1) in the average case if the number of keys n is about equal to the hash tables size m. Closed Hashing Closed hashing is when all the keys are stored in the hash table itself without the use of linked list. This implies that the table size m must be at least as large as the number of keys n. [LEV07]. Various approaches can be employed for collision resolution. The simplest one is referred to as the linear probing which checks the cell following the one where the collision takes place. If that cell is empty, the new key is installed there. However, if the next cell is already in use, the availability of the cell’s direct successor is checked and so on. Using the example given in Open hashing:

Figure 11.7 Example of a hash table construction with linear programming

To search for a given key K, start by computing h(K), where h is the hash function used in the table’s construction. If the cell h(K) is empty, the search is unsuccessful. If the cell is not empty, you must compare K with the cell’s occupant. If they are equal, you have found a matching key; otherwise, you compare K with a key in the next cell and continue in this manner until you encounter either a matching key in the next cell (a successful search) or an empty cell (unsuccessful search). For instance, if you search for the word LIT in the figure, you will get h(LIT) = (12 + 9 + 20) mod 13 = 2. And since cell 2 is empty, you can stop immediately. However, if you search for KID with h(KID) = ( 11 + 9 + 4) mod 13 = 11, you will have to compare KID with ARE, SOON, PARTED and A before you can declare the search unsuccessful. The mathematical analysis of linear probing is a much more difficult problem than that of separate chaining. The fundamental versions of these results state that the average number of times the algorithm must access a hash table, with the load factor a in successful and unsuccessful searches is:

S = (½) (1+ 1/(1- α)) and U = (½) (1+ 1/(1- α)²)

These values are unexpectedly small even for solidly populated tables like the following large percentage values of a:

a (½) (1+ 1/(1- α))

(½) (1+ 1/(1- α)²)



50% 1.5 2.5

75% 2.5 8.5

90% 5.5 50.5

Still, the hash table is closer to being occupied; the implementation of linear probing gets worst because of a prodigy called clustering. A cluster in linear probing is a series of closely occupied cells (with a possible wrapping). For instance, we can say that the last state of the hash table in figure11.7 have two clusters. Clusters are conflicting in hashing algorithm since they make the dictionary operations less efficient. In addition, you have to remember that as clusters become bigger, the opportunity that a new element is attached to a cluster increases. Hence, large clusters increase the probability that the two clusters will join together after a new key’s insertions, producing even more clustering.

[Hashing Algorithm, Pages 18-29 of 29]

Seatwork

Using Boyer-Moore algorithm, detect a mismatch characters Ti, where P is shifted to the table to the right to align Ti with the first encountered character equal to Ti, if such character exist. Given the text: T = aaaaebdaabadbda and P = dabacbd, what will be the value of the following? a. T4 b. P4 c. T5 d. P5 e. T6 f. P6 Answer: a. e b. c c. b d. b e. d f. d

GENERALIZATION:

o Space-time or time-memory tradeoff is a situation where the memory use can be reduced at the cost of slower program execution, or vice versa [www.wikipedia.com].

o Input enhancement can be applied to a string manipulation problem.

o Horspool’s algorithm is a simple algorithm design technique that realizes this distinction for most input patterns. Its main objective is to make the comparison of pattern characters of text from right to left instead of left to right.

o Boyer-Moore algorithm is when the comparison of the rightmost character in the pattern with the corresponding character c in the text fails. This algorithm does exactly the same thing with Horspool's algorithm.

o Hashing is based on the idea of distributing keys among a one-dimensional array called hash table.



o Open hashing is where the keys are stored in linked lists outside a hash table whose elements serve as the lists’ headers.

o Closed hashing is when all the keys are stored in the hash table itself without the use of linked list.

REFERENCES: Space-time tradeoffhttp://en.wikipedia.org/wiki/Space-

time_tradeoff http://www.cs.utsa.edu/~bylander/cs3343/chapter7handout.pdf http://www.cs.tut.fi/~tiraka/english/material2005/lecture13.pdf http://www.math.uaa.alaska.edu/~afkjm/cs351/handouts/string

s.pdf Anany Levitin,(2007), The design and analysis of algorithm

(2nd ed.), Pearson Education Inc.

Documents

Meljun Cortes Algorithm Space and Time Tradeoffs II