Download ppt - Data Structures( 数据结构 ) Course 2:Searching

Data Structures(Data Structures( 数据结数据结构构 ))

Course 2:SearchingCourse 2:Searching

2西南财经大学天府学院

VocabularyVocabulary

sequential search 顺序查找element 元素order 次序binary search 二分查找target 目标algorithm 算法array 数组location 位置object 对象 , 目标parameter 参数

index 下标 , 索引 , 指针sentinel 哨兵probability 概率key 关键字hash 散列 , 杂凑collision 冲突cluster 聚集 , 群集synonym 同义语 , 同义词probe 探测load factor 装填因子


SearchingSearching

One of the most common and time-One of the most common and time-consuming operations in computer consuming operations in computer science.science.

To find the location of a target among a To find the location of a target among a list of objects.list of objects.


Main contents(in chapter Main contents(in chapter 2)2)

List searching(including two basic search algorithms)

Sequential search(including three variations)Binary search

Hashed list searching—the key through an algorithmic function determines the location of data

Collision resolution

To discuss the list search algorithms using an array structure


2-1 list searches (work with arrays)2-1 list searches (work with arrays)

The algorithm used to search a list depends to the structure of list

Sequential search(any array)List no ordered

Small listsNot searched often


44 2121 3636 1414 6262 9191 88 2222 77 8181 7777 1010

A[0] A[1] A[11]

Target given(14)

Location wanted(3)

Locating data in unordered listLocating data in unordered list


Target given:14Location wanted:3

101077778181772222889191626214143636212144

A[0] A[1] A[11]14 not equal 4Inde

x0

101077778181772222889191626214143636212144

A[0] A[1] A[11]14 not equal 21

…

Index

1

101077778181772222889191626214143636212144A[0]A[1] A[11]

14 equal 14Index

3

Search ConceptSearch Concept


Search ConceptSearch Concept


Sequential search algorithms Sequential search algorithms

Needs to tell the calling algorithm two things

Did it Find the data it was looking for?If it did, at what index are the target data found.

Requires four parametersThe list we are searchingAn index to the last element in the listThe targetThe address where the found element’s index location is to stored

(Return Boolean)


sequential search algorithmsequential search algorithmalgorithm seqsearch(val list <array> val last <index> val target <keytype> ref locn <index>)looker=0loop (looker < last and target not equal list [looker])

looker = looker + 1end looplocn = lookerif (target equal list [looker]) found = trueelse found = falseend if return found end seqsearch

Locate the target in an unordered list Pre list must contain at least one elementlast is index to last element in the listtarget contains the data to be locatedlocn is address of index in calling algorithmPostif found—matching index stored in locn & found trueIf not found—last stored in locn & found falseReturn found<boolean>


Variations on sequential Variations on sequential searchessearches

Sentinel search

Probability search

Ordered list search


Sentinel searchSentinel searchalgorithm seqsearch(val list <array> val last <index> val target <keytype> ref locn <index>)List [last + 1] = targetlooker=0loop (target not equal list [looker]) looker = looker + 1end looplocn = lookerif (looker <= last) found = true locn = lookerelse found = false locn = lastend if return found end sentinel search

Locate the target in an unordered list Pre list must contain at least one elementLast is index to last element in the listTarget contains the data to be locatedLocn is address of index in calling algorithmPost if found—matching index stored in locn & found trueIf not found—last stored in locn & found trueReturn found<boolean>


probability searchprobability searchlooker=0loop (looker < last and target not equal list

[looker]) looker = looker + 1end loopif (target equal list [looker]) found = true if ( looker > 0 ) temp = list [looker – 1] list [looker – 1] = list [looker] list [looker] = temp looker = looker – 1 endifelse found = false end if locn = lookerreturn found end probability search

Locate the target in an unordered list Pre as the same abovePost if found—matching index stored in locn & found true &Element move up in Element move up in prioritypriorityIf not found—as same Return found<boolean>


Ordered list searchOrdered list searchIf (target <= list[last ] ) looker=0 loop (target > list [looker]) looker = looker + 1 end loopelse looker = lastendifif (target equal list[looker]) found = true else found = false end if locn = lookerreturn found

Locate target in a list ordered on target

Note:• It is not necessary to

search to the end of list• It is only for the small

list • Incorporate the Sentinel Pre: the same as sequential Postif found—the same as aboveIf not found—locn is index of

first element > target or locn equal last & found is false

Return found < boolean >


Binary searchBinary searchSequential search algorithm is very slow

–But, It is the only solution if the array is not sorted

Binary search(ordered list)–For the large list

–First sort

–Then search


Binary search methodBinary search method

Suppose L a sorted list searching for a value X

1. Compare X to the middle value (M) in L. 2. if X = M we are done. 3. if X < M we continue our search, but we can confine our

search to the first half of L and entirely ignore the second half of L.

4.if X > M we continue, but confine ourselves to the second half of L.


919181817777626236362222212114141010887744

A[0] A[1] A[11]

0

First

5

mid

11

last

Target are found ,target 22 is in the list

22>21

919181817777626236362222212114141010887744

A[0] A[1] A[11]

22<62

6 8 11

First mid last

919181817777626236362222212114141010887744

A[0] A[1] A[11]

22=22

6 6 7

First mid last


Target not found --Target 11 is not in the list

919181817777626236362222212114141010887744

A[0] A[11]

0First

5mid

11last

11<21

919181817777626236362222212114141010887744

A[0] A[1] A[11]0 2 4

First mid last 11>8

11>10First mid last

919181817777626236362222212114141010887744

A[0] A[1] A[11]3 3 4

First mid last

919181817777626236362222212114141010887744

A[0] A[1] A[11]11<144 4 4

First mid last

4 4 3Function terminates


Binary search(Binary search(ordered list ))algorithm binary__search( val list <array>, val end <index>, val target <keytype>, ref locn <index>)First = 0Last = endloop (first <= last ) mid = ( first + last ) / 2 if ( target > list [mid] ) look in upper half first = mid +1 else if ( target < list [mid] ) look in lower half last = mid – 1

else found equal : force exit first = last + 1 end if end loop locn = mid if (target equal list [mid]) found = true else found = false end if return found

end binary search

Pre list is ordered; it must contain at least one elementend is index to the largest element in the list Target is the value of element being soughtLocn is address of index in calling algorithmPostFound:locn assigned index to target element found set truenot found:locn = element below or above target found set falseReturn found<boolean>


Analyzing (the efficiency) Analyzing (the efficiency)

Sequential search ,Sentinel search ,Ordered list search : O(n)Binary search: O(log 2n)

Comparison of binary and sequential searches

size binary Sequential

(average)

Sequential

(worst case)16 4 8 16

10,000 14 5000 10,000

1,000,000 20 500,000 1,000,000


2-3 Hashed list searches2-3 Hashed list searches

Hash functionHash functionkey Location of data

Ideal search : we would know exactly where the

data are and go directly to there

Goal of hashed search : to find the data with only

one test

Use an array of data

Hash algorithmHash algorithmkey index of array(address of list )


address

John adamsJohn adams 107095107095

… … … …

Vu nguyenVu nguyen 102002102002

Sarah trappSarah trapp 111060111060

Harry leeHarry lee

[000][001][002][003][004][005][006][007][008]

[099][100]

hashhash

HashfunctionHash

functionkey address

102002107095111060

51002

key

Figure 2-6 Hash concept


Hash search: A search in which the key ,through an algorithmic function, determines the location of the data.

we use a hashing algorithm to transform the key into the index that contains the data we need to locate

(key-to –address)

Basic ConceptsBasic Concepts


A set of keys hash to the same location—Synonym

Contain two or more synonyms in a list—collision

Home address—produced by hashing algorithm

Collision resolution—two keys collide at a home addressPlace one of the keys and its data in another location

Prime area—memory contains all of home addresses

ProblemProblem


CC AA BB

[0] [4] [8] [16]


1.hash(A)2.hash(B) 3.hash(C)

Collision resolutionB and ACollide at 8

C and BCollide at 16

Figure 2-7 the collision resolution concept


Locate an element in a hashed listLocate an element in a hashed list

Use the same algorithm to insert it into the listUse the same algorithm to insert it into the list

First hash the key and check the home addressFirst hash the key and check the home address

If it does If it does – the search is complete– the search is complete

If not If not – use the collision resolution algorithm to – use the collision resolution algorithm to

determine the next location and continue until determine the next location and continue until

find the element or determine it is not in the listfind the element or determine it is not in the list

Each calculation of an address and test for Each calculation of an address and test for

success – probesuccess – probe


Hashing methods

rotationmidsquaremodulo division

direct

subtraction digitextraction

foldingpseudorandom

generation

Figure 2-8 Basic hashing techniques

Hashing methodsHashing methods


Direct methodDirect method

The key is the address(an element a key , no synonyms)

Example1: total monthly sales by the days of the months

Create an array of 31accumulator

The accumulation code is:

dailySales[sale.day] = dailySales[sale.day] +sale.amount;


Example 2: a small Example 2: a small company has fewer<100company has fewer<100Employee number is Employee number is between 1 and 100 between 1 and 100

000000 (not used)(not used) 001001 Harry leeHarry lee

002002 Sarah trappSarah trapp 003003 004004 005005 Vu nguyenVu nguyen 006006 007007 008008

… … … … 099099 100100 John adamsJohn adams

[000][001][002][003][004][005][006][007][008]

[099][100]

hashhash005100002

51002

address

key

Figure 2-9 Direct hashingOf employee numbers


•keys are consecutive , but do not start from 1•Such as your student ID number

Advantage•Hashing function is very simple•No collisions

Disadvantage

Only for small lists

Subtraction methodSubtraction method


Note:

1. Generally speaking , hashing lists require some

empty elements to reduce the number of collisions

2. This application above two is the ideal ,but it is very

limited , such as ID card number


This method divides the key by the array size and uses the

remainder for the address

Hashing algorithm is:

Address = key modulus listsizeAddress = key modulus listsize

Note: a prime number listsize produces fewer

collisions

Modulo-division method(Division Modulo-division method(Division remainder)remainder)


379452379452 Marry DoddMarry Dodd

121267121267 Bryan DevauxBryan Devaux

378845378845 John CarverJohn Carver

… … … …160252160252 Tuan NgoTuan Ngo045128045128 Shouli FeldmanShouli Feldman

[000][001][002][003][004][005][006][007][008]

[305][306]

hashhash121267045128379452

23060

Figure 2-10 modulo-division Hashing

Listsize=307Listsize=307


Digit extraction method Selected digits are extracted from the key And used as addressExample

379452121267378845160252045128

394112388102051

6-digits Employee number

3-digit address

Select the first, third, fourth

digits


The key is squared and the address selected from the middle of the squared numberLimitation: the size of the keyExample: 4-digit keys

379 * 379=143641121 * 121=014641378 * 378=142884160 * 160=025600045 * 045=002025

Select 1-3 digits

Fill 0 to 6 digits

squared

9452*9452=89340304:address is 3403

364464288560202 Select 3-5

digits as address

379452121267378845160252045128

Variation : select a portion of the key

Midsquare methodMidsquare method


123456789

123

123 456789

789

1

discarded

368

(a)fold shift

321

123 456987

789

1

discarded

764

(b)fold boundary

Digits reversed

Digits reversed

Figure 2-11 hash fold examples

++

Folding methods : fold shift and Folding methods : fold shift and fold boundaryfold boundary


Useful when keys are assigned seriallyUseful when keys are assigned serially

600101600102600103600104600105

600101600102600103600104600105

160010260010360010460010560010

Original key Rotation Rotated key

Figure 2-12 Rotation hashing

Rotation method : Incorporate with Rotation method : Incorporate with othersothers


In this method, the key is used as the seed in a pseudorandom number generator , the resulting random number is scaled into the possible address range using modulo division

A common random generator is: y=ax+cFor efficiency,factors a and c should be prime numbersFor example , a=17, c=7

Pseudorandom method:Pseudorandom method:


… …

379452 Marry Dodd … … 121267 Bryan Devaux … …378845 John Carver

045128 Shouli Feldman … …160252 Tuan Ngo

[000]

[007]

[041]

[306]

hashhash121267045128379452

412977

Figure 2-10 modulo-division Hashing

(17*121267+7) modulo 307=41

(17*045128+7) modulo

307=297

[297](17*379452+7) modulo 307=7


Hash AlgorithmHash Algorithm

Convert the alphanumeric key into a number by adding the American Standard Code for Information Interchange(ASCII) to accumulator.Rotate the bits in the address to maximize the distribution of the values.Take the absolutely value of the address and map it into the address range.


Hash AlgorithmHash Algorithm

test for negative address if (addr<0)

addr=absolute(addr) end if addr =addr modulo maxaddr return end Hash

algorithm Hash( val key <array >, val size <integer>, val maxAddr <integer>, ref addr <integer>)Looper = 0Addr = 0 Hash KeyLoop (Loop<size) if (key[looper] not space)

addr =addr+key[looper]rotate addr 12 bits right

end if End loop

This algorithm converts an alphanumeric key of size characters into an integral address.Pre Key is a key to be hashed. size is the number of characters in the key. MaxAddr is the maximum possible address for the list.Post addr contain the hashed address


2-4 collision resolution2-4 collision resolution

Except the direct and subtraction, none of the hashing methods are one-to-one mappingCollision not avoidThere are several methods for hashing collisions


Open addressing

Linearprobe Quadratic

probe pseudorandom Key offset

Linked lists buckets

Figure 2-13 collision resolution methods


load factor

Clustering

•There must be some empty There must be some empty elements in a list:elements in a list:

load load factorfactor

= The number of filled elementsThe number of filled elements

The total number of elementsThe total number of elements<75%<75%

Several conceptsSeveral concepts•data to group within the list data to group within the list (unevenly across a hashed list).(unevenly across a hashed list).

•a high degree of clustering grows a high degree of clustering grows the number of probes to locate an the number of probes to locate an element and reduces the element and reduces the processing efficiency of the list. processing efficiency of the list. There are two:There are two:•Primary clustering : when data Primary clustering : when data cluster around a home address cluster around a home address •Secondary clustering:when data Secondary clustering:when data become grouped along a collision become grouped along a collision path throughout a listpath throughout a list

•Need to design hashing algorithms Need to design hashing algorithms to minimize clustering to minimize clustering


Open addressingOpen addressing

Resolves collisions in the prime area (contains all of the home addresses )

Linear probeQuadratic probeDouble hashing

PseudorandomKey offset


379452379452 Marry DoddMarry Dodd070918070918 Sarah TrappSarah Trapp

121267121267 Bryan DevauxBryan Devaux 166702166702 Harry eagleHarry eagle

378845378845 John CarverJohn Carver


[000][001][002][003][004][005][006][007][008]

[305][306]

hashhash070918

166702

1

1

Figure 2-14 linear probe collision resolution

First insert:No collision

second insert:

collision Add 1

Linear ProbeLinear Probe


linear probelinear probe

Variation :Add 1, subtract 2,Add 3, subtract 4

Advantage: simple to implement.

Disadvantage: first, tend to produce primary clustering . Second, tend to make the search algorithm more complex


Quadratic probe Quadratic probe

To eliminate primary clustering

The increment is the collision probe number squared.first probe, add 12,second probe, add 22 ,… The new address is the modulo of the list size.Disadvantage :

1. the time required to square the probe number. 2. It is not possible to generate a new address for

every element in the list.


Pseudorandom collision resolutionPseudorandom collision resolution

A double hashing : the address is rehashedUses a pseudorandom number to resolve the collision Using the collision address as a factor in the random number calculation, such as:

New address = 3 * collision address + 5

Figure2-15 showing a collision resolving for figure 2-14


379452379452 Marry DoddMarry Dodd070918070918 Sarah TrappSarah Trapp

121267121267 Bryan DevauxBryan Devaux

378845378845 John CarverJohn Carver 166702166702 Harry eagleHarry eagle


[000][001][002][003][004][005][006][007][008]

[305][306]

hashhash070918166702

1

1

Figure 2-15 pseudorandom collision resolution

First insert:No collision

second insert:

collision

Pseudorandom

Y = 3x+5

Pseudorandom probePseudorandom probe


Key offsetKey offset

Another double hashing Produces different collision paths for different keys key offset calculates the new address as (the simplest versions)

offset = offset = key/listsizekey/listsizeaddress = ((offset + old address) modulo listsize)address = ((offset + old address) modulo listsize)


offset = 166702 / 307 = 543address = ((543 + 001) modulo 307) = 237

Example: the key is 166702, list size is 307,using the Example: the key is 166702, list size is 307,using the modulo-division generate an address of 1modulo-division generate an address of 1This synonym of 070918 produces a collision at 1This synonym of 070918 produces a collision at 1Using key offset to calculate the next addressUsing key offset to calculate the next address

If 237 were also a collision, repeat the processIf 237 were also a collision, repeat the process

offset = 166702 / 307 = 543address = ((543 + 237) modulo 307) = 166


To really see the effect of key offset, we need to calculate several different keys ,all hashing to the same home address. Table 2-3 shows that three keys that collide at address 001, Next two collision probe addresses

Key28Key28 Home Home addressaddress

Key Key offsetoffset

Probe 1Probe 1 Probe 2Probe 2

166702166702 11 543543 237237 166166572556572556 11 18651865 024024 047047067234067234 11 219219 220220 132132

Table 2-3 key offsetNote: each key resolves its collision at a different address for both the first and second probes


Linked list resolution Linked list resolution

To eliminate the disadvantage of open addressing that each collision resolution increases the probability of future collisionsA linked list is an ordered collection of data in which each element contains the location of the next element


379452 Marry Dodd070918 Sarah Trapp

121267 Bryan Devaux

… …

160252 Tuan Ngo

045128 Shouli Feldman

[000]

[001][002]

[003]

[004]

[005][006]

[007]

[008]

[305]

[306]

166702 Harry eagleHarry eagle

572556 Chris Wallj

Figure 2-16 linked list collision resolution

pointer pointer


Linked list resolutionLinked list resolution

Linked list resolution uses a separate area to store collisions and chains all synonyms together in a linked listIt uses two storage areas, the prime area and the overflow areaEach element in the prime area contains an additional field, a link head pointerThe linked list data can be stored in any order, but the most common is key sequence


Bucket hashingBucket hashing

nodes that accommodate multiple data. occurrences, collision are postponed until the bucket is full

Bucket

0

379452 Marry Dodd

Bucket

1

070918 Sarah Trapp 166702 Harry eagle367173 Ann georgis

Bucket

2

121267 Bryan Devaux572556 Chris wallj

Bucket

307

045128 Shouli Feldman

[000]

[001]

[002]

[307]

Figure 2-17 bucked hashing

Linear probe

Places here


Two problems & combination Two problems & combination approachesapproaches

First : it uses significantly more space, many of the buckets will be (or partially) emptySecond: it does not completely resolve the collision problemResolving the collision is to use the linear probeThere are several approaches to resolving collisions ,often uses multiple stepsExample one large database hashes to a bucket, full, linear probe , linked list overflow area


summarysummary

Searching is the process of finding the location of a target among a list of objectsTwo basic searching methods for arrays: sequential and binary searchThe sequential search is normally used when a list is not sorted. It starts at the beginning of the list and searches until it finds the data or hits the end of the listOne of the variation of the sequential search is the sentinel search. In this method,the condition ending the search is reduced to only one by artificially inserting the target at the end of the listThe second variation of the sequential search is called the probability search. In this method, the list is ordered with the most probable elements at the beginning of the list and the least probable at the end


2-5 summary(continued)2-5 summary(continued)

The sequential search can also be used to search a sorted list, in this case, we can terminate the search when the target is less than the current elementIf an array is sorted, we can use a more efficient algorithm called the binary searchthe binary search algorithm searches the list by first checking the middle element. If the target is not in the middle element, the algorithm eliminates the upper half or the lower half of the list depending on the value of the middle element. The process continues until the target is found or reduced list length becomes zero The efficiency of a sequential search is O(n)The efficiency of a binary search is O(log2n)


summary(continued)summary(continued)

In a hashed search,the key through an algorithmic transformation,determines the location of the data. It is a key-to-address transformationThere are several hashing functions : we discussed direct, subtraction, modulo division, digit extraction, mid-square, folding, rotation , and pseudorandom generation



In direct hashing,the key is the address without any algorithmic manipulation In subtraction hashing,the key is transformed to an address by subtracting a fixed number from itIn modulo-division hashing,the key is divided by the list size,recommended to be a prime numberIn digit-extraction hashing,selected digits are extracted from the key and used as an addressIn mid-square hashing,the key is squared and the address is selected from the middle of the resultIn fold shift hashing,the key is divided into parts whose sizes match the size of the required address.then the parts are added to obtain the address



In fold boundary hashing,the key is divided into parts whose sizes match the size of the required address.then the left and right parts are reversed and added to the middle part to obtain the addressIn rotation hashing,the rightmost digit of the key is rotated to the left to determine an address. However,this method is usually used in combination with other methodsIn the pseudorandom generation hashing,the key is used as the seed to generate a pseudorandom number. The result is then scaled to obtain the addressExcept in the direct and subtraction methods, collisions are unavoidable in hashing. Collision occur when a new key is hashed to an address that is already occupied



Clustering is the tendency of data to build up unevenly across a hashed list.

Primary clustering occur when data build up around a home addressSecondary clustering occurs when data build up along a collision path in the list

To solve a collision, a collision resolution method is usedThree general methods are used to resolve collision : open addressing,linked list,and buckets

The open addressing method can be subdivided into linear probe,quadratic probe,pseudorandom rehashing,and key-offset rehashing



In the linear probe method,when the collision occurs,the new data will be stored in the next available address.In the quadratic method,the increment is the collision probe number squared.In the pseudorandom rehashing method, we use a random number generator to rehash the addressIn the key-offset rehashing method,we use an offset to rehash the address



In the linked list technique,we use separate areas to store collision and chain all synonyms together in a linked listIn bucket hashing,we use a bucket that can accommodate multiple data occurrences


HomeworkHomework

Using the modulo-division method and linear probing, store the keys shown below in an array with 19 elements, How many collision occurred? The value of load factor of the list after all keys have been inserted?

224562,137456,214562,140145,214567,162145,144467,199645,234534Repeat above problem using the digit-extraction method (first, third and fifth digits) and quadratic probing.