Data Structures(Data Structures( 数据结数据结构构 ))
Course 2:SearchingCourse 2:Searching
2西南财经大学天府学院
VocabularyVocabulary
sequential search 顺序查找element 元素order 次序binary search 二分查找target 目标algorithm 算法array 数组location 位置object 对象 , 目标parameter 参数
index 下标 , 索引 , 指针sentinel 哨兵probability 概率key 关键字hash 散列 , 杂凑collision 冲突cluster 聚集 , 群集synonym 同义语 , 同义词probe 探测load factor 装填因子
3西南财经大学天府学院
SearchingSearching
One of the most common and time-One of the most common and time-consuming operations in computer consuming operations in computer science.science.
To find the location of a target among a To find the location of a target among a list of objects.list of objects.
4西南财经大学天府学院
Main contents(in chapter Main contents(in chapter 2)2)
List searching(including two basic search algorithms)
Sequential search(including three variations)Binary search
Hashed list searching—the key through an algorithmic function determines the location of data
Collision resolution
To discuss the list search algorithms using an array structure
5西南财经大学天府学院
2-1 list searches (work with arrays)2-1 list searches (work with arrays)
The algorithm used to search a list depends to the structure of list
Sequential search(any array)List no ordered
Small listsNot searched often
6西南财经大学天府学院
44 2121 3636 1414 6262 9191 88 2222 77 8181 7777 1010
A[0] A[1] A[11]
Target given(14)
Location wanted(3)
Locating data in unordered listLocating data in unordered list
7西南财经大学天府学院
Target given:14Location wanted:3
101077778181772222889191626214143636212144
A[0] A[1] A[11]14 not equal 4Inde
x0
101077778181772222889191626214143636212144
A[0] A[1] A[11]14 not equal 21
…
Index
1
101077778181772222889191626214143636212144A[0]A[1] A[11]
14 equal 14Index
3
Search ConceptSearch Concept
8西南财经大学天府学院
Search ConceptSearch Concept
9西南财经大学天府学院
Sequential search algorithms Sequential search algorithms
Needs to tell the calling algorithm two things
Did it Find the data it was looking for?If it did, at what index are the target data found.
Requires four parametersThe list we are searchingAn index to the last element in the listThe targetThe address where the found element’s index location is to stored
(Return Boolean)
10西南财经大学天府学院
sequential search algorithmsequential search algorithmalgorithm seqsearch(val list <array> val last <index> val target <keytype> ref locn <index>)looker=0loop (looker < last and target not equal list [looker])
looker = looker + 1end looplocn = lookerif (target equal list [looker]) found = trueelse found = falseend if return found end seqsearch
Locate the target in an unordered list Pre list must contain at least one elementlast is index to last element in the listtarget contains the data to be locatedlocn is address of index in calling algorithmPostif found—matching index stored in locn & found trueIf not found—last stored in locn & found falseReturn found<boolean>
11西南财经大学天府学院
Variations on sequential Variations on sequential searchessearches
Sentinel search
Probability search
Ordered list search
12西南财经大学天府学院
Sentinel searchSentinel searchalgorithm seqsearch(val list <array> val last <index> val target <keytype> ref locn <index>)List [last + 1] = targetlooker=0loop (target not equal list [looker]) looker = looker + 1end looplocn = lookerif (looker <= last) found = true locn = lookerelse found = false locn = lastend if return found end sentinel search
Locate the target in an unordered list Pre list must contain at least one elementLast is index to last element in the listTarget contains the data to be locatedLocn is address of index in calling algorithmPost if found—matching index stored in locn & found trueIf not found—last stored in locn & found trueReturn found<boolean>
13西南财经大学天府学院
probability searchprobability searchlooker=0loop (looker < last and target not equal list
[looker]) looker = looker + 1end loopif (target equal list [looker]) found = true if ( looker > 0 ) temp = list [looker – 1] list [looker – 1] = list [looker] list [looker] = temp looker = looker – 1 endifelse found = false end if locn = lookerreturn found end probability search
Locate the target in an unordered list Pre as the same abovePost if found—matching index stored in locn & found true &Element move up in Element move up in prioritypriorityIf not found—as same Return found<boolean>
14西南财经大学天府学院
Ordered list searchOrdered list searchIf (target <= list[last ] ) looker=0 loop (target > list [looker]) looker = looker + 1 end loopelse looker = lastendifif (target equal list[looker]) found = true else found = false end if locn = lookerreturn found
Locate target in a list ordered on target
Note:• It is not necessary to
search to the end of list• It is only for the small
list • Incorporate the Sentinel Pre: the same as sequential Postif found—the same as aboveIf not found—locn is index of
first element > target or locn equal last & found is false
Return found < boolean >
15西南财经大学天府学院
Binary searchBinary searchSequential search algorithm is very slow
–But, It is the only solution if the array is not sorted
Binary search(ordered list)–For the large list
–First sort
–Then search
16西南财经大学天府学院
Binary search methodBinary search method
Suppose L a sorted list searching for a value X
1. Compare X to the middle value (M) in L. 2. if X = M we are done. 3. if X < M we continue our search, but we can confine our
search to the first half of L and entirely ignore the second half of L.
4.if X > M we continue, but confine ourselves to the second half of L.
17西南财经大学天府学院
919181817777626236362222212114141010887744
A[0] A[1] A[11]
0
First
5
mid
11
last
Target are found ,target 22 is in the list
22>21
919181817777626236362222212114141010887744
A[0] A[1] A[11]
22<62
6 8 11
First mid last
919181817777626236362222212114141010887744
A[0] A[1] A[11]
22=22
6 6 7
First mid last
18西南财经大学天府学院
Target not found --Target 11 is not in the list
919181817777626236362222212114141010887744
A[0] A[11]
0First
5mid
11last
11<21
919181817777626236362222212114141010887744
A[0] A[1] A[11]0 2 4
First mid last 11>8
11>10First mid last
919181817777626236362222212114141010887744
A[0] A[1] A[11]3 3 4
First mid last
919181817777626236362222212114141010887744
A[0] A[1] A[11]11<144 4 4
First mid last
4 4 3Function terminates
19西南财经大学天府学院
Binary search(Binary search(ordered list ))algorithm binary__search( val list <array>, val end <index>, val target <keytype>, ref locn <index>)First = 0Last = endloop (first <= last ) mid = ( first + last ) / 2 if ( target > list [mid] ) look in upper half first = mid +1 else if ( target < list [mid] ) look in lower half last = mid – 1
else found equal : force exit first = last + 1 end if end loop locn = mid if (target equal list [mid]) found = true else found = false end if return found
end binary search
Pre list is ordered; it must contain at least one elementend is index to the largest element in the list Target is the value of element being soughtLocn is address of index in calling algorithmPostFound:locn assigned index to target element found set truenot found:locn = element below or above target found set falseReturn found<boolean>
20西南财经大学天府学院
Analyzing (the efficiency) Analyzing (the efficiency)
Sequential search ,Sentinel search ,Ordered list search : O(n)Binary search: O(log 2n)
Comparison of binary and sequential searches
size binary Sequential
(average)
Sequential
(worst case)16 4 8 16
10,000 14 5000 10,000
1,000,000 20 500,000 1,000,000
21西南财经大学天府学院
2-3 Hashed list searches2-3 Hashed list searches
Hash functionHash functionkey Location of data
Ideal search : we would know exactly where the
data are and go directly to there
Goal of hashed search : to find the data with only
one test
Use an array of data
Hash algorithmHash algorithmkey index of array(address of list )
22西南财经大学天府学院
address
John adamsJohn adams 107095107095
… … … …
Vu nguyenVu nguyen 102002102002
Sarah trappSarah trapp 111060111060
Harry leeHarry lee
[000][001][002][003][004][005][006][007][008]
[099][100]
hashhash
HashfunctionHash
functionkey address
102002107095111060
51002
key
Figure 2-6 Hash concept
23西南财经大学天府学院
Hash search: A search in which the key ,through an algorithmic function, determines the location of the data.
we use a hashing algorithm to transform the key into the index that contains the data we need to locate
(key-to –address)
Basic ConceptsBasic Concepts
24西南财经大学天府学院
A set of keys hash to the same location—Synonym
Contain two or more synonyms in a list—collision
Home address—produced by hashing algorithm
Collision resolution—two keys collide at a home addressPlace one of the keys and its data in another location
Prime area—memory contains all of home addresses
ProblemProblem
25西南财经大学天府学院
CC AA BB
[0] [4] [8] [16]
Collision resolution
1.hash(A)2.hash(B) 3.hash(C)
Collision resolutionB and ACollide at 8
C and BCollide at 16
Figure 2-7 the collision resolution concept
26西南财经大学天府学院
Locate an element in a hashed listLocate an element in a hashed list
Use the same algorithm to insert it into the listUse the same algorithm to insert it into the list
First hash the key and check the home addressFirst hash the key and check the home address
If it does If it does – the search is complete– the search is complete
If not If not – use the collision resolution algorithm to – use the collision resolution algorithm to
determine the next location and continue until determine the next location and continue until
find the element or determine it is not in the listfind the element or determine it is not in the list
Each calculation of an address and test for Each calculation of an address and test for
success – probesuccess – probe
27西南财经大学天府学院
Hashing methods
rotationmidsquaremodulo division
direct
subtraction digitextraction
foldingpseudorandom
generation
Figure 2-8 Basic hashing techniques
Hashing methodsHashing methods
28西南财经大学天府学院
Direct methodDirect method
The key is the address(an element a key , no synonyms)
Example1: total monthly sales by the days of the months
Create an array of 31accumulator
The accumulation code is:
dailySales[sale.day] = dailySales[sale.day] +sale.amount;
29西南财经大学天府学院
Example 2: a small Example 2: a small company has fewer<100company has fewer<100Employee number is Employee number is between 1 and 100 between 1 and 100
000000 (not used)(not used) 001001 Harry leeHarry lee
002002 Sarah trappSarah trapp 003003 004004 005005 Vu nguyenVu nguyen 006006 007007 008008
… … … … 099099 100100 John adamsJohn adams
[000][001][002][003][004][005][006][007][008]
[099][100]
hashhash005100002
51002
address
key
Figure 2-9 Direct hashingOf employee numbers
30西南财经大学天府学院
•keys are consecutive , but do not start from 1•Such as your student ID number
Advantage•Hashing function is very simple•No collisions
Disadvantage
Only for small lists
Subtraction methodSubtraction method
31西南财经大学天府学院
Note:
1. Generally speaking , hashing lists require some
empty elements to reduce the number of collisions
2. This application above two is the ideal ,but it is very
limited , such as ID card number
32西南财经大学天府学院
This method divides the key by the array size and uses the
remainder for the address
Hashing algorithm is:
Address = key modulus listsizeAddress = key modulus listsize
Note: a prime number listsize produces fewer
collisions
Modulo-division method(Division Modulo-division method(Division remainder)remainder)
33西南财经大学天府学院
379452379452 Marry DoddMarry Dodd
121267121267 Bryan DevauxBryan Devaux
378845378845 John CarverJohn Carver
… … … …160252160252 Tuan NgoTuan Ngo045128045128 Shouli FeldmanShouli Feldman
[000][001][002][003][004][005][006][007][008]
[305][306]
hashhash121267045128379452
23060
Figure 2-10 modulo-division Hashing
Listsize=307Listsize=307
34西南财经大学天府学院
Digit extraction method Selected digits are extracted from the key And used as addressExample
379452121267378845160252045128
394112388102051
6-digits Employee number
3-digit address
Select the first, third, fourth
digits
35西南财经大学天府学院
The key is squared and the address selected from the middle of the squared numberLimitation: the size of the keyExample: 4-digit keys
379 * 379=143641121 * 121=014641378 * 378=142884160 * 160=025600045 * 045=002025
Select 1-3 digits
Fill 0 to 6 digits
squared
9452*9452=89340304:address is 3403
364464288560202 Select 3-5
digits as address
379452121267378845160252045128
Variation : select a portion of the key
Midsquare methodMidsquare method
36西南财经大学天府学院
123456789
123
123 456789
789
1
discarded
368
(a)fold shift
321
123 456987
789
1
discarded
764
(b)fold boundary
Digits reversed
Digits reversed
Figure 2-11 hash fold examples
++
Folding methods : fold shift and Folding methods : fold shift and fold boundaryfold boundary
37西南财经大学天府学院
Useful when keys are assigned seriallyUseful when keys are assigned serially
600101600102600103600104600105
600101600102600103600104600105
160010260010360010460010560010
Original key Rotation Rotated key
Figure 2-12 Rotation hashing
Rotation method : Incorporate with Rotation method : Incorporate with othersothers
38西南财经大学天府学院
In this method, the key is used as the seed in a pseudorandom number generator , the resulting random number is scaled into the possible address range using modulo division
A common random generator is: y=ax+cFor efficiency,factors a and c should be prime numbersFor example , a=17, c=7
Pseudorandom method:Pseudorandom method:
39西南财经大学天府学院
… …
379452 Marry Dodd … … 121267 Bryan Devaux … …378845 John Carver
045128 Shouli Feldman … …160252 Tuan Ngo
[000]
[007]
[041]
[306]
hashhash121267045128379452
412977
Figure 2-10 modulo-division Hashing
(17*121267+7) modulo 307=41
(17*045128+7) modulo
307=297
[297](17*379452+7) modulo 307=7
40西南财经大学天府学院
Hash AlgorithmHash Algorithm
Convert the alphanumeric key into a number by adding the American Standard Code for Information Interchange(ASCII) to accumulator.Rotate the bits in the address to maximize the distribution of the values.Take the absolutely value of the address and map it into the address range.
41西南财经大学天府学院
Hash AlgorithmHash Algorithm
test for negative address if (addr<0)
addr=absolute(addr) end if addr =addr modulo maxaddr return end Hash
algorithm Hash( val key <array >, val size <integer>, val maxAddr <integer>, ref addr <integer>)Looper = 0Addr = 0 Hash KeyLoop (Loop<size) if (key[looper] not space)
addr =addr+key[looper]rotate addr 12 bits right
end if End loop
This algorithm converts an alphanumeric key of size characters into an integral address.Pre Key is a key to be hashed. size is the number of characters in the key. MaxAddr is the maximum possible address for the list.Post addr contain the hashed address
42西南财经大学天府学院
2-4 collision resolution2-4 collision resolution
Except the direct and subtraction, none of the hashing methods are one-to-one mappingCollision not avoidThere are several methods for hashing collisions
Collision resolution
Open addressing
Linearprobe Quadratic
probe pseudorandom Key offset
Linked lists buckets
Figure 2-13 collision resolution methods
43西南财经大学天府学院
load factor
Clustering
•There must be some empty There must be some empty elements in a list:elements in a list:
load load factorfactor
= The number of filled elementsThe number of filled elements
The total number of elementsThe total number of elements<75%<75%
Several conceptsSeveral concepts•data to group within the list data to group within the list (unevenly across a hashed list).(unevenly across a hashed list).
•a high degree of clustering grows a high degree of clustering grows the number of probes to locate an the number of probes to locate an element and reduces the element and reduces the processing efficiency of the list. processing efficiency of the list. There are two:There are two:•Primary clustering : when data Primary clustering : when data cluster around a home address cluster around a home address •Secondary clustering:when data Secondary clustering:when data become grouped along a collision become grouped along a collision path throughout a listpath throughout a list
•Need to design hashing algorithms Need to design hashing algorithms to minimize clustering to minimize clustering
44西南财经大学天府学院
Open addressingOpen addressing
Resolves collisions in the prime area (contains all of the home addresses )
Linear probeQuadratic probeDouble hashing
PseudorandomKey offset
45西南财经大学天府学院
379452379452 Marry DoddMarry Dodd070918070918 Sarah TrappSarah Trapp
121267121267 Bryan DevauxBryan Devaux 166702166702 Harry eagleHarry eagle
378845378845 John CarverJohn Carver
… … … …160252160252 Tuan NgoTuan Ngo045128045128 Shouli FeldmanShouli Feldman
[000][001][002][003][004][005][006][007][008]
[305][306]
hashhash070918
166702
1
1
Figure 2-14 linear probe collision resolution
First insert:No collision
second insert:
collision Add 1
Linear ProbeLinear Probe
46西南财经大学天府学院
linear probelinear probe
Variation :Add 1, subtract 2,Add 3, subtract 4
Advantage: simple to implement.
Disadvantage: first, tend to produce primary clustering . Second, tend to make the search algorithm more complex
47西南财经大学天府学院
Quadratic probe Quadratic probe
To eliminate primary clustering
The increment is the collision probe number squared.first probe, add 12,second probe, add 22 ,… The new address is the modulo of the list size.Disadvantage :
1. the time required to square the probe number. 2. It is not possible to generate a new address for
every element in the list.
48西南财经大学天府学院
Pseudorandom collision resolutionPseudorandom collision resolution
A double hashing : the address is rehashedUses a pseudorandom number to resolve the collision Using the collision address as a factor in the random number calculation, such as:
New address = 3 * collision address + 5
Figure2-15 showing a collision resolving for figure 2-14
49西南财经大学天府学院
379452379452 Marry DoddMarry Dodd070918070918 Sarah TrappSarah Trapp
121267121267 Bryan DevauxBryan Devaux
378845378845 John CarverJohn Carver 166702166702 Harry eagleHarry eagle
… … … …160252160252 Tuan NgoTuan Ngo045128045128 Shouli FeldmanShouli Feldman
[000][001][002][003][004][005][006][007][008]
[305][306]
hashhash070918166702
1
1
Figure 2-15 pseudorandom collision resolution
First insert:No collision
second insert:
collision
Pseudorandom
Y = 3x+5
Pseudorandom probePseudorandom probe
50西南财经大学天府学院
Key offsetKey offset
Another double hashing Produces different collision paths for different keys key offset calculates the new address as (the simplest versions)
offset = offset = key/listsizekey/listsizeaddress = ((offset + old address) modulo listsize)address = ((offset + old address) modulo listsize)
51西南财经大学天府学院
offset = 166702 / 307 = 543address = ((543 + 001) modulo 307) = 237
Example: the key is 166702, list size is 307,using the Example: the key is 166702, list size is 307,using the modulo-division generate an address of 1modulo-division generate an address of 1This synonym of 070918 produces a collision at 1This synonym of 070918 produces a collision at 1Using key offset to calculate the next addressUsing key offset to calculate the next address
If 237 were also a collision, repeat the processIf 237 were also a collision, repeat the process
offset = 166702 / 307 = 543address = ((543 + 237) modulo 307) = 166
52西南财经大学天府学院
To really see the effect of key offset, we need to calculate several different keys ,all hashing to the same home address. Table 2-3 shows that three keys that collide at address 001, Next two collision probe addresses
Key28Key28 Home Home addressaddress
Key Key offsetoffset
Probe 1Probe 1 Probe 2Probe 2
166702166702 11 543543 237237 166166572556572556 11 18651865 024024 047047067234067234 11 219219 220220 132132
Table 2-3 key offsetNote: each key resolves its collision at a different address for both the first and second probes
53西南财经大学天府学院
Linked list resolution Linked list resolution
To eliminate the disadvantage of open addressing that each collision resolution increases the probability of future collisionsA linked list is an ordered collection of data in which each element contains the location of the next element
54西南财经大学天府学院
379452 Marry Dodd070918 Sarah Trapp
121267 Bryan Devaux
… …
160252 Tuan Ngo
045128 Shouli Feldman
[000]
[001][002]
[003]
[004]
[005][006]
[007]
[008]
[305]
[306]
166702 Harry eagleHarry eagle
572556 Chris Wallj
Figure 2-16 linked list collision resolution
pointer pointer
55西南财经大学天府学院
Linked list resolutionLinked list resolution
Linked list resolution uses a separate area to store collisions and chains all synonyms together in a linked listIt uses two storage areas, the prime area and the overflow areaEach element in the prime area contains an additional field, a link head pointerThe linked list data can be stored in any order, but the most common is key sequence
56西南财经大学天府学院
Bucket hashingBucket hashing
nodes that accommodate multiple data. occurrences, collision are postponed until the bucket is full
Bucket
0
379452 Marry Dodd
Bucket
1
070918 Sarah Trapp 166702 Harry eagle367173 Ann georgis
Bucket
2
121267 Bryan Devaux572556 Chris wallj
Bucket
307
045128 Shouli Feldman
[000]
[001]
[002]
[307]
Figure 2-17 bucked hashing
Linear probe
Places here
57西南财经大学天府学院
Two problems & combination Two problems & combination approachesapproaches
First : it uses significantly more space, many of the buckets will be (or partially) emptySecond: it does not completely resolve the collision problemResolving the collision is to use the linear probeThere are several approaches to resolving collisions ,often uses multiple stepsExample one large database hashes to a bucket, full, linear probe , linked list overflow area
58西南财经大学天府学院
summarysummary
Searching is the process of finding the location of a target among a list of objectsTwo basic searching methods for arrays: sequential and binary searchThe sequential search is normally used when a list is not sorted. It starts at the beginning of the list and searches until it finds the data or hits the end of the listOne of the variation of the sequential search is the sentinel search. In this method,the condition ending the search is reduced to only one by artificially inserting the target at the end of the listThe second variation of the sequential search is called the probability search. In this method, the list is ordered with the most probable elements at the beginning of the list and the least probable at the end
59西南财经大学天府学院
2-5 summary(continued)2-5 summary(continued)
The sequential search can also be used to search a sorted list, in this case, we can terminate the search when the target is less than the current elementIf an array is sorted, we can use a more efficient algorithm called the binary searchthe binary search algorithm searches the list by first checking the middle element. If the target is not in the middle element, the algorithm eliminates the upper half or the lower half of the list depending on the value of the middle element. The process continues until the target is found or reduced list length becomes zero The efficiency of a sequential search is O(n)The efficiency of a binary search is O(log2n)
60西南财经大学天府学院
summary(continued)summary(continued)
In a hashed search,the key through an algorithmic transformation,determines the location of the data. It is a key-to-address transformationThere are several hashing functions : we discussed direct, subtraction, modulo division, digit extraction, mid-square, folding, rotation , and pseudorandom generation
61西南财经大学天府学院
summary(continued)summary(continued)
In direct hashing,the key is the address without any algorithmic manipulation In subtraction hashing,the key is transformed to an address by subtracting a fixed number from itIn modulo-division hashing,the key is divided by the list size,recommended to be a prime numberIn digit-extraction hashing,selected digits are extracted from the key and used as an addressIn mid-square hashing,the key is squared and the address is selected from the middle of the resultIn fold shift hashing,the key is divided into parts whose sizes match the size of the required address.then the parts are added to obtain the address
62西南财经大学天府学院
summary(continued)summary(continued)
In fold boundary hashing,the key is divided into parts whose sizes match the size of the required address.then the left and right parts are reversed and added to the middle part to obtain the addressIn rotation hashing,the rightmost digit of the key is rotated to the left to determine an address. However,this method is usually used in combination with other methodsIn the pseudorandom generation hashing,the key is used as the seed to generate a pseudorandom number. The result is then scaled to obtain the addressExcept in the direct and subtraction methods, collisions are unavoidable in hashing. Collision occur when a new key is hashed to an address that is already occupied
63西南财经大学天府学院
summary(continued)summary(continued)
Clustering is the tendency of data to build up unevenly across a hashed list.
Primary clustering occur when data build up around a home addressSecondary clustering occurs when data build up along a collision path in the list
To solve a collision, a collision resolution method is usedThree general methods are used to resolve collision : open addressing,linked list,and buckets
The open addressing method can be subdivided into linear probe,quadratic probe,pseudorandom rehashing,and key-offset rehashing
64西南财经大学天府学院
summary(continued)summary(continued)
In the linear probe method,when the collision occurs,the new data will be stored in the next available address.In the quadratic method,the increment is the collision probe number squared.In the pseudorandom rehashing method, we use a random number generator to rehash the addressIn the key-offset rehashing method,we use an offset to rehash the address
65西南财经大学天府学院
summary(continued)summary(continued)
In the linked list technique,we use separate areas to store collision and chain all synonyms together in a linked listIn bucket hashing,we use a bucket that can accommodate multiple data occurrences
66西南财经大学天府学院
HomeworkHomework
Using the modulo-division method and linear probing, store the keys shown below in an array with 19 elements, How many collision occurred? The value of load factor of the list after all keys have been inserted?
224562,137456,214562,140145,214567,162145,144467,199645,234534Repeat above problem using the digit-extraction method (first, third and fifth digits) and quadratic probing.
Recommended