Upload
amit-maheshwari
View
156
Download
0
Embed Size (px)
Citation preview
File OrganisationFile Organisation
Placing File on DiskPlacing File on Disk
File – a sequence of recordsFile – a sequence of records Records Records
Record typeRecord type Record fieldsRecord fields Data typeData type Number of bytes in a fieldNumber of bytes in a field
fixedfixed VariableVariable
Record CharacteristicsRecord Characteristics
A logical view:A logical view: SELECT * FROM STUDENTS orSELECT * FROM STUDENTS or (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or STUDENT(Name, STUDENT(Name, NumberNumber, , Class, Major)Class, Major)
A physical view: A physical view: (20 bytes + 4 bytes + 4 bytes + 3 bytes)(20 bytes + 4 bytes + 4 bytes + 3 bytes) data types determine record lengthdata types determine record length
- -records can be of fixed or variable length- -records can be of fixed or variable length
Fixed Versus Variable Length RecordsFixed Versus Variable Length Records
FIXED LENGTH: FIXED LENGTH: every record has same fieldsevery record has same fields field can be located relative to record startfield can be located relative to record start
VARIABLE LENGTH - FIELDS:VARIABLE LENGTH - FIELDS: Some fields have unknown lengthSome fields have unknown length use a field separatoruse a field separator Use a record terminatorUse a record terminator
WHAT IF RECORDS ARE SMALLER THAN A WHAT IF RECORDS ARE SMALLER THAN A BLOCK? - BLOCK? - BLOCKING FACTORBLOCKING FACTOR
WHAT IF RECORDS ARE LARGER THAN A WHAT IF RECORDS ARE LARGER THAN A BLOCK? - BLOCK? - SPANNING RECORDSSPANNING RECORDS
Record blockingRecord blockingAllocating records to disk blocksAllocating records to disk blocks Unspanned recordsUnspanned records
Each record is fully contained in one blockEach record is fully contained in one block Many records in one blockMany records in one block Blocking factor Blocking factor bfr bfr – number of records– number of records that fit in that fit in
one blockone blockExample: Block size B = 1024 record size (fixed) R Example: Block size B = 1024 record size (fixed) R = 150 = 150 bfr = bfr = 1024/150 1024/150 = 6 ( = 6 (floor and ceiling functionsfloor and ceiling functions))
Spanned organizationSpanned organization Record ‘continued’ on the consecutive blockRecord ‘continued’ on the consecutive block Required pointer to point the block with the Required pointer to point the block with the
remainder of a recordremainder of a record If records are of a variable length , then bfr could If records are of a variable length , then bfr could
represent the average number of records per bloc represent the average number of records per bloc (the rounding function does not apply)(the rounding function does not apply)
File structure File structure File – as a set of pages (disk blocks) File – as a set of pages (disk blocks)
storing recordsstoring records File headerFile header
Record format, types of separatorsRecord format, types of separators Block address(es)Block address(es)
Blocks allocatedBlocks allocated ContiguousContiguous Linked (use of block pointers)Linked (use of block pointers) Linked clustersLinked clusters IndexedIndexed
Searching for a recordSearching for a recordSearch for a record on disk, Search for a record on disk, one or more file blocks copied into buffers. one or more file blocks copied into buffers. Programs search for the desired record in the buffers, using the Programs search for the desired record in the buffers, using the
information in the file header. information in the file header. If the address of the block with desired record is not If the address of the block with desired record is not
known, the search programs must do a known, the search programs must do a linearlinear search search through the file blocks. Each file block is copied into a through the file blocks. Each file block is copied into a buffer and searched either until the record is located or all buffer and searched either until the record is located or all the file blocks have been searched unsuccessfully. the file blocks have been searched unsuccessfully.
The goal of a good file organization is to locate the block that The goal of a good file organization is to locate the block that contains a desired record with a minimal number of block contains a desired record with a minimal number of block transfers transfers
Operations on FilesOperations on FilesBecause of complex path from stored data to Because of complex path from stored data to user, DBMS offer a range of I/O operations:user, DBMS offer a range of I/O operations:
OPEN - OPEN - access the file and prepare pointeraccess the file and prepare pointer
FIND (LOCATE)FIND (LOCATE) - - find first record find first record
FINDNEXTFINDNEXT
FINDALL FINDALL - set- set
READREAD
INSERTINSERT
DELETEDELETE
MODIFYMODIFY
CLOSECLOSE
REORGANISEREORGANISE - set - set
READ-ORDERED (FIND-ORDERED)READ-ORDERED (FIND-ORDERED) - set - set
File organization and access File organization and access methodmethod..
Difference between the terms Difference between the terms file organization file organization and and access method. access method.
A file organization is organization of the data of a A file organization is organization of the data of a file into records, blocks, and access structures; file into records, blocks, and access structures; way of placing records and blocks on the storage way of placing records and blocks on the storage
medium medium An access method provides a group of operations An access method provides a group of operations
that can be applied to a file resulting in retrieval, that can be applied to a file resulting in retrieval, modification and reorganisation. modification and reorganisation. One file organization can accept many different access One file organization can accept many different access
methods Some access methods, though, can be applied methods Some access methods, though, can be applied only to files with specific file organization. only to files with specific file organization. For example, one cannot apply an indexed access For example, one cannot apply an indexed access method to a file without an indexmethod to a file without an index
Why do Access Methods Why do Access Methods mattermatterThe unit of transfer between disk and main memory is a The unit of transfer between disk and main memory is a blockblock
Data must be in Data must be in memorymemory for the DBMS to use it for the DBMS to use it DBMS memory is handled in units of a DBMS memory is handled in units of a pagepage, e.g. 4K, 8K. , e.g. 4K, 8K.
Pages in memory represent one or more Pages in memory represent one or more hardware blocks hardware blocks from the diskfrom the disk
If a single item is needed, the whole block is transferredIf a single item is needed, the whole block is transferred Time taken for an I/O depends on the location of the data on Time taken for an I/O depends on the location of the data on
the disk and is lower if the number of seek times and the disk and is lower if the number of seek times and rotational delays are small, we remember that:rotational delays are small, we remember that: access time access time = seek times + = seek times + rotational delaysrotational delays + transfer times + transfer times
The reason many DBMS do not rely on the OS file system is:The reason many DBMS do not rely on the OS file system is: higher level DB operations, e.g. higher level DB operations, e.g. JOIN,JOIN, have a known pattern of have a known pattern of
page accesses and can be translated into known sets of I/O page accesses and can be translated into known sets of I/O operationsoperations
buffer manager can buffer manager can PRE-FETCHPRE-FETCH pages by anticipating the next pages by anticipating the next request. This is especially efficient when the required data are request. This is especially efficient when the required data are stored stored CONTIGUOUSLYCONTIGUOUSLY on disk on disk
Simple File Simple File OrganisationsOrganisations
UnorderedUnordered files of records: files of records: Heap Heap or or Pile Pile filefile New records inserted at EOF, or anywhereNew records inserted at EOF, or anywhere locating a record is by a linear searchlocating a record is by a linear search insertion is easyinsertion is easy retrieval of an individual record, or in any retrieval of an individual record, or in any
order, is difficult (time consuming).order, is difficult (time consuming). Question. How many blocks in average Question. How many blocks in average
one needs to reed to find a single record ? one needs to reed to find a single record ? Fast: Fast: Select * from CourseSelect * from CourseSlow: Slow: Select count(*) from Course Select count(*) from Course
group by Course_Numbergroup by Course_Number
Operations on Unordered Operations on Unordered FileFileInserting a new record is Inserting a new record is very efficient:very efficient:
The address of the last file block is kept in the file headerThe address of the last file block is kept in the file header The last disk block of the file is copied into a buffer page; The last disk block of the file is copied into a buffer page; The new record is added or new page is opened; the page The new record is added or new page is opened; the page
is then rewritten back to disk block. is then rewritten back to disk block.
Searching for a record using any search condition Searching for a record using any search condition in a file stored in b blocksin a file stored in b blocks
Linear search through the file, block by blockLinear search through the file, block by block Cost = b/2 block transfers. on average, if only one record satisfies Cost = b/2 block transfers. on average, if only one record satisfies
the search condition,the search condition, Cost = b block transfers. If no records or several records satisfy the Cost = b block transfers. If no records or several records satisfy the
search condition. program must read and search all search condition. program must read and search all b b blocks in the blocks in the file. file.
To delete a record,To delete a record, find its block and copy the block into a buffer page, find its block and copy the block into a buffer page, delete the record from the buffer, delete the record from the buffer, rewrite the updated page back to the disk block. rewrite the updated page back to the disk block.
NNote: Unused space in the block could be used in future ote: Unused space in the block could be used in future for a new record if suitable (some book keeping for a new record if suitable (some book keeping necessary on unused space in file blocks))necessary on unused space in file blocks))
Special Deletion Special Deletion ProceduresProcedures
Technique used for record deletion Technique used for record deletion Each record has an extra byte or bit, called a Each record has an extra byte or bit, called a
deletion marker set to ‘1’ at insertion *)deletion marker set to ‘1’ at insertion *) DO not remove deleted record, but reset its DO not remove deleted record, but reset its
deletion marker to ‘0’ when deleteddeletion marker to ‘0’ when deleted Record with deletion marker set to 0 is not used Record with deletion marker set to 0 is not used
by application programsby application programs From time to time reorganise the file: physically From time to time reorganise the file: physically
remove deleted records or reclaim unused space.remove deleted records or reclaim unused space.
*) Just for simplicity we assume that values of *) Just for simplicity we assume that values of deletion markers are ‘0’ or ‘1’. A system actually deletion markers are ‘0’ or ‘1’. A system actually can choose other characters or combination of bits can choose other characters or combination of bits as values of deletion markers.as values of deletion markers.
Simple File Simple File OrganisationsOrganisationsOrderedOrdered files of records files of records - sequential files - sequential files
still extremely useful in DBM (auditing, recovery, still extremely useful in DBM (auditing, recovery, security…)security…)
A record field is nominated and records are ordered A record field is nominated and records are ordered based on that fieldbased on that field
Ordering keyOrdering key insertion is expensiveinsertion is expensive retrieval is easy (efficient) if exploiting the sort orderretrieval is easy (efficient) if exploiting the sort order binary searchbinary search reduces time significantly reduces time significantly
Fast: Fast: Select * from Course order by <order>Select * from Course order by <order>
Slow: Select * from Course where <any other Slow: Select * from Course where <any other attribute> = attribute> = cc
Retrieval & Update in Retrieval & Update in Sorted FilesSorted Files Binary search on ordering field to find Binary search on ordering field to find
block with key = block with key = kk::B = # of blocks; B = # of blocks; HighHigh:= B; := B; LowLow := 0 := 0
Do while not (Found or NotThere)Do while not (Found or NotThere)
Read Block Read Block
MidMid = ( = (Low Low + + HighHigh) / 2) / 2If If k k < key field of first record in the block < key field of first record in the block
Then Then High = MidHigh = Mid - 1 - 1
Else Else If If k k > key field of last record > key field of last record
Then Then Low = Mid Low = Mid + 1+ 1
Else Else If If kk record is in the buffer record is in the buffer
Then FoundThen Found Else NotThereElse NotThere
endend
Operations on Ordered Operations on Ordered FileFileSearching for records when criteria are Searching for records when criteria are
specified in terms of ordering fieldspecified in terms of ordering field Reading the records in order of the ordering key values is Reading the records in order of the ordering key values is
extremely efficient,extremely efficient, Finding the next record from the current one in order of Finding the next record from the current one in order of
the ordering key usually requires no additional block the ordering key usually requires no additional block accesses, accesses,
the next record is in the same block or in the next the next record is in the same block or in the next block block
using a search condition based on the value of an using a search condition based on the value of an ordering key field results in faster access when the ordering key field results in faster access when the binary searchbinary search technique is used, technique is used,
A binary search can be done on the blocks rather than on A binary search can be done on the blocks rather than on the records.. A binary search usually accesses logthe records.. A binary search usually accesses log22((bb) ) blocks, whether the record is found or notblocks, whether the record is found or not
No advantage if search criterion is specified in terms of No advantage if search criterion is specified in terms of non ordering fieldsnon ordering fields
Operations on Ordered Operations on Ordered File(cndt)File(cndt)
Inserting records is expensive. To Inserting records is expensive. To insert a recordinsert a record
find its correct position in the file, based on find its correct position in the file, based on its ordering field value, - cost logits ordering field value, - cost log22((bb) )
make space in the file to insert the record in make space in the file to insert the record in that position.that position.
on the average, half the records of the file on the average, half the records of the file must be moved to make space for the new must be moved to make space for the new record. record.
these file blocks must be read and these file blocks must be read and rewritten to keep the order. Cost of rewritten to keep the order. Cost of insertion is then =b/2 block transfersinsertion is then =b/2 block transfers
Operations on Ordered Operations on Ordered File(cndt)File(cndt)
Deleting record.Deleting record. Find the record using binary search based Find the record using binary search based
on ordering field value, - cost logon ordering field value, - cost log22((b)b) Delete the record,Delete the record, Reorganise part of the file (all records Reorganise part of the file (all records
after that deleted one, b/2 blocks in after that deleted one, b/2 blocks in average)average)
Modifying record Modifying record Find record using binary search and Find record using binary search and
update as requiredupdate as required
Operations on Ordered FileOperations on Ordered FileAlternative ways for more efficient insertionAlternative ways for more efficient insertion keep some unused space in each block for new records keep some unused space in each block for new records
(not good - problem returns when that space is filled up) (not good - problem returns when that space is filled up) create and maintain a temporary create and maintain a temporary unordered unordered file called file called
an an overflow fileoverflow file. . New records are inserted at the end of the overflow file New records are inserted at the end of the overflow file Periodically, the overflow file is sorted and merged with Periodically, the overflow file is sorted and merged with
the main file during file reorganization.the main file during file reorganization. Searching for a record must involve both files, main and Searching for a record must involve both files, main and
overflow; the cost of searching is thus more expensive overflow; the cost of searching is thus more expensive but for large main file will be still close to logbut for large main file will be still close to log22((bb) )
Alternative way for more efficient deletionAlternative way for more efficient deletion Use the technique based on deletion marker, as Use the technique based on deletion marker, as
described earlierdescribed earlier
Access Properties of Access Properties of Simple FilesSimple Files
Heap (sequential unordered)Heap (sequential unordered)
Ordered (sequential) fileOrdered (sequential) file
Note: in this and the following examples record numbers Note: in this and the following examples record numbers corresponds to values of ordering field in ascending ordercorresponds to values of ordering field in ascending order
R4 -------R2 -------R3 -------R16 -------
R1 -------R7 -------R35 -------R10 -------
R14 -------R12 -------R23 -------R6 -------
R24 -------R27 -------
R1 -------R2 -------R3 -------R4 -------
R6 -------R7 -------R10 -------R12 -------
R14 -------R16 -------R23 -------R24 -------
R27 -------R35 -------
Access Properties of Access Properties of Simple FilesSimple Files
Insert into Heap file record R15Insert into Heap file record R15
R4 -------R2 -------R3 -------R16 -------
R1 -------R7 -------R35 -------R10 -------
R14 -------R12 -------R23 -------R6 -------
R24 -------R27 -------R15 -------
And after insertionAnd after insertion
R4 -------R2 -------R3 -------R16 -------
R1 -------R7 -------R35 -------R10 -------
R14 -------R12 -------R23 -------R6 -------
R24 -------R27 -------
Access Properties of Access Properties of Simple FilesSimple Files
Insert into Ordered file record R15Insert into Ordered file record R15
R1 -------R2 -------R3 -------R4 -------
R6 -------R7 -------R10 -------R12 -------
R14 -------R15 -------R16 -------R23 -------
R24 -------R27 -------R35 -------
And after insertionAnd after insertion
Notice that all records after R15 have changed Notice that all records after R15 have changed their page location or position on the pagetheir page location or position on the page
R1 -------R2 -------R3 -------R4 -------
R6 -------R7 -------R10 -------R12 -------
R14 -------R16 -------R23 -------R24 -------
R27 -------R35 -------
Access Properties of Access Properties of Simple FilesSimple Files
Insert into Ordered file records R15, R9, R17 using overflow fileInsert into Ordered file records R15, R9, R17 using overflow file
And after insertionAnd after insertion
Periodically overflow file is sorted and merged with the main filePeriodically overflow file is sorted and merged with the main file
R1 -------R2 -------R3 -------R4 -------
R6 -------R7 -------R10 -------R12 -------
R14 -------R16 -------R23 -------R24 -------
R27 -------R35 -------
Main File Overflow FileMain File Overflow File
R15 -------R9 -------R17 -------
R1 -------R2 -------R3 -------R4 -------
R6 -------R7 -------R10 -------R12 -------
R14 -------R16 -------R23 -------R24 -------
R27 -------R35 -------
Main File Overflow FileMain File Overflow File
Access Properties of Access Properties of Simple FilesSimple Files
Deletions from a Heap: R10, R3, R7:Deletions from a Heap: R10, R3, R7:
Simple delete: Simple delete:
R4 -------R2 -------
R16 -------
R1 -------
R35 -------
R14 -------R12 -------R23 -------R6 -------
R24 -------R27 -------
After delete operations After delete operations
R4 -------R2 -------R3 -------R16 -------
R1 -------R7 -------R35 -------R10 -------
R14 -------R12 -------R23 -------R6 -------
R24 -------R27 -------
Access Properties of Access Properties of Simple FilesSimple Files
Deletions from a Heap: R10, R3, R7:Deletions from a Heap: R10, R3, R7:
using deletion marker techniqueusing deletion marker technique
After delete operationsAfter delete operationsR4 ------- 1R2 ------- 1R3 ------- 0R16 ------- 1
R1 ------- 1R7 ------- 0R35 ------- 1R10 ------- 0
R14 ------- 1R12 ------- 1R23 ------- 1R6 ------- 1
R24 ------- 1R27 ------- 1
Deletion markers set to ‘0’ and later these records will be Deletion markers set to ‘0’ and later these records will be physically removed when file is reorganisedphysically removed when file is reorganised
R4 ------- 1R2 ------- 1R3 ------- 1R16 ------- 1
R1 ------- 1R7 ------- 1R35 ------- 1R10 ------- 1
R14 ------- 1R12 ------- 1R23 ------- 1R6 ------- 1
R24 ------- 1R27 ------- 1
Access Properties of Access Properties of Simple FilesSimple Files
Deletions from ordered file: R10, R3, R7:Deletions from ordered file: R10, R3, R7:
Simple delete: Simple delete:
After delete operations After delete operations
R1 -------R2 -------R3 -------R4 -------
R6 -------R7 -------R10 -------R12 -------
R14 -------R16 -------R23 -------R24 -------
R27 -------R35 -------
R1 -------R2 -------R4 -------R6 -------
R12 -------R14 -------R16 -------R23 -------
R24 -------R27 -------R35 -------
Access Properties of Access Properties of Simple FilesSimple Files
Deletions from ordered file: R10, R3, R7:Deletions from ordered file: R10, R3, R7:
Using deletion marker technique:Using deletion marker technique:
R1 ------- 1R2 ------- 1R3 ------- 0R4 ------- 1
R6 ------- 1R7 ------- 0R10 ------- 0R12 ------- 1
R14 ------- 1R16 ------- 1R23 ------- 1R24 ------- 1
R27 ------- 1R35 ------- 1
After delete operationsAfter delete operations
R1 ------- 1R2 ------- 1R3 ------- 1R4 ------- 1
R6 ------- 1R7 ------- 1R10 ------- 1R12 ------- 1
R14 ------- 1R16 ------- 1R23 ------- 1R24 ------- 1
R27 ------- 1R35 ------- 1
Deletion markers set to ‘0’ and later these records will be Deletion markers set to ‘0’ and later these records will be physicaly removed when file is reorganisedphysicaly removed when file is reorganised
Retrieval and Update In Retrieval and Update In HeapsHeapsQuick summaryQuick summary
can only use linear searchcan only use linear search insertion is fastinsertion is fast deletion, update are slowdeletion, update are slow parameter search (e.g. SELECT…WHERE) is slowparameter search (e.g. SELECT…WHERE) is slow unconditional search can be fast ifunconditional search can be fast if
records are of fixed lengthrecords are of fixed length records do not span blocks:records do not span blocks:
j-th record located by position in block j / bfrj-th record located by position in block j / bfr average time to find a single record = b / 2 average time to find a single record = b / 2
(b = number of blocks)(b = number of blocks)
Retrieval & Update in Retrieval & Update in Sorted FilesSorted Files
Quick summaryQuick summary retrieval on key field is fast - “next” retrieval on key field is fast - “next”
record is nearbyrecord is nearby any other retrieval either requires a any other retrieval either requires a
sort, or an index, or is as slow as a sort, or an index, or is as slow as a heapheap
update, delete, insert are slow (find update, delete, insert are slow (find block, update block, rewrite block)block, update block, rewrite block)
FAST ACCESS FOR DATABASE: FAST ACCESS FOR DATABASE: HASHINGHASHING
Types of hashing: static or dynamicTypes of hashing: static or dynamic What is the point of hashing?What is the point of hashing?
reduce a large address spacereduce a large address space provide close to direct accessprovide close to direct access provide reasonable performance for all U,I,D,Sprovide reasonable performance for all U,I,D,S
What is a hash function?What is a hash function? propertiesproperties behaviourbehaviour
CollisionsCollisions Collision resolutionCollision resolution Open addressingOpen addressing SummarySummary
What Is Hashing, and What Is It What Is Hashing, and What Is It For?For?
““direct” access to block containing the desired direct” access to block containing the desired recordrecord
reduce the number of blocks read or writtenreduce the number of blocks read or written allow for file expansion and contraction with allow for file expansion and contraction with
minimal file reorganisingminimal file reorganising permit retrieval on “hashed” fields without permit retrieval on “hashed” fields without
re-sorting the filere-sorting the file no need to allocate contiguous disk areasno need to allocate contiguous disk areas if file is small, internal hashing; otherwise if file is small, internal hashing; otherwise
externalexternal no direct access other than by hashingno direct access other than by hashing
A basic example of hashing:A basic example of hashing:
There are 25 rows of seats, with 3 seats per There are 25 rows of seats, with 3 seats per row (75 seats total)row (75 seats total)
We have to allocate each person to a row in We have to allocate each person to a row in advance, at randomadvance, at random
We will hash on their We will hash on their family namefamily name so as to so as to find the person’s row number directly, find the person’s row number directly, knowing only the nameknowing only the name
The database is logically a single tableThe database is logically a single table
ROOM (ROOM (NameName, Age, Attention), Age, Attention)
implemented as a implemented as a blocked, hashed blocked, hashed filefile
The hashing processThe hashing process
The hash process is:The hash process is: LocLoc = 0 = 0 Until no more characters in YourNameUntil no more characters in YourName
Add the alphabetic position of the character Add the alphabetic position of the character to to LocLoc
Calculate Calculate RowNum RowNum = = LocLoc mod 25 mod 25
AA BB CC DD EE FF GG HH II JJ KK LL MM NN OO PP QQ RR SS TT UU VV WW XX YY ZZ
11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414 1515 1616 1717 1818 1919 2020 2121 2222 2323 2424 2525 2626
Examples - Hashed Examples - Hashed NamesNames
AABB CC DD EE FF GG HH II JJ KK LL MM NN OO PP QQ RR SS TT UU VV WW XX YY ZZ
11 22 33 44 55 66 77 88 991100
1111
1122
1133
1144
1155
1166
1177
1188
1199
2200
2211
2222
2233
2244
2255
2266
Where is MCWILLIAM? Hash(MCWILLIAM) = Row 20
NAMENAME LocLoc Row Num (mod 25) Row Num (mod 25)
REYEREYE 5353 33
ANDERSONANDERSON 9090 15 15
MCWILLIAMMCWILLIAM 9595 20 20
TANGTANG 4242 1717
LEELEE 2222 2222
NAMENAME LocLocRowNum RowNum
(mod 25) (mod 25)
REYEREYE 5353 33
ANDERSOANDERSONN 9090 15 15
MCWILLIAMCWILLIAMM 9595 20 20
TANGTANG 4242 1717
LEELEE 2222 2222
Name Hashing Example Name Hashing Example continuedcontinued
NameName RowRow NameName RowRow NameName RowRow NameName RowRow
LeeLee 2222 AlexAlex 1717 GeorgeGeorge 77 AnneAnne 99WestWest 1717 RitaRita 2323 GuyGuy 33 WillWill 66JamesJames 2323 JodieJodie 1818 DaveDave 77 WilfWilf 00AnnaAnna 55 JillJill 1818 DonDon 88 WaltWalt 66AnitaAnita 1818 LilyLily 88 DixyDixy 1212 JackJack 00JieJie 2424 AshAsh 33 JonJon 1414 LanaLana 33KennyKenny1919 BenBen 2121 NinaNina 1313 OlgaOlga 1010MarieMarie 2121 KayKay 1212 MayMay 1414 FredFred 88LoisLois 55 PeterPeter 1414 MaxMax 1313 TaniaTania 1818BestBest 2121 PaulPaul 00 NoraNora 2323 TomTom 2323RobRob 1010 PhilPhil 2020 CashCash 66 JuliaJulia 33LouLou 2323 PatPat 1111 FootFoot 66 LeahLeah 66AxelAxel 1717 EdEd 99 TanTan 1010 LingLing 1717
The results after 52 The results after 52 arrivalsarrivals
RowNRowNoo
#Sitti#Sittingng
Alloc Alloc ++
00 33
11 00
22 00
33 33 11
44 00
55 22
66 33 22
77 22
88 33
99 22
1010 33
1111 11
1212 22
RowNRowNoo
#Sitti#Sittingng
Alloc Alloc ++
1313 22
1414 33
1515 00
1616 00
1717 33 11
1818 33 11
1919 11
2020 11
2121 33
2222 11
2323 33 22
2424 11
The Room as a Hashed The Room as a Hashed FileFile
Each person has a Each person has a hash keyhash key - the name - the name Each person is a Each person is a recordrecord Each row is a hardware Each row is a hardware block block (bucket)(bucket) Each Each row numberrow number is the is the addressaddress of a bucket of a bucket Records here are Records here are fixed-lengthfixed-length (and 3 records per (and 3 records per
block)block) The leftover people are The leftover people are collisionscollisions (key collisions) (key collisions) They will have to be found a seat by They will have to be found a seat by collision collision
resolutionresolution
Collision ResolutionCollision Resolution
Leave an empty seat in each rowLeave an empty seat in each row Under population - blocks 66% full Under population - blocks 66% full
A notice on the end of the row: “extra seat A notice on the end of the row: “extra seat for row for row NN can be found at the rear exit” can be found at the rear exit” bucket’s overflow chain points to an bucket’s overflow chain points to an
overflow page containing the recordoverflow page containing the record ““Everyone stand up while we reallocate Everyone stand up while we reallocate
seats”seats” file reorganisationfile reorganisation
Collision ResolutionCollision Resolution
Strategy 1 (open addressing): Strategy 1 (open addressing): ““Nora” is 4th arrival for 23Nora” is 4th arrival for 23 Place new arrival in next higher No Place new arrival in next higher No
block with a vacancyblock with a vacancy Retrieval - search for “Nora”:Retrieval - search for “Nora”:
Retrieve block 23Retrieve block 23 Read blocks in 23 consecutively. If Read blocks in 23 consecutively. If
“Nora” not found“Nora” not found try 24… try 24…
Disadvantages:Disadvantages: May need to read whole file May need to read whole file
consecutively on some keysconsecutively on some keys Blocks will gradually fill up with out-of-Blocks will gradually fill up with out-of-
place recordsplace records Deletions cause either immediate or Deletions cause either immediate or
periodic reorganisationperiodic reorganisation
Collision ResolutionCollision ResolutionStrategy 2: Strategy 2: Reserve some rows (buckets) for overflow Reserve some rows (buckets) for overflow
Blocks 25, 26 and 27 or recalculate hash Blocks 25, 26 and 27 or recalculate hash function for smaller mod, say 20 instead of 25function for smaller mod, say 20 instead of 25 “ “Julia” is then 4th arrival for block 3Julia” is then 4th arrival for block 3 Place in overflow block with smaller label and with Place in overflow block with smaller label and with
available space (26 ? and optionally placing a available space (26 ? and optionally placing a pointer in bucket 3 pointing to 26th). pointer in bucket 3 pointing to 26th).
Retrieval - search for “Julia”:Retrieval - search for “Julia”: Retrieve block 3Retrieve block 3 Read blocks in 3 consecutively. If “Julia” not Read blocks in 3 consecutively. If “Julia” not
found, either:found, either: search overflow consecutively, orsearch overflow consecutively, or follow pointer to block 26 (chaining) follow pointer to block 26 (chaining)
Disadvantages:Disadvantages: Overflow gradually fills up giving longer Overflow gradually fills up giving longer
retrieval timesretrieval times Deletions/additions cause periodic Deletions/additions cause periodic
reorganisation reorganisation
Collision ResolutionCollision Resolution More formallyMore formally Open addressing: If location specified by hash Open addressing: If location specified by hash
address isaddress is occupied then the subsequent positions occupied then the subsequent positions are checked in order until an unused (empty) position are checked in order until an unused (empty) position is found. is found.
Chaining: Chaining: various overflow locations are kept, a various overflow locations are kept, a pointer field is addedpointer field is added to each record location. A to each record location. A collision is resolved by placing the new record in an collision is resolved by placing the new record in an unused overflow location and setting the pointer of unused overflow location and setting the pointer of the occupied hash address location to the address of the occupied hash address location to the address of that overflow location. that overflow location.
Multiple hashing: Multiple hashing: A second hash function is applied if A second hash function is applied if the first results in a collision. the first results in a collision.
Performance on Hashed Performance on Hashed FilesFiles
Retrieve (SELECT): very fast if name is Retrieve (SELECT): very fast if name is known, otherwise hopelessknown, otherwise hopeless SELECT * FROM ROOM SELECT * FROM ROOM
WHERE NAME = ‘McWilliam’WHERE NAME = ‘McWilliam’ SELECT * FROM ROOM SELECT * FROM ROOM
WHERE AGE > 30WHERE AGE > 30 Update: sameUpdate: same
UPDATE ROOM SET ATTENTION = ‘low’ UPDATE ROOM SET ATTENTION = ‘low’
WHERE NAME = ‘McWilliam’WHERE NAME = ‘McWilliam’ UPDATE ROOM SET ATTENTION = ‘high’UPDATE ROOM SET ATTENTION = ‘high’
WHERE AGE > 50 OR AGE < 10WHERE AGE > 50 OR AGE < 10
Performance on Hashed Performance on Hashed FilesFiles Delete: same as SELECT, UPDATEDelete: same as SELECT, UPDATE
DELETE FROM ROOM DELETE FROM ROOM (uses (uses hash - fast)hash - fast)
WHERE NAME = ‘Nora’WHERE NAME = ‘Nora’ DELETE FROM ROOM DELETE FROM ROOM (can’t use (can’t use
hash - slow)hash - slow)
WHERE NAME IS LIKE ‘No%’WHERE NAME IS LIKE ‘No%’
Insert: unpredictableInsert: unpredictable INSERT INTO ROOM INSERT INTO ROOM
VALUES (‘Smyth’, ‘high’)VALUES (‘Smyth’, ‘high’)
Internal HashingInternal Hashing Internal hashing is used as an Internal hashing is used as an internal internal
search structure within a programsearch structure within a program whenever a group of records is accessed whenever a group of records is accessed exclusively by using the value of one field. exclusively by using the value of one field.
Applicable to smaller filesApplicable to smaller files Hashed in main memory: fast lookup in Hashed in main memory: fast lookup in
store store RR records, records, R R-length array-length array Hash function transforms key field Hash function transforms key field
into subscriptinto subscript array in the range 0 to array in the range 0 to RR - 1 - 1
hash (Key Value) = Key Value (mod hash (Key Value) = Key Value (mod R)R) subscript is the record address in storesubscript is the record address in store
External HashingExternal Hashing
Hashing for disk files is called external Hashing for disk files is called external hashing.hashing.
address space is made of buckets, each of address space is made of buckets, each of which holds multiple records. which holds multiple records.
A bucket is either one disk block or a A bucket is either one disk block or a cluster of contiguous blocks. cluster of contiguous blocks.
The hashing function maps a key into a The hashing function maps a key into a relative bucket number, relative bucket number,
A table maintained in the file header A table maintained in the file header converts the bucket number into the converts the bucket number into the corresponding disk block address corresponding disk block address
External Hashing (static)External Hashing (static)
The hashing scheme is called static hashing if The hashing scheme is called static hashing if a fixed number of buckets a fixed number of buckets M M is allocated. is allocated.
If a record is to be retrieved with search If a record is to be retrieved with search condition specified for the key values, then condition specified for the key values, then the bucket number of the bucket potentially the bucket number of the bucket potentially containing that record is determined using containing that record is determined using the hashing function applied on the key and the hashing function applied on the key and then that bucket is examined for the then that bucket is examined for the containment of the desired record. If record is containment of the desired record. If record is not in that bucket then further search could not in that bucket then further search could be activated in overflow buckets.be activated in overflow buckets.
External Hashing (static)External Hashing (static)Construction of hashed file Construction of hashed file Identify size of the file, choose hashing function (according to the anticipated number of buckets) and Identify size of the file, choose hashing function (according to the anticipated number of buckets) and
decide about selection of the collision resolution procedure - for the life of the filedecide about selection of the collision resolution procedure - for the life of the file Apply hashing function to each inserted record to get the bucket number and place the record in the Apply hashing function to each inserted record to get the bucket number and place the record in the
bucket with that numberbucket with that number
External Hashing External Hashing (static….)(static….)
If bucket is full then apply selected If bucket is full then apply selected collision resolution procedure collision resolution procedure
If the number of records in overflow If the number of records in overflow buckets is large and/or distribution buckets is large and/or distribution of records in buckets is highly un-of records in buckets is highly un-uniform , then reorganise the file uniform , then reorganise the file using changed hashing function using changed hashing function (tuning)(tuning)
External Hashing (static)External Hashing (static)
keyH
H(key) mod N
0
1
N-1
Primary buckets
Overflow Page
Problems of Static Problems of Static HashingHashing
Number of buckets is fixedNumber of buckets is fixed shrinkage causes wasted spaceshrinkage causes wasted space growth causes long overflow chainsgrowth causes long overflow chains Solutions:Solutions:
reorganisereorganise re-hashre-hash use dynamic hashing...use dynamic hashing...
Extendible HashingExtendible Hashing Previously, to insert a new record into a full bucketPreviously, to insert a new record into a full bucket
add overflow page, oradd overflow page, or reorganise by doubling the bucket allocation and reorganise by doubling the bucket allocation and
redistributing the recordsredistributing the records This is a poor solution:This is a poor solution:
entire file is readentire file is read twice as many pages have to be writtentwice as many pages have to be written
Solution: Extendible HashingSolution: Extendible Hashing add a directory of pointers to bucketsadd a directory of pointers to buckets double the number of buckets by doubling the double the number of buckets by doubling the
directorydirectory split only the bucket that has overflowedsplit only the bucket that has overflowed
LINEAR HASHINGLINEAR HASHING It does not have a directory at allIt does not have a directory at all Instead, have a “family” of algorithms to manage Instead, have a “family” of algorithms to manage
dynamic expansion and contraction of the filedynamic expansion and contraction of the file Start with a set number of M buckets 0..M-1 with Start with a set number of M buckets 0..M-1 with
hashing function mod Mhashing function mod M Split them in Split them in linear linear order, when more space is order, when more space is
needed. The next hashing function is mod 2M and needed. The next hashing function is mod 2M and subsequent 3M, 4M etc as requiredsubsequent 3M, 4M etc as required Example: Block capacity is 2 records. Records Example: Block capacity is 2 records. Records
with values 72, 62, 32 are colliding for hashing with values 72, 62, 32 are colliding for hashing function (mod 10), but after application of next function (mod 10), but after application of next hashing function (mod 20) they do not (one hashing function (mod 20) they do not (one bucket contains 72 and 32 and another 62).bucket contains 72 and 32 and another 62).
Combines Combines controlledcontrolled overflow with new space overflow with new space acquisitionacquisition
Linear Hashing - Linear Hashing - AdvantagesAdvantages
Another type of dynamic hashingAnother type of dynamic hashing does not require a directorydoes not require a directory manages collisions wellmanages collisions well accommodates insertions and deletions accommodates insertions and deletions
wellwell allows overflow chain length to be allows overflow chain length to be
traded against average space traded against average space utilisationutilisation
uses several hash functionsuses several hash functions
HASHING SUMMARYHASHING SUMMARY
Comparison of Simple and Hashed FilesComparison of Simple and Hashed Files Pages in a hashed file are grouped into Pages in a hashed file are grouped into
buckets (1 block or a cluster of contiguous buckets (1 block or a cluster of contiguous blocks)blocks)
reduce a large address spacereduce a large address space provide close to direct accessprovide close to direct access Static hashing has some disadvantages Static hashing has some disadvantages
which are addressed in dynamic hashing which are addressed in dynamic hashing solutions (extendible, linear) solutions (extendible, linear)
Static hashed files are kept at about Static hashed files are kept at about 80% occupancy then reorganised. 80% occupancy then reorganised. New pages are added when each New pages are added when each existing page is about 80% fullexisting page is about 80% full
Hence, time to read entire file is Hence, time to read entire file is 1.25 non-hashed file 1.25 non-hashed file
Dynamic hashing provides flexibility Dynamic hashing provides flexibility in usage of file storage space in usage of file storage space (expansion and contraction)(expansion and contraction)