File Organisation&Hasing

File OrganisationFile Organisation

Placing File on DiskPlacing File on Disk

File – a sequence of recordsFile – a sequence of records Records Records

Record typeRecord type Record fieldsRecord fields Data typeData type Number of bytes in a fieldNumber of bytes in a field

fixedfixed VariableVariable

Record CharacteristicsRecord Characteristics

A logical view:A logical view: SELECT * FROM STUDENTS orSELECT * FROM STUDENTS or (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or STUDENT(Name, STUDENT(Name, NumberNumber, , Class, Major)Class, Major)

A physical view: A physical view: (20 bytes + 4 bytes + 4 bytes + 3 bytes)(20 bytes + 4 bytes + 4 bytes + 3 bytes) data types determine record lengthdata types determine record length

- -records can be of fixed or variable length- -records can be of fixed or variable length

Fixed Versus Variable Length RecordsFixed Versus Variable Length Records

FIXED LENGTH: FIXED LENGTH: every record has same fieldsevery record has same fields field can be located relative to record startfield can be located relative to record start

VARIABLE LENGTH - FIELDS:VARIABLE LENGTH - FIELDS: Some fields have unknown lengthSome fields have unknown length use a field separatoruse a field separator Use a record terminatorUse a record terminator

WHAT IF RECORDS ARE SMALLER THAN A WHAT IF RECORDS ARE SMALLER THAN A BLOCK? - BLOCK? - BLOCKING FACTORBLOCKING FACTOR

WHAT IF RECORDS ARE LARGER THAN A WHAT IF RECORDS ARE LARGER THAN A BLOCK? - BLOCK? - SPANNING RECORDSSPANNING RECORDS

Record blockingRecord blockingAllocating records to disk blocksAllocating records to disk blocks Unspanned recordsUnspanned records

Each record is fully contained in one blockEach record is fully contained in one block Many records in one blockMany records in one block Blocking factor Blocking factor bfr bfr – number of records– number of records that fit in that fit in

one blockone blockExample: Block size B = 1024 record size (fixed) R Example: Block size B = 1024 record size (fixed) R = 150 = 150 bfr = bfr = 1024/150 1024/150 = 6 ( = 6 (floor and ceiling functionsfloor and ceiling functions))

Spanned organizationSpanned organization Record ‘continued’ on the consecutive blockRecord ‘continued’ on the consecutive block Required pointer to point the block with the Required pointer to point the block with the

remainder of a recordremainder of a record If records are of a variable length , then bfr could If records are of a variable length , then bfr could

represent the average number of records per bloc represent the average number of records per bloc (the rounding function does not apply)(the rounding function does not apply)

File structure File structure File – as a set of pages (disk blocks) File – as a set of pages (disk blocks)

storing recordsstoring records File headerFile header

Record format, types of separatorsRecord format, types of separators Block address(es)Block address(es)

Blocks allocatedBlocks allocated ContiguousContiguous Linked (use of block pointers)Linked (use of block pointers) Linked clustersLinked clusters IndexedIndexed

Searching for a recordSearching for a recordSearch for a record on disk, Search for a record on disk, one or more file blocks copied into buffers. one or more file blocks copied into buffers. Programs search for the desired record in the buffers, using the Programs search for the desired record in the buffers, using the

information in the file header. information in the file header. If the address of the block with desired record is not If the address of the block with desired record is not

known, the search programs must do a known, the search programs must do a linearlinear search search through the file blocks. Each file block is copied into a through the file blocks. Each file block is copied into a buffer and searched either until the record is located or all buffer and searched either until the record is located or all the file blocks have been searched unsuccessfully. the file blocks have been searched unsuccessfully.

The goal of a good file organization is to locate the block that The goal of a good file organization is to locate the block that contains a desired record with a minimal number of block contains a desired record with a minimal number of block transfers transfers

Operations on FilesOperations on FilesBecause of complex path from stored data to Because of complex path from stored data to user, DBMS offer a range of I/O operations:user, DBMS offer a range of I/O operations:

OPEN - OPEN - access the file and prepare pointeraccess the file and prepare pointer

FIND (LOCATE)FIND (LOCATE) - - find first record find first record

FINDNEXTFINDNEXT

FINDALL FINDALL - set- set

READREAD

INSERTINSERT

DELETEDELETE

MODIFYMODIFY

CLOSECLOSE

REORGANISEREORGANISE - set - set

READ-ORDERED (FIND-ORDERED)READ-ORDERED (FIND-ORDERED) - set - set

File organization and access File organization and access methodmethod..

Difference between the terms Difference between the terms file organization file organization and and access method. access method.

A file organization is organization of the data of a A file organization is organization of the data of a file into records, blocks, and access structures; file into records, blocks, and access structures; way of placing records and blocks on the storage way of placing records and blocks on the storage

medium medium An access method provides a group of operations An access method provides a group of operations

that can be applied to a file resulting in retrieval, that can be applied to a file resulting in retrieval, modification and reorganisation. modification and reorganisation. One file organization can accept many different access One file organization can accept many different access

methods Some access methods, though, can be applied methods Some access methods, though, can be applied only to files with specific file organization. only to files with specific file organization. For example, one cannot apply an indexed access For example, one cannot apply an indexed access method to a file without an indexmethod to a file without an index

Why do Access Methods Why do Access Methods mattermatterThe unit of transfer between disk and main memory is a The unit of transfer between disk and main memory is a blockblock

Data must be in Data must be in memorymemory for the DBMS to use it for the DBMS to use it DBMS memory is handled in units of a DBMS memory is handled in units of a pagepage, e.g. 4K, 8K. , e.g. 4K, 8K.

Pages in memory represent one or more Pages in memory represent one or more hardware blocks hardware blocks from the diskfrom the disk

If a single item is needed, the whole block is transferredIf a single item is needed, the whole block is transferred Time taken for an I/O depends on the location of the data on Time taken for an I/O depends on the location of the data on

the disk and is lower if the number of seek times and the disk and is lower if the number of seek times and rotational delays are small, we remember that:rotational delays are small, we remember that: access time access time = seek times + = seek times + rotational delaysrotational delays + transfer times + transfer times

The reason many DBMS do not rely on the OS file system is:The reason many DBMS do not rely on the OS file system is: higher level DB operations, e.g. higher level DB operations, e.g. JOIN,JOIN, have a known pattern of have a known pattern of

page accesses and can be translated into known sets of I/O page accesses and can be translated into known sets of I/O operationsoperations

buffer manager can buffer manager can PRE-FETCHPRE-FETCH pages by anticipating the next pages by anticipating the next request. This is especially efficient when the required data are request. This is especially efficient when the required data are stored stored CONTIGUOUSLYCONTIGUOUSLY on disk on disk

Simple File Simple File OrganisationsOrganisations

UnorderedUnordered files of records: files of records: Heap Heap or or Pile Pile filefile New records inserted at EOF, or anywhereNew records inserted at EOF, or anywhere locating a record is by a linear searchlocating a record is by a linear search insertion is easyinsertion is easy retrieval of an individual record, or in any retrieval of an individual record, or in any

order, is difficult (time consuming).order, is difficult (time consuming). Question. How many blocks in average Question. How many blocks in average

one needs to reed to find a single record ? one needs to reed to find a single record ? Fast: Fast: Select * from CourseSelect * from CourseSlow: Slow: Select count(*) from Course Select count(*) from Course

group by Course_Numbergroup by Course_Number

Operations on Unordered Operations on Unordered FileFileInserting a new record is Inserting a new record is very efficient:very efficient:

The address of the last file block is kept in the file headerThe address of the last file block is kept in the file header The last disk block of the file is copied into a buffer page; The last disk block of the file is copied into a buffer page; The new record is added or new page is opened; the page The new record is added or new page is opened; the page

is then rewritten back to disk block. is then rewritten back to disk block.

Searching for a record using any search condition Searching for a record using any search condition in a file stored in b blocksin a file stored in b blocks

Linear search through the file, block by blockLinear search through the file, block by block Cost = b/2 block transfers. on average, if only one record satisfies Cost = b/2 block transfers. on average, if only one record satisfies

the search condition,the search condition, Cost = b block transfers. If no records or several records satisfy the Cost = b block transfers. If no records or several records satisfy the

search condition. program must read and search all search condition. program must read and search all b b blocks in the blocks in the file. file.

To delete a record,To delete a record, find its block and copy the block into a buffer page, find its block and copy the block into a buffer page, delete the record from the buffer, delete the record from the buffer, rewrite the updated page back to the disk block. rewrite the updated page back to the disk block.

NNote: Unused space in the block could be used in future ote: Unused space in the block could be used in future for a new record if suitable (some book keeping for a new record if suitable (some book keeping necessary on unused space in file blocks))necessary on unused space in file blocks))

Special Deletion Special Deletion ProceduresProcedures

Technique used for record deletion Technique used for record deletion Each record has an extra byte or bit, called a Each record has an extra byte or bit, called a

deletion marker set to ‘1’ at insertion *)deletion marker set to ‘1’ at insertion *) DO not remove deleted record, but reset its DO not remove deleted record, but reset its

deletion marker to ‘0’ when deleteddeletion marker to ‘0’ when deleted Record with deletion marker set to 0 is not used Record with deletion marker set to 0 is not used

by application programsby application programs From time to time reorganise the file: physically From time to time reorganise the file: physically

remove deleted records or reclaim unused space.remove deleted records or reclaim unused space.

*) Just for simplicity we assume that values of *) Just for simplicity we assume that values of deletion markers are ‘0’ or ‘1’. A system actually deletion markers are ‘0’ or ‘1’. A system actually can choose other characters or combination of bits can choose other characters or combination of bits as values of deletion markers.as values of deletion markers.

Simple File Simple File OrganisationsOrganisationsOrderedOrdered files of records files of records - sequential files - sequential files

still extremely useful in DBM (auditing, recovery, still extremely useful in DBM (auditing, recovery, security…)security…)

A record field is nominated and records are ordered A record field is nominated and records are ordered based on that fieldbased on that field

Ordering keyOrdering key insertion is expensiveinsertion is expensive retrieval is easy (efficient) if exploiting the sort orderretrieval is easy (efficient) if exploiting the sort order binary searchbinary search reduces time significantly reduces time significantly

Fast: Fast: Select * from Course order by <order>Select * from Course order by <order>

Slow: Select * from Course where <any other Slow: Select * from Course where <any other attribute> = attribute> = cc

Retrieval & Update in Retrieval & Update in Sorted FilesSorted Files Binary search on ordering field to find Binary search on ordering field to find

block with key = block with key = kk::B = # of blocks; B = # of blocks; HighHigh:= B; := B; LowLow := 0 := 0

Do while not (Found or NotThere)Do while not (Found or NotThere)

Read Block Read Block

MidMid = ( = (Low Low + + HighHigh) / 2) / 2If If k k < key field of first record in the block < key field of first record in the block

Then Then High = MidHigh = Mid - 1 - 1

Else Else If If k k > key field of last record > key field of last record

Then Then Low = Mid Low = Mid + 1+ 1

Else Else If If kk record is in the buffer record is in the buffer

Then FoundThen Found Else NotThereElse NotThere

endend

Operations on Ordered Operations on Ordered FileFileSearching for records when criteria are Searching for records when criteria are

specified in terms of ordering fieldspecified in terms of ordering field Reading the records in order of the ordering key values is Reading the records in order of the ordering key values is

extremely efficient,extremely efficient, Finding the next record from the current one in order of Finding the next record from the current one in order of

the ordering key usually requires no additional block the ordering key usually requires no additional block accesses, accesses,

the next record is in the same block or in the next the next record is in the same block or in the next block block

using a search condition based on the value of an using a search condition based on the value of an ordering key field results in faster access when the ordering key field results in faster access when the binary searchbinary search technique is used, technique is used,

A binary search can be done on the blocks rather than on A binary search can be done on the blocks rather than on the records.. A binary search usually accesses logthe records.. A binary search usually accesses log22((bb) ) blocks, whether the record is found or notblocks, whether the record is found or not

No advantage if search criterion is specified in terms of No advantage if search criterion is specified in terms of non ordering fieldsnon ordering fields

Operations on Ordered Operations on Ordered File(cndt)File(cndt)

Inserting records is expensive. To Inserting records is expensive. To insert a recordinsert a record

find its correct position in the file, based on find its correct position in the file, based on its ordering field value, - cost logits ordering field value, - cost log22((bb) )

make space in the file to insert the record in make space in the file to insert the record in that position.that position.

on the average, half the records of the file on the average, half the records of the file must be moved to make space for the new must be moved to make space for the new record. record.

these file blocks must be read and these file blocks must be read and rewritten to keep the order. Cost of rewritten to keep the order. Cost of insertion is then =b/2 block transfersinsertion is then =b/2 block transfers

Operations on Ordered Operations on Ordered File(cndt)File(cndt)

Deleting record.Deleting record. Find the record using binary search based Find the record using binary search based

on ordering field value, - cost logon ordering field value, - cost log22((b)b) Delete the record,Delete the record, Reorganise part of the file (all records Reorganise part of the file (all records

after that deleted one, b/2 blocks in after that deleted one, b/2 blocks in average)average)

Modifying record Modifying record Find record using binary search and Find record using binary search and

update as requiredupdate as required

Operations on Ordered FileOperations on Ordered FileAlternative ways for more efficient insertionAlternative ways for more efficient insertion keep some unused space in each block for new records keep some unused space in each block for new records

(not good - problem returns when that space is filled up) (not good - problem returns when that space is filled up) create and maintain a temporary create and maintain a temporary unordered unordered file called file called

an an overflow fileoverflow file. . New records are inserted at the end of the overflow file New records are inserted at the end of the overflow file Periodically, the overflow file is sorted and merged with Periodically, the overflow file is sorted and merged with

the main file during file reorganization.the main file during file reorganization. Searching for a record must involve both files, main and Searching for a record must involve both files, main and

overflow; the cost of searching is thus more expensive overflow; the cost of searching is thus more expensive but for large main file will be still close to logbut for large main file will be still close to log22((bb) )

Alternative way for more efficient deletionAlternative way for more efficient deletion Use the technique based on deletion marker, as Use the technique based on deletion marker, as

described earlierdescribed earlier

Access Properties of Access Properties of Simple FilesSimple Files

Heap (sequential unordered)Heap (sequential unordered)

Ordered (sequential) fileOrdered (sequential) file

Note: in this and the following examples record numbers Note: in this and the following examples record numbers corresponds to values of ordering field in ascending ordercorresponds to values of ordering field in ascending order

R4 -------R2 -------R3 -------R16 -------

R1 -------R7 -------R35 -------R10 -------

R14 -------R12 -------R23 -------R6 -------

R24 -------R27 -------

R1 -------R2 -------R3 -------R4 -------

R6 -------R7 -------R10 -------R12 -------

R14 -------R16 -------R23 -------R24 -------

R27 -------R35 -------


Insert into Heap file record R15Insert into Heap file record R15

R4 -------R2 -------R3 -------R16 -------

R1 -------R7 -------R35 -------R10 -------

R14 -------R12 -------R23 -------R6 -------

R24 -------R27 -------R15 -------

And after insertionAnd after insertion

R4 -------R2 -------R3 -------R16 -------

R1 -------R7 -------R35 -------R10 -------

R14 -------R12 -------R23 -------R6 -------

R24 -------R27 -------


Insert into Ordered file record R15Insert into Ordered file record R15

R1 -------R2 -------R3 -------R4 -------

R6 -------R7 -------R10 -------R12 -------

R14 -------R15 -------R16 -------R23 -------

R24 -------R27 -------R35 -------


Notice that all records after R15 have changed Notice that all records after R15 have changed their page location or position on the pagetheir page location or position on the page

R1 -------R2 -------R3 -------R4 -------

R6 -------R7 -------R10 -------R12 -------

R14 -------R16 -------R23 -------R24 -------

R27 -------R35 -------


Insert into Ordered file records R15, R9, R17 using overflow fileInsert into Ordered file records R15, R9, R17 using overflow file


Periodically overflow file is sorted and merged with the main filePeriodically overflow file is sorted and merged with the main file

R1 -------R2 -------R3 -------R4 -------

R6 -------R7 -------R10 -------R12 -------

R14 -------R16 -------R23 -------R24 -------

R27 -------R35 -------

Main File Overflow FileMain File Overflow File

R15 -------R9 -------R17 -------

R1 -------R2 -------R3 -------R4 -------

R6 -------R7 -------R10 -------R12 -------

R14 -------R16 -------R23 -------R24 -------

R27 -------R35 -------

Main File Overflow FileMain File Overflow File


Deletions from a Heap: R10, R3, R7:Deletions from a Heap: R10, R3, R7:

Simple delete: Simple delete:

R4 -------R2 -------

R16 -------

R1 -------

R35 -------

R14 -------R12 -------R23 -------R6 -------

R24 -------R27 -------

After delete operations After delete operations

R4 -------R2 -------R3 -------R16 -------

R1 -------R7 -------R35 -------R10 -------

R14 -------R12 -------R23 -------R6 -------

R24 -------R27 -------


Deletions from a Heap: R10, R3, R7:Deletions from a Heap: R10, R3, R7:

using deletion marker techniqueusing deletion marker technique

After delete operationsAfter delete operationsR4 ------- 1R2 ------- 1R3 ------- 0R16 ------- 1

R1 ------- 1R7 ------- 0R35 ------- 1R10 ------- 0

R14 ------- 1R12 ------- 1R23 ------- 1R6 ------- 1

R24 ------- 1R27 ------- 1

Deletion markers set to ‘0’ and later these records will be Deletion markers set to ‘0’ and later these records will be physically removed when file is reorganisedphysically removed when file is reorganised

R4 ------- 1R2 ------- 1R3 ------- 1R16 ------- 1

R1 ------- 1R7 ------- 1R35 ------- 1R10 ------- 1

R14 ------- 1R12 ------- 1R23 ------- 1R6 ------- 1

R24 ------- 1R27 ------- 1


Deletions from ordered file: R10, R3, R7:Deletions from ordered file: R10, R3, R7:

Simple delete: Simple delete:

After delete operations After delete operations

R1 -------R2 -------R3 -------R4 -------

R6 -------R7 -------R10 -------R12 -------

R14 -------R16 -------R23 -------R24 -------

R27 -------R35 -------

R1 -------R2 -------R4 -------R6 -------

R12 -------R14 -------R16 -------R23 -------

R24 -------R27 -------R35 -------


Deletions from ordered file: R10, R3, R7:Deletions from ordered file: R10, R3, R7:

Using deletion marker technique:Using deletion marker technique:

R1 ------- 1R2 ------- 1R3 ------- 0R4 ------- 1

R6 ------- 1R7 ------- 0R10 ------- 0R12 ------- 1

R14 ------- 1R16 ------- 1R23 ------- 1R24 ------- 1

R27 ------- 1R35 ------- 1

After delete operationsAfter delete operations

R1 ------- 1R2 ------- 1R3 ------- 1R4 ------- 1

R6 ------- 1R7 ------- 1R10 ------- 1R12 ------- 1

R14 ------- 1R16 ------- 1R23 ------- 1R24 ------- 1

R27 ------- 1R35 ------- 1

Deletion markers set to ‘0’ and later these records will be Deletion markers set to ‘0’ and later these records will be physicaly removed when file is reorganisedphysicaly removed when file is reorganised

Retrieval and Update In Retrieval and Update In HeapsHeapsQuick summaryQuick summary

can only use linear searchcan only use linear search insertion is fastinsertion is fast deletion, update are slowdeletion, update are slow parameter search (e.g. SELECT…WHERE) is slowparameter search (e.g. SELECT…WHERE) is slow unconditional search can be fast ifunconditional search can be fast if

records are of fixed lengthrecords are of fixed length records do not span blocks:records do not span blocks:

j-th record located by position in block j / bfrj-th record located by position in block j / bfr average time to find a single record = b / 2 average time to find a single record = b / 2

(b = number of blocks)(b = number of blocks)

Retrieval & Update in Retrieval & Update in Sorted FilesSorted Files

Quick summaryQuick summary retrieval on key field is fast - “next” retrieval on key field is fast - “next”

record is nearbyrecord is nearby any other retrieval either requires a any other retrieval either requires a

sort, or an index, or is as slow as a sort, or an index, or is as slow as a heapheap

update, delete, insert are slow (find update, delete, insert are slow (find block, update block, rewrite block)block, update block, rewrite block)

FAST ACCESS FOR DATABASE: FAST ACCESS FOR DATABASE: HASHINGHASHING

Types of hashing: static or dynamicTypes of hashing: static or dynamic What is the point of hashing?What is the point of hashing?

reduce a large address spacereduce a large address space provide close to direct accessprovide close to direct access provide reasonable performance for all U,I,D,Sprovide reasonable performance for all U,I,D,S

What is a hash function?What is a hash function? propertiesproperties behaviourbehaviour

CollisionsCollisions Collision resolutionCollision resolution Open addressingOpen addressing SummarySummary

What Is Hashing, and What Is It What Is Hashing, and What Is It For?For?

““direct” access to block containing the desired direct” access to block containing the desired recordrecord

reduce the number of blocks read or writtenreduce the number of blocks read or written allow for file expansion and contraction with allow for file expansion and contraction with

minimal file reorganisingminimal file reorganising permit retrieval on “hashed” fields without permit retrieval on “hashed” fields without

re-sorting the filere-sorting the file no need to allocate contiguous disk areasno need to allocate contiguous disk areas if file is small, internal hashing; otherwise if file is small, internal hashing; otherwise

externalexternal no direct access other than by hashingno direct access other than by hashing

A basic example of hashing:A basic example of hashing:

There are 25 rows of seats, with 3 seats per There are 25 rows of seats, with 3 seats per row (75 seats total)row (75 seats total)

We have to allocate each person to a row in We have to allocate each person to a row in advance, at randomadvance, at random

We will hash on their We will hash on their family namefamily name so as to so as to find the person’s row number directly, find the person’s row number directly, knowing only the nameknowing only the name

The database is logically a single tableThe database is logically a single table

ROOM (ROOM (NameName, Age, Attention), Age, Attention)

implemented as a implemented as a blocked, hashed blocked, hashed filefile

The hashing processThe hashing process

The hash process is:The hash process is: LocLoc = 0 = 0 Until no more characters in YourNameUntil no more characters in YourName

Add the alphabetic position of the character Add the alphabetic position of the character to to LocLoc

Calculate Calculate RowNum RowNum = = LocLoc mod 25 mod 25

AA BB CC DD EE FF GG HH II JJ KK LL MM NN OO PP QQ RR SS TT UU VV WW XX YY ZZ

11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414 1515 1616 1717 1818 1919 2020 2121 2222 2323 2424 2525 2626

Examples - Hashed Examples - Hashed NamesNames

AABB CC DD EE FF GG HH II JJ KK LL MM NN OO PP QQ RR SS TT UU VV WW XX YY ZZ

11 22 33 44 55 66 77 88 991100

1111

1122

1133

1144

1155

1166

1177

1188

1199

2200

2211

2222

2233

2244

2255

2266

Where is MCWILLIAM? Hash(MCWILLIAM) = Row 20

NAMENAME LocLoc Row Num (mod 25) Row Num (mod 25)

REYEREYE 5353 33

ANDERSONANDERSON 9090 15 15

MCWILLIAMMCWILLIAM 9595 20 20

TANGTANG 4242 1717

LEELEE 2222 2222

NAMENAME LocLocRowNum RowNum

(mod 25) (mod 25)

REYEREYE 5353 33

ANDERSOANDERSONN 9090 15 15

MCWILLIAMCWILLIAMM 9595 20 20

TANGTANG 4242 1717

LEELEE 2222 2222

Name Hashing Example Name Hashing Example continuedcontinued

NameName RowRow NameName RowRow NameName RowRow NameName RowRow

LeeLee 2222 AlexAlex 1717 GeorgeGeorge 77 AnneAnne 99WestWest 1717 RitaRita 2323 GuyGuy 33 WillWill 66JamesJames 2323 JodieJodie 1818 DaveDave 77 WilfWilf 00AnnaAnna 55 JillJill 1818 DonDon 88 WaltWalt 66AnitaAnita 1818 LilyLily 88 DixyDixy 1212 JackJack 00JieJie 2424 AshAsh 33 JonJon 1414 LanaLana 33KennyKenny1919 BenBen 2121 NinaNina 1313 OlgaOlga 1010MarieMarie 2121 KayKay 1212 MayMay 1414 FredFred 88LoisLois 55 PeterPeter 1414 MaxMax 1313 TaniaTania 1818BestBest 2121 PaulPaul 00 NoraNora 2323 TomTom 2323RobRob 1010 PhilPhil 2020 CashCash 66 JuliaJulia 33LouLou 2323 PatPat 1111 FootFoot 66 LeahLeah 66AxelAxel 1717 EdEd 99 TanTan 1010 LingLing 1717

The results after 52 The results after 52 arrivalsarrivals

RowNRowNoo

#Sitti#Sittingng

Alloc Alloc ++

00 33

11 00

22 00

33 33 11

44 00

55 22

66 33 22

77 22

88 33

99 22

1010 33

1111 11

1212 22

RowNRowNoo

#Sitti#Sittingng

Alloc Alloc ++

1313 22

1414 33

1515 00

1616 00

1717 33 11

1818 33 11

1919 11

2020 11

2121 33

2222 11

2323 33 22

2424 11

The Room as a Hashed The Room as a Hashed FileFile

Each person has a Each person has a hash keyhash key - the name - the name Each person is a Each person is a recordrecord Each row is a hardware Each row is a hardware block block (bucket)(bucket) Each Each row numberrow number is the is the addressaddress of a bucket of a bucket Records here are Records here are fixed-lengthfixed-length (and 3 records per (and 3 records per

block)block) The leftover people are The leftover people are collisionscollisions (key collisions) (key collisions) They will have to be found a seat by They will have to be found a seat by collision collision

resolutionresolution

Collision ResolutionCollision Resolution

Leave an empty seat in each rowLeave an empty seat in each row Under population - blocks 66% full Under population - blocks 66% full

A notice on the end of the row: “extra seat A notice on the end of the row: “extra seat for row for row NN can be found at the rear exit” can be found at the rear exit” bucket’s overflow chain points to an bucket’s overflow chain points to an

overflow page containing the recordoverflow page containing the record ““Everyone stand up while we reallocate Everyone stand up while we reallocate

seats”seats” file reorganisationfile reorganisation

Collision ResolutionCollision Resolution

Strategy 1 (open addressing): Strategy 1 (open addressing): ““Nora” is 4th arrival for 23Nora” is 4th arrival for 23 Place new arrival in next higher No Place new arrival in next higher No

block with a vacancyblock with a vacancy Retrieval - search for “Nora”:Retrieval - search for “Nora”:

Retrieve block 23Retrieve block 23 Read blocks in 23 consecutively. If Read blocks in 23 consecutively. If

“Nora” not found“Nora” not found try 24… try 24…

Disadvantages:Disadvantages: May need to read whole file May need to read whole file

consecutively on some keysconsecutively on some keys Blocks will gradually fill up with out-of-Blocks will gradually fill up with out-of-

place recordsplace records Deletions cause either immediate or Deletions cause either immediate or

periodic reorganisationperiodic reorganisation

Collision ResolutionCollision ResolutionStrategy 2: Strategy 2: Reserve some rows (buckets) for overflow Reserve some rows (buckets) for overflow

Blocks 25, 26 and 27 or recalculate hash Blocks 25, 26 and 27 or recalculate hash function for smaller mod, say 20 instead of 25function for smaller mod, say 20 instead of 25 “ “Julia” is then 4th arrival for block 3Julia” is then 4th arrival for block 3 Place in overflow block with smaller label and with Place in overflow block with smaller label and with

available space (26 ? and optionally placing a available space (26 ? and optionally placing a pointer in bucket 3 pointing to 26th). pointer in bucket 3 pointing to 26th).

Retrieval - search for “Julia”:Retrieval - search for “Julia”: Retrieve block 3Retrieve block 3 Read blocks in 3 consecutively. If “Julia” not Read blocks in 3 consecutively. If “Julia” not

found, either:found, either: search overflow consecutively, orsearch overflow consecutively, or follow pointer to block 26 (chaining) follow pointer to block 26 (chaining)

Disadvantages:Disadvantages: Overflow gradually fills up giving longer Overflow gradually fills up giving longer

retrieval timesretrieval times Deletions/additions cause periodic Deletions/additions cause periodic

reorganisation reorganisation

Collision ResolutionCollision Resolution More formallyMore formally Open addressing: If location specified by hash Open addressing: If location specified by hash

address isaddress is occupied then the subsequent positions occupied then the subsequent positions are checked in order until an unused (empty) position are checked in order until an unused (empty) position is found. is found.

Chaining: Chaining: various overflow locations are kept, a various overflow locations are kept, a pointer field is addedpointer field is added to each record location. A to each record location. A collision is resolved by placing the new record in an collision is resolved by placing the new record in an unused overflow location and setting the pointer of unused overflow location and setting the pointer of the occupied hash address location to the address of the occupied hash address location to the address of that overflow location. that overflow location.

Multiple hashing: Multiple hashing: A second hash function is applied if A second hash function is applied if the first results in a collision. the first results in a collision.

Performance on Hashed Performance on Hashed FilesFiles

Retrieve (SELECT): very fast if name is Retrieve (SELECT): very fast if name is known, otherwise hopelessknown, otherwise hopeless SELECT * FROM ROOM SELECT * FROM ROOM

WHERE NAME = ‘McWilliam’WHERE NAME = ‘McWilliam’ SELECT * FROM ROOM SELECT * FROM ROOM

WHERE AGE > 30WHERE AGE > 30 Update: sameUpdate: same

UPDATE ROOM SET ATTENTION = ‘low’ UPDATE ROOM SET ATTENTION = ‘low’

WHERE NAME = ‘McWilliam’WHERE NAME = ‘McWilliam’ UPDATE ROOM SET ATTENTION = ‘high’UPDATE ROOM SET ATTENTION = ‘high’

WHERE AGE > 50 OR AGE < 10WHERE AGE > 50 OR AGE < 10

Performance on Hashed Performance on Hashed FilesFiles Delete: same as SELECT, UPDATEDelete: same as SELECT, UPDATE

DELETE FROM ROOM DELETE FROM ROOM (uses (uses hash - fast)hash - fast)

WHERE NAME = ‘Nora’WHERE NAME = ‘Nora’ DELETE FROM ROOM DELETE FROM ROOM (can’t use (can’t use

hash - slow)hash - slow)

WHERE NAME IS LIKE ‘No%’WHERE NAME IS LIKE ‘No%’

Insert: unpredictableInsert: unpredictable INSERT INTO ROOM INSERT INTO ROOM

VALUES (‘Smyth’, ‘high’)VALUES (‘Smyth’, ‘high’)

Internal HashingInternal Hashing Internal hashing is used as an Internal hashing is used as an internal internal

search structure within a programsearch structure within a program whenever a group of records is accessed whenever a group of records is accessed exclusively by using the value of one field. exclusively by using the value of one field.

Applicable to smaller filesApplicable to smaller files Hashed in main memory: fast lookup in Hashed in main memory: fast lookup in

store store RR records, records, R R-length array-length array Hash function transforms key field Hash function transforms key field

into subscriptinto subscript array in the range 0 to array in the range 0 to RR - 1 - 1

hash (Key Value) = Key Value (mod hash (Key Value) = Key Value (mod R)R) subscript is the record address in storesubscript is the record address in store

External HashingExternal Hashing

Hashing for disk files is called external Hashing for disk files is called external hashing.hashing.

address space is made of buckets, each of address space is made of buckets, each of which holds multiple records. which holds multiple records.

A bucket is either one disk block or a A bucket is either one disk block or a cluster of contiguous blocks. cluster of contiguous blocks.

The hashing function maps a key into a The hashing function maps a key into a relative bucket number, relative bucket number,

A table maintained in the file header A table maintained in the file header converts the bucket number into the converts the bucket number into the corresponding disk block address corresponding disk block address

External Hashing (static)External Hashing (static)

The hashing scheme is called static hashing if The hashing scheme is called static hashing if a fixed number of buckets a fixed number of buckets M M is allocated. is allocated.

If a record is to be retrieved with search If a record is to be retrieved with search condition specified for the key values, then condition specified for the key values, then the bucket number of the bucket potentially the bucket number of the bucket potentially containing that record is determined using containing that record is determined using the hashing function applied on the key and the hashing function applied on the key and then that bucket is examined for the then that bucket is examined for the containment of the desired record. If record is containment of the desired record. If record is not in that bucket then further search could not in that bucket then further search could be activated in overflow buckets.be activated in overflow buckets.

External Hashing (static)External Hashing (static)Construction of hashed file Construction of hashed file Identify size of the file, choose hashing function (according to the anticipated number of buckets) and Identify size of the file, choose hashing function (according to the anticipated number of buckets) and

decide about selection of the collision resolution procedure - for the life of the filedecide about selection of the collision resolution procedure - for the life of the file Apply hashing function to each inserted record to get the bucket number and place the record in the Apply hashing function to each inserted record to get the bucket number and place the record in the

bucket with that numberbucket with that number

External Hashing External Hashing (static….)(static….)

If bucket is full then apply selected If bucket is full then apply selected collision resolution procedure collision resolution procedure

If the number of records in overflow If the number of records in overflow buckets is large and/or distribution buckets is large and/or distribution of records in buckets is highly un-of records in buckets is highly un-uniform , then reorganise the file uniform , then reorganise the file using changed hashing function using changed hashing function (tuning)(tuning)

External Hashing (static)External Hashing (static)

keyH

H(key) mod N

0

1

N-1

Primary buckets

Overflow Page

Problems of Static Problems of Static HashingHashing

Number of buckets is fixedNumber of buckets is fixed shrinkage causes wasted spaceshrinkage causes wasted space growth causes long overflow chainsgrowth causes long overflow chains Solutions:Solutions:

reorganisereorganise re-hashre-hash use dynamic hashing...use dynamic hashing...

Extendible HashingExtendible Hashing Previously, to insert a new record into a full bucketPreviously, to insert a new record into a full bucket

add overflow page, oradd overflow page, or reorganise by doubling the bucket allocation and reorganise by doubling the bucket allocation and

redistributing the recordsredistributing the records This is a poor solution:This is a poor solution:

entire file is readentire file is read twice as many pages have to be writtentwice as many pages have to be written

Solution: Extendible HashingSolution: Extendible Hashing add a directory of pointers to bucketsadd a directory of pointers to buckets double the number of buckets by doubling the double the number of buckets by doubling the

directorydirectory split only the bucket that has overflowedsplit only the bucket that has overflowed

LINEAR HASHINGLINEAR HASHING It does not have a directory at allIt does not have a directory at all Instead, have a “family” of algorithms to manage Instead, have a “family” of algorithms to manage

dynamic expansion and contraction of the filedynamic expansion and contraction of the file Start with a set number of M buckets 0..M-1 with Start with a set number of M buckets 0..M-1 with

hashing function mod Mhashing function mod M Split them in Split them in linear linear order, when more space is order, when more space is

needed. The next hashing function is mod 2M and needed. The next hashing function is mod 2M and subsequent 3M, 4M etc as requiredsubsequent 3M, 4M etc as required Example: Block capacity is 2 records. Records Example: Block capacity is 2 records. Records

with values 72, 62, 32 are colliding for hashing with values 72, 62, 32 are colliding for hashing function (mod 10), but after application of next function (mod 10), but after application of next hashing function (mod 20) they do not (one hashing function (mod 20) they do not (one bucket contains 72 and 32 and another 62).bucket contains 72 and 32 and another 62).

Combines Combines controlledcontrolled overflow with new space overflow with new space acquisitionacquisition

Linear Hashing - Linear Hashing - AdvantagesAdvantages

Another type of dynamic hashingAnother type of dynamic hashing does not require a directorydoes not require a directory manages collisions wellmanages collisions well accommodates insertions and deletions accommodates insertions and deletions

wellwell allows overflow chain length to be allows overflow chain length to be

traded against average space traded against average space utilisationutilisation

uses several hash functionsuses several hash functions

HASHING SUMMARYHASHING SUMMARY

Comparison of Simple and Hashed FilesComparison of Simple and Hashed Files Pages in a hashed file are grouped into Pages in a hashed file are grouped into

buckets (1 block or a cluster of contiguous buckets (1 block or a cluster of contiguous blocks)blocks)

reduce a large address spacereduce a large address space provide close to direct accessprovide close to direct access Static hashing has some disadvantages Static hashing has some disadvantages

which are addressed in dynamic hashing which are addressed in dynamic hashing solutions (extendible, linear) solutions (extendible, linear)

Static hashed files are kept at about Static hashed files are kept at about 80% occupancy then reorganised. 80% occupancy then reorganised. New pages are added when each New pages are added when each existing page is about 80% fullexisting page is about 80% full

Hence, time to read entire file is Hence, time to read entire file is 1.25 non-hashed file 1.25 non-hashed file

Dynamic hashing provides flexibility Dynamic hashing provides flexibility in usage of file storage space in usage of file storage space (expansion and contraction)(expansion and contraction)

Documents

File Organisation&Hasing