Upload
mattie-seales
View
229
Download
2
Tags:
Embed Size (px)
Citation preview
CMPT 454Data Storage and Disk Access
Data Storage and Disk Access
Memory hierarchy Hard disks
Architecture Processing requests Writing to disk
Hard disk reliability and efficiency RAID
Solid State Drives Buffer management Data storage
Memory
DBMS and Memory
Main Memory
DiskVirtual Memory
Cache
File System
Tertiary Storage
DBMS
Memory Hierarchy
Primary memory: volatile Main memory Cache
Secondary memory: non-volatile Solid State Drive (SSD) Magnetic Disk (Hard Disk Drive, HDD)
Tertiary memory: non-volatile CD/DVD Tape - sequential access Usually used as backup or for long-
term storage
cost speed
Main Memory vs Secondary Storage
Speed Main memory is much faster than secondary
memory▪ 10 – 100 nanoseconds to move data in main memory
▪ 0.00001 to 0.0001 milliseconds
▪ 10 milliseconds to read a block from an HDD▪ 0.1 milliseconds to read a block from a SSD
Cost Main memory is around 100 times more
expensive than secondary memory SSDs are more expensive than HDDs
Main Memory vs Secondary Storage
System limitations On a 32 bit system only 232 bytes can be directly
referenced▪ Many databases are larger than that
Volatility Data must be maintained between program
executions which requires non-volatile memory▪ Nonvolatile storage retains its contents when the
device is turned off, or if there is a power failure
Main memory is volatile, secondary storage is not
Hard Disk Drives
Magnetic Disks
Database data is usually stored on disks A database will often be too large to be retained in
main memory When a query is processed data will need to be
retrieved from storage Data is stored on disk blocks
Also referred to as blocks, or in relation to the OS, pages
A contiguous sequence of bytes and The unit in which data is written to, and read from Block size is typically between 4 and 16 kilobytes
Magnetic Disk Structure
A hard disk consists of a number of platters Platters can store data on either one or both of its
surfaces so is referred to as▪ Single-sided or double sided
Surfaces are composed of concentric rings called tracks The set of all tracks with the same diameter is called a
cylinder Sectors are arcs of a track
And are typically 4 kilobytes in size Block size is set when the disk is initialized, usually a
small multiple of the sector size (hence 4 to 16 kilobytes)
Diagram of a Disk
track
plattersurfaces
cylinder
3*2*
*statistics for Western Digital Caviar Black 1 TB hard drive
Disk Heads
Data is transferred to or from a surface by a disk head
There is one disk head for each surface These disk heads are moved as a unit (called a
disk head array)▪ Therefore all the heads are in identical positions with
respect to their surfaces
To read or write a block a disk head must be positioned over it
Only one disk head can read or write at a time
Disk Anatomy
disk head array
platters
track
the disk spins – around 7,200rpm
moves in and out
Disk Controller
Disk drives are controlled by a processor called a disk controller which Controls the actuator that moves the head
assembly Selects sectors and determines when the disk
has rotated to a sector Transfers data between the disk and main
memory Some controllers buffer data from tracks in
the expectation that the data will be required
Accessing Data in a Disk
The disk constantly spins 7,200 rpm*
* Western Digital Caviar Black 1 TB hard drive (again)
The head pivots over the desired
track
The desired block is read as it passes underneath the
head
Accessing A Block
The disk head is moved in or out to the track This seek time is typically 10 milliseconds
▪ WD Caviar Caviar Black 1TB: 8.9 ms Wait until the block rotates under the disk
head This rotational delay is typically 4 milliseconds
▪ WD Caviar Caviar Black 1TB : 4.2 ms The data on the block is transferred to
memory This transfer time is the time it takes for the block
to completely rotate past the disk head▪ Typically less than 1 millisecond
Transfer Time
The seek time and rotational delay depend on Where the disk head is before the request, Which track is being requested, and How far the disk has to rotate
The transfer time depends on the request size The transfer time (in ms) for one block equals
▪ (60,000 / disk rpm) / blocks per track The transfer time (in ms) for an entire track equals
▪ (60,000 / disk rpm)
Main Memory versus Disk
Typical access time for a block on a hard disk 15 milliseconds
Typical access time for a main memory frame 60 nanoseconds
What’s the difference? 1 millisecond = 1,000,000 nanoseconds 60 ns = 0.000,060 ms
Accessing a hard drive is around 250,000 times slower than accessing main memory
Reducing Disk Access Time
Disk latency (access time) has three components seek time + rotational delay + transfer time The overall access time can be shortened by reducing,
or even eliminating seek time and rotational delay Related data should be stored in close proximity
Accessing two records in adjacent blocks on a track▪ Seek the desired track, rotate to first block, and transfer two
blocks = 10 + 4 + 2*1 = 16ms
Accessing two records on different tracks▪ Seek the desired track, rotate to the block, and transfer the
block, then repeat = (10 + 4 + 1)*2 = 30ms
Order of Closeness
What does it mean to say that related data should be stored close to each other? The term close refers not to physical proximity but
to how the access time is affected In order of closeness:
Same block Adjacent blocks on the same track Same track Same cylinder, but different surfaces Adjacent cylinders …
Which is Closer
xx
x
1
2
3
Is 2, or 3 "closer" to 1? 2 is in the adjacent
track And is clearly
physically closer, but The disk head must be
moved to access it 3 is in the same
cylinder The disk head does
not have to be moved Which is why 3 is
closer
Fulfilling Disk Requests A fair algorithm would take a first-
come, first-serve approach Insert requests in a queue and
process them in the order in which they are received
2,0001
4,0004
6,0002
10,0006
14,0003
16,0005
Cylinder Received Complete Moved Total
2,000 0 5 2,000 2,000
6,000 0 14 4,000 6,000
14,000 0 27 8,000 14,000
4,000 10 43 10,000 24,000
16,000 20 60 12,000 36,000
10,000 30 72 6,000 42,000
Elevator Algorithm
The elevator algorithm usually performs better than FIFO Requests are buffered and the disk
head moves in one direction, processing requests
The arm then reverses direction
2,0001
4,0004
6,0002
10,0006
14,0003
16,0005
Cylinder Received Complete Moved Total
2,000 0 5 2,000 2,000
6,000 0 14 4,000 6,000
14,000 0 27 8,000 14,000
16,000 20 35 2,000 16,000
10,000 30 46 6,000 22,000
4,000 30 58 6,000 28,000
Requests – Discussion
The elevator algorithm gives much better performance than FIFO on average And is a relatively fair algorithm
The elevator algorithm is not optimal The shortest-seek first algorithm is closer to
optimal but can result in a high variance in response time▪ And may even result in starvation for distant requests
In some cases the elevator algorithm can perform worse than FIFO
Modifying a Record
To modify an existing record (on a disk) the following steps must be taken Read the record Modify the record in main memory Write the modified record back to disk
It is important to remember that the smallest unit of transfer to / from a disk is a block A single disk block usually contains
many records
Read – Modify – Write Cycle
other records …
Landis#winner#Phonak#...
other records …
other records …
Landis#winner#Phonak#...
other records …
Read one block into main memory …
Read – Modify – Write Cycle
other records …
Landis#winner#Phonak#...
other records …
other records …
Landis#disq.#none#... other records …
other records …
Landis#winner#Phonak#...
other records …
Read one block into main memory …
… modify the desired record …
Read – Modify – Write Cycle
other records …
Landis#disq.#none#... other records …
other records …
Landis#disq.#none#... other records …
other records …
Landis#winner#Phonak#...
other records …
Read one block into main memory …
… modify the desired record …
… and write it back.
Inserting Records
Consider creating a new record The user enters the data for the record
Through some application interface The record is created in main memory And then written to disk
Does this process require a read-modify-write process? YES! Because, otherwise, the existing contents of the
disk block will be overwritten
Disk Failures
Intermittent failure Multiple attempts are required to read or write a sector
Media decay A bit or a number of bits are permanently corrupted
and it is impossible to read a sector Write failure
A sector cannot be written to or retrieved▪ Often caused by a power failure during a write
Disk crash The entire disk becomes unreadable
Checksums
An intermittent failure may result in incorrect data being read by the disk controller Such incorrect data can be detected by a
checksum Each sector contains additional bits whose
values are based on the data bits in the sector A simple single-bit checksum is to maintain an
even parity on the sector▪ If there is an odd number of 1s the parity is odd
▪ If there is an even number of 1s the parity is even
Parity Bits
Assume that there are seven data bits and a single checksum bit Data bits 0111011 – parity is odd
▪ Checksum bit is set to 1 so that the overall parity is even
Using a single checksum bit allows errors of only one bit to be detected reliably
Several checksum bits can be maintained to reduce the chance of failing to notice an error e.g. maintain 8 checksum bits, one for each bit
position in the data bytes
Stable Storage
Checksums can detect errors but can't correct them
Stable storage can be implemented on a disk to allow errors to be corrected Sectors are paired, with each pair representing a
single sector Pairs are usually referred to as Left and Right
▪ Errors in a sector (L or R) are detected using checksums
Stable storage can cope with media failures and write failures
Stable Storage Policy
For writing, write the value of some sector X into XL
Check that the value is correct (using checksums)
If the value is not correct after a given number of attempts then assume that the sector has failed▪ A spare sector should be substituted for XL
Repeat the process for XR
For reading, XL and XR are read from in turn until a correct value is returned
RAID
Problems with Hard Disks Hard disks act as bottlenecks for
processing DB data is stored on disks, and must be fetched
into main memory to be processed, and Disk access is considerably slower than main
memory processing There are also reliability issues with disks
Disks contain mechanical components that are more prone to failure than electronic components
One solution is to use multiple disks
Multiple Disks
Single disk Multiple
platters Disk
heads are always over the same cylinder
Multiple disks Each disk contains multiple
platters Disks can be read in parallel,
and Different disks can read from
different cylinders e.g. the first disk can access
data from cylinder 6,000, while the second disk is accessing data from cylinder 11,000
Improving Efficiency
Using multiple disks to store data improves efficiency as the disks can be read in parallel
To satisfy a request the physical disks and disk blocks that the data resides on must be identified The data may be on a single disk, or it may be split
over multiple disks The way in which data is distributed over the
disks affects the cost of accessing it In the same way that related data should be stored
close to each other on a single disk
Data Striping
A disk array gives the user the abstraction of a single, large, disk When an I/O request is issued the physical disk blocks
to be retrieved have to be identified How the data is distributed over the disks in the array
affects how many disks are involved in an I/O request Data is divided into partitions called striping units
The striping unit is usually either a block or a bit Striping units are distributed over the disks using a
round robin algorithm
Striping
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 …
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 …
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 …
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 …
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …
Notional File – the data is divided into striping units of a given size
The striping units are distributed across a RAID system in a round robin fashion
The size of the striping unit has an impact on the behaviour of the system
disk 1
disk 2
disk 3
disk 4
Disk 11 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40 65 66 67 68 69 70 71 72 …
Disk 4
Disk 2
Disk 3
25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64 89 90 91 92 93 94 95 96 …
9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48 73 74 75 76 77 78 79 80 …
17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56 81 82 83 84 85 86 87 88 …
Striping Units – Block StripingAssume that a file is to be distributed across a four disk RAID
system, using block striping, and that,Purely for the sake of illustration, the block size is just
one byte!
1 2 3 4 5 6 7 8 9 10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 …
Notional File – the numbers represent a sequence of individual bits in the file
Distribute these bits across a 4 disk RAID system using BLOCK striping:
Block 1 Block 2 Block 3
Disk 4
Disk 1
Disk 2
Disk 3
Striping Units – Bit StripingHere is the same file to be distributed across a four disk RAID
system, this time using bit striping, and again remember that
Purely for the sake of illustration , the block size is just one byte!
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 …
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 …
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 …
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 …
1 2 3 4 5 6 7 8 9 10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 …
Notional File – the numbers represent a sequence of individual bits in the file
Distribute these bits across a 4 disk RAID system using BIT striping:
Block 1 Block 2 Block 3
Disk Array Performance
Assume that a disk array consists of D disks Data is distributed across the disks using data
striping How does it perform compared to a single
disk? To answer this question we must specify the
kinds of requests that will be made▪ Random read – reading multiple, unrelated records▪ Random write▪ Sequential read – reading a number of records (such as
one file or table), stored on more than D blocks▪ Sequential write
The Basic Idea …
Use all D disks to improve efficiency, and distribute data using block striping
Random read performance Very good – up to D different records can be read at
once▪ Depending on which disks the records reside on
Random write performance – same as read performance
Sequential read performance Very good – as related data are distributed over all
D disks performance is D times faster than a single disk
Sequential write performance – same as read performance
But what about reliability …
Reliability
Hard disks contain mechanical components and are less reliable than other, purely electronic, components Increasing the number of hard disks decreases
reliability, reducing the mean-time-to-failure (MTTF)▪ The MTTF of a hard disk is 50,000 hours, or 5.7 years
In a disk array the overall MTTF decreases Because the number of disks is greater MTTF of a 100 disk array is 21 days – (50,000/100) /
24▪ This assumes that failures occur independently and▪ The failure probability does not change over time
Reliability is improved by storing redundant data
Redundancy
Reliability of a disk array can be improved by storing redundant data
If a disk fails the redundant data can be used to reconstruct the data lost on the failed disk The data can either be stored on a separate check
disk or Distributed uniformly over all the disks
Redundant data is typically stored using one of two methods Mirroring, where each disk is duplicated A parity scheme, where sufficient redundant data is
maintained to recreate the data in any one disk Other redundancy schemes provide greater
reliability
Parity Scheme
For each bit on the data disks there is a parity bit on a check disk If the sum of the data disks bits is even the parity bit is set to zero If the sum of the bits is odd the parity bit is set to one
The data on any one failed disk can be recreated bit by bit
0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 1 …
1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 …
0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 …
0 0 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 …
1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 …
5th check disk containing parity data
4 data disk system showing individual bit values
Parity Scheme Read and Write
Reading The parity scheme does not affect
reading Writing
A naïve approach would be to calculate the new value of the parity bit from all the data disks
A better approach is to compare the old and new values of the disk that is written to▪ And change the value of a parity bit if the
corresponding bits have changed
Introducing RAID
A RAID system consists of several disks organized to increase performance and improve reliability Performance is improved through data striping Reliability is improved through redundancy RAID stands for Redundant Arrays of Independent
Disks There are several RAID schemes or levels
The levels differ in terms of their▪ Read and write performance,▪ Reliability, and▪ Cost
RAID Level 0
All D disks are used to improve efficiency, and data is distributed using block striping
No redundant information is kept Read and write performance is very good But, reliability is poor
Unless data is regularly backed up a RAID 0 system should only be used when the data is not important
A RAID 0 system is the cheapest of all RAID levels As there are no disks used for storing redundant data
Level 1: Mirrored
An identical copy is kept of each disk in the system, hence the term mirroring
Read performance is similar to a single disk No data striping, but parallel reads of the duplicate disks
can be made which improves random read performance Write performance is worse than a single disk as the
duplicate disk has to be written to Writes to the original and mirror should not be performed
simultaneously in case there is a global system failure But write performance is superior to most other RAID levels
Very reliable but costly
With D data disks, a level 1 RAID system has 2D disks
Level 1+0: Striping + Mirroring Sometimes referred to as RAID level 10,
combines both striping and mirroring Very good read performance
Similar to RAID level 0 2D times the speed of a single disk for sequential
reads Up to 2D times the speed of a single disk for
random reads Allows parallel reads of blocks that, conceptually,
reside on the same disk Poor write performance
Similar to RAID level 1 Very reliable but the most expensive RAID level
Writing and Redundant Data Writing data is the Achilles
heel of RAID systems Data and check disks should
not be written to simultaneously
Parity information may have to be read before check disks can be written to
In many RAID systems writing is less efficient than with a single disk!
Writing Parity Data
Sequential writes, or random writes in a RAID system using bit striping: Write to all D data disks, using a read-modify-write
cycle Calculate the parity information from the written
data Write to the check disk(s)
▪ A read-modify-write cycle is not required Random writes in a system using block striping:
Write to the data disk using a read-modify-write cycle
Read the check disk(s), and calculate the new parity data
Write to the check disk(s)
Striping Units Performance A RAID system with D disks can read data up to
D times faster than a single disk system For sequential reads there is no performance
difference between bit striping and block striping Block striping is more efficient for random reads
With bit striping all D disks have to be read to recreate a single record (and block) of the data file
With block striping, a complete record is stored on one disk, so only one disk is required to satisfy a single random read
Write performance is similar except that it is also affected by the parity scheme
Levels 2 and 3
Level 2 does not use the standard parity scheme Uses a scheme that allows the failed disk to be
identified increasing the number of disks required However the failed disk can be detected by the disk
controller so this is unnecessary Can tolerate the loss of a single disk
Level 3 is Bit Interleaved Parity The striping unit is a single bit Random read and write performance is poor as all
disks have to be accessed for each request Can tolerate the loss of a single disk
RAID Level 4
Uses block striping to distribute data over disks
Uses one redundant disk containing parity data The ith block on the redundant disk
contains parity checks for the ith blocks of all data disks
Good sequential read performance D times single disk speed
Very good random read performance Disks can be read independently, up to
D times single disk speed
RAID Level 4: Writing
When data is written the affected block and the redundant disk must both be written to
To calculate the new value of the redundant disk Read the old value of the changed block Read the corresponding redundant disk block Write the new data block Recalculate the block of the redundant disk
To recalculate the redundant data consider the changes in the bit pattern of the written data block
RAID Level 4: Performance Cost is moderate
Only one check disk is required The system can tolerate the loss of one drive Write performance is poor for random writes
Where different data disks are written independently
For each such write a write to the redundant disk is also required
Performance can be improved by distributing the redundant data across all disks – RAID level 5
Level 5: Block-Interleaved Distributed Parity
The dedicated check disk in RAID level 4 tends to act as a bottleneck for random writes
RAID level 5 does not have a dedicated check disk but distributes the parity data across all disks This removes the bottleneck thus increasing the
performance of random writes Sequential write performance is similar to level 4
Cost is moderate, with the same effective space utilization as level 4
The system can tolerate the loss of one drive
Multiple Disk Crashes
RAID levels 4 and 5 can only cope with single disk crashes Therefore if multiple disks crash at the same
time (or before a failed disk can be replaced) data will be lost
RAID level 6 allows systems to deal with multiple disk crashes These systems use more sophisticated error
correcting codes One of the simpler error correcting codes is
the Hamming Code
Hamming Code
Consider a system with seven disks which can be identified with numbers from 1 to 7 Four of the disks are data disks, disks 1 to 4 Three of the disks are redundant disks, disks 5
to 7 Each of the three check disks contain parity
data for three of the four data disks Disk 5 contains parity data for disks 1, 2 and 3 Disk 6 contains parity data for disks 1, 2 and 4 Disk 7 contains parity data for disks 1, 3 and 4
Hamming Code Example
Disk 1 2 3 4 5 6 7
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
1,2,3
1,2,4
1,3,4
Data Redundant Data
RAID Level 6
Reads are performed as normal Only the data disks are used
Writes are performed in a similar way to RAID level 4 Except that multiple redundant disks
may be involved Cost is high, as more check disks are
required
RAID Level 6 Recovery
If one disk fails use the parity data to restore the failed disk like level 4
If two disks fail then both disks can be rebuilt using three of the other disks, e.g. If disks 1 and 2 fail
▪ Rebuild disk 1 using disks 3, 4 and 7
▪ Rebuild disk 2 using disks 1, 3 and 5
If disks 3 and 5 fail▪ Rebuild disk 3 using disks 1, 4 and 7
▪ Rebuild disk 5 using disks 1, 2 and 3
Parity Scheme and Reliability In real-life RAID systems the disk array is
partitioned into reliability groups A reliability group consists of a set of data disks
and a set of check disks The number of check disks depends on the
reliability level that is selected Consider a RAID system with 100 data disks
and 10 check disks i.e. 10 reliability groups The MTTF is increased from 21 days to 250 years!
Which RAID Level
Level 0 improves performance at the lowest cost but does not improve reliability
Level 1+0 is better than level 1 and has the best write performance
Levels 2 and 4 are always inferior to 3 and 5 Level 3 is good for large transfer requests of several
contiguous blocks but bad for many small request of a single disk block
Level 5 is a good general-purpose solution Level 6 is appropriate if higher reliability is required In practice the choice is usually between 0, 1, and 5
RAID Levels Comparison
The table that follows compares RAID levels using RAID level 0 as a baseline
Comparisons of RAID systems vary dependent on the metric used and the measurements of those metrics The three primary metrics are reliability,
performance and cost These can be measured in I/Os per second, bytes
per second, response time and so on The comparison uses throughput per dollar for
systems of equivalent file capacity File capacity is the amount of information that can
be stored on the system, which excludes redundant data
RAID Levels Comparison
Random Read
Random Write
Sequential Read
Sequential Write
Storage Efficiency
RAID 0 1 1 1 1 1
RAID 1 1 ½ 1 ½ ½
RAID 3 1/G 1/G (G-1)/G (G-1)/G (G-1)/G
RAID 4 (G-1)/G max(1/G, ¼)
(G-1)/G (G-1)/G (G-1)/G
RAID 5 1 max(1/G, ¼)
1 (G-1)/G (G-1)/G
RAID 6 1 max(1/G, 1/6)
1 (G-2)/G (G-2)/G G refers to the number of disks in a reliability
group (both data disks and check disks) RAID levels 10 and 2 not shown
Solid State Drives
Solid State Drives
Solid State Drives (SSDs) use NAND flash memory and do not contain moving parts like an HDD Accessing an SSD does not require seek time or rotational
latency and they are therefore considerably faster Flash memory is non-volatile memory that is used by smart-
phones, mp3 players and thumb (or USBO) drives NAND flash architecture is similar to a NAND (negated
and) logic gate hence the name NAND flash architecture is only able to read and write data one
page at a time There are two types of SSD
Multi-level cell (MLC) Single-level cell (SLC)
MLC SSD
MLC cells can store multiple different charge levels And therefore more than one bit
▪ With four charge levels a cell can store 2 bits Multiple threshold voltages makes reading
more complex but allows more data to be stored per cell
MLC SSDs are cheaper than SLC SSDs However write performance is worse And their lifetimes are shorter
SLC SSD
SLC cells can only store a single charge level They are therefore on or off, and can contain only
one bit SLC drives are less complex
They are more reliable and have a lower error rate
They are faster since it is easier to read or write a single charge value
SLC drives are more expensive And typically used for enterprise rather than
home use
SSD Performance
Reads are much faster than HDDs since there are no moving parts
Writes are also faster than HDDs However flash memory must be erased
before it is written, and entire blocks must be erased▪ Referred to as write amplification
The performance increase is greatest for random reads
Storing Data on a Disk
DBMS Structure
Database
Disk Space Manager
Buffer Manager
Query Evaluation
Transaction
and
Lock
Manager
DBMS
File and Access Code Recover
y
Manager
Accessing Data
When an SQL command is evaluated a request may be made for a DB record
Such a request is passed to the buffer manager
If the record is not stored in the (main memory) buffer the page must be fetched from disk
The disk space manager provides routines for allocating, de-allocating, reading and writing pages
Disk Space Management
The disk space manager (DSM) keeps track of available disk space The lowest level of the DBMS architecture
Supports the allocation and de-allocation of disk pages Pages are abstract units of storage, mapped to disk
blocks Reading and writing to a page is performed in one
disk I/O Sequences of pages are allocated to a contiguous
sequence of blocks to increase access speed The DSM hides the underlying details of storage
Allowing higher level processes to consider the data to be a collection of pages
Tracking Free Blocks
A DB increases and decreases in size over time In addition to mapping pages to blocks the DSM has
to record which disk blocks are in use As time goes on, gaps in sequences of allocated
blocks appear Free blocks need to be recorded so that they can
be allocated in the future, using either A linked list, the head points to the first free block A bitmap, each bit corresponds to a single block
▪ Allows for fast identification, and therefore allocation, of contiguous areas of free space
Using OS as the DSM
An OS is required to manage space on a disk Typically an OS abstracts a file as a sequence of
bytes While possible to build a DSM with the OS many
DBMS perform their own disk management This makes the DBMS more portable across platforms Using the OS may impose technical limitations such
as maximum file size In addition, OS files cannot typically be stored on
separate disks, which may be necessary in a DBMS
Record and Page Format
Record Formats
Attributes, or fields, must be organized within records Information that is common to all records of a particular
type is stored in the system catalog▪ Including the number and type of fields
Records of a single table can vary from each other In addition to differences in data (obviously) Different records may contain different number of
fields, or Fields of varying length
Examples of Field Types
INTEGER, represented by two or four bytes FLOAT, represented by four or eight bytes CHAR(n), fixed length character strings of n bytes
Unused characters are occupied with a pad character▪ e.g. if a CHAR(5) stored "elm" it would be stored as elm
VARCHAR(n), character strings of varying lengths Stored as arrays of n+1 bytes
▪ i.e. even though a VARCHAR's contents can vary, n+1 bytes are dedicated to them
The length of a VARCHAR is stored in the first byte, or Its end is specified by a null character
Fixed Length Records
Fields are a fixed length, and the number of fields is fixed Fields may then be stored consecutively And given the address of a record, the
address of a particular field can be found ▪ By referring to the field size in the system
catalog It is common to begin all fields at a
multiple of 4 or 8 bytes
Fixed Length Record Format
name address salary
0 12 44 300 308
Consider an employee record: {name CHAR(30), address VARCHAR(255), salary
FLOAT} Fields can be found by looking up the field size in
the schema and performing an offset calculationpointer to schema
header
length timestamp
Variable Length Fields
In the relational model each record contains the same number of fields However fields may be of variable length
If a record contains both fixed and variable length fields, store the fixed length fields first The fixed length fields are easy to locate
To store variable length fields include additional information in the record header The length of the record Pointers to the beginning of each variable length
field A pointer to the end of the record
Variable Length Record
name salary address
Consider an employee record with name, and salary being fixed length and address being variable length
The pointer to the first variable length field may be omittedother header information
header
record lengthaddress pointer
end of record
Repeating Fields
Records in a relational DB have the same number of fields But it is possible to have repeating fields For example a many to many relationship in a
record that represents an object References to other objects will have to be stored
▪ The references (or pointers) to other objects suggest that different records will have different lengths
There are three alternatives for recording such data
Storing Repeating Fields
Store the entire record in one block Maintain a pointer to the first reference
Store the fixed length portion in one location, and the variable length portion in another The header contains a pointer to the variable length
portion (the references to other objects), and
The number of such objects Store a fixed length record with an fixed number
of occurrences of the repeating fields, and A pointer to (and count of) any additional occurrences
Fixed vs. Variable Length
There are many advantages of keeping records (and therefore fields) fixed length More efficient for search Lower overhead (the header contains
less data) Easier to move records around
The main advantage of using variable length fields is that it can save space This can result in fewer disk I/O
operations
Variable Length Fields Issues Modifying a variable field in a record may make it
larger Later fields in the same record have to be moved,
and Other records may also have to be moved
When a variable field is modified the record’s size may increase to the extent that it no longer fits on the page The record must then be moved to another page,
but A "forwarding address" has to be maintained on the
old page, so that external references to the rid are still valid
A record may grow larger than the page size The record must then be broken into sections and
connected by pointers
Forwarding Addresses
Forwarding addresses may need to be maintained When a record grows too large, or When records are maintained in order (clustered)
When maintaining ordered data Provide a forwarding address if a record has to be
moved to a new page to maintain the ordering Delete records by inserting a NULL value, or tombstone
pointer in the header The record slot can be re-used when another record is
inserted
8
Insertion
1
8
3
1
6
3
8
21
17
Insert 6,
The numbers represent the primary keys of the records
17,
and 21
Tombstone Pointer
1
8
3
5
Delete 3, and insert 5
11
8
1
8
Other Data Types
There are other data types that requires special treatment in terms of record storage Pointers, and
reference variables Large objects such as
text, images, video, sound etc.
Records with Pointers
If a record represents an object, the object may contain pointers to or addresses of some other object Such pointers need to be managed by a DBMS
A data item may have two addresses Database address on disk, usually 8 bytes Memory address in main memory, usually 4 bytes
When an item is on the disk (i.e. secondary storage) its database address must be used
And when an item is in the buffer pool it can be referred to by either its database or memory address It is more efficient to use the memory address
Translation Tables
Database addresses of items in main memory should be translated to their current memory addresses To avoid unnecessary disk I/O It is possible to create a translation table that maps
database addresses to memory addresses However when using such a table addresses may
have to be repeatedly translated Whenever a pointer of a record in main memory is
accessed the translation table must be used Pointer swizzling is used to avoid repeated
translation table look-up
Pointer Swizzling
Whenever a block is moved from secondary to main memory pointers in that block may be swizzled i.e. translated from the database to the memory
address A pointer in main memory consists of
A bit that indicates whether the pointer is a database or a memory address, and
The memory address (four byte) or database address (8 byte) as appropriate▪ Space is always reserved for the database address
There are several strategies to decide when a pointer should be swizzled
Pointer Swizzling Example
…
Block 1
Block 2
Disk Memory
Read into memory
swizzled
unswizzled
Swizzling Strategies
When a new block is brought into main memory, pointers related to that block may be swizzled The block may contain pointers to records in the same
block, or other blocks, and Pointers in records in other blocks, already in main
memory, may point to records in the newly copied block There are four main swizzling strategies
Automatic swizzling Swizzling on demand No swizzling – i.e. just use the translation table Programmer controlled swizzling – when access
patterns are known
Automatic Swizzling
Enter the address of the block and its records into the translation table
Enter the address of any pointers in the records in the block into the translation table If such an address is already in the table, swizzle
the pointer giving it the appropriate memory address
If the address is not already in the table, copy its block into memory and swizzle the pointer
This ensures that all pointers in the new block are swizzled when the block is loaded, which may save time However, it is possible that some of the pointers
may not be followed, hence time spent swizzling them is wasted
Swizzling on Demand
Enter the address of the block and its records into the translation table
Leave all pointers in the block unswizzled When an unswizzled pointer is followed, look up
the address in the translation table If the address is in the table, swizzle the pointer If the address is not in the table, copy the
appropriate block into main memory, and swizzle the pointer
Unlike automatic swizzling, this strategy does not result in unnecessary swizzling
Returning Blocks to Disk
When a block is written to disk its pointer must first be unswizzled That is, the pointers to memory addresses must
be replaced by the appropriate database addresses
The translation table can be searched (by memory address) to find the database address This is potentially time consuming The translation table should therefore be indexed
to allow efficient lookup of both memory and database addresses
Pinned Records and Blocks Pointer swizzling may result in blocks being
pinned A block is pinned if it cannot safely be written back
to disk A block that is pointed to by a swizzled pointer
should be pinned Otherwise, the pointer can no longer be followed to
the block at the specified memory address If a block is unpinned pointers to it must be
unswizzled The translation table must also include the memory
addresses of pointers that refer to an entry▪ As a linked list attached to an entry in the translation
table, or▪ As a (pointer to a) linked list in the record's pointer
field
Large Object Blocks
How are large data objects stored in records? Video clips, or sound files or the text from a book
LOB data types store and manipulate large blocks of unstructured data Tables can contain multiple LOB columns The maximum size of a LOB is large
▪ At least 8 terabytes in Oracle 10g LOB data must be processed by application programs
LOB data is stored as either binary or character data BLOB – unstructured binary data CLOB, NCLOB – character data BFILE – unstructured binary data in OS files
LOB Storage
LOB's have to be stored on a sequence of blocks Ideally the blocks should be contiguous for efficient
retrieval, but It is possible to store the LOB on a linked list of blocks
▪ Where each block contains a pointer to the next block If fast retrieval of LOBs is required they can be
striped across multiple disks for parallel access It may be necessary to provide an index to a
LOB For example indexing by seconds for a movie to allow
a client to request small portions of the movie
Page Formats
Records are organized on pages Pages can be thought of as a collection of slots,
each of which contains a single record A record can be identified by its record id (rid)
▪ The rid is the {page ID, slot number} pair Before considering different organizations for
managing slots it is important to know if Records are fixed length or Variable length
There are two organizations based on how records are deleted
Packed Page Format
Records are stored consecutively in slots
When a record is deleted the last record on the page is moved to its location
Records are found by an offset calculation
All empty space is at the bottom of the page
But the rid includes the slot number As records are moved
external references become invalid
slot 1
slot 2
…
slot N
free space
N
number of records
Unpacked Page Format
The page header contains a bitmap Each bit represents a
single slot A slot's bit is turned
off when the slot is empty
New records are inserted in empty slots
A record's slot number doesn't change
slot 1
slot 2
slot 3
…
slot M
1 … 0 1 0 M
M 3 2 1
number of slotsbitmap
showing slot occupancy
Variable Length Records
With variable length records a page cannot be divided into fixed length slots If a new record is larger than the slot it cannot be
inserted If a new record is smaller it wastes space To avoid wasting space, records must be moved so
that all the free space is contiguous without changing the rids
One solution is to maintain a directory of page slots at the end of each page which contains A pointer (an offset value) to each record and The length of each record
free space
16 … 24 20 N
N 2 1
Organizing Variable Length Records Pointers are offsets to records
Moving a record on the page has no impact on its rid
Its pointer changes but its slot number does not
A pointer to the start of the free space is required
Records are deleted by setting the offset to -1 New records can be inserted
in vacant slots Pages should be periodically
reorganized to remove gaps The directory "grows" into
the free space
number of slots
slot directory
length = 24
length = 16
length = 20 pointer to start of free space
Files and Records
A page can be considered as a collection of records
Pages containing related records are organized into collections, or files One file usually represents a single table
One file may span several pages It is therefore necessary to be able to
access all of the pages that make up a file
The basic file structure is a heap file
Heap Files
Heap files are not ordered in any way But they do guarantee that all of the records in a file can
be retrieved by repeatedly requesting the next record Each record in a file has a unique record ID (rid) And each page in the file is the same size
Heap files support the following operations: Creating and destroying files Inserting and deleting records Scanning all the records in the file
To support these operations it is necessary to: Keep track of the pages in the file Keep track of which of those pages contain free space
Heap File Organization 1
Maintain the heap file as a pair of doubly linked lists of pages One list for pages with free space and One list for pages that are full The DBMS can record the first page in the list in a
table with one entry for each file If records are of variable length most pages will
end up in the list of pages with free space It may be necessary to search several pages on the
free space list to find one with enough free space
Heap File Organization 2
Maintain the heap file as a directory of pages Each directory entry identifies a page (or
a sequence of pages) in the heap file The entries are kept in data page
order and records for each page: Whether or not the page is full, or The amount of free space
▪ If the amount of free space is recorded there is no need to visit a page to determine if it contains enough space
Managing Data in Main Memory
The Buffer Manager
The buffer manager is responsible for bringing pages from disk to main memory as required Main memory is partitioned into a collection of
pages called the buffer pool Main memory pages are referred to as frames Other processes must tell the buffer manager if a
page is no longer required and whether or not it has been modified
A DB may be many times larger than the buffer pool Accessing the entire DB (or performing queries that
require joins) can easily fill up the buffer pool▪ When the buffer pool is full, the buffer manager must
decide which pages to replace by following a replacement policy
Buffer Pool Management
Program 1 Program 2
Buffer Manager
Database
MAIN MEMORY
DISK
Buffer Pool
Disk Page
Free Frame
Page Frames
Buffer pool frames are the same size as disk pages
The buffer manager records two pieces of information for each frame dirty bit – on if the page
has been modified pin-count – the number
of times the page has been requested but not released
Data Page
Main memory frame
Dirty BitPin Count
Requesting (Allocating) a Page If the page is already in the buffer pool
Increment the frame's pin-count (called pinning) Otherwise
Choose a frame to replace (using the policy)▪ A frame is only chosen for replacement if its pin-count
is zero▪ If there is no frame with a pin-count of zero the
transaction must either wait or be aborted▪ If the chosen frame is dirty write it to the disk
Read requested page into replacement frame and set its pin-count to 1
Return the address of the frame
Releasing a Page
When a process releases (de-allocates) a page its pin-count is reduced, known as unpinning The program indicates if the page has been
modified, if so the buffer manager sets the dirty bit to on
Processes for requesting and releasing pages are affected by concurrency and crash recovery policies These will be discussed at a later date
Buffer Replacement Policies The policy used to replace frames can affect
the efficiency of database operations Ideally a frame should not be replaced if it will be
needed again in the near future Least Recently Used (LRU) replacement policy
Assumes that frames that haven't been used recently are no longer required
Uses a queue to keep track of frames with pin-count of zero
Replaces the frame at the front of the queue Requires main memory space for the queue
Clock Replacement
A variant of the LRU policy with les overhead Instead of a queue the policy requires one bit per
frame, and a single variable, called current Assume that the frames are numbered from 0 to B-1
▪ Where B is the number of frames
Each frame has an associated referenced bit The referenced bit is initially set to off, and is Set to on when the frame's pin-count reaches zero
current is initially set to 0, and is used to indicate the next frame to be considered for replacement
Clock Replacement Process Consider the current frame for replacement
If pin-count 0, increment current If pin-count 0 and referenced bit is on
▪ Switch referenced to off and increment current If pin-count 0 and referenced is off
▪ Replace the frame If current equals B-1 set it to 0
Only replace frames with pin-counts of zero Frames with a pin-count of zero are only replaced
after all older candidates are replaced
Is LRU the Right Policy?
LRU and clock replacement are fair schemes They are not always the best strategies for a DB
system It is common for some DB operations to require
repeated sequential scans of data (e.g. Cartesian products, joins)
With LRU such operations may result in sequential flooding
An alternative is the Most Recently Used policy This prevents sequential flooding but is generally
poor Most systems use some variant of LRU
Some systems will identify certain operations, and apply MRU for those operations
No Sequential Flooding Assume that a process requests sequential scans of a file The file, shown below, has nine pages
Assume that the buffer pool has ten frames
p1 p2 p3 p4 p5 p6 p7 p8 p9
Read page 1 first, All the pages are in the buffer, when the next scan of the file is requested, no further disk access is required!
Buffer Pool
then page 2,
… then page 9
p1p1 p2p1 p2 p3 p4 p5 p6 p7 p8 p9
Sequential Flooding
Read pages 1 to 10 first, page 11 is still to be read
Assume that a process requests sequential scans of a file This file, shown below, has eleven pages
Assume that the buffer pool still has ten frames
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
Buffer Poolp1p1 p2p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
Sequential Flooding
Read pages 1 to 10 first, page 11 is still to be read
Assume that a process requests sequential scans of a file This file, shown below, has eleven pages
Assume that the buffer pool still has ten frames
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
Buffer Poolp11 p2 p3 p4 p5 p6 p7 p8 p9 p10
Using LRU, replace the appropriate frame, which contains p1, with p11
p11 p2 p3 p4 p5 p6 p7 p8 p9 p10
Sequential Flooding
Read pages 1 to 10 first, page 11 is still to be read
Assume that a process requests sequential scans of a file This file, shown below, has eleven pages
Assume that the buffer pool still has ten frames
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
Buffer Pool
Using LRU, replace the appropriate frame, which contains p1, with p11
p11 p1 p3 p4 p5 p6 p7 p8 p9 p10
The first scan is complete, start the second scan by reading p1 from the fileReplace the LRU frame (containing p2) with p1
p11 p1 p3 p4 p5 p6 p7 p8 p9 p10
Sequential Flooding
Read pages 1 to 10 first, page 11 is still to be read
Assume that a process requests sequential scans of a file This file, shown below, has eleven pages
Assume that the buffer pool still has ten frames
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
Buffer Pool
Using LRU, replace the appropriate frame, which contains p1, with p11The first scan is complete, start the second scan by reading p1 from the fileReplace the LRU frame (containing p2) with p1
p11 p1 p2 p4 p5 p6 p7 p8 p9 p10
Continue the scan by reading p2, …
p11 p1 p2 p4 p5 p6 p7 p8 p9 p10
Sequential Flooding
Assume that a process requests sequential scans of a file This file, shown below, has eleven pages
Assume that the buffer pool still has ten frames
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
Buffer Poolp11 p1 p2 p3 p5 p6 p7 p8 p9 p10
Each scan of the file requires that every page is read from the disk!In this case LRU is the WORST possible replacement policy!
p11 p1 p2 p3 p4 p6 p7 p8 p9 p10p11 p1 p2 p3 p4 p5 p7 p8 p9 p10p11 p1 p2 p3 p4 p5 p6 p8 p9 p10p11 p1 p2 p3 p4 p5 p6 p7 p9 p10p11 p1 p2 p3 p4 p5 p6 p7 p8 p10p11 p1 p2 p3 p4 p5 p6 p7 p8 p9p10 p1 p2 p3 p4 p5 p6 p7 p8 p9p10 p11 p2 p3 p4 p5 p6 p7 p8 p9
OS Buffer Management
There are similarities between OS virtual memory and DBMS buffer management Both have the goal of accessing more
data than will fit in main memory Both bring pages from disk to main
memory as needed and replace unneeded pages
A DBMS requires its own buffer management To increase the efficiency of database
operations To control when a page is written to disk
DBMS Buffer Management A DBMS can often predict patterns in the way in
which pages are referenced Most page references are generated by processes
such as query processing with known patterns of page accesses
Knowledge of these patterns allows for a better choice of pages to replace and
Allows prefetching of pages, where the page requests can be anticipated and performed before they are requested
A DBMS requires the ability to force a page to disk To ensure that the page is updated on a disk This is necessary to implement crash recovery protocols
where the order in which pages are written is critical
Prefetching
Some DBMS buffer managers are able to predict page requests And fetch pages into the buffer before they are
requested The pages are then available in the buffer pool as
soon as they are requested, and If the pages to be prefetched are contiguous, the
retrieval will be faster than if they had been retrieved individually
If the pages are not contiguous, retrieval may still be faster as access to them can be efficiently scheduled
The disadvantage of prefetching (aka double-buffering) is that it requires extra main memory buffers
Performance Strategies
Organizing data by cylinders Related data should be stored "close to" each other
Using a RAID system to improve efficiency or reliability Multiple disks and striping improves efficiency Mirroring or redundancy improves reliability
Scheduling requests using the elevator algorithm Reduces disk access time for random reads and
writes Most effective when there are many requests
waiting Prefetching (or double-buffering) data in large
chunks Speeds up access when needed blocks can be
predicted but requires more main memory buffers