Upload
jessica-newman
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Aug. 2 Aug. 3 Aug. 4 Aug. 5 Aug. 6 9:00 Intro &
terminologyTP mons& ORBs
Logging &res. Mgr.
Files &Buffer Mgr.
Structuredfiles
11:00 Reliability Lockingtheory
Res. Mgr. &Trans. Mgr.
COM+ Access paths
13:30 Faulttolerance
Lockingtechniques
CICS & TP& Internet
CORBA/EJB + TP
Groupware
15:30 Transactionmodels
Queueing AdvancedTrans. Mgr.
Replication Performance& TPC
18:00 Reception Workflow Cyberbricks Party FREE
Files and Buffer Manager
Chapter 15
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 2
Abstractions Provided by the File Manager
Device independence: The file manager turns the large variety of external storage devices, such as disks (with their different numbers of cylinders, tracks, arms, and read/write heads), ram-disks, tapes, and so on, into simple abstract data types.
Allocation independence: The file manager does its own space management for storing the data objects presented by the client. It may store the same objects in more than one place (replication).
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 3
Abstractions Provided by the File Manager
Address independence: Whereas objects in main memory are always accessed through their addresses, the file manager provides mechanisms for associative access. Thus, for example, the client can request access to all records with a specified value in some field of the record. Support for associative access comes in many flavors, from simple mechanisms yielding fast retrieval via the primary key up to the expressive power of the SQL select statement.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 4
External Storage vs. Main Memory
Capacity: Main memory is usually limited to a size that is some orders of magnitude smaller than what large databases need.
Economics: External storage holds large volumes of data at reasonable cost.
Durability: Main memory is volatile. External storage devices such as magnetic or optical disks are inherently durable and therefore are appropriate for storing persistent objects. After a crash, recovery starts with what is found in durable storage.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 5
External Storage vs. Main Memory
Speed: External storage devices are some orders of magnitude slower than main memory. As a result, it is more costly, both in terms of latency and in terms of pathlength, to get data from external storage to the CPU than to load data from main memory.
Functionality: Data cannot be processed directly on external storage: they can neither be compared nor modified “out there.”
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 6
The Storage Pyramid
main memory
online external storage
near line (archive) storage
typical capacity
Electronic RAM and bulk storage
Magnetic / optical disks
Automated archives (e.g. optical disk jukeboxes, tape robots, etc.)
current data
stale data
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 7
Interfacing to External Memory:Read-Write Mapping
External Storage
File A
File B
File C
File D
Main Memory
read object w read object yread object x /write object x
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 8
Interfacing to External Memory:File Mapping
External Storage
File A
File B
File C
File D
Main Memory File C
map File C
unmap File C
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 9
Interfacing to External Memory:Single-Level Storage
External Storage
File A
File B
File C
File D
Main Memory
File A File B File C File DVirtual memory
Explicit mapping
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 10
Locality and Cacheing
The movement of data through the pyramid is guided by the principle of locality:
Locality of active data: Data that have recently been referenced will very likely be referenced again.
Locality of passive data: Data that have not been referenced recently will most likely not be referenced in the future.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 11
Levels of Abstraction in a File and Database Manager
main memory
online external memory
nearline external memory
DBMS Application
database access modules
databasebuffer mgr.
logging recovery
media and file manager
archive manager
Transaction programs
Tuple management, associative access
Buffer management
File management
Archive management
manages
manages
manages
tuple oriented access
block oriented access
device oriented access
setorientedaccess
Application
sort, join,...
read, write
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 12
Operations of the Basic File System
STATUS create(filename, allocparmp)
STATUS delete(filename)
STATUS open(filename, ACCESSMODE, FILEID);
STATUS close(FILEID)
STATUS extend(FILEID, allocparmp)
STATUS read(FILEID, BLOCKID, BLOCKP)
STATUS readc(FILEID, BLOCKID, blockcount, BLOCKP)
STATUS write(FILEID, BLOCKID, BLOCKP)
STATUS writec(FILEID, BLOCKID, blockcount,
BLOCKP)
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 13
Mapping Files To Disk
File A
File B
File C
File D
File E
Disk 1
The denote the address mapping between disk and files.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 14
Issues in Managing Disk Space
Initial allocation: When a file is created, how many contiguous slots should be allocated to it?
Incremental expansion: If an existing file grows beyond the number of slots currently allocated, how many additional contiguous blocks should be assigned to that file?
Reorganization: When and how should the free space on the disk be reorganized?
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 15
Extent-Based Allocation
Disk-A Disk-B Disk-A Disk-Adisk-id
extent indexaccum-length
14 187 3 214100 350 600 850
primary 1.secondary 2.secondary 3.secondary
file directory
A Bextent directory
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 16
Buddy SystemsSlots (shaded extents are free)
Free extents
Type
012
3
0110
00101100
1011
Buddy types
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
0
1
2
3
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 17
Simple Mapping of Relations To Disks
file system & operating systemdatabase system
real disks
File system segments
relationsextents
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 18
A Usual Way of Mapping of Relations To Disks
operating system database system
real disks logical disks OS-files
tabelspacessegments
relations
extents
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 19
Principles of the Database Bufferprocess of access module
buffer storage area
buffer is accessible from the caller's process (shared memory)
buffer manager interface
readdirect
directory
bufferfix (P, ...)Give me page P
find frame in buffer
determine FILEID and block number
return frame address F
1
23
4
5
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 20
Design Options for the Buffer Manager
Buffer per file: Each file has its own private buffer pool.. Buffer per page size: In systems with different page
(and block) sizes, there is usually at least one buffer for each page size.
Buffer per file type: There are files like indices, which are accessed in a significantly different way from other files. Therefore, some systems dedicate buffers to files depending on the access pattern and try to manage each of them in a way that is optimal for the respective file organization.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 21
Logic of the Buffer Manager
Search in buffer: Check if the requested page is in the buffer. If found, return the address F of this frame to the caller.
Find free frame: If the page is not in the buffer, find a frame that holds no valid page.
Determine replacement victim: If no such frame exists, determine a page that can be removed from the buffer (in order to reuse its frame).
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 22
Logic of the Buffer Manager
Write modified page: If replacement page has been changed, write it.
Establish frame address: Denote the start address of the frame as F.
Determine block address: Translate the requested PAGEID P into a FILEID and a block number. Read the block into the frame selected.
Return: Return the frame address F to the caller.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 23
Synchronization in the Bufferprocess A process B process A process B
bufferfix (P,...) bufferfix (P,...)
give copy to caller
give copy to caller
change P to P' change P to P"
rewrite page rewrite page
?
a) Access module in process A requests access to page P; gets private copy.
b) Access module in process B requests access to page P; gets private copy.
c) Both processes try to rewrite an updated version of the page, but these versions are different. Only the version written last will be on disk; this is the "lost update" anomaly.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 24
What the Buffer Manager Does for Synchronization
Sharing: Pages are made addressable to all processes that run the database code.
Semaphore protection: Each requestor gets the address of a semaphor protecting the page.
Durable storage: The access modules inform the buffer manager if their page access has resulted in an update of the page; the actual write operation, however, is issued by the buffer manager, probably at a time when the update transaction is long gone.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 25
The Interface to the Buffer Manager
typedef struct
{PAGEID pageid; /* id of page in file */
PAGEPTR pageaddr; /* base addr. in buffer */
int index; /* record within page */
semaphore * pagesem; /* pointer to the sem. */
Boolean modified; /* caller modif. page */
Boolean invalid; /* destroyed page */
} BUFFER_ACC_CB, *BUFFER_ACC_CBP;
/* control block for buffer access */
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 26
The Need for Fix and Unfixtransaction X
bufferfix(P,...)
page P not in buffer
read block containing page P
return base address in buffer
transaction Y transaction Y
bufferfix(Q,...)
page Q not in buffer; replace P
read block containing page Q
bufferfix(Q,...)
return base address in buffer
a) Transaction X requests access to page P; gets base address in buffer.
b) Transaction Y requests access to page Q; buffer mgr. decides to replace page P
c) Transaction Y gets the base address of Q in the buffer - is same as P's.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 27
The Fix-Use-Unfix Protocol I
FIX: The client requests access to a page using the bufferfix interface.
USE: The client uses the page and the pointer to the frame containing the page will remain valid.
UNFIX: The client explicitly waives further usage of the frame pointer; that is, it tells the buffer manager that it no longer wants to use that page.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 28
The Fix-Use-Unfix Protocol II
page P
page Q
page R
fix page P
use
useunfix page P
fix page R
use
use
use
unfix page R
fix page Quse
use
useunfix page Q
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 29
Structure of the Buffer Manager
bufferpool
frames
hash table
buffercontrol block
buffer control block
buffercontrol block
buffer control block
buffer control block
buffercontrol block
pages
frame_index first_bcb
next_in_hclass
mru_pagelru_page
buffer access control block
to and from client
index of frame holding the page (address pointer in case of buffer access control block)chain of buffer control blocks in same hash classLRU - chain
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 30
Logging and Recovery from the Buffer Manager's Perspective I
Transaction Buffer Database Remark
running
running
running
running
committed
committed
committed
committed
A BA B OK; old state in DBBA A B OK; old state in DBBA A B database corruptedBA BA conflicting view on TA
A BA B
BA A B
BA A B
BA BA
OK; Read-only TA
DB not in new state
database corrupted
OK; new state in DB
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 31
Logging and Recovery from the Buffer Manager's Perspective II
state of transaction TA
aborted
aborted
committed
committed
state of page A in database
old
new
new
result of recovery using operation log
wrong tuple might be deleted
old
inverse operation succeeds
operation succeeds
duplicate of tuple is inserted
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 32
The Log and Page LSNs
buffer manager
log manager log record for
page A log record for page A
log record for page A
write to disk
page A LSN1
write to disk
page A LSN3
log record for page A
time
LSN1 LSN2 LSN3 LSN4
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 33
Different Buffer Management Policies
Steal policy: When the buffer manager needs space, it can decide to replace dirty pages.
No-Steal policy: Pages can be replaced only if they are clean.
Force policy: At end of transaction, all modified pages are forced to disk in a series of synchronous write operations.
No-Force policy: No modified page is forced during commit. REDO log records are written to the log.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 34
The Problem of Hotspot Pages
bufferpool
durable storage
log
TA1
TA2TA3
TA4 TA5TA6
TA7TA8
force
The dotted arrows indicate an update of the page by the respective transaction.The arrows at 45 degrees indicate the forced writing of the page during commit processing.The downward arrows indicate the writing of log records for the respective transaction.
update operations
page A
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 35
The Basic Checkpoint Algorithm
Quiesce: Delay all incoming update DML calls until all fixes with exclusive semaphores have been released.
Flush the buffer: Write all modified pages. Log the checkpoint: Write a record to the log,
saying that a checkpoint has been generated. Resume normal operation: The bufferfix requests
for updates that have been delayed in order to take the checkpoint can now be processed again.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 36
The Case for Indirect Checkpointing
checkpoints
c1 c2 c3
1 23456789
10
frame- numbers
log
When taking a checkpoint, the PAGEIDs of the pages currently in buffer are written to the log.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 37
The Indirect Checkpointing Algorithm
Record TOC: Log the list of PAGEIDs. Compare with prev. ckpt: See if any modified
pages have not been replaced since last ckpt. Force lazy pages: Schedule the writing of those
pages during the next checkpoint interval. Low-water mark: Find the LSN of the oldest
still-volatile update; write it to the log. Write “Checkpoint done” record Resume normal operation
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 38
Further Possibilities for Optimization
Pre-flushing can be performed by an asynchronous process that scans the buffer for "old" modified pages. Writing is done under semaphore protection.
Pre-fetching can, among other things, be used to make restart more efficient. If page reads are logged one can use the recent checkpoint plus the log to prime the bufferpool, i.e. it will look almost exactly like at the moment of the crash.
Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 39
Further Possibilities for Optimization
Transaction scheduling and buffer management can take hints from the query optimizer:
This relation will be scanned sequentially. This is a sequential scan of the leaves of a B-
tree. This is the traversal of a B-tree, starting at the
root. This is a nested-loop join, where the inner
relation is scanned in physically sequential order.