View
4.437
Download
1
Category
Tags:
Preview:
DESCRIPTION
This lecture is part of an Introduction to Databases course given at the Vrije Universiteit Brussel.
Citation preview
2 December 2005
Introduction to DatabasesAccess Methods
Prof. Beat Signer
Department of Computer Science
Vrije Universiteit Brussel
http://www.beatsigner.com
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2April 24, 2015
Context of Today's Lecture
Access
Methods
System
Buffers
Authorisation
Control
Integrity
Checker
Command
Processor
Program
Object Code
DDL
Compiler
File
Manager
Buffer
Manager
Recovery
Manager
Scheduler
Query
Optimiser
Transaction
Manager
Query
Compiler
Queries
Catalogue
Manager
DML
Preprocessor
Database
Schema
Application
Programs
Database
Manager
Data
Manager
DBMS
Programmers Users DB Admins
Based on 'Components of a DBMS', Database Systems,
T. Connolly and C. Begg, Addison-Wesley 2010
Data, Indices and
System Catalogue
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 3April 24, 2015
Basic Index Concepts
An index is used to efficiently access specific data e.g. index in a textbook
Often database queries reference only a small number of
records in a file indices can be used to enhance the search for these records
most DBMSs automatically create an index for primary keys
A search key is an attribute (or set of attributes) that is
used to lookup records in a file
An index file consists of (searchKey, pointer) index
entries normally much smaller than the original file
pointer identifies a disk block and a record offset within the block
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 4April 24, 2015
Basic Index Concepts ...
There are two basic types of indices ordered indices
- based on a sorted ordering of search keys
hash indices
- search keys are uniformely distributed to different buckets based on a hash
function
Indexing techniques have to be evaluated based on access types
- e.g. records with a given attribute value or an attribute value in a specific range
access time
insertion and deletion time
- if the underlying data is changed, all indices have to be updated which may
result in a significant overhead for modifications
space overhead
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 5April 24, 2015
Ordered Indices
An ordered index stores the search key values in sorted order e.g. author catalogue in a library
A primary index (clustering index) is an index whose
search key also defines the sequential order of the file the primary key is often used as a search key of a primary index
files with a clustering index on the search key are calledindex-sequential-files
A secondary index (non-clustering index) is an index
whose search key has a different order than the
sequential file order
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 6April 24, 2015
Dense Index
A dense index contains an entry for each search key in the file
An index record of a dense primary index points to the first file record with the given search key
An index record of a dense secondary index stores a list of pointers to the records with the same search key value
de Rover
Frei
Jones
Meier
Michelin
2 de Rover Pieter Brussels
1 Frei Urs Zurich
8 Frei Urs Zurich
14 Jones Andrew Paris
53 Meier Beat Zurich
5 Michelin Robert Parisdense index
index-sequential file
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 7April 24, 2015
Dense Index Insertion
find the search key of the record to be inserted in the index
if (index contains search key) {
if (index stores pointers to all records) {add a pointer to the new record to the index entry
}else { // index stores pointer to the first record onlyensure that the record is inserted after all other records with thesame search key
}
}else { // index does not contain the search keyinsert an index record with the search key at the appropriate position
}
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 8April 24, 2015
Dense Index Deletion
D
Update can be modelled as a delete followed by an insert
look up the record to be deleted
if (deleted record is the only one with the search key value) {delete index record
}else { // there are other records with the same search key
if (index stores pointers to all records) {delete the corresponding pointer from the index record
}else { // index stores pointer to the first record only
if (deleted record was the first record with the search key value) {update the index record to point to the next record
}
}}
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 9April 24, 2015
Sparse Index
Contains only some of the search key values can only be used for primary indices!
To search a given record in a sparse index the following
steps have to be performed1) find the index record ri with the largest search key value that is
smaller or equal to the search key value to be found
2) start a linear search at record ri until the record is found
de Rover
Jones
Michelin
2 de Rover Pieter Brussels
1 Frei Urs Zurich
8 Frei Urs Zurich
14 Jones Andrew Paris
53 Meier Beat Zurich
5 Michelin Robert Paris
sparse index
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 10April 24, 2015
Sparse Index
A sparse index consumes less space and produces less
maintenance overhead for insertions and deletions
On the other hand it takes more time to locate a record
based on a sparse index
A good trade-off is a sparse index with one index record
for each block
...
sparse index
...
block 1
block 2
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 11April 24, 2015
Sparse Index Insertion
// assumption: one index entry for each block
if (new block is created) {insert first search key value in the new block into the index
}else { // no new block
if (new record has the smallest search key value in the block) {update the index record pointing to the block
}
}
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 12April 24, 2015
Sparse Index Deletion
Update can be modelled as a delete followed by an insert
if (index contains the search key value of deleted record) {if (deleted record is the only one with the search key value) {
if (search key value of the next record is not in the index {update the index record to point to the next record
}else { // search key value of next record is already in the indexdelete index entry
}
}else { // there are other records with the same search key
if (deleted record was the first record with the search key value) {update the index entry to point to the next record with the samesearch key value
}}
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 13April 24, 2015
Multilevel Index
A multilevel index can be used if the
index grows too large to be kept in memory inner index froms the primary index file
outer index is a sparse index on the inner index
...
inner index
...
datablock 1
datablock 2
...
...
outer index
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 14April 24, 2015
Multilevel Index ...
If even the outer index grows too large to be kept in
memory, the process can be repeated note that instead of this type of multilevel index it is preferred to
use B+-trees
The update of the multilevel index (on insertion and
deletion) is an extension of the updates described for the
single level indices
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 15April 24, 2015
Secondary Index
A secondary index has to be a dense index and must
point to all the records with the same search key value e.g. indirection via buckets that contain the pointers
The use of a secondary index normally requires more
I/O operations than a primary index
Brussels
Paris
Zurich
2 de Rover Pieter Brussels
1 Frei Urs Zurich
8 Frei Urs Zurich
14 Jones Andrew Paris
53 Meier Beat Zurich
5 Michelin Robert Paris
secondary index
buckets
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 16April 24, 2015
Index-Sequential File Problems
Index-sequential files have a number of disadvantages degradation and loss of performance when the file grows since
many overflow blocks have to be created
the entire file has to be periodically reorganised
Alternative index structures that do not show a loss of
performance after updates can be used e.g. the B+-tree index file structure is a widely used multilevel
index structure that is based on a B+-tree (balanced and sorted )
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 17April 24, 2015
B+-Tree
Properties of a balanced B+-tree every path from the root to a leaf node has the same length
for a given n, each leaf node has between n-1)/2 and n-1 values
- leaf nodes must always be at least half full
each non-leaf (and non-root) node has between n/2 and nchildren (fanout)
typical B+-tree node
- K1,..., Km-1 are the ordered search key values (Ki < Ki+1)
- a non-leaf node has pointers P1,..., Pm to child nodes
- a leaf node has pointers P1,..., Pm to file records or to buckets of pointers
- typically the size of a disk block
if the root node is not a leaf node it has at least 2 children; otherwise it has between 0 and n-1 values
P1 K1 P2 ... Pm-1 Km-1 Pm ...
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 18April 24, 2015
B+-Tree Leaf Nodes
Leaf node example (n=3)
for 1 i <n the pointers Pi
point to a file record with thesearch key value Ki or to a bucketof pointers for search keys that arenon-ordered non-candidate keys
the pointer Pn points to the next leaf node in search key order
de Rover Frei
2 de Rover Pieter Brussels
1 Frei Urs Zurich
8 Frei Urs Zurich
...
next leaf node
index-sequential file
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 19April 24, 2015
B+-Tree Non-Leaf (Internal) Nodes
The non-leaf nodes form a multilevel sparse index on the
leaf nodes
For a non-leaf node with m pointers the following holds
the values of the search keys in the subtree that P1 points to are all smaller than K1
for 2 i <m the search key values in the subtree to which Pi
points to have values greater than or equal to Ki-1 and smaller than Ki
the search keys in the subtree to which Pm points to have values greather than or equal to Km-1
P1 K1 P2 ... Pm-1 Km-1 Pm ...
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 20April 24, 2015
B+-Tree Example
Example B+-tree for n=3 root node must have at least two children
non-leaf nodes must have between 2 and 3 children (for n=3)
leaf nodes must have between 1 and 2 values (for n=3)
note that the node size is normally defined by the block size and the fanout is typically in the range of 50-100 (e.g. n=100)
Bush Frei Jones Meier Otlet Price
Jones Otlet
Meier
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 21April 24, 2015
B+-Tree Properties
Logically close blocks do not have to be physically close
since the inter-node connections are realised via pointers
B+-trees generally have a small height if there are k search key values, the B+-tree tree height is not
greater than log n/2 (k)
- e.g. for 5'000'000 search keys and n = 100 we get a height not greater than 4
(note that the height would be 23 for a binary tree)
search operations can be performed efficiently (without many block accesses)
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 22April 24, 2015
B+-Tree Lookup
Find all records for a given search key v
Bush Frei Jones Meier Otlet Price
Jones Otlet
Meier
c = root node
while (c is not a leaf node) { let Ki = smallest search key value in c greater than v
if (Ki exists) {c = node pointed to by Pi
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 23April 24, 2015
B+-Tree Lookup ...
The lookup requires O(lognk) I/O operations in worst case
}else { // no search key greater than vc = node pointed to by Pm // where m is the number of pointers
}
}
if (a key value Ki in c with the value v exists) {pointer Pi leads to the desired record or bucket
}else {a record with the given search key value v does not exist
}
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 24April 24, 2015
B+-Tree Insertion
(1) Find the leaf node in which the search key value would
appear using the lookup algorithm presented before and
add the new record to the file
(2) If the search key is already present in the leaf node if necessary add a new pointer to the bucket
(3) If the search key value is not yet present if there is room in the leaf node then insert a new index record
if there is no room then split the node
- take the n index records (inlcuding the one to be inserted) in sorted order and
place the first n/2records in the original node and the rest in a new node
- let the new node be p and the smallest value in p be k. Insert (p,k) in the parent
node (if there is no space in the parent node, the split is propagated upwards)
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 25April 24, 2015
B+-Tree Insertion ...
In worst case, the root node has to be split and the
height of the tree is increased by 1
The insertion requires O(lognk) I/O operations in worst
case
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 26April 24, 2015
B+-Tree Insertion Example
insertion of Codd
Bush Frei Jones Meier Otlet Price
Jones Otlet
Meier
Bush Codd Meier Otlet Price
Frei Jones Otlet
Meier
JonesFrei
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 27April 24, 2015
B+-Tree Deletion
(1) Find the record to be deleted by using the lookup
algorithm and delete it from the file
(2) Remove the search key and value from the leaf index
node if there is no bucket or if the bucket is empty
(3) If the node has no longer enough entries due to the
deleted index record and the entries in the node and a
sibling fit into a single node then merge the siblings insert the search key values of the two nodes into the left node
and delete the right node
recursively delete the entry for the deleted node from the parent node (propagation)
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 28April 24, 2015
B+-Tree Deletion ...
(4) If the node has no longer enough entries due to the
deleted index record but the entries in the node and a
sibling do not fit into a single node then redistribute the
pointers redistribute the pointers between the node and a sibling such that
both have more than the minimum number of entries
update the search key values in the parent node
If the root node has only one pointer after deletion, it is
deleted and its child becomes the root
The deletion of a previously located record requires
O(lognk) I/O operations in worst case note that there are variants where leaf nodes are not merged
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 29April 24, 2015
B+-Tree Deletion Example
deletion of Frei
Bush Codd Meier Otlet Price
Frei Jones Otlet
Meier
JonesFrei
Bush Codd Meier Otlet Price
Jones Otlet
Meier
Jones
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 30April 24, 2015
B+-Tree Deletion Example ...
deletion of Meier
Bush Codd Otlet PriceJones
Bush Codd Meier Otlet Price
Jones Otlet
Meier
Jones
Jones Otlet
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 31April 24, 2015
B+-Tree Deletion Example ...
deletion of Meier
Bush Codd Meier Otlet Price
Frei Jones Otlet
Meier
JonesFrei
Bush Codd Otlet Price
Frei Otlet
Jones
JonesFrei
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 32April 24, 2015
Features of B+-Tree Index Files
Advantages automatic reorganisation with small local changes for insertions
and deletions
no reorganisation of the file is required to maintain the performance (solves index-sequential file problem)
Disadvantages insertion and deletion overhead
space overhead
The advantages of B+-trees outweigh the disadvantages
which makes them an extensively used data structure
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 33April 24, 2015
B+-Tree File Organisation
We can use the same B+-tree structure to organise the
records in a file and thereby solve the problem of data
file degradation
In a B+-tree file organisation, the leaves of the tree store
records instead of pointers to records number of records in the leaf nodes normally smaller than the
number of pointers on non-leave nodes
leaf nodes still have to be at least half full
two siblings can be used in splits/merges for space optimisation
adjacent nodes in the tree may not be continuous on the disk
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 34April 24, 2015
Indexing Strings
The use of strings as search keys introduces two major
problems the strings can be of variable length which leads to different
fanouts
the split and merge operations should no longer be based on the number of search keys but on the fraction of the used space
The fanout of nodes can be increased by using a prefix
compression technique we no longer store the entire search key in non-leaf nodes but
only the part that is necessary to distinguish between the key values in the subtrees
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 35April 24, 2015
B+-Trees vs. B-Trees
B-trees store every search key value only once
(no redundancy) additional pointers to the file records have to be added for non-
leaf nodes
Advantages of B-trees may use less nodes than the corresponding B+-trees
may find the search key before reaching the leaf node
Disadvantages of B-trees non-leaf nodes get larger which results in a deeper trees
insertion and deletion gets more complicated
Normally the advantages of B-trees do not outweigh their
disadvantages
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 36April 24, 2015
Multiple-Key Access
For queries that need fast search on the combination of
multiple attributes we can define composite search keys e.g. search key (name, city) for customers
The ordering of composite search keys is the
lexicographic ordering e.g. (a1, a2) < (b1, b2) if
- a1 <b1, or
- a1 = b1 and a2 <b2
An index with composite search keys can be used to
efficiently answer queries of the form WHERE name = 'Max Frisch' AND city = 'Zurich'
WHERE name = 'Max Frisch' AND city < 'Brussels'
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 37April 24, 2015
Static Hashing
A bucket is a storage unit containing one or more records typically a disk block
In a hash file organisation we get the bucket in a file by
using a hash function on the search key value
A hash function h is a function from the set of search key
values K to the set of buckets B distribution should be uniform and random
The hash function can be used to locate, insert and
delete records
A bucket may contain records with different search keys bucket has to be sequentially searched
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 38April 24, 2015
Hash File Organisation Example
example of a hash function: binary representation of search key modulo 6
14 Jones Andrew Paris2 Rover Pieter Brussels
53 Meier Beat Zurich
1 Frei Urs Zurich
8 Frei Urs Zurich
5 Michelin Robert Paris
bucket 0 bucket 1
bucket 2 bucket 3
bucket 4 bucket 5
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 39April 24, 2015
Bucket Overflows
Buckets can overflow due to different reasons insufficent number of buckets
a skew in the distribution of records
- many records with the same search key
- non-uniform hash function
The possibility for bucket overflows can be reduced but
we cannot prevent overflows bucket overflows are handled by a linked list of additional
overflow buckets
this form of hashing with overflow buckets is called closed hashing
Open hashing does not allow overflow buckets other existing buckets have to be "misused" (linear probing)
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 40April 24, 2015
Hash Index
Hashing cannot only be used for file organisation but
also for the creation of an index
A hash index manages the search keys with their record
pointers in a hash file structure
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 41April 24, 2015
1
53
Hash Index Example
2 de Rover Pieter Brussels
1 Frei Urs Zurich
8 Frei Urs Zurich
14 Jones Andrew Paris
53 Meier Beat Zurich
5 Michelin Robert Paris
5
bucket 0
bucket 1
14
bucket 2
2
bucket 3
8
overflow bucket
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 42April 24, 2015
Static Hashing Problems
The hash function maps the search key values to a fixed
number of buckets B but the database will grow or shrink
over time if the file grows and the initial number of buckets is too small, the
performance decreases since overflow buckets are necessary
if space is preallocated for future growth, some space will be wasted initially
A possible solution would be to reorganise the file with a
new hash function from time to time too expensive and has to be performed exclusively
An alternative is to dynamically modify the number of
buckets (dynamic hashing)
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 43April 24, 2015
Ordered Indexing vs. Hashing
The database designer has to choose which form of
index should be used based on different criteria cost of periodic reorganisation of the index
frequency of update operations
optimised average or worst case access time
expected types of queries
- hashing is generally better for retrieving records with a specific attribute value
• constant average lookup time
• worst case lookup time proportional to the number of attribute values!
- ordered indices are preferred for range queries where the attribute value lies
within a range of values (e.g. WHERE A > C1 AND A < C2)
In practice most DBMSs support an ordered B+-tree index but not all of them
support hash indices
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 44April 24, 2015
Bitmap Index
A bitmap index can be used to efficiently query on
multiple keys/attributes
The following assumptions must hold records in a relation must be numbered sequentially (e.g. starting
from 0)
given a number i, it must be easy to find the ith record
- easy if records have fixed size
Bitmaps are applicable on attributes with a relatively
small number of distinct values e.g. gender, city, ...
A bitmap is an array with as many bits as records the ith bit is 1 if record i has the desired attribute value
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 45April 24, 2015
Bitmap Index ...
Queries can be answered by using bitmap operations and (intersection), or (union) and not (complementation)
- e.g. find all males in Zurich: 101101 AND 011010 = 001000
- e.g. find females not living in Paris: 010010 AND 111010 = 010010
Retrieve records based on the resulting bit array can also be used to count the number of records with a given condition
2 de Rover Pieter m Brussels
21 Brown Kathy f Zurich
8 Frei Urs m Zurich
14 Jones Andrew m Paris
9 Jones Lea f Zurich
5 Michelin Robert m Paris
101101
010010
100000
000101
011010
m
f
Brussel
Paris
Zurich
bitmaps for gender
bitmaps for location
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 46April 24, 2015
Index Definition in SQL
The create index command can be used to create an
index on a relation different DBMSs support different types of indices (e.g. B+-tree,
hash index, etc.)
Example
CREATE INDEX nameIndex ON Customer USING BTREE (name);
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 47April 24, 2015
Summary
Different types of indices can be used to speed up the
search of specific records in a file
The number and types of indices have to be designed
carefully since each index adds additional maintenance
costs on update operations
The structures that are used for the indices (e.g. B+-tree
or hash index) can also be used to manage the records
in a file and reduce data file degradation
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 48April 24, 2015
Homework
Study the following chapter of the
Database System Concepts book chapter 11
- sections 11.1-11.11
- Indexing and Hashing
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 49April 24, 2015
Exercise 9
Storage, Indexing and Hashing
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 50April 24, 2015
References
A. Silberschatz, H. Korth and S. Sudarshan,
Database System Concepts (Sixth Edition),
McGraw-Hill, 2010
2 December 2005
Next LectureQuery Processing and Optimisation
Recommended