Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)

2 December 2005

Introduction to DatabasesAccess Methods

Prof. Beat Signer

Department of Computer Science

Vrije Universiteit Brussel

http://www.beatsigner.com

Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2April 24, 2015

Context of Today's Lecture

Access

Methods

System

Buffers

Authorisation

Control

Integrity

Checker

Command

Processor

Program

Object Code

Compiler

Manager

Buffer

Manager

Recovery

Manager

Scheduler

Optimiser

Transaction

Manager

Compiler

Queries

Catalogue

Manager

Preprocessor

Database

Schema

Application

Programs

Database

Manager

Programmers Users DB Admins

Based on 'Components of a DBMS', Database Systems,

T. Connolly and C. Begg, Addison-Wesley 2010

Data, Indices and

System Catalogue

Basic Index Concepts

An index is used to efficiently access specific data e.g. index in a textbook

Often database queries reference only a small number of

records in a file indices can be used to enhance the search for these records

most DBMSs automatically create an index for primary keys

A search key is an attribute (or set of attributes) that is

used to lookup records in a file

An index file consists of (searchKey, pointer) index

entries normally much smaller than the original file

pointer identifies a disk block and a record offset within the block

Basic Index Concepts ...

There are two basic types of indices ordered indices

- based on a sorted ordering of search keys

hash indices

- search keys are uniformely distributed to different buckets based on a hash

function

Indexing techniques have to be evaluated based on access types

- e.g. records with a given attribute value or an attribute value in a specific range

access time

insertion and deletion time

- if the underlying data is changed, all indices have to be updated which may

result in a significant overhead for modifications

space overhead

Ordered Indices

An ordered index stores the search key values in sorted order e.g. author catalogue in a library

A primary index (clustering index) is an index whose

search key also defines the sequential order of the file the primary key is often used as a search key of a primary index

files with a clustering index on the search key are calledindex-sequential-files

A secondary index (non-clustering index) is an index

whose search key has a different order than the

sequential file order

Dense Index

A dense index contains an entry for each search key in the file

An index record of a dense primary index points to the first file record with the given search key

An index record of a dense secondary index stores a list of pointers to the records with the same search key value

de Rover

Michelin

2 de Rover Pieter Brussels

1 Frei Urs Zurich

8 Frei Urs Zurich

14 Jones Andrew Paris

53 Meier Beat Zurich

5 Michelin Robert Parisdense index

index-sequential file

Dense Index Insertion

find the search key of the record to be inserted in the index

if (index contains search key) {

if (index stores pointers to all records) {add a pointer to the new record to the index entry

}else { // index stores pointer to the first record onlyensure that the record is inserted after all other records with thesame search key

}else { // index does not contain the search keyinsert an index record with the search key at the appropriate position

Dense Index Deletion

Update can be modelled as a delete followed by an insert

look up the record to be deleted

if (deleted record is the only one with the search key value) {delete index record

}else { // there are other records with the same search key

if (index stores pointers to all records) {delete the corresponding pointer from the index record

}else { // index stores pointer to the first record only

if (deleted record was the first record with the search key value) {update the index record to point to the next record

Sparse Index

Contains only some of the search key values can only be used for primary indices!

To search a given record in a sparse index the following

steps have to be performed1) find the index record ri with the largest search key value that is

smaller or equal to the search key value to be found

2) start a linear search at record ri until the record is found

de Rover

Michelin

1 Frei Urs Zurich

8 Frei Urs Zurich

5 Michelin Robert Paris

sparse index

Sparse Index

A sparse index consumes less space and produces less

maintenance overhead for insertions and deletions

On the other hand it takes more time to locate a record

based on a sparse index

A good trade-off is a sparse index with one index record

for each block

sparse index

block 1

block 2

Sparse Index Insertion

// assumption: one index entry for each block

if (new block is created) {insert first search key value in the new block into the index

}else { // no new block

if (new record has the smallest search key value in the block) {update the index record pointing to the block

Sparse Index Deletion

Update can be modelled as a delete followed by an insert

if (index contains the search key value of deleted record) {if (deleted record is the only one with the search key value) {

if (search key value of the next record is not in the index {update the index record to point to the next record

}else { // search key value of next record is already in the indexdelete index entry

}else { // there are other records with the same search key

if (deleted record was the first record with the search key value) {update the index entry to point to the next record with the samesearch key value

Multilevel Index

A multilevel index can be used if the

index grows too large to be kept in memory inner index froms the primary index file

outer index is a sparse index on the inner index

inner index

datablock 1

datablock 2

outer index

Multilevel Index ...

If even the outer index grows too large to be kept in

memory, the process can be repeated note that instead of this type of multilevel index it is preferred to

use B+-trees

The update of the multilevel index (on insertion and

deletion) is an extension of the updates described for the

single level indices

Secondary Index

A secondary index has to be a dense index and must

point to all the records with the same search key value e.g. indirection via buckets that contain the pointers

The use of a secondary index normally requires more

I/O operations than a primary index

Brussels

Zurich

1 Frei Urs Zurich

8 Frei Urs Zurich

secondary index

buckets

Index-Sequential File Problems

Index-sequential files have a number of disadvantages degradation and loss of performance when the file grows since

many overflow blocks have to be created

the entire file has to be periodically reorganised

Alternative index structures that do not show a loss of

performance after updates can be used e.g. the B+-tree index file structure is a widely used multilevel

index structure that is based on a B+-tree (balanced and sorted )

B+-Tree

Properties of a balanced B+-tree every path from the root to a leaf node has the same length

for a given n, each leaf node has between n-1)/2 and n-1 values

- leaf nodes must always be at least half full

each non-leaf (and non-root) node has between n/2 and nchildren (fanout)

typical B+-tree node

- K1,..., Km-1 are the ordered search key values (Ki < Ki+1)

- a non-leaf node has pointers P1,..., Pm to child nodes

- a leaf node has pointers P1,..., Pm to file records or to buckets of pointers

- typically the size of a disk block

if the root node is not a leaf node it has at least 2 children; otherwise it has between 0 and n-1 values

P1 K1 P2 ... Pm-1 Km-1 Pm ...

B+-Tree Leaf Nodes

Leaf node example (n=3)

for 1 i <n the pointers Pi

point to a file record with thesearch key value Ki or to a bucketof pointers for search keys that arenon-ordered non-candidate keys

the pointer Pn points to the next leaf node in search key order

de Rover Frei

1 Frei Urs Zurich

8 Frei Urs Zurich

next leaf node

index-sequential file

B+-Tree Non-Leaf (Internal) Nodes

The non-leaf nodes form a multilevel sparse index on the

leaf nodes

For a non-leaf node with m pointers the following holds

the values of the search keys in the subtree that P1 points to are all smaller than K1

for 2 i <m the search key values in the subtree to which Pi

points to have values greater than or equal to Ki-1 and smaller than Ki

the search keys in the subtree to which Pm points to have values greather than or equal to Km-1

P1 K1 P2 ... Pm-1 Km-1 Pm ...

B+-Tree Example

Example B+-tree for n=3 root node must have at least two children

non-leaf nodes must have between 2 and 3 children (for n=3)

leaf nodes must have between 1 and 2 values (for n=3)

note that the node size is normally defined by the block size and the fanout is typically in the range of 50-100 (e.g. n=100)

Bush Frei Jones Meier Otlet Price

Jones Otlet

B+-Tree Properties

Logically close blocks do not have to be physically close

since the inter-node connections are realised via pointers

B+-trees generally have a small height if there are k search key values, the B+-tree tree height is not

greater than log n/2 (k)

- e.g. for 5'000'000 search keys and n = 100 we get a height not greater than 4

(note that the height would be 23 for a binary tree)

search operations can be performed efficiently (without many block accesses)

B+-Tree Lookup

Find all records for a given search key v

Jones Otlet

c = root node

while (c is not a leaf node) { let Ki = smallest search key value in c greater than v

if (Ki exists) {c = node pointed to by Pi

B+-Tree Lookup ...

The lookup requires O(lognk) I/O operations in worst case

}else { // no search key greater than vc = node pointed to by Pm // where m is the number of pointers

if (a key value Ki in c with the value v exists) {pointer Pi leads to the desired record or bucket

}else {a record with the given search key value v does not exist

B+-Tree Insertion

(1) Find the leaf node in which the search key value would

appear using the lookup algorithm presented before and

add the new record to the file

(2) If the search key is already present in the leaf node if necessary add a new pointer to the bucket

(3) If the search key value is not yet present if there is room in the leaf node then insert a new index record

if there is no room then split the node

- take the n index records (inlcuding the one to be inserted) in sorted order and

place the first n/2records in the original node and the rest in a new node

- let the new node be p and the smallest value in p be k. Insert (p,k) in the parent

node (if there is no space in the parent node, the split is propagated upwards)

B+-Tree Insertion ...

In worst case, the root node has to be split and the

height of the tree is increased by 1

The insertion requires O(lognk) I/O operations in worst

B+-Tree Insertion Example

insertion of Codd

Jones Otlet

Bush Codd Meier Otlet Price

Frei Jones Otlet

JonesFrei

B+-Tree Deletion

(1) Find the record to be deleted by using the lookup

algorithm and delete it from the file

(2) Remove the search key and value from the leaf index

node if there is no bucket or if the bucket is empty

(3) If the node has no longer enough entries due to the

deleted index record and the entries in the node and a

sibling fit into a single node then merge the siblings insert the search key values of the two nodes into the left node

and delete the right node

recursively delete the entry for the deleted node from the parent node (propagation)

B+-Tree Deletion ...

(4) If the node has no longer enough entries due to the

deleted index record but the entries in the node and a

sibling do not fit into a single node then redistribute the

pointers redistribute the pointers between the node and a sibling such that

both have more than the minimum number of entries

update the search key values in the parent node

If the root node has only one pointer after deletion, it is

deleted and its child becomes the root

The deletion of a previously located record requires

O(lognk) I/O operations in worst case note that there are variants where leaf nodes are not merged

B+-Tree Deletion Example

deletion of Frei

Frei Jones Otlet

JonesFrei

Jones Otlet

B+-Tree Deletion Example ...

deletion of Meier

Bush Codd Otlet PriceJones

Jones Otlet

B+-Tree Deletion Example ...

deletion of Meier

Frei Jones Otlet

JonesFrei

Bush Codd Otlet Price

Frei Otlet

JonesFrei

Features of B+-Tree Index Files

Advantages automatic reorganisation with small local changes for insertions

and deletions

no reorganisation of the file is required to maintain the performance (solves index-sequential file problem)

Disadvantages insertion and deletion overhead

space overhead

The advantages of B+-trees outweigh the disadvantages

which makes them an extensively used data structure

B+-Tree File Organisation

We can use the same B+-tree structure to organise the

records in a file and thereby solve the problem of data

file degradation

In a B+-tree file organisation, the leaves of the tree store

records instead of pointers to records number of records in the leaf nodes normally smaller than the

number of pointers on non-leave nodes

leaf nodes still have to be at least half full

two siblings can be used in splits/merges for space optimisation

adjacent nodes in the tree may not be continuous on the disk

Indexing Strings

The use of strings as search keys introduces two major

problems the strings can be of variable length which leads to different

fanouts

the split and merge operations should no longer be based on the number of search keys but on the fraction of the used space

The fanout of nodes can be increased by using a prefix

compression technique we no longer store the entire search key in non-leaf nodes but

only the part that is necessary to distinguish between the key values in the subtrees

B+-Trees vs. B-Trees

B-trees store every search key value only once

(no redundancy) additional pointers to the file records have to be added for non-

leaf nodes

Advantages of B-trees may use less nodes than the corresponding B+-trees

may find the search key before reaching the leaf node

Disadvantages of B-trees non-leaf nodes get larger which results in a deeper trees

insertion and deletion gets more complicated

Normally the advantages of B-trees do not outweigh their

disadvantages

Multiple-Key Access

For queries that need fast search on the combination of

multiple attributes we can define composite search keys e.g. search key (name, city) for customers

The ordering of composite search keys is the

lexicographic ordering e.g. (a1, a2) < (b1, b2) if

- a1 <b1, or

- a1 = b1 and a2 <b2

An index with composite search keys can be used to

efficiently answer queries of the form WHERE name = 'Max Frisch' AND city = 'Zurich'

WHERE name = 'Max Frisch' AND city < 'Brussels'

Static Hashing

A bucket is a storage unit containing one or more records typically a disk block

In a hash file organisation we get the bucket in a file by

using a hash function on the search key value

A hash function h is a function from the set of search key

values K to the set of buckets B distribution should be uniform and random

The hash function can be used to locate, insert and

delete records

A bucket may contain records with different search keys bucket has to be sequentially searched

Hash File Organisation Example

example of a hash function: binary representation of search key modulo 6

14 Jones Andrew Paris2 Rover Pieter Brussels

1 Frei Urs Zurich

8 Frei Urs Zurich

bucket 0 bucket 1

bucket 2 bucket 3

bucket 4 bucket 5

Bucket Overflows

Buckets can overflow due to different reasons insufficent number of buckets

a skew in the distribution of records

- many records with the same search key

- non-uniform hash function

The possibility for bucket overflows can be reduced but

we cannot prevent overflows bucket overflows are handled by a linked list of additional

overflow buckets

this form of hashing with overflow buckets is called closed hashing

Open hashing does not allow overflow buckets other existing buckets have to be "misused" (linear probing)

Hash Index

Hashing cannot only be used for file organisation but

also for the creation of an index

A hash index manages the search keys with their record

pointers in a hash file structure

Hash Index Example

1 Frei Urs Zurich

8 Frei Urs Zurich

bucket 0

bucket 1

bucket 2

bucket 3

overflow bucket

Static Hashing Problems

The hash function maps the search key values to a fixed

number of buckets B but the database will grow or shrink

over time if the file grows and the initial number of buckets is too small, the

performance decreases since overflow buckets are necessary

if space is preallocated for future growth, some space will be wasted initially

A possible solution would be to reorganise the file with a

new hash function from time to time too expensive and has to be performed exclusively

An alternative is to dynamically modify the number of

buckets (dynamic hashing)

Ordered Indexing vs. Hashing

The database designer has to choose which form of

index should be used based on different criteria cost of periodic reorganisation of the index

frequency of update operations

optimised average or worst case access time

expected types of queries

- hashing is generally better for retrieving records with a specific attribute value

• constant average lookup time

• worst case lookup time proportional to the number of attribute values!

- ordered indices are preferred for range queries where the attribute value lies

within a range of values (e.g. WHERE A > C1 AND A < C2)

In practice most DBMSs support an ordered B+-tree index but not all of them

support hash indices

Bitmap Index

A bitmap index can be used to efficiently query on

multiple keys/attributes

The following assumptions must hold records in a relation must be numbered sequentially (e.g. starting

from 0)

given a number i, it must be easy to find the ith record

- easy if records have fixed size

Bitmaps are applicable on attributes with a relatively

small number of distinct values e.g. gender, city, ...

A bitmap is an array with as many bits as records the ith bit is 1 if record i has the desired attribute value

Bitmap Index ...

Queries can be answered by using bitmap operations and (intersection), or (union) and not (complementation)

- e.g. find all males in Zurich: 101101 AND 011010 = 001000

- e.g. find females not living in Paris: 010010 AND 111010 = 010010

Retrieve records based on the resulting bit array can also be used to count the number of records with a given condition

2 de Rover Pieter m Brussels

21 Brown Kathy f Zurich

8 Frei Urs m Zurich

14 Jones Andrew m Paris

9 Jones Lea f Zurich

5 Michelin Robert m Paris

101101

010010

100000

000101

011010

Brussel

Zurich

bitmaps for gender

bitmaps for location

Index Definition in SQL

The create index command can be used to create an

index on a relation different DBMSs support different types of indices (e.g. B+-tree,

hash index, etc.)

Example

CREATE INDEX nameIndex ON Customer USING BTREE (name);

Summary

Different types of indices can be used to speed up the

search of specific records in a file

The number and types of indices have to be designed

carefully since each index adds additional maintenance

costs on update operations

The structures that are used for the indices (e.g. B+-tree

or hash index) can also be used to manage the records

in a file and reduce data file degradation

Homework

Study the following chapter of the

Database System Concepts book chapter 11

- sections 11.1-11.11

- Indexing and Hashing

Exercise 9

Storage, Indexing and Hashing

References

A. Silberschatz, H. Korth and S. Sudarshan,

Database System Concepts (Sixth Edition),

McGraw-Hill, 2010

2 December 2005

Next LectureQuery Processing and Optimisation

Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)

Education

Information Retrieval Methods - Penn Engineeringzives/03s/cis650/ir.pdf · Databases vs. Information Retrieval DATABASES We know the schema in advance, so semantic correlation between

Security Methods for Statistical Databases by Karen Goodwin

Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)

Relational Model and Relational Algebra - Lecture 3 - Introduction to Databases (1007156ANR)

Forecasting Health Expenditure: Methods and Applications ......CHEPA WORKING PAPER SERIES Paper 15-05 Forecasting Health Expenditure: Methods and Applications to International Databases

INFO 629 Dr. R. Weber Copyright R. Weber Knowledge representation methods Knowledge bases, case bases, databases

Lecture 09 Indexing Methods - University of Engineering ...web.uettaxila.edu.pk/CMS/seADMSbsSp09/notes/ADBMS-Lecture-9...Indexing Methods Lecture 9 Storage Requirements of Databases

Extended ER Model and other Modelling Languages - Lecture 2 - Introduction to Databases (1007156ANR)

Techniques, Methods and Advances in Missing Person ......Polymarker Offender DNA Databases 1997 2004 1987 Automated Sequencer 3 Techniques, Methods, Advances SNPs Electrospray Ionization

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases

Introduction and Conceptual Modelling - Lecture 1 - Introduction to Databases (1007156ANR)

Content Overview of TCM TCM and systems biology Computer methods and databases of TCM

Heuristic Methods€¦ · •Interpreting the results •Overview of annotated ‘meta’ or ‘curated’ databases Armstrong, 2007 Bioinformatics 2 DNA Sequence Databases •Raw

Databases@MPA, access methods and plans

Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How ﬁnd a good root? • Reliability

A SURVEY OF THE TRENDS IN FACIAL AND EXPRESSION RECOGNITION DATABASES AND METHODS

Object and Object-Relational Databases - Lecture 12 - Introduction to Databases (1007156ANR)

Security-Control Methods for Statistical Databases: A ...utdallas.edu/~muratk/courses/privacy08f_files/stat_database_sec.pdf · Security-Control Methods for Statistical Databases:

Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1007156ANR)

Transaction Management - Lecture 11 - Introduction to Databases (1007156ANR)