66
University of Dublin Trinity College Index Structures for Files [email protected]

Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files [email protected] . Why do we index ... We can ask the DBMS to

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

University of Dublin Trinity College

Index Structures for Files

[email protected]

Page 2: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Why do we index in the physical world?

The last few pages of many books contain an index Such an index is a table containing a list of topics (keys) and numbers of pages where the topics can be found (reference fields). All indexes are based on the same concepts - keys and reference fields Consider what would happen if we tried to binary search the words in a book

•  Sorting the words would have a bad effect on the meaning of the book

•  Adding an index allows us to impose an order on a file without actually re-arranging it

Page 3: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

We want to find some books in a library. We want to locate books by a specific author, by their title or by their subject area. One way to organise the books so we can do this is to have 3 separate copies of each book, and three separate libraries. All the books in one library would be arranged (sorted or hashed) according to author, another would arrange them by subject and a third by title.

Why do we index in the physical world?

Page 4: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Why do we index in the physical world?

A better system is to use a card catalogue •  A set of three indexed, each based on a different key field •  All of the indexes use the same catalogue number as a reference

field •  Each index allows us to efficiently search a file based on the

different data we are looking for •  An index may be arranged as a sorted list which can be binary

searched, a hash table, or a tree structure of the type we'll look at in the coming lectures

Page 5: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Indexing files in IT? Indexes are auxiliary access structures

•  Speed up retrieval of records in response to certain search conditions

•  Any field can be used to create an index and multiple indexes on different fields can be created

The index is separate from the main file and can be created and destroyed without affecting the main file.

•  The index must be updated when records are inserted or deleted to/from the main file.

The issue is how to organise the index records for efficient access and ease of maintenance.

Page 6: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Advantages over Hashing Multiple indexes can be built for the same file, allowing for efficient access over multiple fields

Page 7: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Static/Dynamic Indexes Static Index structures

•  Static indexes are of fixed size and structure, though their contents may change.

•  As we will see requires periodic reorganisation •  IBM’s ISAM (Indexed Sequential Access Method) uses static index

structures •  Covered in these lectures

Dynamic Index structures •  Dynamic indexes change shape gradually in order to preserve

efficiency. •  Implemented as search trees (e.g. B-Trees, AVL Trees etc.) •  Covered later in course

Page 8: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Dense Index Index record appears for every search key value

Page 9: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Sparse Index Sparse Index: contains index records for only some search-key values.

•  Applicable when records are sequentially ordered on search-key

To locate a record with search-key value K we: •  Find index record with largest search-key value < K •  Search file sequentially starting at the record to which the index

record points

Less space and less maintenance overhead for insertions and deletions. Generally slower than dense index for locating records. Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

Page 10: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Sparse Index

Page 11: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

A single level index is an auxiliary file that makes it more efficient to search for a record in the data file The index is usually specified on one field of the file One form of an index is a file of entries <field value, pointer to record> which is ordered by field value The index is called an access path on the field The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller A binary search on the index yields a pointer to the file record

Single Level Index

Page 12: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Types of Single Level Indexes Primary Index Clustering Index Secondary Index

Page 13: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

A primary index is an ordered file whose entries are of fixed length with two fields:

<value of primary key; address of data block>

•  The data file is ordered on the primary key field and requires primary key for each record to be unique/distinct

•  Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor

•  A similar scheme can be used for the last record in a block

Primary Index

Page 14: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to
Page 15: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Example Performance Gain Ordered file of r=30000 records Block size B =1024 bytes Records Fixed sized and unspanned with record length R=100 bytes Bfr = B/R = 1024/100 = 10 records per block

Number of blocks needed for file is b= r/bfr = 30000/10 = 3000 blocks

A binary search on the data file would need approx log2b = log23000 = 12 block accesses

Page 16: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Example Performance Gain Now suppose ordering key V=9bytes long and block pointer P=6bytes long Size of each index entry Ri = 9+6 = 15 bytes Blocking factor bfri = 1024/15 = 68 entries per block Total number of entries ri= total number of blocks in data file = 3000 Number of blocks needed for index is

bi= ri/bfri = 3000/68 = 45 blocks A binary search of index file would need

log2bi = log245 = 6 block accesses PLUS one block access into the data file itself using the pointer

Therefore 7 block accesses needed

Page 17: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Problem with Primary Indexes Insertion or deletion of record in ordered data file involves not only making space or deleting space in the data file … but also changing the index entries to reflect the new situation Possible solution

•  Use deletion markers for records •  Maintain linked list of overflow records for each block in the data file •  Reorganise periodically

Page 18: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

A clustering index is an ordered file whose entries are of fixed length with two fields:

<value of clustering key; address of data block> •  The data file is ordered on the clustering field but the clustering key

does not have distinct value for each record •  Index includes one index entry for each distinct value of the

clustering field; the index entry points to the first data block that contains records with that field value

Insertion/Deletion still problematic due to ordering of main data file

•  To solve it is common to reserve a block or contiguous blocks (see diagram Separate Block Clustering)

Clustering Index

Page 19: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to
Page 20: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to
Page 21: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

A secondary index is an ordered file whose entries are of fixed length with two fields:

<value of key; address of data block or record pointer> •  The secondary key is some nonordering field of the data file

Frequently used to facilitate query processing For example say we know that queries related to genre are frequent

•  SELECT * FROM movie WHERE genre=“comedy”;

We can ask the DBMS to create a secondary index on genre by issuing the following SQL command

•  CREATE INDEX Gindex ON Movie(genre);

Secondary Index

Page 22: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Secondary Index where unordering field is a key

If the unordering field has distinct values (i.e. could be considered a secondary key) then

•  One index entry for each record in the data file •  Pointer points to the block in which the record is stored or to the

record itself

Index entries are still ordered so can do binary search but will need to know if pointer is a record pointer or a block pointer in order to process search correctly

Page 23: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Secondary Index of block pointers and

secondary key type

Page 24: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Option #1 •  Include several index entries with the same first value, one for each

record. This is a dense index

Option #2 •  Have variable length index entries, with a repeating field for the

pointer. For example <K(i),P(i,1)… P(i,k)>

For these two options the binary search algorithm needs modification

Secondary Index where unordering field is not a key

Page 25: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Option #3

•  Keep one index entry per value of fixed length with a pointer to a block of pointers, that is add a level of indirection

•  If pointers cannot fit in the allocated space for the block of pointers use an overflow or linked list approach to cope

Secondary Index where unordering field is not a key

Page 26: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Secondary Index of record pointers and

nonkey type using Option#3

Page 27: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Performance Generally Secondary Indexes leads to more storage space and longer search time (due to larger number of entries) than for primary indexes However for an arbitrary record the improvement is greater as otherwise we would have to do a linear search!

Page 28: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Consider Earlier Example Recap

•  r = 30000 fixed length (100bytes) records of block size B=1024bytes •  File has 3000 blocks as we already calculated

Linear search would require b/2=3000/2=1500 block accesses Suppose a secondary index on a field of V=9bytes Thus Ri = 9+6 = 15bytes bfri = B/Ri = 1024/15 = 68 entries per block Number of index entries ri = number of records = 30,000 Number of blocks need is bi= ri/bfri = 30000/68 = 442 blocks A binary search of index file would need

log2bi = log2442 = 9 block accesses PLUS one block access into the data file itself using the pointer

Therefore 10 block accesses as opposed to 1500 block accesses

Page 29: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Another way of organising Secondary Index: Inverted File

The inverted file contains one index entry for each value of the attribute in question. The entry contains a list of pointers to every record with that attribute value. For example, secondary key of car manufacturer

Audi 11 BMW 3 9 16 17 Ford 1 4 5 7 10 14 18 19 20 Honda 15 VW 2 6 8 12 13

Page 30: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Why called an “inverted file”? Consider the main file as a function which maps addresses to (attribute, value) pairs :

•  file (address) -> (attribute1, value), (attribute2, value), ...

Inverted files are functionally the inverse of the main file •  inv_file (attribute, value) -> address, address, ...

A number of attributes can be inverted. Degree of inversion of a file is the percentage of attributes indexed in this way. 100% inversion is when every attribute is indexed. Queries on multiple keys need not refer to the main file if all the keys in the query are indexed.

•  Consider a query for the record numbers of "green Fords" if both colour and manufacturer are indexed.

Page 31: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Yet another way of organising Secondary Index: Threaded Files

Each record has a pointer field for each indexed secondary key value. This field is used to link (thread) all records with the same attribute value for that key. The threaded file has a number of separate threads running through it. The index then contains a pointer to the head of each list. To find all records with a particular secondary attribute value, find that value in the index, and follow the thread.

Page 32: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Example Retrieval in Threaded Files

To find all green Ford cars (i.e. a query based on two attributes) we have three options: 1. traverse the list of Fords (using the Manufacturer index and the associated thread) checking each to see if it is green. 2. traverse the list of green cars checking each to see if it is a Ford. We would prefer to traverse a short list - index entries could include a thread length. 3. traverse both threads simultaneously (cf. sequential file merge). Requires that records are threaded in pointer order.

Page 33: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Yet another way of organising Secondary Indexes: Multilists

Like threaded files, but index contains a pointer to every kth record with the particular attribute value.

•  Speeds up merge operations •  Can skip over the rest of a sublist if the next pointer in the index is

still smaller than the current pointer in the other thread. •  Note that threaded files can be considered as multilists with =

Cellular Multilist •  Like an inverted file but only list the secondary storage blocks

which contain records with the attribute value. •  Searches must access the main file blocks to ascertain exactly

which record has the value. •  Compromise between threads and inverted files.

∞=k

Page 34: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Review Basic Operations involving Indexes

Retrieve a record based on a key Create the original empty index and data files

•  Both the index file and the data file are created empty Add records to the data file and index

•  Adding a new record to the data file requires that we also add a record to the index file

•  Adding to the data file is easy. Just add at the end or at an existing gap between records

•  Adding to the index is not easy, because entries of the index file have to be sorted.

•  This means that we have to shift all the index records after the one we are inserting

•  Essentially, we have the same problem as inserting records into a normal sorted file

•  One solution is to use a hash table file for the index, rather than a sorted file •  Another solution is to use sorted structures that are very cheap to add to

Page 35: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Review Indexes allow access using different keys without duplicating all the records

•  avoiding duplication saves storage •  avoiding duplication makes modifying the data easier - we don't

have lots of different copies to keep up to date

Indexes allow a lot of flexibility in the layout of the data file

•  We don't need fixed length records •  We can store records in any order

Page 36: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

University of Dublin Trinity College

Index Structures for Files Multi-Level Indexes

[email protected]

Page 37: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Because a single-level index is an ordered file, we can create a primary index to the index itself

•  the original index file is called the first-level index •  the index to the index is called the second-level index

We can repeat the process, creating a 3rd, 4th,.... top level until all entries in the top level fit in one disk block

Multi-Level Indexes

Page 38: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Multi-Level Indexes Indexing schemes so far have looked at an ordered index file Binary search performed on this index to locate pointer to disk block or record Requires approximately log2n accesses for index with n blocks Base 2 chosen because binary search reduces part of index to search by factor of 2 The idea behind multi-level indexes is to reduce the part of the index to search by bfr , the blocking factor of the index (which is bigger than 2!)

Page 39: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Fan-Out The value of bfr for the index is called the fan-out We will refer to it using the symbol fo Searching a multi-level index requires approximately

logfon

block accesses This is smaller than binary search for fo > 2

Page 40: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk blocks Such a multi-level index is a form of search tree ; however, insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file Hence most multi-level indexes use B-tree or B+ tree data structures, which leave space in each tree node disk block) to allow for new index entries

Multi Level Indexes

Page 41: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

A multi-level index considers the index file as an ordered file with a distinct entry for each K(i)

•  First level

We can create a primary index for the first level

•  Second level •  Because the second level uses block anchors we only need an

entry for each block in the first level

We can repeat this process for the second level •  Third level would be a primary level for the second level

And so on … until all the entries of a level fit in a single block

Multi Level Indexes

Page 42: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Example Imagine we have a single level index with entries across 442 blocks and a blocking factor of 68 Blocks in first level,

n1 = 442 Blocks in second level,

n2 = ceil(n1/fo) = ceil(442/68) = 7 Blocks in third level,

n3 = ceil(n2/fo) = ceil(7/68) = 1 t = 3 t+1 access to search multi-level index. Binary search of a single level index of 442 blocks takes 9 +1 accesses

Page 43: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

How many levels? The top level index (where all the entries fit in one block) is the tth level Each level reduces the number of entries at the previous level by a factor of fo, the index fan-out Therefore,

1 ≤ (n/(fot)) We want t,

t = round(logfon)

Page 44: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Two-Level Index

2 35 55 85

2 8

15 24

35 39 44 51

55 63 71 80

2 5

24 29

.

.

35 36

51 53

.

.

55 61

80 82

.

.

.

.

Second (Top) Level

First (Base) Level Data File

.

.

Page 45: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Search Algorithm For searching a multi-level primary index with t levels p ← address of top level block of index;

for j ← t step -1 to 1 do

begin

read the index block (at jth index level) whose address is p;

search block p for entry I such that Kj(i) ≤ K < Kj(i+1) (if Kj(i) the last entry in the block it is sufficient to satisfy Kj(i) ≤ K);

p ← Pj(i) (* picks appropriate pointer at jth index level *);

end;

read the data file block whose address is p;

search block p for record with key = K;

Page 46: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

The Invention of the B-Tree It is hard to think of a major general-purpose file system that is not built around B-tree design They were invented by two researchers at Boeing, R. Bayer and E. McCreight in 1972 By 1979 B-trees were the "de facto, the standard organization for indexes in a database system“ B-trees address the problem of speeding up indexing schemes that are too large to copy into memory

Page 47: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

The Problem The fundamental problem with keeping an index on secondary storage is that accessing secondary storage is slow. Why? Binary searching requires too many seeks: Searching for a key on a disk often involves seeking to different disk tracks. If we are using binary search, on average about 9.5 seeks is required to find a key in an index of 1000 items using a binary search

Page 48: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

The Problem It can be very expensive to keep the index in sorted order If inserting a key involves moving a large number of other keys in the index, index maintenance is very nearly impractical on secondary storage for indexes with large numbers of keys We need to find a way to make insertions and deletions to indexes that will not require massive reorganization

Page 49: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

These data structures are variations of search trees that allow efficient insertion and deletion i.e. are good in dynamic situations Specifically designed for disk

•  each node corresponds to a disk block •  each node is kept between half full and completely full

Using B-trees and B+ trees as dynamic multi-level indexes

Page 50: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

An insertion into a node that is not full is very efficient; if a node is full then insertion causes the node to split into two nodes Splitting may propagate upwards to other tree levels Deletion is also efficient as long as a node does not become less than half full; if it does then it must be merged with neighbouring nodes

Using B-trees and B+ trees as dynamic multi-level indexes

Page 51: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

A B-Tree, of order m, is a multi-way search lexicographic search tree where:

•  every node has

CEIL[m/2]- 1 ≤ k ≤ m-1 keys

appearing in increasing order from left to right; an exception is the root which may only have one key

•  a node with k keys either has k +1 pointers to children, which correspond to the partition induced on the key-space by those k keys, or it has all its pointers null, in which case it is terminal

•  all terminal nodes are at the same level

Definition of a B-Tree

Page 52: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

20 40

10 15 25 30 45 50

35

Total search time = 2*1 + 6*2 + 1*3 = 17

30 20

15

10

45 50 40

35

25

Total search time = 1 + 2*2 + 4*3 + 2*4 = 25

3-way tree 2-way (binary) tree

Comparison

Page 53: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

The terminal node where the key should be placed is found and the addition (in appropriate place lexicographically) is made If overflow occurs (i.e. >m-1 keys), the node splits in two and middle key (along with pointers to its newly created children) is passed up to the next level for insertion and so on At worst splitting will propagate along a path to the root, resulting in the creation of a new root

Insertion into a B-Tree

Page 54: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

2

20 40

10 15 25 30 45 50

35

20 40

10 15 25 30 35 45 50

too many keys (>m-1)

20 30 40

10 15 25 45 50 35

too many keys

40

10 15 25 45 50 35

30

20

final B-tree

Example of insertion of key 35 into B-tree of order 3

1

3 4

Page 55: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Delete key If node still has at least CEIL[m/2]-1 keys then OK If not -

•  If there is a lexicographic successor (i.e. node with deleted key is not a leaf node) – promote it.

•  If any node is left with less than CEIL[m/2]-1 keys merge that node with left or right hand sibling and common parent

This may leave parent node with too few keys so continue merging which may result in leaving the root empty in which case it is deleted, thereby reducing the number of levels in the tree by 1

Deletion from a B-Tree

Page 56: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

For B tree of order 5: Maximum number of keys per node = m -1 = 5 -1 = 4 Minimum number of keys per node = CEIL [m/2] -1 = CEIL [5/2] -1 =2 Replace ‘zem’ in node A by its lexicographic succesor ‘zil’ in node B

xum yel yin

xim yun

xal xen

xac xag xan xat xet xig xot xul xut yal yep yes yol yon zam zel zil zon zum zun

zem zul A

B

1

Deletion from B-Tree (Case 1)

Page 57: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xum yel yin

xim yun

xal xen

xac xag xan xat xet xig xot xul xut yal yep yes yol yon zam zel zon zum zun

zil zul F

D E

Node D has too few keys and is therefore merged with left-hand sibling, node E, and their common parent ‘zil’ in node F

2

Page 58: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xum yel yin

xim yun

xal xen

xac xag xan xat xet xig xot xul xut yal yep yes yol yon zam zel zil zon zum zun

zul G

I

H

Node G now contains too few nodes and is therefore merged with left-hand sibling, node H, and their common parent ‘yun’ in node I

3

Page 59: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xim

xal xen

xac xag xan xat xet xig xot xul xut yal yep yes yol yon

K

xum yel yin yun zul

Node J now contains too many keys and is therefore split and the middle key ‘yin’ passed up to the root, node K, for insertion.

J

4

zam zel zil zon zum zun

Page 60: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xum yel

xim yin

xal xen

xac xag xan xat xet xig xot xul xut yal yep yes yol yon zum zun

yun zul

zam zel zil zon

The deletion of ‘zem’ has now been completed.

5

Page 61: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xum yel

xim yin

xal xen

xac xag xan xat xet xig xot xul xut yal yep yes yol yon zam zel zil zon zum zun

yun zul

Deletion of key from terminal node where resulting node has less than minimum number of keys

Delete ‘zum’ from previous tree

1

Deletion from B-Tree (Case 2)

Page 62: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xum yel

xim yin

xal xen

xac xag xan xat xet xig xot xul xat yal yep yes yol yon zam zel zil zon zun

yun zul C

B A

Node A now contains too few keys and is merged with left-hand sibling, node B, and their common parent ‘zul’ in node C.

2

Page 63: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xum yel

xim yin

xal xen

xac xag xan xat xet xig xot xul xat yal yep yes yol yon

yun E

D zam zel zil zon zul zun

Node D is now too full, it therefore splits in two and the middle key, say ‘zon’, is passed up to E for insertion

3

Page 64: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

xum yel

xim yin

xal xen

xac xag xan xat xet xig xot xul xat yal yep yes yol yon

yun zon

zam zel zil zul zun

The deletion of ‘zum’ has now been completed

4

Page 65: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

B-trees as Primary file organisation technique

entry in B-tree used as a dynamic multi-level index consists of: <search key, record pointer, tree pointer> i.e. data records are stored separately B-tree can also be used as a primary file organisation technique; each entry consists of: <search key, data record, tree pointer> works well for small files with small records; otherwise fan-out and number of levels becomes too great for efficient access

Page 66: Index Structures for Files - Trinity College, Dublin 4D2b Index Structures f… · Index Structures for Files Owen.Conlan@scss.tcd.ie . Why do we index ... We can ask the DBMS to

Review

Multi-Level Indexes are more efficient than a single level index for searching

•  Better than Binary Search

Definition of a B-Tree B-Trees are quite efficient at inserting and deleting new keys Next Lecture – B+ Trees