File organization 1

1

File Organization & Indexing

22

DBMS stores data on hard disks

• This means that data needs to be– read from the hard disk into memory (RAM)– Written from the memory onto the hard disk

• Because I/O disk operations are slow query performance depends upon how data is stored on hard disks

• The lowest component of the DBMS performs storage management activities

• Other DBMS components need not know how these low level activities are performed

3

Basics of Data storage on hard disk

• A disk is organized into a number of blocks or pages

• A page is the unit of exchange between the disk and the main memory

• A collection of pages is known as a file

• DBMS stores data in one or more files on the hard disk

4

File Organization• The physical arrangement of data in a file into

records and pages on the disk• File organization determines the set of access methods for

– Storing and retrieving records from a file

• We study three types of file organization– Unordered or Heap files– Ordered or sequential files– Hash files

• We examine each of them in terms of the operations we perform on the database– Insert a new record– Search for a record (or update a record)– Delete a record

5

• Heap – a record can be placed anywhere in the file where there is space

• Sequential – store records in sequential order, based on the value of the search key of each record.

• Hashing – This function computed on some attribute of each record. The term hash indicates splitting of key into pieces. Records of each relation may be stored in a separate file.

Organization of Records in Files

6

Unordered Or Heap File

• Records are stored in the same order in which they are created

• Insert operation– Fast – because the incoming record is written at the end

of the last page of the file

• Search (or update) operation– Slow – because linear search is performed on pages

• Delete Operation– Slow – because the record to be deleted is first searched– Deleting the record creates a hole in the page

7

Ordered or Sequential File• Records are sorted on the values of one or more fields

– Ordering field – the field on which the records are sorted

• Search (or update) Operation– Fast – because binary search is performed on sorted records

• Delete Operation– Fast – because searching the record is fast

• Insert Operation– Poor – because if we insert the new record in the correct

position – we need to shift more than half the subsequent records in the

file– Alternatively an ‘overflow file’ is created which contains all the

new records as a heap– Periodically overflow file is merged with the main file

8

Sequential access vs random access .

• sequential access means that a group of elements is accessed predetermined, ordered sequence

• Random Access files will be spited in to pieces and will be stored wherever spaces available.

• Sequential file may load faster and random access files may take time

http://en.wikipedia.org/wiki/Sequence

99

Hash File• Is an array of buckets

– Given a record, k a hash function, h(k) computes the index of the bucket in which record k belongs

– h uses one or more fields in the record called hash fields– Hash key - the key of the file when it is used by the hash

function– h(K)=K mod M

• Example hash function– Assume that the staff last name is used as the hash field– Assume also that the hash file size is 26 buckets - each

bucket corresponding to each of the letters from the alphabet

– Then a hash function can be defined which computes the bucket address (index) based on the first letter in the last name.

10

A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).

Hash function is used to locate records for access, insertion as well as deletion.

Hashing is an effective technique to calculate direct location of data record on the disk without using index structure.

11

Hash File

• Insert Operation– Fast – because the hash function computes the

index of the bucket to which the record belongs

• If that bucket is full you go to the next free one

• Search Operation– Fast – because the hash function computes the

index of the bucket

• Delete Operation– Fast – once again for the same reason of

hashing function being able to locate the record quick

12

Internal Hashing:

•Opening Addressing:-Proceeding from occupied position specified by the hash address, program check the subsequent position in order until an unused empty position is found.

•Chaining -Various overflow locations are kept, usually by extending the array with number of overflow position-A pointer field is added to each record location.

•Multiple hashing:

External Hashing:- Hashing for disk file is called External Hashing- The Goal of good hashing function is to distribute the record uniformly over the address space so as to minimize collisions.

13

Static Hashing

Dynamic HashingDynamic hashing provides a mechanism in which data buckets are added and removed dynamically and on-demand(extended hashing)

!!! ….Problem with static hashing is that it does not expand or shrink dynamically as the size of database grows or shrinks….???

14

Overflow Chaining: When buckets are full, a new bucket is allocated for the same hash result and is linked after the previous one. This mechanism is called Closed Hashing.

Linear Probing: When hash function generates an address at which data is already stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.

15

Hash file organization of account file, using branch_name as key

For a string search - key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned

Use of Extendable Hash Structure: Use of Extendable Hash Structure: Example Example

Initial Hash structure, bucket size = 2

17

18

19

20

Indexing

• Index File (same idea as textbook index) : auxiliary structure designed to speed up access to desired data.

• Indexing field: field on which the index file is defined.

• Index file stores each value of the index field along with pointer(eg:page no.) pointer(s) to block(s) that contain record(s) with that field value or pointer to the record with that field value:<Indexing Field, Pointer>

• To find a record in the data file based on a certain selection criterion on an indexing field , we initially access the index file, which will allow the access of the record on the data file.

• Index file much smaller than the data file => searching will be fast.

• Indexing important for file systems and DBMSs:

21

Choosing Indexing Technique

• Five Factors involved when choosing the indexing technique:

• access type

• access time

• insertion time

• deletion time

• space overhead

22

Two Types of Indices

• Ordered index (Primary index or clustering index) – which is used to access data sorted by order of values.

• Hash index (secondary index or non-clustering index ) - used to access data that is distributed uniformly across a range of buckets.

23

Types of Indexes• Indexes on ordered vs. unordered files

• Dense vs. non-dense (i.e. sparse) indexes- Dense: An entry in the index file for each record of the data file.- Sparse: only some of the data records are represented in the index, often

one index entry per block of the data file.

• Primary indexes vs. secondary indexes

• Ordered Indexes – Hash indexes- Ordered Indexes: indexing fields stored in sorted order.- Hash indexes: indexing fields stored using a hash function.

• Single-level vs. multi-level– single-level index is an ordered file and is searched using binary search.– multi-level ones are tree-structured that improve the search and require a

more elaborate search algorithm.

• Index on a single indexing field – •Index on multiple indexing fields (i.e. Composite Index).

24

Primary Index:Index built on ordering key field of a file

Clustering Index:Index built on ordering non-key field of a file

Secondary Index:Index built on any non-ordering field of a file

25

Single-Level Ordered Index : Primary Index

A primary index file is an index that is constructed using the sorting attribute of the main file.

• Physical records may be kept ordered on the primary key.

• The index is ordered but only one entry record for each block

• Each index entry has the value of the primary key field for the first record (or the last record) in a block and a pointer to that block.

26

27

Procedure: First perform a binary search on the primary index file, to find the

address of the corresponding data.

Performance: Very fast!

Problem: The Primary Index will work only if the main file is a sorted file.

Solution: The new records are inserted into an unordered (heap) in the overflow file for the table. Periodically, the ordered and overflow tables are merged together; at this time, the main file is sorted again, and the Primary Index file is accordingly updated.

28

Dense and Sparse Indices

There are Two types of ordered indices:

Dense Index: • An index record appears for every search key value in file.• This record contains search key value and a pointer to the actual

record.

Sparse Index: • Index records are created only for some of the records.• We start at that record pointed to by the index record, and proceed

along the pointers in the file (that is, sequentially) until we find the desired record.

29

Figures 1 and 2 show dense and sparse indices for the deposit file.

Figure 1: Dense index.

•Notice how we would find records for Perryridge branch using both methods.

Figure 2: Sparse index.

30

Index Choice

• Dense index requires more space overhead and more memory.

• Data can be accessed in a shorter time using Dense Index.

• It is preferable to use a dense index when the file is using a secondary index, or when the index file is small compared to the size of the memory.

31

Single-Level Ordered Index: Clustering Index

• Records physically ordered by a non-key field

• Same general structure as ordered file index– <Clustering field, Block pointer>

• One entry in the index for each distinct value of the clustering field with a pointer to the first block in the data file that has a record with that value for its clustering field.

– Possibly many records for one index entry (non-dense)

• Sometimes entire blocks reserved for each distinct clustering field value

32

Secondary Indexes

• secondary index must contain pointers to all the records.

• A pointer does not point directly to the file but to a bucket that contains pointers to the file.

• Secondary indices must be dense, with an index entry for every search-key value, and a pointer to every record in the file. Secondary indices improve the performance of queries on non-primary keys.

33

Choosing Multi-Level Index

• In some cases an index may be too large for efficient processing.

• In that case use multi-level indexing.

• In multi-level indexing, the primary index is treated as a sequence file and sparse index is created on it.

• The outer index is a sparse index of the primary index whereas the inner index is the primary index.

34

Multi-Level Index

35

B-Tree Index

• B-tree is the most commonly used data structures for indexing.

• It is fully dynamic, that is it can grow and shrink.

36

Three Types B-Tree Nodes

• Root node - contains node pointers to branch nodes.

• Branch node - contains pointers to leaf nodes or other branch nodes.

• Leaf node - contains index items and horizontal pointers to other leaf nodes.

37

Full B Tree Structure

38

Dynamic Multilevel Indexes– Retain the benefits of using multilevel indexing while reducing index insertion & deletion

– Dynamic multilevel indexes are implemented as B-trees and often as B+-trees.

•B-tree: Allow an indexing field value to appear only once at some level in the tree ;. pointer to data at each node.

•B+tree: . pointers to data are stored only at the leaf nodes of the tree . Leaf nodes have an entry for every indexing field value.. The leaf nodes are usually linked together to provide ordered access on the indexing field to the records.All the leaf nodes of the tree are at the same depth: retrieval of any record takes the same time.

39

In a B tree search keys and data stored in internal or leaf nodes. But in B+tree data store only leaf nodes.

Searching of any data in a B+ tree is very easy because all data are found in leaf nodes otherwise in a B tree data cannot found in leaf node.

In B tree data may found leaf or non leaf node. Deletion of non leaf node is very complicated. Otherwise in a B+ tree data must found leaf node. So deletion is easy in leaf node.

Insertion of a B tree is more complicated than B+ tree.

B +tree store redundant search key but B-tree has no redundant value.

In B+ tree leaf node data are ordered in a sequential linked list but in B tree the leaf node cannot stored using linked list. Many database system

implementers prefer the structural simplicity of a B+ tree

40

B+-tree

B-tree

Education

File organization 1