IELM 230: File Storage and Indexes Agenda: - Physical storage of data in Relational DB’s - Indexes and other means to speed Data access - Defining indexes

IELM 230: File Storage and Indexes

Agenda:

- Physical storage of data in Relational DB’s

- Indexes and other means to speed Data access

- Defining indexes in SQL

Physical Data Storage

- All data in a DB is stored on hard disks (HD)

platter arm motion

head

Steppermotor

HD motor(7200rpm)

arm

platter arm motion

head

Steppermotor

HD motor(7200rpm)

arm

- All data in a file series of bits (0, 1)- Each bit is stored (0 magnetised, 1 demagnetised) along points on tracks (concentric circles)

Physical Data Storage..

Data connections, including16 pins to carry data3 pins to specify sector for I/O

Data connections, including16 pins to carry data3 pins to specify sector for I/O

Typical IDE HD controller40-pin socket to motherboard

HD storage details

Track nSector

Cluster of four sectors

R/W Head

Track 0

Track 1023

Track nSector

Cluster of four sectors

R/W Head

Track 0

Track 1023

Schematic of data storage on a 1024-track disk

SECTOR: Smallest unit of data exchange (typical size: 512Byte) [why?]

CLUSTER: group of four sectors

R/W heads move together Four R/W heads can read at the same time

CYLINDER: tracks on different platters that can be read simultaneously

Delays in HD CPU data communication

Block: the amount of data in one sector

CPU,RAM

HD Controller

[request time] Block address

[seek time] Use stepper motor to locate R/W head above correct track

HD buffer[read time] read 1 Block of data from HD and store in HD buffer

[transfer time] send Block of data on Data bus to RAM

1

2

3

4

CPU

RAM

R/W request[sector address]

1 sector of data(1 block)

CPU

RAM

R/W request[sector address]

1 sector of data(1 block)

Typical DB concerns

Fast data access even when

- many users are simultaneously accessing a DB

- data is in a table with millions of rows

Typical operations

- Search for a particular row of data in a table

- Creation a row in a table

- Modify some data in a row of a table

- Deleting a row of data from the table

SELECT…

INSERT…

UPDATE…

DELETE…

How to store tables on the HD

1. Each table is stored as an independent file

2. The attributes in a table are often accessed together [Why ?]

Need to store the attribute values in each record contiguously and

Attributes MUST be stored in the same sequence for each record

3. We can choose the sequence in different records are stored [Why ?]

Storage format for records

Li Richard 99998888 D8

Employee( Lname, Fname, ID, DeptNo)

Record 1

Record 2 Patten Christopher 44444444 D0

field separator record separator

Record 1 Record 2 Record 3 Record4Block n

Block n+1 Record 5 Record 6 Record 7

wasted disk space…

Approximate time for different operations

CPU,RAM

HD Controller

[request time] < 10-6 sec Block address

[seek time] ~3x10-3 sec Use stepper motor to locate R/W head on track

HD buffer

[read time] ~2-4x10-3 sec (including mean latency) read 1 Block of data from HD and store in HD buffer

[transfer time] ~3x10-3 sec send 1 Block of data on Data bus to RAM

1

2

3

4

Capacity: 600GBBuffer (Cache) Size: 16MBBytes per Sector: 512Disk Drive Configurations: Disks: 4. Heads: 8

Performance Specifications:Spindle Speed (RPM): 15000Seek Time: Average Read (ms): 3.4Average Rotational Latency (ms): 2.0Transfer Rate: SCSI (MB/s): 600.

Specifications Seagate Cheetah 15K.7 600GB Hard drive

CPU: Search for a record in one block of data stored in the RAM: ~10-6sec

Time analysis of operation on DB

Total time for an operation (e.g. search for a record in a DB):

1. TRANSFER block of Data to RAM

2. Search for data in BLOCK [transfer data from RAM to CPU] + [examine data] + [report output]

ForEachBlock

few 10-3 secs

few 10-6 secs

Since TRANSFER time dominates, we will ignore CPU timefor all further analysis.

Heap Files

HEAP file: - All records of the table stored in the order of creation - Stored in one large file - Stored on contiguous blocks on HD

Operation: Insert a new record

Method:Get file data (Location of 1st Block, Size of file)Transfer last block from HD RAMIf (enough space) Add Record to Block Transfer updated Record HD (write)Else Increment file size by 1 Block, Add record to new Block, Transfer updated Record HD (write)

t sec

t sec

Worst case time = 2t sec (very fast)

Heap file operations..

Operation: Search for a record

Method: Linear search

(1) Transfer 1st Block RAM

(2) (CPU) Search for record in this block

(3) If no match is found

(3.1) Copy the next block into RAM

(3.2) Go to (2)

Performance:Let: Size of file: B blocksWorst case = (the data is in the last Block, or not in Table)Worst case time = Bt (very slow)Average case time: Bt/2 (very slow)


Operation: Update a record

Method: Linear search

(1) Search for the record to update (Linear search)

(2) If found: Modify record; Write the updated Block to HD

Performance:

Let: Size of file: B blocks

Worst case = Step (1): Bt; Step (2): t

Worst case time = Bt+t (very slow)

Average case time: (Bt+t)/2 (very slow)


Operation: Delete a record

Method: Same as for Update

Performance: Same as for Update

Problem:

Extra space (‘Hole’) is left in the Block with the deleted record

Typical solutions:

(a) Periodic consolidation of Blocks

(b) Use of 1-bit ‘RECORD_DELETED’ markers

Sorted Files

Main idea:Sort the records in the fileBased on one attribute value (ordering attribute/field).

1008Anders

1002Akers

1001Abbot

1008Anders

1002Akers

1001Abbot

Lname SSN Job Salary

1086Atkins

1055Wong

1024Alex

1086Atkins

1055Wong

1024Alex

1239Jacobs

1208Nathan

1197Arnold

1239Jacobs

1208Nathan

1197Arnold

1412Aaron

1321Adams

1310Anderson

1412Aaron

1321Adams

1310Anderson

1615Ali

1514Zimmer

1413Allen

1615Ali

1514Zimmer

1413Allen

2085Acosta 2085Acosta

Block 1

Block 2

Block 3

Block 4

Block 5

Block n

Table sorted by SSN

Sorted file operations

Operation: Search for a record, given value of ordering field

Method: Binary searchLet file size = b Blocks.

1. Look in the block number b/2If (searched record is in this block), DONE;If (searched value) > (last ordering field value in this block)

Binary search in blocks between ( b/2 + 1), b;Else Binary search in blocks between 1, ( b/2 - 1).

Performance: Worst case: t(1 + lg2b)

Sorted file operations..

Heap file vs. Sorted File, Search time comparison

ASSUME: file size = 8192 blocks.

Heap file:Worst case time = 8192t

Sorted file: Worst case time = t( 1 + lg2 8192) = t(1+ 13) = 14t

Searching in sorted file is 8192/14 ≈ 585 times faster

Sorted file operations…

Operations: Delete a record/update a value in a record

Method: Binary search for record; Modify and Write block

Performance:The worst case time = t(1 + lg2b) + t (fast)

NOTE

1. Still need to perform occasional ‘file compacting’ after deletions

2. What if we want to modify the ordering attribute value?

Worst case search time

Sorted file operations….

Operations: Insert a new record Update the ordering attribute value of a record

Method 1:Insert the record in correct position by ordering field.

1. Search correct block to insert record2. If (Block is full)

2.1. Remove last record in Block2.2. Insert new record and rewrite block2.3. Insert the removed block of step 2.1 in next Block…

Performance: Search for the insertion point ≈ t(1+ lg2b) +

Read and Write each block = 2btVery inefficient

Sorted file operations…..

Operations: Search for a record in Table in Sorted+Overflow files

Method:

1. Binary search in Main file2. Linear search in Overflow file

Performance: [exercise]

Sorted file operations….

Operations: Insert a new record Update the ordering attribute value of a record

Method 2: Overflow filesUse two files to store a Table:

Main file: contains most of the records, SORTEDOverflow file: recently inserted records stored in this, HEAP

At periodic intervals, Overflow file records merged into Main file,

Performance: Insertion time: 2t (constant time) (very fast) + occasional time to consolidate Overflow and Main files

Faster search: Hashing

Main idea: divide data into a series of organized “buckets”

Setting up a hash table:

1. Estimate maximum size of Table (e.g. 10,000 Blocks)

2. Specify maximum search time for a record (e.g. 10t)

3. Determine bucket size (here, 10 Blocks)

4. Determine a hashing attribute

5. Determine a hashing function, h( )

h( hash_attribute_value) = Bucket_number

6. Reserve max_size contiguous Blocks on HD

Using a hash file

Insert a record:Let Bucket size = b blocks;

1. Compute the Bucket address = Addr = h( hash key value)2. Get Block at address Addr to RAM 2.1. If enough space, insert and rewrite Block to HD 2.2. Else Set (Addr = Addr+1); go to Step 2.

NOTE:

1. Selection of h( ) is critical: h( hash_key_values) must be uniformly distributed on 1,..n Buckets

2. What happens if a Bucket is full ?

Performance: Constant time for Search, Insert, Delete, Update

Indexes

A primary index file is an index that is constructed usingthe sorting attribute of the main file.

- Hash files sacrifice extra disk space [Why?] for operation speed

- Another way to use extra space for faster operations: Index files

- default sorting attribute: primary key

Primary Index

Block 1

Block 2

Block 3

Block 4

Block 5

Block n

31197

41310

51413

…

21024

11001

31197

41310

51413

…

21024

11001

n2085 n2085

SSN Block No

Primary IndexKey attributeAnchor value

BlockAddress

Primary Index File Main File

1008Anders

1002Akers

1001Abbot

1008Anders

1002Akers

1001Abbot


1086Atkins

1055Wong

1024Alex

1086Atkins

1055Wong

1024Alex

1239Jacobs

1208Nathan

1197Arnold

1239Jacobs

1208Nathan

1197Arnold

1412Aaron

1321Adams

1310Anderson

1412Aaron

1321Adams

1310Anderson

1615Ali

1514Zimmer

1413Allen

1615Ali

1514Zimmer

1413Allen


Example:

Primary Index..

Operation: Search for a record in the main file

Procedure:1. Binary search for Block address of record in primary index file2. Fetch Block of Main file with searched record to RAM 2.1. Search this block for the data

Performance:Let size of Primary Index file = P blocksWorst case time to locate Block address ≈ t(1 + lg2P)

Time to fetch located block from main file = tTotal worst case time ≈ t(1 + lg2P) + t = t(2 + lg2P) (very fast)

Primary Index…

Example: search for record of SSN= ‘1208’

Block 1

Block 2

Block 3

Block 4

Block 5

Block n

31197

41310

51413

…

21024

11001

31197

41310

51413

…

21024

11001

n2085 n2085

SSN Block No

Primary IndexKey attributeAnchor value

BlockAddress

Primary Index File Main File

1008Anders

1002Akers

1001Abbot

1008Anders

1002Akers

1001Abbot


1086Atkins

1055Wong

1024Alex

1086Atkins

1055Wong

1024Alex

1239Jacobs

1208Nathan

1197Arnold

1239Jacobs

1208Nathan

1197Arnold

1412Aaron

1321Adams

1310Anderson

1412Aaron

1321Adams

1310Anderson

1615Ali

1514Zimmer

1413Allen

1615Ali

1514Zimmer

1413Allen


Block 1

Block P

1.Binary search in P blocks SSN= ‘1208’ inBlock 3 of Main file

2.Fetch Block 3 of main file;

3.Find data of SSN=‘1208’;

Primary Index….

Operation: Insert a record into main file

Problem: - Main file must be sorted by sorting attribute

insert into correct position is too expensive

Solution: - Newly inserted records are stored in Overflow file

NOTE: Overflow file may be a Hash file (fast), or Heap file

Performance analysis: Constant time (add record to last Block in Overflow file)

Secondary Indexes

Secondary index file is an index constructed on any non-sortingattribute of the Main table.

The Secondary Index is a two column file storing the block addressof every secondary index attribute value of the table.

Secondary Indexes..

Block 1

Block 2

Block 3

Block 4

Block 5

Block n

1Akers

nAcosta

4Adams

…

1Abbot

4Aaron

1Akers

nAcosta

4Adams

…

1Abbot

4Aaron

5Allen

5Ali

…

…

5Allen

5Ali

…

…

Lname Block No

Secondary IndexKey attribute value

BlockAddress

Secondary Index File Main File

1008Anders

1002Akers

1001Abbot

1008Anders

1002Akers

1001Abbot


1086Atkins

1055Wong

1024Alex

1086Atkins

1055Wong

1024Alex

1239Jacobs

1208Nathan

1197Arnold

1239Jacobs

1208Nathan

1197Arnold

1412Aaron

1321Adams

1310Anderson

1412Aaron

1321Adams

1310Anderson

1615Ali

1514Zimmer

1413Allen

1615Ali

1514Zimmer

1413Allen


2Wong

5Zimmer

…

2Wong

5Zimmer

…

Example:Secondary Indexon Lname

Secondary Indexes…

Operations and time analysis: Similar to Primary Index

Each table can have only one primary index

You can define more than one secondary index files

Why would we create more than one index for the same table?

Creating, Deleting Indexes in SQL

Example 1: Create an index file for Lname attribute of EMPLOYEE.

CREATE INDEX myLnameIndex ON EMPLOYEE(Lname);

Example 2:You can also create an Index on a combination of attributes.

CREATE INDEX myNamesIndex ON EMPLOYEE(Lname, Fname);

Example 3: Delete the index created in Example 2.

DROP INDEX myNamesIndex;

Documents

IELM 230: File Storage and Indexes Agenda: - Physical storage of data in Relational DB’s - Indexes and other means to speed Data access - Defining indexes