CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: 845-4259 Email: [email protected] Notes #6

CPSC-608 Database Systems

Fall 2008

Instructor: Jianer ChenOffice: HRBB 309BPhone: 845-4259Email: [email protected]

Notes #6

Optimizing Disk Access


(By reducing the seek time and rotational delay

via Operating Systems and Disk Controllers)


(By reducing the seek time and rotational delay

via Operating Systems and Disk Controllers)3 or 5x

x

1 NCylinders Traveled

Time

Average seek time = 20 msShortest seek time = 5 ms

Average rotational delay = 8 ms

Dealing with many random accesses(disk scheduling) • Suppose that we have a large (dynamic)

sequence of disk read/write tasks, on blocks randomly distributed in the disk.

• How do we order the tasks so that the total time can be minimized?

• Elevator algorithm

Elevator algorithm

Disk head makes sweeps across the disk, stops at a cylinder if a task reads/writes a block in the cylinder, and reverses its direction if no read/write tasks (at that moment) ahead.

Intuitively good, in particular when there are a large number of tasks reading/writing blocks uniformly distributed in the disk

Real-time mannerPrecise analysis is difficult

Dealing with a long sequence of data on disk • Data in consecutive cylinders• Larger buffer• Pre-fetch/double buffering• Disk arrays• Mirrored disks

Example. Sorting on disk (again)

• A relation R of 10M tuples takes 100K blocks

• Main memory can store 6400 blocks• A disk block read/write: 40 ms

seek time = 31 ms, rotational delay = 8 ms transfer time = 1 ms

• Two-phase Multiway Sorting on randomly distributed blocks takes about 4.5 hours.

• Also assume that a track holds 500 blocks, and that traversing one cylinder takes 5 ms

Data in consecutive cylinders

In phase 1, suppose that we have the input relation stored in consecutive tracks

We can read/write 6400 consecutive blocks between main memory and disks

Phase 1 read/write: 2*(100K/6400)(31 + 8 + 12*5 + 6400*1) ≈ 208000 ms < 4 minutes (save 2 hour)

Larger Buffer

In phase 2, we have 16 sublists, each takes a block in main memory, with 6384 blocks left.

If we use all these 6384 blocks for output buffer, and write them to disk only when they are all full:

Phase 2 writing: (100K/6384)(31 + 8 + 12*5 + 6384*1) = 104000 ms < 2 minutes (save 1

hour)

However

The reading in phase 2 seems harder to improve: it is kind of random.

Double Buffering

For applications where the read/write is predictable.

Have a program» Process B1» Process B2» Process B3

...

Single Buffer Solution

1. Read B1 Buffer2. Process Data in Buffer3. Read B2 Buffer4. Process Data in Buffer ...

Let P = time to process/blockR = time to read in 1 blockn = # blocks

Single buffer time = n(P+R)

Double Buffering

Memory:

Disk: A B C D GE F

Double Buffering

Memory:

Disk: A B C D GE F

A

Double Buffering

Memory:

Disk: A B C D GE F

B

done

process

A

Double Buffering

Memory:

Disk: A B C D GE F

C

process

B

done

if P R

• Double buffering time = R + nP

• Single buffering time = n(R+P)

P = Processing time/blockR = IO time/blockn = # blocks

Disk ArraysTaking the advantage that disk read/write

can be done in parallel between a single CPU and multiple disks.

logically one disk

Would not help if the interesting blocks are in the same disk

Mirrored Disks

Duplicating disks so that multiple reads in the same disk can be done in parallel.

A A B B

Writing is more (but not much) expensive

Disk Failures

• Partial Total• Intermittent Permanent

Coping with Disk Failures

• Detection– Checksum

• Correction– Redundancy

At what level do we cope?

Operating System Level (Stable Storage)

Logical block Copy A Copy B

Database System Level (Log File)

Log

Current DB Yesterday’s DB

Intermittent Failure Detection (Checksums)

• Idea: add n parity bits every m data bits– Ex.: m=8, n=1

• Block A: 01101000:1 (odd # of 1’s)• Block B: 11101110:0 (even # of 1’s)

• But suppose: Block A instead contains• Block A’: 01000000:1 (also has odd # of 1’s) 50% change of detection per parity bit

• More parity bits decrease the probability of an

undetected failure 1/2n (with n ≤ m independent parity bits)

Disk Crash (Disk Arrays)• RAIDs (Redundant Arrays of Inexpensive Drives)

logically one disk

Disk Arrays• RAID Level 1 (Mirroring)

– Keep exact copy of data on redundant disks

AA BB AA BB

Disk Arrays• RAID Level 4

– Keep only one redundant disk– Entire parity blocks on redundant disk

AA BB CC PP

Parity Blocks & Modulo-2 Sums

• Have an array of 3 data disks– Disk 1, block 1: 11110000– Disk 2, block 1: 10101010– Disk 3, block 1: 00111000

• … and 1 parity disk– Disk 4, block 1: 01100010

Note: - Sum over each column is always an even # of 1’s

- Mod-2 sum can recover any missing single row (e.g., a logical block)

Using Mod-2 Sums for Error Recovery

– Suppose we have:– Disk 1, block 1: 11110000

– Disk 2, block 1: ????????– Disk 3, block 1: 00111000– Disk 4, block 1: 01100010 ( Parity)

– Mod-2 sums for block 1 over disks 1,3,4: Disk 2, block 1: 10101010

Disk Arrays

• RAID Level 5 (Striping)– Like level 4, but balanced read & write

load

DDCCBBAA

Parity partition on each disk

Disk Arrays

• RAID Level 6 (error correction code) more powerful, can recover from

more than one task crashes.

Documents

CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: 845-4259 Email: [email protected] Notes #6