Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 CSG131 Summary Donghui Zhang

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CSG131 Summary

Donghui Zhang


What we learned Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation,

NEUStore SB-tree, R-tree.


Components of a Disk

Platters

The platters spin (say, 90rps).

Spindle

The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!).

Disk head

Arm movement

Arm assembly

Only one head reads/writes at any one time.

Tracks

Sector

Block size is a multiple of sector size (which is fixed).


Accessing a Disk Page

Time to access (read/write) a disk block: seek time (moving arms to position disk head on track) rotational delay (waiting for block to rotate under head) transfer time (actually moving data to/from disk surface)

Seek time and rotational delay dominate. Seek time varies from about 1 to 20msec Rotational delay varies from 0 to 10msec Transfer rate is about 1msec per 4KB page

Key to lower I/O cost: reduce seek/rotation delays! Hardware vs. software solutions?


Index Types A primary index is an index which

controls the actual storage of a table. Typically this index is built using the primary key of the table. A data entry is one record of the table.

A secondary index is an index which is built using some other attribute(s). A data entry contains a set of RIDs.


An Example3325 6632

0123

Donghui Zhang

33 617-373-2177

0222

James Bond

55 617-324-6685

2418

Betty Wu

21 951-666-8888332

5Jessica Greene

21

417-332-6659

3891

Frank Gates

33

433-222-1956

6632

Tim Varner

55 617-333-6782

9688

Stephanie Chen

33 951-243-0104

10

20

30

primary index

secondary index21: {10, 20}33: {10, 20, 30}55: {10, 30}


Buffer Management in a DBMS

Data must be in RAM for DBMS to operate on it! For each page, dirty bit, pin count.

DB

MAIN MEMORY

DISK

disk page

free frame

Page Requests from Higher Levels

BUFFER POOL

choice of frame dictatedby replacement policy


Page Formats: Variable Length Records

Can move records on page without changing rid; so, attractive for fixed-length records too.

Page iRid = (i,N)

Rid = (i,2)

Rid = (i,1)

Pointerto startof freespace

SLOT DIRECTORY

N . . . 2 120 16 24 N

# slots



NEUStore SB-tree, R-tree


Example B+ Tree

2* 3*

Root

17

24 30

14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

135

7*5* 8*


B+ Trees in Practice

Typical order: 100. Typical fill-factor: 67%. average fanout = 133

Can often hold top levels in buffer pool: Level 1 = 1 page = 8 KB Level 2 = 133 pages = 1 MB Level 3 = 17,689 pages = 145 MB Level 4 = 2,352,637 pages = 19 GB With 1 MB buffer, can locate one record

in 19 GB (or 0.3 billion records) in two I/Os!


B+-tree index Structure Search Insert Delete

Bulk-loading a B+-tree Aggregation Query SB-tree





Extendible Hashing Directory is array of size 4. To find bucket for r, take

last `global depth’ # bits of h(r); we denote r by h(r). If h(r) = 5 = binary 101,

it is in bucket pointed to by 01.

Insert: If bucket is full, split it (allocate new page, re-distribute).

If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.)

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

DATA PAGES

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*


Insert h(r)=20 (Causes Doubling)

20*

00

01

10

11

2 2

2

2

LOCAL DEPTH 2

2

DIRECTORY

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

1* 5* 21*13*

32*16*

10*

15* 7* 19*

4* 12*

19*

2

2

2

000

001

010

011

100

101

110

111

3

3

3DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

32*

1* 5* 21*13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH


Example of Linear Hashing

On split, hLevel+1 is used to re-distribute entries.

0hh

1

(This infois for illustrationonly!)

Level=0, N=4

00

01

10

11

000

001

010

011

(The actual contentsof the linear hashedfile)

Next=0

PRIMARYPAGES

Data entry rwith h(r)=5

Primary bucket page

44* 36*32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

0hh

1

Level=0

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

Let hLevel=(a*key+b) % 4hLevel+1=(a*key+b) % 8

If insert 22……


Example of Linear Hashing

0hh

1

Level=0

00

01

10

11

000

001

010

011

Next=2

PRIMARYPAGES

44* 36*

32*

25*9*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

Let hLevel=(a*key+b) % 4hLevel+1=(a*key+b) % 8

22*

5*01101

Important notice: after buckets 010 and 011are split, Next moves back To 000!





General External Merge Sort

To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted

runs of B pages each. Pass 2, …, etc.: merge B-1 runs.

N B/

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2

. . . . . .

. . .

More than 3 buffer pages. How can we utilize them?


Cost of External Merge Sort

Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page

file: Pass 0: = 22 sorted runs of 5 pages

each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages

each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages

Note: for merge join, no need to write result.

1 1 log /B N B

108 5/

22 4/





parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answerSQL query

parse tree

logical query plan

“ improved” l.q.p

l.q.p. +sizes

statistics

query compiling

query optimization

query execution


Query Compiling Convert

Input: a parse tree. Output: a logical query plan.

Algorithm: followed by . E.Name(E.SSN<5000 AND E.Age>50(E) ) Alternatively, a l.q.p tree.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name


Query Optimization Consider Physical Plans

Associate each RA operator with an implementation scheme. Multiple implementation schemes? Enumerate all.


Emp E


E.Name

Plan 1 (always work!)

scan

on-the-fly




Emp E


E.Name

Plan 2

range search in SSN index

on-the-fly




Emp E


E.Name

Plan 3

range search in Age index, follow pointers to SSN index

on-the-fly


Query Optimization Estimate Costs sample input

Assume for table E: Schema = (SSN: int, Name: string, Age: int, Salary: int) T(E) = 100 tuples. For attribute SSN:

• V(E, SSN)=100, min(E, SSN)=0000, max(E, SSN)=9999 For attribute Age:

• V(E, Age)=20, min(E, Age)=21, max(E, Age)=60 Primary index on SSN: 3 level B+-tree, 50 leaf nodes. Secondary index on Age: 2 level B+-tree, 10 leaf nodes, every

leaf entry points to 3.5 pageIDs (on average). Assumptions: all B+-tree roots are pinned. Can reach the

first leaf page of a B+-tree directly. Memory buffer size: 2 pages.



Query Optimization Estimate Costs

Cost = 50. (The primary index has 50 leaf nodes. Assume we can reach the first leaf page of a B+-tree directly.)


Emp E


E.Name

Plan 1 (always work!)

scan

on-the-fly




Emp E


E.Name

Plan 2

range search in SSN index

on-the-fly

Cost = 25. SSN<5000 selects half of the employees, so 50/2=25 leaf nodes.

Note: if condition is E.SSN>5000, needs 1 more I/O.




Emp E


E.Name

Plan 3

range search in Age index, follow pointers to SSN index

on-the-fly

Cost = 10/4 + 20/4 * 3.5 = 21.

#I/Os in the Age index #I/Os in the SSN index


Query Optimization Pick Best


physical plan I/O cost

Plan 1: scan 50

Plan 2: range search SSN index 25

Plan 3: range search Age index 21

Pick!


Query Compiling Convert

Algorithm: then then .

E.Name. D.Dname(E.Did=D.Did AND E.SSN<5000 AND D.budget=1000(ED) ) The l.q.p tree:

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

E.Name, D.Dname

Dept D


Query Compiling Apply Laws

Always always: (try to) replace with ! Also, push down.

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000


Four Join Algorithms

Iteration join (nested loop join) Merge join Hash join Join with index


Example E D over common attribute Did

E: T(E)=10,000 primary index on SSN, 3 levels. |E|= 1,000 leaf nodes.

D: T(D)=5,000 primary index on Did. 3 levels. |D| = 500 leaf nodes.

Memory available = 101 blocks


• Iteration JoinUse our memory(1) Read 100 blocks of D(2) Read all of E (using 1 block) + join(3) Repeat until done

• I/O cost = |D| + |D|/100 * |E| =

500 + 5*1000 = 5,500.


Merge Join Cost

Recall that |E|=1000, |D|=500. And |D| is already sorted on Did.

External sort E: pass 0, by reading and writing E, produces a file with 10 sorted runs. Another read is enough.

No need to write! Can pipeline to join operator.

Cost = 3*1000 + 500 = 3,500.


Simple example hash: even/odd

R1 R2 Buckets2 5 Even 4 4 R1 R23 12 Odd: 5 38 139 8

1114

2 4 8 4 12 8 14

3 5 9 5 3 13 11


Hash Join Cost

Read + write both E and D for partitioning, then read to join.

Cost = 3 * (1000 + 500) = 4,500.


Join with index (Conceptually)

For each r E do

Find the corresponding D tuple by probing index.

Assuming the root is pinned in memory,Cost = |E| + T(E)*2 = 1000 + 10,000*2 = 21,000.





ACID Properties of Transactions

Atomicity: all actions are carried out, or none. Consistency: each transaction preserves the

consistency of the database if executed by itself. (The users make sure of this. E.g. transfer money…)

Isolation: transactions are isolated from the effect of concurrently scheduling other transactions.

Durability: the effect of a committed transaction should last, even if system crash before all changed are flushed to disk.


Scheduling Transactions

Serial schedule: Schedule that does not interleave the actions of different transactions.

Equivalent schedules: For any database state, the effect (on the set of objects in the database) of executing the first schedule is identical to the effect of executing the second schedule.

Serializable schedule: A schedule that is equivalent to some serial execution of the transactions.

(Note: If each transaction preserves consistency, every serializable schedule preserves consistency. )


Example

Conflict serializable serializable. Serializable conflict serializable

T1: R(A) W(A), CommitT2: W(A), CommitT3: W(A), Commit

T1 T2 precedence graph:1. a node for each

transaction2. an arc from Ti to Tj if an

action in Ti precedes andconflicts with an action inTj.

T3


Deadlock Prevention

Assign priorities based on timestamps. Assume Ti wants a lock that Tj holds. Two policies are possible: Wait-Die: It Ti has higher priority, Ti waits

for Tj; otherwise Ti aborts Wound-wait: If Ti has higher priority, Tj

aborts; otherwise Ti waits If a transaction re-starts, make sure it

has its original timestamp


Multiple-Granularity Locks

Why? If a transaction needs to scan all records in a table, do we really want to have a lock on all tuples individually? Significant locking overhead!

Put a single lock on the table!

Tuples

Tables

Pages

Database

contains

A lock on a nodeimplicitly locksall decendents.


Allowed Sharings

-- IS IX

--

IS

IX

S X

S

X


Tree Locking Algorithm

Search: Start at root and go down; repeatedly, S lock child then unlock parent.

Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe: If child is safe, release all locks on ancestors.

Safe node: Node such that changes will not propagate up beyond this node. Inserts: Node is not full. Deletes: Node is not half-empty.


Example

ROOT

A

B

C

D E

F

G H I

20

35

20*

38 44

22* 23* 24* 35* 36* 38* 41* 44*

Do:1) Search 38*2) Delete 38*3) Insert 45*4) Insert 25*

23


Kung-Robinson Model

Xacts have three phases: READ: Xacts read from the database,

but make changes to private copies of objects.

VALIDATE: Check for conflicts. WRITE: Make local copies of changes

public.ROOT

old

new

modifiedobjects


Timestamp CC

Idea: Give each object a read-timestamp (RTS) and a write-timestamp (WTS), give each Xact a timestamp (TS) when it begins: If action ai of Xact Ti conflicts with

action aj of Xact Tj, and TS(Ti) < TS(Tj), then ai must occur before aj. Otherwise, restart violating Xact.





The Big Picture: What’s Stored Where

DB

Data pageseachwith apageLSN

Xact TablelastLSNstatus

Dirty Page TablerecLSN

flushedLSN

RAM

prevLSNXIDtype

lengthpageID

offsetbefore-imageafter-image

LogRecords

LOG

master record


WAL & the Log

Each log record has a unique Log Sequence Number (LSN). LSNs always increasing.

Each data page contains a pageLSN. The LSN of the most recent log record

for an update to that page. System keeps track of flushedLSN.

The max LSN flushed so far. WAL: Before a page is written,

pageLSN flushedLSN

LSNs

DB

pageLSNs

RAM

flushedLSN

pageLSN

Log recordsflushed to disk

“ Log tail” in RAM


Normal Execution of an Xact

Series of reads & writes, followed by commit or abort. We will assume that write is atomic on disk.

• In practice, additional details to deal with non-atomic writes.

Strict 2PL. STEAL, NO-FORCE buffer management, with

Write-Ahead Logging.


Simple Transaction Abort

For now, consider an explicit abort of a Xact. No crash involved.

We want to “play back” the log in reverse order, UNDOing updates. Get lastLSN of Xact from Xact table. Can follow chain of log records backward via

the prevLSN field. Before starting UNDO, write an Abort log record.

• For recovering from crash during UNDO!


Transaction Commit

Write commit record to log. All log records up to Xact’s lastLSN are

flushed. Guarantees that flushedLSN lastLSN. Note that log flushes are sequential,

synchronous writes to disk. Many log records per log page.

Commit() returns. Write end record to log.


Crash Recovery: Big Picture

Start from a checkpoint (found via master record).

Three phases. Need to:– Figure out which Xacts

committed since checkpoint, which failed (Analysis).

– REDO all actions. (repeat history)

– UNDO effects of failed Xacts.

Oldest log rec. of Xact active at crash

Smallest recLSN in dirty page table after Analysis

Last chkpt

CRASH

A R U

Documents

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 CSG131 Summary Donghui Zhang