159
CS 245 Notes 4 1 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

Embed Size (px)

Citation preview

Page 1: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 1

CS 245: Database System Principles

Notes 4: Indexing

Hector Garcia-Molina

Page 2: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 2

Indexing & Hashing

value

Chapter 4

value

record

Page 3: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 3

Topics

• Conventional indexes• B-trees• Hashing schemes

Page 4: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 4

Sequential File

2010

4030

6050

8070

10090

Page 5: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 5

Sequential File

2010

4030

6050

8070

10090

Dense Index

10203040

50607080

90100110120

Page 6: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 6

Sequential File

2010

4030

6050

8070

10090

Sparse Index

10305070

90110130150

170190210230

Page 7: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 7

Sequential File

2010

4030

6050

8070

10090

Sparse 2nd level

10305070

90110130150

170190210230

1090

170250

330410490570

Page 8: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 8

• Comment:{FILE,INDEX} may be contiguous

or not (blocks chained)

Page 9: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 9

Question:

• Can we build a dense, 2nd level index for a dense index?

Page 10: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 10

Notes on pointers:

(1) Block pointer (sparse index) can be smaller than record pointer

BP

RP

Page 11: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 11

(2) If file is contiguous, then we can omitpointers (i.e., compute them)

Notes on pointers:

Page 12: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 12

K1

K3

K4

K2

R1

R2

R3

R4

Page 13: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 13

K1

K3

K4

K2

R1

R2

R3

R4

say:1024 Bper block

• if we want K3 block: get it at offset (3-1)1024 = 2048 bytes

Page 14: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 14

Sparse vs. Dense Tradeoff

• Sparse: Less index space per record can keep more of

index in memory• Dense: Can tell if any record exists

without accessing file

(Later: – sparse better for insertions– dense needed for secondary indexes)

Page 15: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 15

Terms

• Index sequential file• Search key ( primary key)• Primary index (on Sequencing field)• Secondary index• Dense index (all Search Key values in)• Sparse index• Multi-level index

Page 16: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 16

Next:

• Duplicate keys

• Deletion/Insertion

• Secondary indexes

Page 17: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 17

Duplicate keys

1010

2010

3020

3030

4540

Page 18: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 18

1010

2010

3020

3030

4540

10101020

20303030

1010

2010

3020

3030

4540

10101020

20303030

Dense index, one way to implement?

Duplicate keys

Page 19: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 19

1010

2010

3020

3030

4540

10203040

Dense index, better way?

Duplicate keys

Page 20: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 20

1010

2010

3020

3030

4540

10102030

Sparse index, one way?

Duplicate keys

Page 21: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 21

1010

2010

3020

3030

4540

10102030

Sparse index, one way?

Duplicate keys

care

ful if lookin

gfo

r 2

0 o

r 3

0!

Page 22: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 22

1010

2010

3020

3030

4540

10203030

Sparse index, another way?

Duplicate keys

– place first new key from block

Page 23: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 23

1010

2010

3020

3030

4540

10203030

Sparse index, another way?

Duplicate keys

– place first new key from block

shouldthis be40?

Page 24: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 24

Duplicate values, primary index

• Index may point to first instance ofeach value only

File Index

Summary

aaa

b

Page 25: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 25

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

Page 26: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 26

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete record 40

Page 27: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 27

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete record 40

Page 28: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 28

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete record 30

Page 29: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 29

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete record 30

4040

Page 30: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 30

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete records 30 & 40

Page 31: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 31

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete records 30 & 40

Page 32: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 32

Deletion from sparse index

2010

4030

6050

8070

10305070

90110130150

– delete records 30 & 40

5070

Page 33: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 33

Deletion from dense index

2010

4030

6050

8070

10203040

50607080

Page 34: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 34

Deletion from dense index

2010

4030

6050

8070

10203040

50607080

– delete record 30

Page 35: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 35

Deletion from dense index

2010

4030

6050

8070

10203040

50607080

– delete record 30

40

Page 36: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 36

Deletion from dense index

2010

4030

6050

8070

10203040

50607080

– delete record 30

4040

Page 37: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 37

Insertion, sparse index case

2010

30

5040

60

10304060

Page 38: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 38

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 34

Page 39: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 39

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 34

34

• our lucky day! we have free space where we need it!

Page 40: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 40

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 15

Page 41: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 41

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 15

15

2030

20

Page 42: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 42

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 15

15

2030

20

• Illustrated: Immediate reorganization• Variation:

– insert new block (chained file)– update index

Page 43: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 43

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 25

Page 44: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 44

Insertion, sparse index case

2010

30

5040

60

10304060

– insert record 25

25

overflow blocks(reorganize later...)

Page 45: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 45

Insertion, dense index case

• Similar

• Often more expensive . . .

Page 46: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 46

Secondary indexesSequencefield

5030

7020

4080

10100

6090

Page 47: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 47

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Sparse index

302080

100

90...

Page 48: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 48

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Sparse index

302080

100

90...

does not make sense!

Page 49: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 49

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Dense index

Page 50: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 50

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Dense index10203040

506070...

Page 51: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 51

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Dense index10203040

506070...

105090...

sparsehighlevel

Page 52: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 52

With secondary indexes:

• Lowest level is dense• Other levels are sparse

Also: Pointers are record pointers

(not block pointers; not computed)

Page 53: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 53

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

Page 54: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 54

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10101020

20304040

4040...

one option...

Page 55: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 55

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10101020

20304040

4040...

one option...

Problem:excess overhead!

• disk space• search time

Page 56: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 56

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10

another option...

4030

20

Page 57: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 57

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10

another option...

4030

20Problem:variable sizerecords inindex!

Page 58: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 58

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10203040

5060...

Another idea (suggested in class):Chain records with same key?

Page 59: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 59

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10203040

5060...

Another idea (suggested in class):Chain records with same key?

Problems:• Need to add fields to records• Need to follow chain to know records

Page 60: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 60

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10203040

5060...

buckets

Page 61: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 61

Why “bucket” idea is useful

Indexes RecordsName: primary EMP

(name,dept,floor,...)

Dept: secondaryFloor: secondary

Page 62: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 62

Query: Get employees in

(Toy Dept) ^ (2nd floor)

Dept. index EMP Floor index

Toy 2nd

Page 63: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 63

Query: Get employees in

(Toy Dept) ^ (2nd floor)

Dept. index EMP Floor index

Toy 2nd

Intersect toy bucket and 2nd Floor

bucket to get set of matching EMP’s

Page 64: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 64

This idea used in text information retrieval

Documents

...the cat is fat ...

...was raining cats and dogs...

...Fido the dog ...

Page 65: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 65

This idea used in text information retrieval

Documents

...the cat is fat ...

...was raining cats and dogs...

...Fido the dog ...

Inverted lists

cat

dog

Page 66: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 66

IR QUERIES

• Find articles with “cat” and “dog”• Find articles with “cat” or “dog”• Find articles with “cat” and not “dog”

Page 67: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 67

IR QUERIES

• Find articles with “cat” and “dog”• Find articles with “cat” or “dog”• Find articles with “cat” and not “dog”

• Find articles with “cat” in title• Find articles with “cat” and “dog”

within 5 words

Page 68: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 68

Common technique: more info in inverted

list

cat Title 5

Title 100

Author 10

Abstract 57

Title 12

d3d2

d1

dog

typeposit

ion

locatio

n

Page 69: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 69

Posting: an entry in inverted list.Represents occurrence

ofterm in article

Size of a list: 1 Rare words or (in postings) miss-spellings

106 Common words

Size of a posting: 10-15 bits (compressed)

Page 70: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 70

IR DISCUSSION

• Stop words• Truncation• Thesaurus• Full text vs. Abstracts• Vector model

Page 71: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 71

Vector space model

w1 w2 w3 w4 w5 w6 w7 …

DOC = <1 0 0 1 1 0 0 …>

Query= <0 0 1 1 0 0 0 …>

Page 72: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 72

Vector space model

w1 w2 w3 w4 w5 w6 w7 …

DOC = <1 0 0 1 1 0 0 …>

Query= <0 0 1 1 0 0 0 …>

PRODUCT = 1 + ……. = score

Page 73: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 73

• Tricks to weigh scores + normalize

e.g.: Match on common word not asuseful as match on rare

words...

Page 74: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 74

• How to process V.S. Queries?

w1 w2 w3 w4 w5 w6 …Q = < 0 0 0 1 1 0 … >

Page 75: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 75

• Try Stanford Libraries• Try Google, Yahoo, ...

Page 76: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 76

Summary so far

• Conventional index– Basic Ideas: sparse, dense, multi-

level…– Duplicate Keys– Deletion/Insertion– Secondary indexes

– Buckets of Postings List

Page 77: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 77

Conventional indexes

Advantage:- Simple- Index is sequential file

good for scans

Disadvantage:- Inserts expensive,

and/or- Lose sequentiality &

balance

Page 78: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 78

Example Index (sequential)

continuous

free space

102030

405060

708090

Page 79: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 79

Example Index (sequential)

continuous

free space

102030

405060

708090

39313536

323834

33

overflow area(not sequential)

Page 80: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 80

Outline:

• Conventional indexes• B-Trees NEXT• Hashing schemes

Page 81: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 81

• NEXT: Another type of index– Give up on sequentiality of index– Try to get “balance”

Page 82: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 82

Root

B+Tree Example n=3

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Page 83: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 83

Sample non-leaf

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

57

81

95

Page 84: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 84

Sample leaf node:

From non-leaf node

to next leafin

sequence5

7

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

Page 85: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 85

In textbook’s notationn=3

Leaf:

Non-leaf:

30

35

30

30 35

30

Page 86: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 86

Size of nodes: n+1 pointersn keys

(fixed)

Page 87: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 87

Don’t want nodes to be too empty

• Use at least

Non-leaf: (n+1)/2pointers

Leaf: (n+1)/2 pointers to data

Page 88: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 88

Full nodemin. node

Non-leaf

Leaf

n=3

12

01

50

18

0

30

3 5 11

30

35

counts

even if

null

Page 89: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 89

B+tree rules tree of order n

(1) All leaves at same lowest level(balanced tree)

(2) Pointers in leaves point to records except for “sequence pointer”

Page 90: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 90

(3) Number of pointers/keys for B+tree

Non-leaf(non-root) n+1 n (n+1)/2 (n+1)/2- 1

Leaf(non-root) n+1 n

Root n+1 n 1 1

Max Max Min Min ptrs keys ptrsdata keys

(n+1)/2 (n+1)/2

Page 91: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 91

Insert into B+tree

(a) simple case– space available in leaf

(b) leaf overflow(c) non-leaf overflow(d) new root

Page 92: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 92

(a) Insert key = 32 n=33 5 11

30

31

30

100

Page 93: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 93

(a) Insert key = 32 n=33 5 11

30

31

30

100

32

Page 94: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 94

(a) Insert key = 7 n=3

3 5 11

30

31

30

100

Page 95: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 95

(a) Insert key = 7 n=3

3 5 11

30

31

30

100

3 5

7

Page 96: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 96

(a) Insert key = 7 n=3

3 5 11

30

31

30

100

3 5

7

7

Page 97: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 97

(c) Insert key = 160 n=3

10

0

120

150

180

150

156

179

180

200

Page 98: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 98

(c) Insert key = 160 n=3

10

0

120

150

180

150

156

179

180

200

160

179

Page 99: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 99

(c) Insert key = 160 n=3

10

0

120

150

180

150

156

179

180

200

18

0

160

179

Page 100: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 100

(c) Insert key = 160 n=3

10

0

120

150

180

150

156

179

180

200

160

18

0

160

179

Page 101: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 101

(d) New root, insert 45 n=3

10

20

30

1 2 3 10

12

20

25

30

32

40

Page 102: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 102

(d) New root, insert 45 n=3

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

Page 103: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 103

(d) New root, insert 45 n=3

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

40

Page 104: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 104

(d) New root, insert 45 n=3

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

40

30new root

Page 105: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 105

(a) Simple case - no example

(b) Coalesce with neighbor (sibling)

(c) Re-distribute keys(d) Cases (b) or (c) at non-leaf

Deletion from B+tree

Page 106: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 106

(b) Coalesce with sibling– Delete 50

10

40

100

10

20

30

40

50

n=4

Page 107: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 107

(b) Coalesce with sibling– Delete 50

10

40

100

10

20

30

40

50

n=4

40

Page 108: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 108

(c) Redistribute keys– Delete 50

10

40

100

10

20

30

35

40

50

n=4

Page 109: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 109

(c) Redistribute keys– Delete 50

10

40

100

10

20

30

35

40

50

n=4

35

35

Page 110: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 110

40

45

30

37

25

26

20

22

10

141 3

10

20

30

40

(d) Non-leaf coalese– Delete 37

n=4

25

Page 111: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 111

40

45

30

37

25

26

20

22

10

141 3

10

20

30

40

(d) Non-leaf coalese– Delete 37

n=4

30

25

Page 112: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 112

40

45

30

37

25

26

20

22

10

141 3

10

20

30

40

(d) Non-leaf coalese– Delete 37

n=4

40

30

25

Page 113: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 113

40

45

30

37

25

26

20

22

10

141 3

10

20

30

40

(d) Non-leaf coalese– Delete 37

n=4

40

30

25

25

new root

Page 114: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 114

B+tree deletions in practice

– Often, coalescing is not implemented– Too hard and not worth it!

Page 115: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 115

Comparison: B-trees vs. static indexed sequential

fileRef #1: Held & Stonebraker

“B-Trees Re-examined”CACM, Feb. 1978

Page 116: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 116

Ref # 1 claims:- Concurrency control harder in B-Trees

- B-tree consumes more space

For their comparison:block = 512 byteskey = pointer = 4 bytes4 data records per block

Page 117: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 117

Example: 1 block static index

127 keys

(127+1)4 = 512 Bytes-> pointers in index implicit! up to 127

blocks

k1

k2

k3

k1

k2

k3

1 datablock

Page 118: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 118

Example: 1 block B-tree

63 keys

63x(4+4)+8 = 512 Bytes-> pointers needed in B-tree up to 63

blocks because index is blocksnot contiguous

k1

k2

...

k63

k1

k2

k3

1 datablock

next

-

Page 119: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 119

Size comparison Ref. #1Size comparison Ref. #1

Static Index B-tree# data # datablocks height blocks

height

2 -> 127 2 2 -> 63 2128 -> 16,129 3 64 -> 3968 316,130 -> 2,048,383 4 3969 -> 250,047 4

250,048 -> 15,752,961 5

Page 120: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 120

Ref. #1 analysis claims

• For an 8,000 block file,after 32,000 inserts

after 16,000 lookups Static index saves enough accesses

to allow for reorganization

Page 121: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 121

Ref. #1 analysis claims

• For an 8,000 block file,after 32,000 inserts

after 16,000 lookups Static index saves enough accesses

to allow for reorganization

Ref. #1 conclusion Static index better!!

Page 122: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 122

Ref #2: M. Stonebraker, “Retrospective on a

database system,” TODS, June 1980

Ref. #2 conclusion B-trees better!!

Page 123: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 123

• DBA does not know when to reorganize• DBA does not know how full to load

pages of new index

Ref. #2 conclusion B-trees better!!

Page 124: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 124

• Buffering– B-tree: has fixed buffer requirements– Static index: must read several

overflow blocks to be efficient (large & variable size buffers needed for this)

Ref. #2 conclusion B-trees better!!

Page 125: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 125

• Speaking of buffering… Is LRU a good policy for B+tree

buffers?

Page 126: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 126

• Speaking of buffering… Is LRU a good policy for B+tree

buffers? Of course not! Should try to keep root in memory

at all times(and perhaps some nodes from second level)

Page 127: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 127

Interesting problem:

For B+tree, how large should n be?

n is number of keys / node

Page 128: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 128

Sample assumptions:(1) Time to read node from disk is

(S+Tn) msec.

Page 129: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 129

Sample assumptions:(1) Time to read node from disk is

(S+Tn) msec.(2) Once block in memory, use

binary search to locate key:(a + b LOG2

n) msec.For some constants a,b; Assume a <<

S

Page 130: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 130

Sample assumptions:(1) Time to read node from disk is

(S+Tn) msec.(2) Once block in memory, use

binary search to locate key:(a + b LOG2

n) msec.For some constants a,b; Assume a <<

S(3) Assume B+tree is full, i.e.,

# nodes to examine is LOGn N where N = # records

Page 131: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 131

Can get: f(n) = time to find a record

f(n)

nopt n

Page 132: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 132

FIND nopt by f’(n) = 0

Answer is nopt = “few hundred”

(see homework for details)

Page 133: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 133

FIND nopt by f’(n) = 0

Answer is nopt = “few hundred”

(see homework for details)

What happens to nopt as

• Disk gets faster?• CPU get faster?

Page 134: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

Exercise

• f(n)= lognN*[S+T*n+a+b*log2(n)]

CS 245 Notes 4 134

S= 14000

T= 0.2

b= 0.002

a= 0

N= 10000000

Page 135: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

N=10 million records

CS 245 Notes 4 135

S= 14000T= 0.2b= 0.002a= 0

N=1000000

0

times in microseconds

Page 136: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

N=100 million records

CS 245 Notes 4 136

S= 14000T= 0.2b= 0.002a= 0

N=1000000

0

times in microseconds

Page 137: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 137

Variation on B+tree: B-tree (no +)• Idea:

– Avoid duplicate keys– Have record pointers in non-leaf nodes

Page 138: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 138

to record to record to record with K1 with K2 with K3

to keys to keys to keys to keys < K1 K1<x<K2 K2<x<k3 >k3

K1 P1 K2 P2 K3 P3

Page 139: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 139

B-tree example n=2

65

125

145

165

85

105

25

45

10

20

30

40

110

120

90

100

70

80

170

180

50

60

130

140

150

160

Page 140: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 140

B-tree example n=2

65

125

145

165

85

105

25

45

10

20

30

40

110

120

90

100

70

80

170

180

50

60

130

140

150

160

• sequence pointers not useful now! (but keep space for simplicity)

Page 141: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 141

Note on inserts

• Say we insert record with key = 25

10

20

30 n=3

leaf

Page 142: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 142

Note on inserts

• Say we insert record with key = 25

10

20

30 n=3

leaf

10

– 20 –

25

30

• Afterwards:

Page 143: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 143

So, for B-trees:

MAX MINTree Rec Keys Tree Rec KeysPtrs Ptrs Ptrs Ptrs

Non-leafnon-root n+1 n n (n+1)/2 (n+1)/2-1

(n+1)/2-1Leafnon-root 1 n n 1 n/2 n/2

Rootnon-leafn+1 n n 2 1 1

RootLeaf 1 n n 1 1 1

Page 144: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 144

Tradeoffs:

B-trees have faster lookup than B+trees

in B-tree, non-leaf & leaf different sizes in B-tree, deletion more complicated

Page 145: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 145

Tradeoffs:

B-trees have faster lookup than B+trees

in B-tree, non-leaf & leaf different sizes in B-tree, deletion more complicated

B+trees preferred!

Page 146: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 146

But note:

• If blocks are fixed size (due to disk and buffering restrictions)

Then lookup for B+tree is actually better!!

Page 147: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 147

Example:- Pointers 4 bytes- Keys 4 bytes- Blocks 100 bytes (just example)- Look at full 2 level tree

Page 148: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 148

Root has 8 keys + 8 record pointers+ 9 son pointers

= 8x4 + 8x4 + 9x4 = 100 bytes

B-tree:

Page 149: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 149

Root has 8 keys + 8 record pointers+ 9 son pointers

= 8x4 + 8x4 + 9x4 = 100 bytes

B-tree:

Each of 9 sons: 12 rec. pointers (+12 keys)

= 12x(4+4) + 4 = 100 bytes

Page 150: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 150

Root has 8 keys + 8 record pointers+ 9 son pointers

= 8x4 + 8x4 + 9x4 = 100 bytes

B-tree:

Each of 9 sons: 12 rec. pointers (+12 keys)

= 12x(4+4) + 4 = 100 bytes2-level B-tree, Max # records =

12x9 + 8 = 116

Page 151: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 151

Root has 12 keys + 13 son pointers= 12x4 + 13x4 = 100 bytes

B+tree:

Page 152: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 152

Root has 12 keys + 13 son pointers= 12x4 + 13x4 = 100 bytes

B+tree:

Each of 13 sons: 12 rec. ptrs (+12 keys)

= 12x(4 +4) + 4 = 100 bytes

Page 153: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 153

Root has 12 keys + 13 son pointers= 12x4 + 13x4 = 100 bytes

B+tree:

Each of 13 sons: 12 rec. ptrs (+12 keys)

= 12x(4 +4) + 4 = 100 bytes2-level B+tree, Max # records

= 13x12 = 156

Page 154: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 154

So...

ooooooooooooo ooooooooo 156 records 108 records

Total = 116

B+ B

8 records

Page 155: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 155

So...

ooooooooooooo ooooooooo 156 records 108 records

Total = 116

B+ B

8 records

• Conclusion:– For fixed block size,– B+ tree is better because it is bushier

Page 156: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 156

An Interesting Problem...

• What is a good index structure when:– records tend to be inserted with keys

that are larger than existing values?(e.g., banking records with growing data/time)

– we want to remove older data

Page 157: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 157

One Solution: Multiple Indexes

• Example: I1, I2

day days indexed days indexed I1 I210 1,2,3,4,5 6,7,8,9,1011 11,2,3,4,5 6,7,8,9,1012 11,12,3,4,5 6,7,8,9,1013 11,12,13,4,5 6,7,8,9,10

•advantage: deletions/insertions from smaller index•disadvantage: query multiple indexes

Page 158: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 158

Another Solution (Wave Indexes)

day I1 I2 I3 I410 1,2,3 4,5,6 7,8,9 1011 1,2,3 4,5,6 7,8,9 10,1112 1,2,3 4,5,6 7,8,9 10,11, 1213 13 4,5,6 7,8,9 10,11, 1214 13,14 4,5,6 7,8,9 10,11, 1215 13,14,15 4,5,6 7,8,9 10,11, 1216 13,14,15 16 7,8,9 10,11, 12

•advantage: no deletions•disadvantage: approximate windows

Page 159: CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245 Notes 4 159

Outline/summary

• Conventional Indexes• Sparse vs. dense• Primary vs. secondary

• B trees• B+trees vs. B-trees• B+trees vs. indexed sequential

• Hashing schemes --> Next