70
1 CS143: Index

CS143: Index

  • Upload
    illias

  • View
    94

  • Download
    2

Embed Size (px)

DESCRIPTION

CS143: Index. Book Chapters: (4 th ) 12.1-3, 12.5-8 (5 th ) 12.1-3, 12.6-8, 12.10. Topics to Learn. Important concepts Dense index vs. sparse index Primary index vs. secondary index (= clustering index vs. non-clustering index) Tree-based vs. hash-based index Tree-based index - PowerPoint PPT Presentation

Citation preview

Page 1: CS143: Index

1

CS143: Index

Page 2: CS143: Index

2

Topics to Learn

• Important concepts– Dense index vs. sparse index– Primary index vs. secondary index

(= clustering index vs. non-clustering index)– Tree-based vs. hash-based index

• Tree-based index– Indexed sequential file– B+-tree

• Hash-based index– Static hashing– Extendible hashing

Page 3: CS143: Index

3

Basic Problem

• SELECT *FROM StudentWHERE sid = 40

• How can we answer the query?

sid name GPA

20 Elaine 3.2

70 Peter 2.6

40 Susan 3.7

Page 4: CS143: Index

4

Random-Order File

• How do we find sid=40?

sid name GPA

20 Susan 3.5

60 James 1.7

70 Peter 2.6

40 Elaine 3.9

30 Christy 2.9

Page 5: CS143: Index

5

Sequential File

• Table sequenced by sid. Find sid=40?sid name GPA

20 Susan 3.5

30 James 1.7

40 Peter 2.6

50 Elaine 3.9

60 Christy 2.9

Page 6: CS143: Index

6

Binary Search

• 100,000 records• Q: How many blocks to read?

• Any better way?– In a library, how do we find a book?

Page 7: CS143: Index

7

Basic Idea

• Build an “index” on the table– An auxiliary structure to help us

locate a record given a “key”20

60

10

40

80

40

Page 8: CS143: Index

8

Dense, Primary Index

• Primary index (clustering index)– Index on the search key

• Dense index– (key, pointer) pair for every

record

• Find the key from index and follow pointer– Maybe through binary search

• Q: Why dense index?– Isn’t binary search on the file

the same?

2010

4030

6050

8070

10090

Dense Index

10203040

50607080

90100110120

Sequential File

Page 9: CS143: Index

9

Why Dense Index?

• Example– 10,000,000 records (900-bytes/rec)– 4-byte search key, 4-byte pointer– 4096-byte block. Unspanned tuples

• Q: How many blocks for table (how big)?

• Q: How many blocks for index (how big)?

Page 10: CS143: Index

10

Sparse, Primary Index

• Sparse index– (key, pointer) pair per

every “block”– (key, pointer) pair points

to the first record in the block

• Q: How can we find 60?

Sequential File

2010

4030

6050

8070

10090

Sparse Index

10305070

90110130150

Page 11: CS143: Index

11

Multi-level index

Sequential File

2010

4030

6050

8070

10090

Sparse 2nd level10305070

90110130150

170190210230

1090

170250

330410490570

1st level

Q: Why multi-level index?

Q: Does dense, 2nd level index make sense?

Page 12: CS143: Index

12

Secondary (non-clustering) Index

• Secondary (non-clustering) index– When tuples in the table

are not ordered by the index search key

• Index on a non-search-key for sequential file

• Unordered file

• Q: What index?– Does sparse index make

sense?

Sequencefield

5030

7020

4080

10100

6090

Page 13: CS143: Index

13

Sparse and secondary index?

5030

7020

4080

10100

6090

302080

100

90...

Page 14: CS143: Index

14

Secondary index

5030

7020

4080

10100

6090

10203040

506070...

105090...

sparseHigh level

• First level is always dense• Sparse from the second level

Page 15: CS143: Index

15

Important terms

• Dense index vs. sparse index• Primary index vs. secondary index

– Clustering index vs. non-clustering index

• Multi-level index• Indexed sequential file

– Sometimes called ISAM (indexed sequential access method)

• Search key ( primary key)

Page 16: CS143: Index

16

Insertion

2010

30

5040

60

10304060

Insert 35

Q: Do we need to update higher-level index?

Page 17: CS143: Index

17

Insertion

2010

30

5040

60

10304060

Insert 15 (overflow)

Q: Do we need to update higher-level index?

Page 18: CS143: Index

18

Insertion

2010

30

5040

60

10304060

Insert 15 (redistribute)

Q: Do we need to update higher-level index?

Page 19: CS143: Index

19

Potential performance problemAfter many insertions…

102030

405060

708090

39313536

323834

33

overflow pages(not sequential)

Main index

Page 20: CS143: Index

20

Traditional Index (ISAM)

• Advantage– Simple– Sequential blocks

• Disadvantage– Not suitable for updates– Becomes ugly (loses sequentiality and

balance) over time

Page 21: CS143: Index

21

B+Tree

• Most popular index structure in RDBMS

• Advantage– Suitable for dynamic updates– Balanced– Minimum space usage guarantee

• Disadvantage– Non-sequential index blocks

Page 22: CS143: Index

22

B+Tree Example (n=3)

20 30 50 80 90 70

50 80

70

Leaf

Non leaf

root

20 Susan 2.7

30 James 3.6

50 Peter 1.8

… … …

...

......

Balanced: All leaf nodes are at the same level

Page 23: CS143: Index

23

• n: max # of pointers in a node• All pointers (except the last one) point to tuples • At least half of the pointers are used. (more precisely, (n+1)/2 pointers)

Sample Leaf Node (n=3)

20 30

From a non-leaf node

Last pointer: to the next leaf node

20 Susan 2.7

30 James 3.6

50 Peter 1.8

… … …

points to tuple

Page 24: CS143: Index

24

• Points to the nodes one-level below- No direct pointers to tuples

• At least half of the ptrs used (precisely, n/2)- except root, where at least 2 ptrs used

Sample Non-leaf Node (n=3)

23 56

To keys23 k<56

To keys56 k

To keysk<23

Page 25: CS143: Index

25

• Find a greater key and follow the link on the left (Algorithm: Figure 12.10 on textbook)

• Find 30, 60, 70?

Search on B+tree

20 30 50 80 90 70

50 80

70

Page 26: CS143: Index

26

Nodes are never too empty

• Use at leastNon-leaf: n/2 pointersLeaf: (n+1)/2 pointers

full node min. node

Non-leaf

Leaf

n=45 8 10 5

5 8 10 5 8

Page 27: CS143: Index

27

Non-leaf(non-root) n n-1 n/2 n/2-1

Leaf(non-root) n n-1

Root n n-1 2 1

Max Max Min Min Ptrs keys ptrs keys

(n+1)/2 (n-1)/2

Number of Ptrs/Keys for B+tree

Page 28: CS143: Index

28

(a) simple case (no overflow)(b) leaf overflow(c) non-leaf overflow(d) new root

B+Tree Insertion

Page 29: CS143: Index

29

• Insert 60

Insertion (Simple Case)

20 30 50 80 90 70

50 80

70

Page 30: CS143: Index

30

• Insert 55

• No space to store 55

Insertion (Leaf Overflow)

20 30 50 60 80 90 70

50 80

70

Page 31: CS143: Index

31

• Insert 55

Insertion (Leaf Overflow)

20 30 50 55 80 90 70

50 60 80

70

60

• Q: After split, leaf nodes always half full?

No overflow. Stop

Page 32: CS143: Index

32

Insertion (Non-leaf Overflow)

• Insert 52

20 30 50 55 60

50 60

70

Leaf overflow. Split and copy the first key of the new node

Page 33: CS143: Index

33

50 60

Insertion (Non-leaf Overflow)

• Insert 52

20 30 50 52 60

55 70

55

No overflow. Stop

Q: After split, non-leaf at least half full?

Page 34: CS143: Index

34

Insertion (New Root Node)

• Insert 25

20 30 50 55 60

50 60

Page 35: CS143: Index

35

60 30

50

Insertion (New Root Node)

• Insert 25

20 25 50 55 60 30

• Q: At least 2 ptrs at root?

Page 36: CS143: Index

36

B+Tree Insertion

• Leaf node overflow– The first key of the new node is

copied to the parent

• Non-leaf node overflow– The middle key is moved to the

parent

• Detailed algorithm: Figure 12.13

Page 37: CS143: Index

37

B+Tree Deletion

(a) Simple case (no underflow)(b) Leaf node, coalesce with neighbor (c) Leaf node, redistribute with neighbor(d) Non-leaf node, coalesce with neighbor(e) Non-leaf node, redistribute with neighbor

In the examples, n = 4– Underflow for non-leaf when fewer than n/2 = 2 ptrs– Underflow for leaf when fewer than (n+1)/2 = 3 ptrs– Nodes are labeled as a, b, c, d, …

Page 38: CS143: Index

38

(a) Simple case

• Delete 25

20 25 30 40 50

20 40 60 a

b c d e

Page 39: CS143: Index

39

(b) Coalesce with sibling (leaf)

• Delete 50

20 30 40 50

20 40 60

60b c d e

a

Page 40: CS143: Index

40

(b) Coalesce with sibling (leaf)

• Delete 50– Check underflow at a. Min 2 ptrs, currently 3

20 30 40

20 60

60b c e

a

Page 41: CS143: Index

41

(c) Redistribute (leaf)

• Delete 50

20 25 30 40 50

20 40 60

60b c d e

a

Page 42: CS143: Index

42

(c) Redistribute (leaf)

• Delete 50– No underflow at a. Done.

20 25 30 40

20 40 60

60b c d e

a 30

Page 43: CS143: Index

43

(d) Coalesce (non-leaf)

• Delete 20– Underflow! Merge d with e.

• Move everything in the right to the left

10 20 30 40

50 90

50 60 70

30 70

a

b c

d e f g

Page 44: CS143: Index

44

(d) Coalesce (non-leaf)

• Delete 20– Underflow at a? Min 2 ptrs. Currently 2. Done.

10 30 40

90

50 60 70

50 70

a

b

d f g

Page 45: CS143: Index

45

(e) Redistribute (non-leaf)

• Delete 20– Underflow! Merge d with e.

10 20 30 40

50 99

50 60 70

30 70 90 97

a

b c

d e f g

Page 46: CS143: Index

46

(e) Redistribute (non-leaf)

• Delete 20– Underflow at a? Min 2 ptrs, currently 3. Done

10 30 40

90 99

50 60 70

50 70 97

a

b c

d f g

Page 47: CS143: Index

47

Important Points

• Remember: – For leaf node merging, we delete the mid-key

from the parent– For non-leaf node merging/redistribution, we

pull down the mid-key from their parent.

• Exact algorithm: Figure 12.17• In practice

– Coalescing is often not implemented• Too hard and not worth it

Page 48: CS143: Index

48

Where does n come from?

• n determined by– Size of a node– Size of search key– Size of an index pointer

• Q: 1024B node, 10B key, 8B ptr n?

Page 49: CS143: Index

49

Question on B+tree

• SELECT *FROM StudentWHERE sid > 60?

20 30 50 60 80 90 70

50 80

70

Page 50: CS143: Index

50

Summary on tree index

• Issues to consider– Sparse vs. dense– Primary (clustering) vs. secondary (non-

clustering)

• Indexed sequential file (ISAM)– Simple algorithm. Sequential blocks– Not suitable for dynamic environment

• B+trees– Balanced, minimum space guarantee– Insertion, deletion algorithms

Page 51: CS143: Index

51

Index Creation in SQL

• CREATE INDEX <indexname> ON <table>(<attr>,<attr>,…)

• Example– CREATE INDEX stidx ONStudent(sid)• Creates a B+tree on the attributes• Speeds up lookup on sid

Page 52: CS143: Index

52

Primary (Clustering) Index

• MySQL: – Primary key becomes the clustering index

• DB2: – CREATE INDEX idx ON Student(sid) CLUSTER– Tuples in the table are sequenced by sid

• Oracle: Index-Organized Table (IOT)– CREATE TABLE T (

...) ORGANIZATION INDEX

– B+tree on primary key– Tuples are stored at the leaf nodes of B+tree

• Periodic reorganization may still be necessary to improve range scan performance

Page 53: CS143: Index

53

Next topic

• Hash index– Static hashing– Extendible hashing

Page 54: CS143: Index

54

What is a Hash Table?

• Hash Table– Hash function

• h(k): key integer [0…n]• e.g., h(‘Susan’) = 7

– Array for keys: T[0…n]– Given a key k, store it in

T[h(k)]0

1 Neil

2

3 James

4 Susan

5

h(Susan) = 4h(James) = 3h(Neil) = 1

Page 55: CS143: Index

55

Hashing for DBMS(Static Hashing)

(key, record)

.

.

Disk blocks (buckets)

search key h(key)

0

1

2

3

4

Page 56: CS143: Index

56

Overflow and Chaining

• Insert h(a) = 1h(b) = 2h(c) = 1h(d) = 0h(e) = 1

• Deleteh(b) = 2h(c) = 1

0

1

2

3

a

bc

d

e

Page 57: CS143: Index

57

Major Problem of Static Hashing

• How to cope with growth?– Data tends to grow in size– Overflow blocks unavoidable

102030

405060

708090

39313536

323834

33

overflow blockshash buckets

Page 58: CS143: Index

58

(a) Use i of b bits output by hash function

Extendible Hashing(two ideas)

00110101h(K)

b

use i grows over time

Page 59: CS143: Index

59

(b) Use directory that maintains pointers to hash buckets (indirection)

Extendible Hashing(two ideas)

.

.

.

.

ce

hash bucketdirectory

h(c)

Page 60: CS143: Index

60

Example• h(k) is 4 bits; 2 keys/bucket

Insert 0111 1

1

0001

1001

1100

10

1

i =

i =

i =

0111

Page 61: CS143: Index

61

Example

Insert 1010 (overflow) 1

1

0001

1001

1100

0111

i =

i =

10

1

i =

Increase i of the bucket. Split it.

Page 62: CS143: Index

62

Example

Insert 0000 1

2

0001

1001

1010

0111

2110

0

00

01

10

11

2i =

Split bucket and increase i

Page 63: CS143: Index

63

Insert 0011 (double directory)

2

2

0111

1001

1010

2110

0

00

01

10

11

2i =

20000

0001

Split bucket, increase i, redistribute keys

Page 64: CS143: Index

64

Extendible Hashing: Deletion

• Two optionsa) No merging of bucketsb) Merge buckets and shrink directory

if possible

Page 65: CS143: Index

65

Delete 1010 (merge buckets and shrink directory)

10001

21001

21100

00

01

10

11

2i =a

b

c

1010

Page 66: CS143: Index

66

Bucket Merge Condition

• Bucket merge condition– Bucket i’s are the same– First (i-1) bits of the hash key are the

same

• Directory shrink condition– All bucket i’s are smaller than the

directory i

Page 67: CS143: Index

67

Questions on Extendible Hashing

• Can we provide minimum space guarantee?

Page 68: CS143: Index

68

Space Waste

2

1

40001

40000

0000

4i =

3

Page 69: CS143: Index

69

Hash index summary

• Static hashing– Overflow and chaining

• Extendible hashing– Can handle growing files

• No periodic reorganizations

– Indirection• Up to 2 disk accesses to access a key

– Directory doubles in size• Not too bad if the data is not too large

Page 70: CS143: Index

70

Hashing vs. Tree

• Can an extendible-hash index support?SELECTFROM RWHERE R.A > 5

• Which one is better, B+tree or Extendible hashing?

SELECTFROM RWHERE R.A = 5