BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…

BTrees & Sorting

11/3

Announcements

• I hope you had a great Halloween.

• Regrade requests were due a few minutes ago…

Indexing“If you don’t find it in the index, look very carefully through the entire catalog” -- Sears, Roebuck, and Co., Consumers Guide, 1897

Index Motivation

• A file contains some records, say products

• We want faster access to those records– E.g., Give me all products made by Sony

• Intuition: Build a second file that organizes the records “by product” to make this faster– NB: we don’t always have to build a second file

Indexes• An index on a file speeds up selections on

the search key fields for the index.– Search key properties

• Any subset of fields• is not the same as key of a relation

Product(name, maker, price)On which attributes

would you build indexes?

More precisely

• An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k.

Product(name, maker, price) Sample queries?

Indexing is one the most important facilities provided by a database for performance

Operations on an Index

• Search: Given a key find all records– More sophisticated variants as well. Why?

• Insert /Remove entries– Bulk Load. Why?

Real difference between structures: costs of ops determines which index you pick and why

Data File with Several Index Files11 8012 1012 2013 75

Name Age SalBob 12 10Cal 11 80Joe 12 20Sue 13 75

11121213

10 1220 1275 1380 11

<Age, Sal>

<Sal, Age>

80102075

<Age> <Sal>

Equality Query:Age = 12 and Sal = 90?

Range Query:Age = 5 and Sal > 5?

Composite keys in Dictionary Order

High-level of Index Structures

Outline

• Btrees – Very good for range queries, sorted data– Some old databases only implemented Btrees

• Hash Tables– There are variants of this basic structure to deal

with IO

The data structures we present here are “IO aware”

B+ Trees

• Search trees – B does not mean binary!

• Idea in B Trees:– make 1 node = 1 physical page– Balanced, height adjusted tree (not the B either)

• Idea in B+ Trees:– Make leaves into a linked list (range queries)

Each node has >= d and <= 2d keys (except root)

B+ Trees Basics

30 120 240

Keys k < 30Keys 30<=k<120 Keys 120<=k<240

Keys 240<=k

40 50 60

40 50 60

Next leaf

Each leaf has >=d and <= 2d keys:

Parameter d = the degreeInternal Nodes

Leaf Nodes

B+ Tree Example

80

20 60 100 120

140

10 15

18 20 30

40 50 60 65

80 85

90

10 15 18 20 30 40 50 60 65 80 85 90

d = 2

1. No data in internal nodes.

2. Links between leaf pages.

Searching a B+ Tree

• Exact key values:– Start at the root– Proceed down, to the leaf

• Range queries:– As above– Then sequential traversal

Select nameFrom peopleWhere age = 25

Select nameFrom peopleWhere 20 <= age and age <= 30

B+ Tree Example

80

20 60 100 120

140

10 15

18 20 30

40 50 60 65

80 85

90

10 15 18 20 30 40 50 60 65 80 85 90

K = 30? 30 < 80.

30 in [20,60)

To the data!

B+ Tree Example

80

20 60 100 120

140

10 15

18 20 30

40 50 60 65

80 85

90

10 15 18 20 30 40 50 60 65 80 85 90

K in [30,85] 30 < 80.

30 in [20,60)

To the data!Use those leaf

pointers!

B+ Tree Design

• How large is d ?• Example:– Key size = 4 bytes– Pointer size = 8 bytes– Block size = 4096 byes

• 2d x 4 + (2d+1) x 8 <= 4096• d = 170

Observable Universe contains ≈ 1080 atoms. What is height of a B+tree that indexes it?

NB: Oracle allows 64k pages

TiB is 240 bytes. What is the height to index with 64k Pages?

B+ Trees in Practice

• Typical order: 100. Typical fill-factor: 67%.– average fanout = 133

• Typical capacities:– Height 4: 1334 = 312,900,700 records– Height 3: 1333 = 2,352,637 records

• Top levels of tree sit in the buffer pool:– Level 1 = 1 page = 8 Kbytes– Level 2 = 133 pages = 1 Mbyte– Level 3 = 17,689 pages = 133 MBytes

Typically, pay for one IO!

Insert!

Insertion in a B+ Tree

Insert (K, P)• Find leaf where K belongs, insert• If no overflow (2d keys or less), halt• If overflow (2d+1 keys), split node, insert in parent:

• If leaf, keep K3 too in right node• When root splits, new root has 1 key only

K1 K2 K3 K4 K5

P0

P1 P2 P3 P4 p5

K1 K2

P0 P1 P2

K4 K5

P3 P4 p5

(K3, ) to parent


80

20 60 100 120

140

10 15

18 20 30

40 50 60 65

80 85

90

10 15 18 20 30 40 50 60 65 80 85 90

Insert K=19


80

20 60 100 120

140

10 15

18 19 20 30

40 50 60 65

80 85

90

10 15 18 20 30 40 50 60 65 80 85 9019

After insertion


80

20 60 100 120

140

10 15

18 19 20 30

40 50 60 65

80 85

90

10 15 18 20 30 40 50 60 65 80 85 9019

Now insert 25


80

20 60 100 120

140

10 15

18 19 20 25

30 40

50 60 65

80 85

90

10 15 18 20 25 30 40 60 65 80 85 9019

After insertion

50


80

20 60 100 120

140

10 15

18 19 20 25

30 40

50 60 65

80 85

90

10 15 18 20 25 30 40 60 65 80 85 9019

But now have to split !

50


80

20 30 60 100 120

140

10 15

18 19 20 25

60 65

80 85

90

10 15 18 20 25 30 40 60 65 80 85 9019

After the split

50

30 40

50

Key concepts (exam)

• How to search in a B+tree – which pages are touched

• Understanding the impact of various design decisions.

• Will not ask for the details of the insert algorithm, but do need to know it remains balanced.

Clustered Indexes

Index Classification

An index is clustered if the data is ordered in the same way

as the underlying data.

Clustered vs. Unclustered Index

Index entries direct search

Data entries

(Index File)

(Data file)Data Records

Data Records

CLUSTERED UNCLUSTERED

Clustered (or not) dramatically impacts cost

A Simple Cost Models

Operations on an Index

Search: Given a key find all records– More sophisticated variants as well.

Real difference between structures: costs of ops which index you pick and why

Cost Model for Our Analysis

We ignore CPU costs, for simplicity:– N: The number of records– R: Number of records per page

Measure number of page I/O’s

Goal: Good enough to show the overall trends…

Clustered v. Unclustered

Fanout of Tree is F.Range query for M entries (100 per page)

IOs to search for a single item?

Traversal of the tree: logF(1.5N)Range search Query : logF(1.5N) + ceil(M/100)

Traversal of the tree: logF(1.5N)Range search Query : logF(1.5N) + M

Unclustered

Clustered

Which of these IOs are random/sequential?

Plugging in Some numbers

Clustered:logF(1.5N) + ceil(M/100) ~ 1 Random IO (10ms)

Unclustered:logF(1.5N) + M Random IO (M*10ms)

If M is 1 then there is no difference!If M is 100,000 records, ~10 minutes vs. 10ms

Takeaway

• B+Tree are a workhorse index.

• You can write down a cost model.– Databases actually do this!

• Clustered v. unclustered is a big deal.

Sorting.

Why Sort?

• Data requested in sorted order – e.g., find students in increasing GPA order

• Sorting is first step in bulk loading B+ tree index.

A classic problem in computer science!

More reasons to sort…

• Sorting useful for eliminating duplicate copies in a collection of records (Why?)

• Sort-merge join algorithm involves sorting.

• Problem: sort 1Tb of data with 1Gb of RAM.– why not use virtual memory?

Do people care?

Sort benchmark bares his name

http://sortbenchmark.org

http://sortbenchmark.org/

Simplified External Sorts.

Two Ideas behind external sort

• I/O optimized: long sequential disk access

• Observation: Merging sorted files is easy

Sort small sets in memory, then merge.

Phase I: Buffer with 3 Pages Sort

Main Memory

44,10,33,55

Sort it! (Quicksort)

10,33,44,55

Phase 1, Per Page2 IOs (1 Read,1 Write)

18,8,5,305,8,18,30 End: All pages sorted.

Phase II: Merge

Main Memory

10,33,44,55

5,8,18,30

10,33,44,555,8,18,30

1. Read Pages

5,8,10,1855,8

2. Merge

Phase II: Merge

Main Memory

10,33,44,55

5,8,18,30

10,33,44,555,8,18,30

1. Read Pages

5,8,10,1855,8

2. Merge

Phase II: Merge

Main Memory

10,33,44,55

5,8,18,30

10,33,44,555,8,18,30

1. Read Pages

5,8,10,185,85,8,10

2. Merge

Phase II: Merge

Main Memory

10,33,44,55

5,8,18,30

10,33,44,555,8,18,30

1. Read Pages

5,8,10,185,8,105,8,10,18

2. Merge

3rd Page is filled

Phase II: Merge

Main Memory

10,33,44,55

5,8,18,30

10,33,44,555,8,18,30

1. Read Pages

5,8,10,185,8,10,18

2. Merge

3. Write Back

Keep on Merging!

3 Buffer Pages Sort

30,33,44,55

Main Memory

5,8,10,18

Now, runs of length 2.If our file has 16 pages,

what is next?

Phase II: Merge

Main Memory

10,33,44,55

5,8,18,30

10,33,44,555,8,18,30

1. Read Pages

5,8,10,185,8,10,18

2. Merge

3. Write Back

Two-Way External Merge SortEach pass we read + write each page in file.N pages in the file => the number of passes

So toal cost is:

Idea: Divide and conquer: sort subfiles and merge

log2 1N

2 12N Nlog

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,62,6 4,9 7,8 1,3 2

2,34,6

4,7

8,91,35,6 2

2,3

4,46,7

8,9

1,23,56

1,22,3

3,4

4,56,6

7,8

More Buffer Pages Sort

18,8,5,30

Main Memory

44,10,33,55

What if we have B+1 Buffer Pages?

Sort IOs:2N(1 + logB(N/(B+1)))

1st Pass: Runs of Length B+1

Merge Phase: B-way Merge.

Number of Passes of External Sort N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4

Engineer’s rule of thumb: You sort in 3 passes

Other Optimizations

Can get twice as long runs– Tournament sort (used in Postgres)

Can Improve IO performance by using bigger buffers to “prefetch” or “double buffer”– Prefetch: Hide latency– Bigger Batch Sizes: Amortize expensive sequential

reads and writes.

Documents

BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…