43
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005

I/O-Algorithms Lars Arge University of Aarhus March 1, 2005

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

I/O-Algorithms

Lars Arge

University of Aarhus

March 1, 2005

Lars Arge

I/O-algorithms

2

I/O-Model

• Parameters

N = # elements in problem instance

B = # elements that fits in disk block

M = # elements that fits in main memory

T = # output size in searching problem

• I/O: Movement of block between memory and disk

D

P

M

Block I/O

Lars Arge

I/O-algorithms

3

Fundamental Bounds [AV88]

Internal External

• Scanning: N

• Sorting: N log N

• Permuting

– List rank

• Searching: NBlog

BN

BN

BMlog

BN

log,minBN

BN

BMNN

N2log

Lars Arge

I/O-algorithms

4

Data Structures until now

• One-dimensional searching

– B-trees: Node degree (B) queries in

Rebalancing using split/fuse updates in

– Weight-balanced B-tress: Weight rather than degree constraint

Ω(w(v)) updates below v between rebalancing operations

– Persistent B-trees: N O(N/B) space

Update in current version in

Search in all previous versions in

)(log BT

B NO )(log NO B

)(log BT

B NO )(log NO B

Lars Arge

I/O-algorithms

5

Data structures until now

• Last week: “dimension 1.5” searching:

– Interval stabbing: External interval tree

O(N/B) space, query, update

• We utilized a number of tools/techniques

– B-trees, weight-balanced B-trees, persistent B-trees

– Logarithmic method

– Global rebuilding

– Construction using buffer technique

x

)(log NO B)(log BT

B NO

Lars Arge

I/O-algorithms

6

Interval Management• External interval tree:

– Fan-out weight-balanced B-tree on endpoints

– Intervals stored in O(B) secondary structure in each internal node

– Query efficiency using filtering

– Bootstrapping used to avoid O(B) search cost in each node

* Size O(B2) underflow structure in each node

* Constructed using sweep and persistent B-tree

* Dynamic using global rebuilding

$m$ blocksv)( B

v

)( B

Lars Arge

I/O-algorithms

7

3-Sided Range Searching• Interval management corresponds to simple form of 2d range search

• More general problem: Dynamic 3-sidede range searching

– Maintain set of points in plane such

that given query (q1, q2, q3), all points

(x,y) with q1 x q2 and y q3 can

be found efficiently

(x,x)

(x1,x2)

x

x1 x2

q3

q2q1

Lars Arge

I/O-algorithms

8

3-Sided Range Searching : Static Solution• Construction: Sweep top-down inserting x in persistent B-tree at (x,y)

– O(N/B) space

– I/O construction using buffer technique

• Query (q1, q2, q3): Perform range query with [q1,q2] in B-tree at q3

– I/Os

• Dynamic?

q3

q2q1

)(log BT

B NO

)log( NO BBN

Lars Arge

I/O-algorithms

9

• Base tree on x-coordinates with nodes augmented with points

• Heap on y-coordinates

– Decreasing y values on root-leaf path

– (x,y) on path from root to leaf holding x

– If v holds point then parent(v) holds point

Internal Priority Search Tree9

16.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

Lars Arge

I/O-algorithms

10

• Linear space

• Insert of (x,y) (assuming fixed x-coordinate set):

– Compare y with y-coordinate in root

– Smaller: Recursively insert (x,y) in subtree on path to x

– Bigger: Insert in root and recursively insert old point in subtree

O(log N) update

Internal Priority Search Tree9

16.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

Insert (10,21) 10,21

Lars Arge

I/O-algorithms

11

Internal Priority Search Tree

• Query with (q1, q2, q3) starting at root v:

– Report point in v if satisfying query

– Visit both children of v if point reported

– Always visit child(s) of v on path(s) to q1 and q2

O(log N+T) query

916.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

4

194

Lars Arge

I/O-algorithms

12

• Natural idea: Block tree

• Problem:

– I/Os to follow paths to to q1 and q2

– But O(T) I/Os may be used to visit other nodes (“overshooting”)

query

Externalizing Priority Search Tree9

16.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

)(log NO B

)(log TNO B

Lars Arge

I/O-algorithms

13

Externalizing Priority Search Tree

• Solution idea:

– Store B points in each node * O(B2) points stored in each supernode

* B output points can pay for “overshooting”

– Bootstrapping:

* Store O(B2) points in each supernode in static structure

916.20

1619,9

1313,3

1920,3

45,6

59,4

11,2

201916139544,1

1

Lars Arge

I/O-algorithms

14

External Priority Search Tree• Base tree: Weight-balanced B-tree with branching parameter 1/4B

and leaf parameter B on x-coordinates

• Points in “heap order”:

– Root stores B top points for each of the child slabs

– Remaining points stored recursively

• Points in each node stored in “O(B2)-structure”

– Persistent B-tree structure for static problem

Linear space

)(B

)(B

Lars Arge

I/O-algorithms

15

External Priority Search Tree

• Query with (q1, q2, q3) starting at root v:

– Query O(B2)-structure and report points satisfying query

– Visit child v if

* v on path to q1 or q2

* All points corresponding to v satisfy query

Lars Arge

I/O-algorithms

16

External Priority Search Tree• Analysis:

– I/Os used to visit node v

– nodes on path to q1 or q2

– For each node v not on path to q1 or q2 visited, B points reported in parent(v)

query

)1()(log 2B

TB

TB

vv OBO )(log NO B

)(log BT

B NO

Lars Arge

I/O-algorithms

17

External Priority Search Tree• Insert (x,y) (ignoring insert in base tree - rebalancing):

– Find relevant node v:

* Query O(B2)-structure to find

B points in root corresponding

to node u on path to x

* If y smaller than y-coordinates

of all B points then recursively

search in u

– Insert (x,y) in O(B2)-structure of v

– If O(B2)-structure contains >B points for child u, remove lowest point and insert recursively in u

• Delete: Similarly

u

Lars Arge

I/O-algorithms

18

• Analysis:

– Query visits nodes

– O(B2)-structure queried/updated in each node

* One query

* One insert and one delete

• O(B2)-structure analysis:

– Query:

– Update in O(1) I/Os using update

block and global rebuilding

I/Os

External Priority Search Tree

u

)(log NO B

)1()/(log 2 OBBBO B

)(log NO B

Lars Arge

I/O-algorithms

19

Dynamic Base Tree• Deletion:

– Delete point as previously

– Delete x-coordinate from base

tree using global rebuilding

I/Os amortized

• Insertion:

– Insert x-coordinate in base tree

and rebalance (using splits)

– Insert point as previously

• Split: Boundary in v becomes boundary in parent(v)

)(log NO B

v

v’’v’

Lars Arge

I/O-algorithms

20

Dynamic Base Tree• Split: When v splits B new points needed in parent(v)

• One point obtained from v’ (v’’) using “bubble-up” operation:

– Find top point p in v’

– Insert p in O(B2)-structure

– Remove p from O(B2)-structure of v’

– Recursively bubble-up point to v’

• Bubble-up in I/Os

– Follow one path from v to leaf

– Uses O(1) I/O in each node

Split in I/Os

v’’v’

))((log vwO B

))(())(log( vwOvwBO B

Lars Arge

I/O-algorithms

21

Dynamic Base Tree• O(1) amortized split cost:

– Cost: O(w(v))

– Weight balanced base tree: inserts below v between splits

• External Priority Search Tree

– Space: O(N/B)

– Query:

– Updates: I/Os amortized

• Amortization can be removed from update bound in several ways

– Utilizing lazy rebuilding

))(( vw

)(log NO B

)(log BT

B NO

v’’v’

Lars Arge

I/O-algorithms

22

Two-Dimensional Range Search• We have now discussed structures for special cases of two-

dimensional range searching

– Space: O(N/B)

– Query:

– Updates:

• Cannot be obtained for general 2d range searching:

– query requires space

– space requires query

q3

q2q1q

q

q3

q2q1

q4

)(loglog

logN

NBN

BB

BΩ)(log NO cB

)(BNO )( B

N

)(log NO B

)(log BT

B NO

Lars Arge

I/O-algorithms

23

• Base tree: Weight balanced tree with branching parameter

and leaf parameter B on x-coordinates

height

• Points below each node stored in 4 linear space secondary structures:

– “Right” priority search tree

– “Left” priority search tree

– B-tree on y-coordinates

– Interval tree

space

External Range Tree

)(loglog

logN

N

BB

BO

)(log NB

)(loglog

logN

NBN

BB

BO

)(log NB

Lars Arge

I/O-algorithms

24

• Secondary interval tree structure:

– Connect points in each slab in y-order

– Project obtained segments in y-axis

– Intervals stored in interval tree

* Interval augmented with pointer to corresponding points in y-coordinate B-tree in corresponding child node

External Range Tree

)(log NB

Lars Arge

I/O-algorithms

25

• Query with (q1, q2, q3 , q4) answered in top node with q1 and q2 in different slabs v1 and v2

• Points in slab v1

– Found with 3-sided query in v1

using right priority search tree

• Points in slab v2

– Found with 3-sided query in v2

using left priority search tree

• Points in slabs between v1 and v2

– Answer stabbing query with q3 using interval tree

first point above q3 in each of the slabs

– Find points using y-coordinate B-tree in slabs

External Range Tree

)(log NO B

)(log NO B

)(log NB

v1 v2

Lars Arge

I/O-algorithms

26

External Range Tree• Query analysis:

– I/Os to find relevant node

– I/Os to answer two 3-sided queries

– I/Os to query interval tree

– I/Os to traverse B-trees

I/Os

)(log NO B

)(log BT

B NO )(log)(log log NONO BB

NB

B )(log B

TB NO )(log NO B

)(log BT

B NO )(log NB

v1 v2

Lars Arge

I/O-algorithms

27

External Range Tree• Insert:

– Insert x-coordinate in weight-balanced B-tree

* Split of v can be performed in I/Os

I/Os

– Update secondary structures in all nodes on one root-leaf path

* Update priority search trees

* Update interval tree

* Update B-tree

I/Os

• Delete:

– Similar and using global rebuilding

)(loglog

logN

N

BB

BO)(

logloglog2

NN

BB

BO

))(log)(( vwvwO B

)(loglog

log2

NN

BB

BO

)(log NB

v1 v2

Lars Arge

I/O-algorithms

28

Summary: External Range Tree• 2d range searching in space

– I/O query

– I/O update

• Optimal among query structures

)(loglog

log2

NN

BB

BO

)(log BT

B NO )(

logloglog

NN

BN

BB

BO

)(log BT

B NO

q3

q2q1

q4

Lars Arge

I/O-algorithms

29

kdB-tree

• kd-tree:

– Recursive subdivision of point-set into two half using vertical/horizontal line

– Horizontal line on even levels, vertical on uneven levels

– One point in each leaf

Linear space and logarithmic height

Lars Arge

I/O-algorithms

30

kd-Tree: Query

• Query

– Recursively visit nodes corresponding to regions intersecting query

– Report point in trees/nodes completely contained in query

• Query analysis

– Horizontal line intersect Q(N) = 2+2Q(N/4) = regions

– Query covers T regions I/Os worst-case

)( NO

)( TNO

Lars Arge

I/O-algorithms

31

kdB-tree

• KdB-tree:

– Blocking of kd-tree but with B point in each leaf

• Query as before

– Analysis as before except that each region now contains B points

I/O query)( B

TB

NO

Lars Arge

I/O-algorithms

32

Construction of kdB-tree

• Simple algorithm

– Find median of y-coordinates (construct root)

– Distribute point based on median

– Recursively build subtrees

• Idea in improved algorithm [AAPV01]

– Construct levels at a time using O(N/B) I/Os

– Recursively build subtrees

)log( 2 BN

BNO

)log(BN

BN

BMO

BMlog

Lars Arge

I/O-algorithms

33

Construction of kdB-tree• Sort N points by x- and by y-coordinates using I/Os

• Building levels ( nodes) in O(N/B) I/Os:

1. Construct by grid

with points in each slab

2. Count number of points in each

grid cell and store in memory

3. Find slab s with median x-coordinate

4. Scan slab s to find median x-coordinate and construct node

5. Split slab containing median x-coordinate and update counts

6. Recurse on each side of median x-coordinate using grid (step 3) Grid grows to during algorithm Each node constructed in I/Os

BMlog

)log(BN

BN

BMO

BM

BM

BM

N

BM

)( BM

BM

BM

BM

))/(( BNO BM

Lars Arge

I/O-algorithms

34

kdB-tree

• kdB-tree:

– Linear space

– Query in I/Os

– Construction in I/Os

• Dynamic?

– Deletes in I/Os using global rebuilding

– Insertions?

)( BT

BNO

)log()log( NOO BBN

BN

BN

BM

)(log NO B

Lars Arge

I/O-algorithms

35

Insertions using Logarithmic Method• Partition pointset S into subsets S0, S1, … Slog N, |Si| = 2i or |Si| = 0

• Build kdB-tree Di on Si

• Operations:

– Query: Query each Di

– Insert: Find first empty Di and construct Di out of

elements in S0,S1, … Si-1

– I/Os per moved point

– Point moved O(log N) times

I/Os amortized

– Delete: Delete from relevant Di (ignore in log-method)

..................................

0 2222 1 2 log N

iij

j 221 10

)()(log

02

BT

BN

BTN

i B OO ii

)2log( 2 iBB

iO )2log( 1 i

BBO

)(log)log2log( 21 NONO Bi

BB

)(log NO B

Lars Arge

I/O-algorithms

36

O-Tree Structure• O-tree:

– B-tree on vertical slabs

– B-tree on horizontal slabs in each vertical slab

– kdB-tree on points in each leaf

NBBN log

)log( NBBN

)log()( 2)log( 2 NB BN

NBB

N

NBBN 2log

)log( NBBN

NB B2log

Lars Arge

I/O-algorithms

37

O-Tree Query• Perform rangesearch with q1 and q2 in vertical B-tree

– Query all kdB-trees in leaves of two horizontal B-trees with x-interval intersected but not spanned by query

– Perform rangesearch with q3 and q4 horizontal B-trees with x-interval spanned by query

* Query all kdB-trees with range intersected by query

NBBN log

NBBN 2log

NB B2log

Lars Arge

I/O-algorithms

38

O-Tree Query Analysis• Vertical B-tree query:

• Query of all kdB-trees in leaves of two horizontal B-trees:

• Query horizontal B-trees:

• Query kdB-trees not completely in query

• Query in kdB-trees completely

contained in query:

I/Os

)())log((log BN

BBN

B ONO

)()()log()log( 2BT

BN

BT

BBBN OOBNBONO

)log( NO BBN

)())log((log)log( BN

BBN

BBBN ONONO

)()()log()log(2 2BT

BN

BT

BBBN OOBNBONO

)(BTO

)(BT

BNO

)log(2 NO BBN

Lars Arge

I/O-algorithms

39

O-Tree Update• Insert:

– Search in vertical B-tree: I/Os

– Search in horizontal B-tree: I/Os

– Insert in kdB-tree: I/Os

• Use global rebuilding when structures grow too big/small

– B-trees not contain elements

– kdB-trees not contain elements

I/Os

• Deletes can be handled

in I/Os similarly

)log( NBBN

)log( 2 NB B

)(log NO B

)(log NO B

)(log))log((log 22 NONBO BBB

)(log NO B

)(log NO B

Lars Arge

I/O-algorithms

40

Summary: O-Tree• 2d range searching in linear space

– I/O query

– I/O update

• Optimal among structures

using linear space

• Can be extended to work in d-dimensions

with optimal query bound

q3

q2q1

q4

)(log NO B

)(BT

BNO

))((11

BT

BN dO

Lars Arge

I/O-algorithms

41

Summary: 3 and 4-sided Range Search• 3-sided 2d range searching: External priority search tree

– query, space, update

• General (4-sided) 2d range searching:

– External range tree: query, space,

update

– O-tree: query, space, update

q3

q2q1

q3

q2q1

q4

)(loglog

logN

NBN

BB

BO)(log BT

B NO

)(BNO)( B

TB

NO

)(log NO B)(log BT

B NO

)(log NO B

)(loglog

log2

NN

BB

BO

)(BNO

Lars Arge

I/O-algorithms

42

Techniques (one final time)• Tools:

– B-trees

– Persistent B-trees

– Buffer trees

– Logarithmic method

– Weight-balanced B-trees

– Global rebuilding

• Techniques:

– Bootstrapping

– Filtering

q3

q2q1

q3

q2q1

q4

(x,x)

Lars Arge

I/O-algorithms

43

Other results• Many other results for e.g.

– Higher dimensional range searching

– Range counting

– Halfspace (and other special cases) of range searching

– Structures for moving objects

– Proximity queries

– Structures for objects other than points (bounding rectangles)

• Many heuristic structures in database community

• Implementation efforts:

– LEDA-SM (MPI)

– TPIE (Duke/Aarhus)