30
Feifei Li 1 , Ke Yi 2 , Wangchao Le 1 Florida State University HongKong University of Science & Technology Top-k Queries on Temporal Data

Top-k Queries on Temporal Data

  • Upload
    theo

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Top-k Queries on Temporal Data. Feifei Li 1 , Ke Yi 2 , Wangchao Le 1 Florida State University HongKong University of Science & Technology. Problem Def. Temporal data: temporal data refer to data that change over time. Typical examples - stock traces - objects’ trajectories. . - PowerPoint PPT Presentation

Citation preview

Page 1: Top-k Queries on Temporal Data

Feifei Li1, Ke Yi2, Wangchao Le1

Florida State UniversityHongKong University of Science & Technology

Top-k Queries on Temporal Data

Page 2: Top-k Queries on Temporal Data

Temporal data: temporal data refer to data that change over time.

Typical examples - stock traces- objects’ trajectories.

Problem Def.

Time

Score

Page 3: Top-k Queries on Temporal Data

For the efficiency of storage, indexing , queries, etc., time series are often represented as piecewise linear functions, each called a Piecewise Linear Approximation (PLA).

Problem Def.

Time

Score

Time

Score

Each PLA is called an object.An PLA object with 4 line segments.

Page 4: Top-k Queries on Temporal Data

Ranking Queries on Temporal data : top-k queries on time instants.

Problem Def. (cont.)

Given a set of PLA objects {oi|i=1 … n}, a time instant t and k, a top-k/t query retrieves the k objects that have the highest scores on time instant t.

Page 5: Top-k Queries on Temporal Data

Use R-tree R-tree revisit:

- Index multi-dim. info.- linear space - Branch and bound with a priority queue- Do NOT have a worst case query cost guarantee (linear scan in worst case).

Treat an object as a trajectory - Break up each trajectory into pieces of segments

- R-tree is built on pieces of segmentsUse kNN query at time t

-Adding an artificial query point that is high enough (example in next slide).

State of the Art

Page 6: Top-k Queries on Temporal Data

kNN query at time t using R-tree- Use min. snapshot distance (MinSTDist.), distance along time instance t from q.

State of the Art (cont.)

Branch & bound with MinSTDist- Stop when there are k objects in the priority queue whose MinSTDist are smaller than other unseen objects.

Page 7: Top-k Queries on Temporal Data

Efficiency of R-tree based approach- Linear space consumption- Handle queries on higher dimensional problems

Deficiency of R-tree based approach- Do not have worse case performance guarantee (build, query)- Current commercial DBMSs have limited supports on R-tree

State of the Art (cont.)

Page 8: Top-k Queries on Temporal Data

We propose seb-tree, the Sampled Envelope B-tree.Simplicity

- B-tree is the only building block , easily to integrate into commercial DBMSsOptimal query performance

- Answer a top-k/t query in logarithm I/O on expectation Handle update

- 99.5% updates will end up in simple insertions/deletions- Only 0.5% updates need to lock and modify a larger portion of the B-tree

Size & construction- Occupy near linear space- Require near linear time to build.

Our contribution

Page 9: Top-k Queries on Temporal Data

Let S be a set of N line segments in the planeBuild series of random sampling on S

- Define l independent sampling ratio pi (0≤i≤l) - Sampling on S with pi - Sampled set Si & unsampled set USi - l+1 groups of Si and USi

How to decide l and pi?- , kmax is the highest possible k

- pi is a geometrically decreasing series : 1/(2iB), i= 0, 1, …, l, B is the # of segments can be hold in a disk block

Seb-tree (rand. sampling)

Page 10: Top-k Queries on Temporal Data

For each sample Si, compute its upper envelope envi

- What’s upper envelope?

Upper envelope can be computed in near linear time (1989)

Seb-tree ( the upper envelope)

A random sampled set Si

Si and its upper envelope envi

Page 11: Top-k Queries on Temporal Data

For each vertex on envi- shoot up a vertical line- if it is an endpoint of a segment, also shown down until it hits another segment or score=0.

This results the trapezoidal decomposition of Si: D(Si).

Seb-tree ( the trapezoidal decomp.)

Si and its upper envelope envi

Si and its decomposition

Page 12: Top-k Queries on Temporal Data

Conflict- consider a trapezoid ∆ from some D(Si) and s USi

- we say s conflicts with ∆ if s intersects ∆ Conflict list

- for each ∆, find all s USi conflicted with it (do we need to consider s Si?)- collect all such segments into a list, which is named conflict list C(∆)

Seb-tree (the conflict list)

Sa

Sb Sc

Se

Sd

C(∆)= {Sa, Sb, Sc, Sd, Se}

Page 13: Top-k Queries on Temporal Data

Let ∆1, ∆2, …, ∆t be the trapezoids of D(Si) from left to right - sort by the starting x value of ∆

Build a B-tree Ti on C(∆1), C(∆2), …, C(∆t) in order

Build a B-tree for each level of sampling- totally we have l+1 B-trees

Seb-tree (the index)

Page 14: Top-k Queries on Temporal Data

Lemma 1 (1989): E(|C(∆)|)=O(1/p)By Lemma1, for a ∆ on level i, E(|C(∆)|)=O(2iB)

Lemma 2 (1986): There are O(n*α(n)) vertices on the upper envelope of n line segments in the plane, where α(n) is the inverse Ackermann function and can be treated as a constant of all imaginable input size.

- for Si, it has expected O(1/2i*N/B* α(N/B)) trapezoids- for B-tree Ti, it occupied O(N*α(N/B)) blocks.

Size of seb-treeFor B-trees, the size of seb-tree is

Size of seb-tree

Page 15: Top-k Queries on Temporal Data

Each line segment might intersect with multiple trapezoids

How to build the conflict list efficientlyHierarchical decompositionConflict lists can be build in near linear time.

More on seb-tree

Page 16: Top-k Queries on Temporal Data

Let L0 be the set of segments in Si, we then build a gradation

where Lj is ½ sampling of Lj-1, λ=O(log|L0|)

The hirarchical decomposition

LLL 10

L0

L1

L2

Page 17: Top-k Queries on Temporal Data

For each Lj, we build its trapezoidal decomposition D(Lj)

The hirarchical decomposition

L0

L1

L2

Page 18: Top-k Queries on Temporal Data

For each Lj, we build its trapezoidal decomposition D(Lj)

We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(L λ)

The hierarchical decomposition

L0

L1

L2

Page 19: Top-k Queries on Temporal Data

For each Lj, we build its trapezoidal decomp. D(Lj)

We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(L λ)

Store all trapezoids in this hierarchy in a tree (HDT).

The hierarchical decomposition

L0

L1

L2

Page 20: Top-k Queries on Temporal Data

To judge which C(∆) a line segment belongs to at L0, we search top-down from L λ, visiting a ∆ if only if the segment intersect with it.

The hierarchical decomposition

L0

L1

L2

seg1

seg1

seg2

seg2

seg2

ab

b d e

f g

Page 21: Top-k Queries on Temporal Data

For a particular level Si, the decomp. has a height of λ=O(log|Si|)

For a segment s, the time it spent to visit the HDT will be proportional to the size of the HDT, which is

At Lj, its conflict list has an expected size E(|C(∆)|)=O(2i+jB) |Lj|= O(N/2i+jB), there are O(|Lj|α(|Lj|)) trapezoids in D(Lj),

so D(Lj) has an expected size of O(N *α(N/B)*log(N/B)) The total time spent on the entire l+1 HDTs is

Cost on building conflict lists

Page 22: Top-k Queries on Temporal Data

Query on seb-tree is simple (in 1 for-loop)- Given k and a time instant t, initiate i=0 1. use B-tree Ti, do point search and find ∆ whose x-span contains t,

read its conflict list C(∆) 2. if there are at least k segments in C(∆) intersect with t, return the top-k

segments, else if i<l, then i=i+1, repeat step 1 3. scan entire S to find top-k segments to find top-k

- An improvement is that instead of letting i=0 at the first step, we can directly

start at level i=log(k/B) (because 2iB need to larger than k).

Query on seb-tree

Page 23: Top-k Queries on Temporal Data

Query performance guarantee comes from B-tree

For any query, seb-tree index can find the top-

k/t segments in expected O(logBN+k/B) I/Os

The probability that seb-tree needs to trigger a brute force scan is less than B/N, and scanning the whole data set needs O(N/B) I/Os, this adds only O(1) to the total query cost.

Query cost

Page 24: Top-k Queries on Temporal Data

Recall that to build a B-tree at level i, we need to- take a 1/2iB sampling on S to get Si

- build a trapezoidal decomp. D(Si)

- store the conflict list in the level i B-treeGiven a new segment s

- if it changes none of the D(L0), …, D(Lλ), then simply follow the HDT to check where s belongs to.- if it does change one of the D(L0), …, D(Lλ), then we need to rebuild a larger potion of the seb-tree.

Deletion can be handled similarly.

Updating the seb-tree

Page 25: Top-k Queries on Temporal Data

Based on lemma 1:One will expect to see O(1/p) conflicting segments for any trapezoid on level Si, where p is sampling rate = 1/2iB

To avoid expensive I/O, we define threshold λ, when |C(∆)| > λ O(1/p), simply don’t store it (for query part, skip it)

In practice, λ=3 or 4

Space-query tradeoff

Sa

Sb Sc

Se

Sd

|C(∆)|=O(1/p)

Page 26: Top-k Queries on Temporal Data

How seb-tree will behave when …1) the number of time series changes

2) the deviation of time series changes3) the threshold λ changes4) Kmax in changes

Compare to R-tree

Experiment

Page 27: Top-k Queries on Temporal Data

Index size & construction time

Experiment (1)

Page 28: Top-k Queries on Temporal Data

Query cost

Experiment (2)

Page 29: Top-k Queries on Temporal Data

Effect of Kmax

Experiment (3)

Page 30: Top-k Queries on Temporal Data

Study ranking queries on temporal dataPropose seb-treeTake near-linear time to constructionOccupy near-linear spaceSupport dynamic update efficientlyEmploy B-tree as its only building block.

Conclusion