Top-k Queries on Temporal Data

Feifei Li1, Ke Yi2, Wangchao Le1

Florida State UniversityHongKong University of Science & Technology

Top-k Queries on Temporal Data

Temporal data: temporal data refer to data that change over time.

Typical examples - stock traces- objects’ trajectories.

Problem Def.

Time

Score

For the efficiency of storage, indexing , queries, etc., time series are often represented as piecewise linear functions, each called a Piecewise Linear Approximation (PLA).

Problem Def.

Time

Score

Time

Score

Each PLA is called an object.An PLA object with 4 line segments.

Ranking Queries on Temporal data : top-k queries on time instants.

Problem Def. (cont.)

Given a set of PLA objects {oi|i=1 … n}, a time instant t and k, a top-k/t query retrieves the k objects that have the highest scores on time instant t.

Use R-tree R-tree revisit:

- Index multi-dim. info.- linear space - Branch and bound with a priority queue- Do NOT have a worst case query cost guarantee (linear scan in worst case).

Treat an object as a trajectory - Break up each trajectory into pieces of segments

- R-tree is built on pieces of segmentsUse kNN query at time t

-Adding an artificial query point that is high enough (example in next slide).

State of the Art

kNN query at time t using R-tree- Use min. snapshot distance (MinSTDist.), distance along time instance t from q.

State of the Art (cont.)

Branch & bound with MinSTDist- Stop when there are k objects in the priority queue whose MinSTDist are smaller than other unseen objects.

Efficiency of R-tree based approach- Linear space consumption- Handle queries on higher dimensional problems

Deficiency of R-tree based approach- Do not have worse case performance guarantee (build, query)- Current commercial DBMSs have limited supports on R-tree

State of the Art (cont.)

We propose seb-tree, the Sampled Envelope B-tree.Simplicity

- B-tree is the only building block , easily to integrate into commercial DBMSsOptimal query performance

- Answer a top-k/t query in logarithm I/O on expectation Handle update

- 99.5% updates will end up in simple insertions/deletions- Only 0.5% updates need to lock and modify a larger portion of the B-tree

Size & construction- Occupy near linear space- Require near linear time to build.

Our contribution

Let S be a set of N line segments in the planeBuild series of random sampling on S

- Define l independent sampling ratio pi (0≤i≤l) - Sampling on S with pi - Sampled set Si & unsampled set USi - l+1 groups of Si and USi

How to decide l and pi?- , kmax is the highest possible k

- pi is a geometrically decreasing series : 1/(2iB), i= 0, 1, …, l, B is the # of segments can be hold in a disk block

Seb-tree (rand. sampling)

For each sample Si, compute its upper envelope envi

- What’s upper envelope?

Upper envelope can be computed in near linear time (1989)

Seb-tree ( the upper envelope)

A random sampled set Si

Si and its upper envelope envi

For each vertex on envi- shoot up a vertical line- if it is an endpoint of a segment, also shown down until it hits another segment or score=0.

This results the trapezoidal decomposition of Si: D(Si).

Seb-tree ( the trapezoidal decomp.)

Si and its upper envelope envi

Si and its decomposition

Conflict- consider a trapezoid ∆ from some D(Si) and s USi

- we say s conflicts with ∆ if s intersects ∆ Conflict list

- for each ∆, find all s USi conflicted with it (do we need to consider s Si?)- collect all such segments into a list, which is named conflict list C(∆)

Seb-tree (the conflict list)

∆

Sa

Sb Sc

Se

Sd

C(∆)= {Sa, Sb, Sc, Sd, Se}

Let ∆1, ∆2, …, ∆t be the trapezoids of D(Si) from left to right - sort by the starting x value of ∆

Build a B-tree Ti on C(∆1), C(∆2), …, C(∆t) in order

Build a B-tree for each level of sampling- totally we have l+1 B-trees

Seb-tree (the index)

Lemma 1 (1989): E(|C(∆)|)=O(1/p)By Lemma1, for a ∆ on level i, E(|C(∆)|)=O(2iB)

Lemma 2 (1986): There are O(n*α(n)) vertices on the upper envelope of n line segments in the plane, where α(n) is the inverse Ackermann function and can be treated as a constant of all imaginable input size.

- for Si, it has expected O(1/2i*N/B* α(N/B)) trapezoids- for B-tree Ti, it occupied O(N*α(N/B)) blocks.

Size of seb-treeFor B-trees, the size of seb-tree is

Size of seb-tree

Each line segment might intersect with multiple trapezoids

How to build the conflict list efficientlyHierarchical decompositionConflict lists can be build in near linear time.

More on seb-tree

Let L0 be the set of segments in Si, we then build a gradation

where Lj is ½ sampling of Lj-1, λ=O(log|L0|)

The hirarchical decomposition

LLL 10

L0

L1

L2

For each Lj, we build its trapezoidal decomposition D(Lj)

The hirarchical decomposition

L0

L1

L2

For each Lj, we build its trapezoidal decomposition D(Lj)

We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(L λ)

The hierarchical decomposition

L0

L1

L2

For each Lj, we build its trapezoidal decomp. D(Lj)

We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(L λ)

Store all trapezoids in this hierarchy in a tree (HDT).


L0

L1

L2

To judge which C(∆) a line segment belongs to at L0, we search top-down from L λ, visiting a ∆ if only if the segment intersect with it.


L0

L1

L2

seg1

seg1

seg2

seg2

seg2

ab

b d e

f g

For a particular level Si, the decomp. has a height of λ=O(log|Si|)

For a segment s, the time it spent to visit the HDT will be proportional to the size of the HDT, which is

At Lj, its conflict list has an expected size E(|C(∆)|)=O(2i+jB) |Lj|= O(N/2i+jB), there are O(|Lj|α(|Lj|)) trapezoids in D(Lj),

so D(Lj) has an expected size of O(N *α(N/B)*log(N/B)) The total time spent on the entire l+1 HDTs is

Cost on building conflict lists

Query on seb-tree is simple (in 1 for-loop)- Given k and a time instant t, initiate i=0 1. use B-tree Ti, do point search and find ∆ whose x-span contains t,

read its conflict list C(∆) 2. if there are at least k segments in C(∆) intersect with t, return the top-k

segments, else if i<l, then i=i+1, repeat step 1 3. scan entire S to find top-k segments to find top-k

- An improvement is that instead of letting i=0 at the first step, we can directly

start at level i=log(k/B) (because 2iB need to larger than k).

Query on seb-tree

Query performance guarantee comes from B-tree

For any query, seb-tree index can find the top-

k/t segments in expected O(logBN+k/B) I/Os

The probability that seb-tree needs to trigger a brute force scan is less than B/N, and scanning the whole data set needs O(N/B) I/Os, this adds only O(1) to the total query cost.

Query cost

Recall that to build a B-tree at level i, we need to- take a 1/2iB sampling on S to get Si

- build a trapezoidal decomp. D(Si)

- store the conflict list in the level i B-treeGiven a new segment s

- if it changes none of the D(L0), …, D(Lλ), then simply follow the HDT to check where s belongs to.- if it does change one of the D(L0), …, D(Lλ), then we need to rebuild a larger potion of the seb-tree.

Deletion can be handled similarly.

Updating the seb-tree

Based on lemma 1:One will expect to see O(1/p) conflicting segments for any trapezoid on level Si, where p is sampling rate = 1/2iB

To avoid expensive I/O, we define threshold λ, when |C(∆)| > λ O(1/p), simply don’t store it (for query part, skip it)

In practice, λ=3 or 4

Space-query tradeoff

∆

Sa

Sb Sc

Se

Sd

|C(∆)|=O(1/p)

How seb-tree will behave when …1) the number of time series changes

2) the deviation of time series changes3) the threshold λ changes4) Kmax in changes

Compare to R-tree

Experiment

Index size & construction time

Experiment (1)

Query cost

Experiment (2)

Effect of Kmax

Experiment (3)

Study ranking queries on temporal dataPropose seb-treeTake near-linear time to constructionOccupy near-linear spaceSupport dynamic update efficientlyEmploy B-tree as its only building block.

Conclusion

Documents

Top-k Queries on Temporal Data