Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Minimal Probing:Supporting Expensive Predicates for

Top-k Queries

Kevin C. ChangSeung-won HwangUniv. of Illinois at Urbana-Champaign

Ranked queries return top-k results, unlike Boolean Crucial for retrieving data by “soft” conditions

– relevance: e.g., text search engines– similarity: e.g., multimedia databases– preference: e.g., e-commerce product search

Example scenario: preference query for finding house:– select h.id from house h

where new(age), cheap(price, size), large(size)

order by min(new,cheap,large) stop after 5

Observation: Crucial to support expensive predicates

Context: Top-k Queries

predicate

k: retrieval sizescoring function

Problem: Expensive Predicates

Expensive predicates– no pre-computed indexes for zero-time sorted-access– need a probe to evaluate each object (similar to sequential scan)

Unified abstraction for:– user-defined functions: functional extensibility

query conditions can be arbitrary, user-specific e.g., cheap(price,size)

– external predicates: data extensibility source interface may require one probe per object e.g., safe(zip) access crime rate from apbnews.com

– fuzzy joins associations of relations can be arbitrary e.g., close(house.zip, park.zip)

Require sorted access of search predicates.

To “simulate” sorted access, require complete probing– are these probes necessary?

Goal: Minimize probe cost

Current Limitations: “Sort-Merge” Framework

d:0.90, a:0.85, b:0.78, c:0.75, e:0.70

b:0.90, d:0.90, e:0.80, a:0.75, c:0.20

a:0.90, b:0.80, c:0.70, d:0.60, e:0.50

b:0.78Merge

Algorithm

F = min(new,cheap,large)k = 1

Sort stepMerge stepTop-k outputnew (search predicate)

cheap (expensive predicate)

large (expensive predicate)

Motivation: Solution Space

Assume sequential probing:

Algorithm skeleton:do:

schedule next obj o, pred p

probe pr(o,p)

until (top-k identified)

predicates

p1 p2 p3 object

a

b

c

Our framework: Separate, Global Predicate Scheduling

Two important decisions on framework: Separate predicate scheduling

– scheduling as separate “optimization” phase before probing– avoid run-time scheduling overhead

Global predicate scheduling– scheduling based on global info (predicate selectivities)– lack of per-object information to justify per-object scheduling– avoid per-object scheduling overhead

Simple framework and algorithm– and efficient!– allow essentially A* framework, for given predicate schedule – enable formal analysis: optimality, scalability

Separate, global predicate scheduling

Simple Framework

Algorithm skeleton:find global schedule H

do: schedule next obj o

probe pr(o, next(o,H))


predicates H=(p1,p2,p3)

p1 p2 p3 object

a

b

c

Challenges for Minimizing Probing

Predicate scheduling before probing– how to identify the best H?

Object scheduling during probing– how to find next object to probe, for achieving “minimal

probing” with respect to H?

Algorithm skeleton:find global schedule H

do: schedule next obj o

probe pr(o, next(o,H))


?

?

Challenge 1: Object Scheduling

Goal: Perform only necessary probes

Necessary probes:– A probe is necessary if top-k answers cannot be determined by any

algorithm without it, regardless of the outcomes of other probes. Question 1: Given a probe pr(o, next(o,H)), how to

determine if it is necessary?

Probe-optimal algorithm– An algorithm is probe-optimal if it performs only the necessary

probes. Question 2: How to identify necessary probes in order to

design such an algorithm?

k=1, F=min(x,p1,p2); suppose H=(p1,p2)

Question 1: Is this Probe Necessary?

OID x p1 p2 F=min(x,p1,p2)

a 0.9

b 0.8

c 0.7

d 0.6

e 0.5

?????

1 1 0.9

1 1 0.7

1 1 0.6

1 1 0.5

top 1

Maybe Not! 0.8

k=1, F=min(x,p1,p2); suppose H=(p1,p2)

Theorem: Probe pr(o,p) is absolutely necessary, if o is among the current top-k in terms of ceiling scores.

Question 1: Is this Probe Necessary?

OID x p1 p2 F=min(x,p1,p2)

a 0.9

b 0.8

c 0.7

d 0.6

e 0.5

? 0.9

1 1 0.7

1 1 0.6

1 1 0.5

top 1?

Necessary!

1 1 0.8

Question 2: Probe-optimal object scheduling

Objects in current top-k must be further probed Probe-optimal object scheduling: Algorithm MPro

– use a priority queue with ceiling scores as priorities

a:0.9

b:0.8

c:0.7

d:0.6

e:0.5

a:0.85

b:0.8

c:0.7

d:0.6

e:0.5

b:0.8

a:0.75

c:0.7

d:0.6

e:0.5

a:0.75

c:0.7

d:0.6

e:0.5

b:0.78b:0.78

a:0.75

c:0.7

d:0.6

e:0.5

b:0.78

pr(a,p1) =0.85

pr(a,p2) =0.75

pr(b,p1) =0.78

pr(b,p2) =0.90

top 1

Challenge 2: Predicate Scheduling

Scheduling problem– find minimal cost schedule from permutations

Challenges– selectivity estimation:

dynamic predicates aggregate selectivities (context-dependent)

– scheduling computation: NP-hard

Our approach:– on-line sampling to estimate selectivities– greedy selection to schedule predicates

0.1% sampling achieves almost the best schedule

Experiment Results

Practical performance of MPro– proportional cost to the retrieval size k– significant speedup for small k

Impact of performance factors– database size: sublinear cost scalability– score distribution and scoring function: see paper

6 hour

2 min

Demo : House Search

Data: All houses on sale in Illinois (N=20990)– from www.realtor.com. – objects: house(id, price, size, bed, bath, zip, city)

Query: F = Average(n, c, r)– n nearcity: close to Chicago

– c cheap: “reasonable” price for its size

– r roomy: prefer 4-6 rooms

Summary of Contributions (more in the paper)

Abstraction:– for user-defined, external, and fuzzy join predicates

Framework and algorithm:– sampling-based global scheduling

– probe-optimal algorithm MPro

– extensions of MPro: fuzzy joins, parallel MPro, approximation

Principles/Theorems:– necessary-probe principle

– probe-optimality of MPro

– analytical scalability of MPro

Extensive experiments

Thank You!

Parallel MPro: Overview

Probe-parallel MPro– Probe k necessary probes

concurrently– Up to k-fold speedup

Data-parallel MPro– Partition data into s chunks– Up to s-time speedup

top-k

MPro MPro MPro

Merge

Scalability

k=100N=1000

k=1000N=10000

k=10000N=100000

N=1000N=10000

N=100000

Comparison

0

0

0.01

0.1

1

10

100

1 10 100 k

tim

e(s

ec

)

probe scheduling

T T T

O O O

Documents

Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign