20
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana- Champaign

Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Embed Size (px)

Citation preview

Page 1: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Minimal Probing:Supporting Expensive Predicates for

Top-k Queries

Kevin C. ChangSeung-won HwangUniv. of Illinois at Urbana-Champaign

Page 2: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Ranked queries return top-k results, unlike Boolean Crucial for retrieving data by “soft” conditions

– relevance: e.g., text search engines– similarity: e.g., multimedia databases– preference: e.g., e-commerce product search

Example scenario: preference query for finding house:– select h.id from house h

where new(age), cheap(price, size), large(size)

order by min(new,cheap,large) stop after 5

Observation: Crucial to support expensive predicates

Context: Top-k Queries

predicate

k: retrieval sizescoring function

Page 3: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Problem: Expensive Predicates

Expensive predicates– no pre-computed indexes for zero-time sorted-access– need a probe to evaluate each object (similar to sequential scan)

Unified abstraction for:– user-defined functions: functional extensibility

query conditions can be arbitrary, user-specific e.g., cheap(price,size)

– external predicates: data extensibility source interface may require one probe per object e.g., safe(zip) access crime rate from apbnews.com

– fuzzy joins associations of relations can be arbitrary e.g., close(house.zip, park.zip)

Page 4: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Require sorted access of search predicates.

To “simulate” sorted access, require complete probing– are these probes necessary?

Goal: Minimize probe cost

Current Limitations: “Sort-Merge” Framework

d:0.90, a:0.85, b:0.78, c:0.75, e:0.70

b:0.90, d:0.90, e:0.80, a:0.75, c:0.20

a:0.90, b:0.80, c:0.70, d:0.60, e:0.50

b:0.78Merge

Algorithm

F = min(new,cheap,large)k = 1

Sort stepMerge stepTop-k outputnew (search predicate)

cheap (expensive predicate)

large (expensive predicate)

Page 5: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Motivation: Solution Space

Assume sequential probing:

Algorithm skeleton:do:

schedule next obj o, pred p

probe pr(o,p)

until (top-k identified)

predicates

p1 p2 p3 object

a

b

c

Page 6: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Our framework: Separate, Global Predicate Scheduling

Two important decisions on framework: Separate predicate scheduling

– scheduling as separate “optimization” phase before probing– avoid run-time scheduling overhead

Global predicate scheduling– scheduling based on global info (predicate selectivities)– lack of per-object information to justify per-object scheduling– avoid per-object scheduling overhead

Simple framework and algorithm– and efficient!– allow essentially A* framework, for given predicate schedule – enable formal analysis: optimality, scalability

Page 7: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Separate, global predicate scheduling

Simple Framework

Algorithm skeleton:find global schedule H

do: schedule next obj o

probe pr(o, next(o,H))

until (top-k identified)

predicates H=(p1,p2,p3)

p1 p2 p3 object

a

b

c

Page 8: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Challenges for Minimizing Probing

Predicate scheduling before probing– how to identify the best H?

Object scheduling during probing– how to find next object to probe, for achieving “minimal

probing” with respect to H?

Algorithm skeleton:find global schedule H

do: schedule next obj o

probe pr(o, next(o,H))

until (top-k identified)

?

?

Page 9: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Challenge 1: Object Scheduling

Goal: Perform only necessary probes

Necessary probes:– A probe is necessary if top-k answers cannot be determined by any

algorithm without it, regardless of the outcomes of other probes. Question 1: Given a probe pr(o, next(o,H)), how to

determine if it is necessary?

Probe-optimal algorithm– An algorithm is probe-optimal if it performs only the necessary

probes. Question 2: How to identify necessary probes in order to

design such an algorithm?

Page 10: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

k=1, F=min(x,p1,p2); suppose H=(p1,p2)

Question 1: Is this Probe Necessary?

OID x p1 p2 F=min(x,p1,p2)

a 0.9

b 0.8

c 0.7

d 0.6

e 0.5

?????

1 1 0.9

1 1 0.7

1 1 0.6

1 1 0.5

top 1

Maybe Not! 0.8

Page 11: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

k=1, F=min(x,p1,p2); suppose H=(p1,p2)

Theorem: Probe pr(o,p) is absolutely necessary, if o is among the current top-k in terms of ceiling scores.

Question 1: Is this Probe Necessary?

OID x p1 p2 F=min(x,p1,p2)

a 0.9

b 0.8

c 0.7

d 0.6

e 0.5

? 0.9

1 1 0.7

1 1 0.6

1 1 0.5

top 1?

Necessary!

1 1 0.8

Page 12: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Question 2: Probe-optimal object scheduling

Objects in current top-k must be further probed Probe-optimal object scheduling: Algorithm MPro

– use a priority queue with ceiling scores as priorities

a:0.9

b:0.8

c:0.7

d:0.6

e:0.5

a:0.85

b:0.8

c:0.7

d:0.6

e:0.5

b:0.8

a:0.75

c:0.7

d:0.6

e:0.5

a:0.75

c:0.7

d:0.6

e:0.5

b:0.78b:0.78

a:0.75

c:0.7

d:0.6

e:0.5

b:0.78

pr(a,p1) =0.85

pr(a,p2) =0.75

pr(b,p1) =0.78

pr(b,p2) =0.90

top 1

Page 13: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Challenge 2: Predicate Scheduling

Scheduling problem– find minimal cost schedule from permutations

Challenges– selectivity estimation:

dynamic predicates aggregate selectivities (context-dependent)

– scheduling computation: NP-hard

Our approach:– on-line sampling to estimate selectivities– greedy selection to schedule predicates

0.1% sampling achieves almost the best schedule

Page 14: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Experiment Results

Practical performance of MPro– proportional cost to the retrieval size k– significant speedup for small k

Impact of performance factors– database size: sublinear cost scalability– score distribution and scoring function: see paper

6 hour

2 min

Page 15: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Demo : House Search

Data: All houses on sale in Illinois (N=20990)– from www.realtor.com. – objects: house(id, price, size, bed, bath, zip, city)

Query: F = Average(n, c, r)– n nearcity: close to Chicago

– c cheap: “reasonable” price for its size

– r roomy: prefer 4-6 rooms

Page 16: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Summary of Contributions (more in the paper)

Abstraction:– for user-defined, external, and fuzzy join predicates

Framework and algorithm:– sampling-based global scheduling

– probe-optimal algorithm MPro

– extensions of MPro: fuzzy joins, parallel MPro, approximation

Principles/Theorems:– necessary-probe principle

– probe-optimality of MPro

– analytical scalability of MPro

Extensive experiments

Page 17: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Thank You!

Page 18: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Parallel MPro: Overview

Probe-parallel MPro– Probe k necessary probes

concurrently– Up to k-fold speedup

Data-parallel MPro– Partition data into s chunks– Up to s-time speedup

top-k

MPro MPro MPro

Merge

Page 19: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Scalability

k=100N=1000

k=1000N=10000

k=10000N=100000

N=1000N=10000

N=100000

Page 20: Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign

Comparison

0

0

0.01

0.1

1

10

100

1 10 100 k

tim

e(s

ec

)

probe scheduling

T T T

O O O