Upload
alexandrina-butler
View
212
Download
0
Embed Size (px)
Citation preview
Minimal Probing:Supporting Expensive Predicates for
Top-k Queries
Kevin C. ChangSeung-won HwangUniv. of Illinois at Urbana-Champaign
Ranked queries return top-k results, unlike Boolean Crucial for retrieving data by “soft” conditions
– relevance: e.g., text search engines– similarity: e.g., multimedia databases– preference: e.g., e-commerce product search
Example scenario: preference query for finding house:– select h.id from house h
where new(age), cheap(price, size), large(size)
order by min(new,cheap,large) stop after 5
Observation: Crucial to support expensive predicates
Context: Top-k Queries
predicate
k: retrieval sizescoring function
Problem: Expensive Predicates
Expensive predicates– no pre-computed indexes for zero-time sorted-access– need a probe to evaluate each object (similar to sequential scan)
Unified abstraction for:– user-defined functions: functional extensibility
query conditions can be arbitrary, user-specific e.g., cheap(price,size)
– external predicates: data extensibility source interface may require one probe per object e.g., safe(zip) access crime rate from apbnews.com
– fuzzy joins associations of relations can be arbitrary e.g., close(house.zip, park.zip)
Require sorted access of search predicates.
To “simulate” sorted access, require complete probing– are these probes necessary?
Goal: Minimize probe cost
Current Limitations: “Sort-Merge” Framework
d:0.90, a:0.85, b:0.78, c:0.75, e:0.70
b:0.90, d:0.90, e:0.80, a:0.75, c:0.20
a:0.90, b:0.80, c:0.70, d:0.60, e:0.50
b:0.78Merge
Algorithm
F = min(new,cheap,large)k = 1
Sort stepMerge stepTop-k outputnew (search predicate)
cheap (expensive predicate)
large (expensive predicate)
Motivation: Solution Space
Assume sequential probing:
Algorithm skeleton:do:
schedule next obj o, pred p
probe pr(o,p)
until (top-k identified)
predicates
p1 p2 p3 object
a
b
c
Our framework: Separate, Global Predicate Scheduling
Two important decisions on framework: Separate predicate scheduling
– scheduling as separate “optimization” phase before probing– avoid run-time scheduling overhead
Global predicate scheduling– scheduling based on global info (predicate selectivities)– lack of per-object information to justify per-object scheduling– avoid per-object scheduling overhead
Simple framework and algorithm– and efficient!– allow essentially A* framework, for given predicate schedule – enable formal analysis: optimality, scalability
Separate, global predicate scheduling
Simple Framework
Algorithm skeleton:find global schedule H
do: schedule next obj o
probe pr(o, next(o,H))
until (top-k identified)
predicates H=(p1,p2,p3)
p1 p2 p3 object
a
b
c
Challenges for Minimizing Probing
Predicate scheduling before probing– how to identify the best H?
Object scheduling during probing– how to find next object to probe, for achieving “minimal
probing” with respect to H?
Algorithm skeleton:find global schedule H
do: schedule next obj o
probe pr(o, next(o,H))
until (top-k identified)
?
?
Challenge 1: Object Scheduling
Goal: Perform only necessary probes
Necessary probes:– A probe is necessary if top-k answers cannot be determined by any
algorithm without it, regardless of the outcomes of other probes. Question 1: Given a probe pr(o, next(o,H)), how to
determine if it is necessary?
Probe-optimal algorithm– An algorithm is probe-optimal if it performs only the necessary
probes. Question 2: How to identify necessary probes in order to
design such an algorithm?
k=1, F=min(x,p1,p2); suppose H=(p1,p2)
Question 1: Is this Probe Necessary?
OID x p1 p2 F=min(x,p1,p2)
a 0.9
b 0.8
c 0.7
d 0.6
e 0.5
?????
1 1 0.9
1 1 0.7
1 1 0.6
1 1 0.5
top 1
Maybe Not! 0.8
k=1, F=min(x,p1,p2); suppose H=(p1,p2)
Theorem: Probe pr(o,p) is absolutely necessary, if o is among the current top-k in terms of ceiling scores.
Question 1: Is this Probe Necessary?
OID x p1 p2 F=min(x,p1,p2)
a 0.9
b 0.8
c 0.7
d 0.6
e 0.5
? 0.9
1 1 0.7
1 1 0.6
1 1 0.5
top 1?
Necessary!
1 1 0.8
Question 2: Probe-optimal object scheduling
Objects in current top-k must be further probed Probe-optimal object scheduling: Algorithm MPro
– use a priority queue with ceiling scores as priorities
a:0.9
b:0.8
c:0.7
d:0.6
e:0.5
a:0.85
b:0.8
c:0.7
d:0.6
e:0.5
b:0.8
a:0.75
c:0.7
d:0.6
e:0.5
a:0.75
c:0.7
d:0.6
e:0.5
b:0.78b:0.78
a:0.75
c:0.7
d:0.6
e:0.5
b:0.78
pr(a,p1) =0.85
pr(a,p2) =0.75
pr(b,p1) =0.78
pr(b,p2) =0.90
top 1
Challenge 2: Predicate Scheduling
Scheduling problem– find minimal cost schedule from permutations
Challenges– selectivity estimation:
dynamic predicates aggregate selectivities (context-dependent)
– scheduling computation: NP-hard
Our approach:– on-line sampling to estimate selectivities– greedy selection to schedule predicates
0.1% sampling achieves almost the best schedule
Experiment Results
Practical performance of MPro– proportional cost to the retrieval size k– significant speedup for small k
Impact of performance factors– database size: sublinear cost scalability– score distribution and scoring function: see paper
6 hour
2 min
Demo : House Search
Data: All houses on sale in Illinois (N=20990)– from www.realtor.com. – objects: house(id, price, size, bed, bath, zip, city)
Query: F = Average(n, c, r)– n nearcity: close to Chicago
– c cheap: “reasonable” price for its size
– r roomy: prefer 4-6 rooms
Summary of Contributions (more in the paper)
Abstraction:– for user-defined, external, and fuzzy join predicates
Framework and algorithm:– sampling-based global scheduling
– probe-optimal algorithm MPro
– extensions of MPro: fuzzy joins, parallel MPro, approximation
Principles/Theorems:– necessary-probe principle
– probe-optimality of MPro
– analytical scalability of MPro
Extensive experiments
Thank You!
Parallel MPro: Overview
Probe-parallel MPro– Probe k necessary probes
concurrently– Up to k-fold speedup
Data-parallel MPro– Partition data into s chunks– Up to s-time speedup
top-k
MPro MPro MPro
Merge
Scalability
k=100N=1000
k=1000N=10000
k=10000N=100000
N=1000N=10000
N=100000
Comparison
0
0
0.01
0.1
1
10
100
1 10 100 k
tim
e(s
ec
)
probe scheduling
T T T
O O O