50
Supporting Ranking in Queries Score-based Paradigm Russell Greenspan CS 411 Spring 2004

probabilistic ranking

  • Upload
    felix75

  • View
    118

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: probabilistic ranking

Supporting Ranking in QueriesScore-based Paradigm

Russell GreenspanCS 411Spring 2004

Page 2: probabilistic ranking

2

Supporting Ranking in QueriesTalk Outline

What Why How

– “Out-of-the-box” support– “Smart” top-k processing

Page 3: probabilistic ranking

3

Ranking in Queries What is ranking in queries?

A mechanism to return only the top-k results– Closest matches to user-specified boolean criteria– Scoring results based on user-specified

predicates SELECT Address

FROM HousesForSaleORDER BY Best(Size, Price)

Express similarity, relevance, or preference to a given query

Page 4: probabilistic ranking

4

What is ranking in queries?Definitions

Intuitive– Output an ordered list of k items such that the list

includes only those items whose scored rank is greater than the items not included

Formal– “Given retrieval size k and scoring function F, a

ranked query returns a list K of k objects (i.e. |K| = k) with query scores, sorted in a descending order, such that F(t1, ..., tn) [u] > F(t1, ..., tn) [v] for all u in K and all v not in K.” [Chang, Hwang, 2002]

Page 5: probabilistic ranking

5

What is ranking in queries?Differences from traditional queries

How does this differ from traditional queries?– Traditional queries:

Do not stop processing until all results are computed Do not focus on ranking tuples to best match the input

query

– Traditional boolean queries: Do not return “close” matches Can “over” or “under” match, producing too few or too

many results

Page 6: probabilistic ranking

6

Ranking in Queries Why use ranking in queries?

Exact matches not required– Often times something “close enough” satisfies a

user’s demands

Fuzzy matches desired– Multimedia/image matching, where the very nature

of the query does not involve an exact match

Avoid unnecessary computations– Find the “best” answers quickly as opposed to all

answers

Page 7: probabilistic ranking

7

Ranking in Queries How do we use execute ranked queries?

“Out-of-the-box” support– Perform query as any other, then perform sort

and return only first k rows– Why is this bad?

Lots of unnecessary processing Waste of resources in intermediate results If scoring function is expensive, could result in

computation of unneeded scores

Can we do better?

Page 8: probabilistic ranking

8

How do we use execute ranked queries?“Smart” Ranked Query Execution

Query Processing – Try to achieve significant reduction in query

execution time– Use mid-query (i.e. as query executes)

techniques to optimize query plan for top-k results– Consider minimal amount of tuples necessary to

return k results Scoring Predicate

– Consider expense of scoring function in determining optimal query plan

Page 9: probabilistic ranking

9

“Smart” Ranked Query ExecutionTwo Areas of Research Focus

Top-k processing– Reducing number of tuples considered at each

intermediate step Assume minimal work necessary to retrieve items sorted

by score (i.e. indexes on simple attributes)

Rank function– Reducing number of calls to ranking function

Assume rank calculation is expensive

– Implementing unusual ranking function

Page 10: probabilistic ranking

10

“Smart” Ranked Query ExecutionResearch and Techniques

Reducing number of tuples considered– Middleware/Multimedia

Garlic [Fagin, 1999] CHITRA [Nepal, Ramakrishna, 1999]

– Relational STOP Operator [Carey, Kossmann, 1997] Probabilistic [Donjerkovic, Ramakrishnan, 1999] Statistical [Chaudhuri, Gravano, 1999]

Reducing number of calls to ranking function – MPro [Chang, Hwang, 2002]

Implementing unusual ranking function– AutoRank [Agrawal, Chaudhuri, Das, Gionis, 2003]

Page 11: probabilistic ranking

11

“Smart” Ranked Querying (Middleware) –Garlic [Fagin, 1999]

Integrates data from different database systems or non-database data servers– Relational Query Set vs. “Sorted List”

Example: “Return the reddest covers of Beatle’s albums”i.e. (Artist = ‘Beatles’) AND (AlbumColor LIKE ‘red’), where Artists are stored relationally and Album colors in a multimedia database

Assign grade to each object– Boolean grade either 0 or 1– Fuzzy value 0<=x<=1 indicating closeness

Page 12: probabilistic ranking

12

Garlic [Fagin, 1999]Rank Processing Methods

How to combine two fuzzy values to retrieve top-k objects?– Inefficient

Consider graded sets of all objects by color and shape Compute combined score for every object, then output top

k objects

– Efficient Retrieve objects (sorted by grade) from each subsystem

until there are at least k of the same objects in each set Compute combined score for each of these k objects

Page 13: probabilistic ranking

13

Garlic [Fagin, 1999]Example Query

Example: (use combined scoring function = x * y) Return Top 2 Color = ‘red’ AND Shape = ‘round’

Object Roundness

A .6

B .8

C .3

D .2

E .9

F .1

G .7

H .4

Object Redness

A .2

B .6

C .1

D .8

E .3

F .5

G .9

H .3

Page 14: probabilistic ranking

14

Garlic [Fagin, 1999]Inefficient vs. Efficient Processing

Inefficient– Calculate combined

score for every object– Sort by score– Return top k objects

{G, B}

Object Score

A .12

B .48

C .03

D .16

E .27

F .05

G .63

H .12

Page 15: probabilistic ranking

15

Garlic [Fagin, 1999]Inefficient vs. Efficient Processing

Efficient (Fagin’s A0 algorithm)– Consider ordered members from each set until there

are k of the same object in each set A1 = {G(.9), D(.8), B(.6)} A2 = {E(.9), B(.8), G(.7)}

– Calculate combined score for each of the k objects G = .9 * .7 = .63 B = .6 * .8 = .48

– Return these objects ordered by combined score {G, B}

Page 16: probabilistic ranking

16

Garlic [Fagin, 1999]Conclusions

Why is this more efficient?– Incur expense of scoring function k times, as

opposed to n times (where n is the total number of items)

– Access each subsystem at least k and at most n times, as opposed to n times (again, where n is the total number of items)

Page 17: probabilistic ranking

17

“Smart” Ranked Querying (Middleware) –CHITRA [Nepal, Ramakrishna, 1999]

Expands on Fagin’s GARLIC system by proposing new “multi-step” processing algorithm

Experimental results show 50% improvement

Page 18: probabilistic ranking

18

CHITRA [Nepal, Ramakrishna, 1999]“Multi-step” Algorithm

Consider first sorted item x from each subsystem i

Perform random access into every other subsystem to obtain other rankings of x

Add object to result set if its rank is greater than the threshold grade, quit when we have k objects– Threshold is score of all objects considered each

iteration

Page 19: probabilistic ranking

19

CHITRA [Nepal, Ramakrishna, 1999]Example Query

Back to our example...Return Top 2 Color = ‘red’ AND Shape = ‘round’

Object Roundness

A .6

B .8

C .3

D .2

E .9

F .1

G .7

H .4

Object Redness

A .2

B .6

C .1

D .8

E .3

F .5

G .9

H .3

Page 20: probabilistic ranking

20

CHITRA [Nepal, Ramakrishna, 1999]Example Scoring Functions Results

Consider two scoring functions as examples:

– min[x, y]

– [x * y]

Iter. Items Grade Threshold Resultset

1 i1 = {G(.9)}i2 = {E(.9)}

G = min[.9, .7] = .7E = min[.9, .3] = .3

min[.9, .9] = .9

2 i1 = {D(.8)}i2 = {B(.8)}

D = min[.8, .2] = .2B = min[.8, .6] = .6

min[.8, .8] = .8

3 i1 = {B(.6)}i2 = {G(.7)}

B = min[.6, .8] = .6G = min[.7, .9] = .7

min[.6, .7] = .6 {G, B}

Iter. Items Grade Threshold Resultset

1 i1 = {G(.9)}i2 = {E(.9)}

G = [.9 * .7] = .63E = [.9 * .3] = .27

[.9 * .9] = .81

2 i1 = {D(.8)}i2 = {B(.8)}

D = [.8 * .2] = .16B = [.8 * .6] = .48

[.8 * .8] = .64

3 i1 = {B(.6)}i2 = {G(.7)}

B = [.6 * .8] = .48G = [.7 * .9] = .63

[.6 * .7] = .43 {G, B}

Page 21: probabilistic ranking

21

CHITRA [Nepal, Ramakrishna, 1999]Conclusions

Why is this more efficient?– Requires fewer accesses to each subsystem

How do we know this algorithm is correct?– Proof by contradiction

Assume object z which should have been included If Rank(z) > Rank(y), either:

– y must have at least one subsystem rank smaller than all subsystem ranks of z

– z must have at least one subsystem rank greater than all subsystem ranks of y

However, since Rank(z) < Threshold and Rank(y) >= Threshold, Rank(z) cannot be greater than Rank(y)

Page 22: probabilistic ranking

22

“Smart” Ranked Querying (Relational) –STOP Operator [Carey, et al, 1997]

Specifies extension to SQL-92 standard to allow limit on cardinality of result– STOP AFTER

Return subset of results from each section of query plan

Implement with STOP operator– STOP(N, D, E) where N is the number of desired

tuples, D is the Sort Directive [asc, desc, none], and E is the Sort Expression

– Heuristically determine when and how to apply

Page 23: probabilistic ranking

23

STOP Operator [Carey, et al, 1997] Example query plans

Fig a shows traditional JOIN– Join all EMP to DEPT, sort, output top k

Fig b shows implementation of STOP operators– Based on cardinality estimates, only 20 rows of EMP need

be joined with 30 rows of DEPT to produce top-k of 10

Page 24: probabilistic ranking

24

STOP Operator [Carey, et al, 1997]Conservative Heuristic

Ensures that every tuple in each intermediate result is guaranteed to generate at least one tuple of the overall query result

Advantages– No restarts from intermediate processing returning fewer than k

results– Intermediate STOP operators take their N value from overall

query k value Disadvantages

– Only inserts STOP operators where all remaining predicates are non-reductive (cannot use with multi-way joins)

Page 25: probabilistic ranking

25

STOP Operator [Carey, et al, 1997]Aggressive Heuristic

Applies STOP operator wherever it may be beneficial, thus reducing intermediate results to a greater degree

Choose N value using cardinality estimates

Requires RESTART operator when intermediate processing returns too few results

Page 26: probabilistic ranking

26

STOP Operator [Carey, et al, 1997]Experimental Results

Which heuristic is better?– Depends on cardinality, expense of processing

intermediate results, accuracy of prediction, etc.– With low expense of processing intermediate

results, experimental results show aggressive overestimation the best:

Traditional Conservative Aggressive,Underestimate (1/10)

Aggressive,Overestimate (10)

128.3 sec 63.9 sec 63.1 sec 18.5 sec

Page 27: probabilistic ranking

27

STOP Operator [Carey, et al, 1997]Experimental Results

Performance vs. Traditional (“out-of-the-box”) processing shows benefits in both indexed and non-indexed situations

Page 28: probabilistic ranking

28

“Smart” Ranked Querying (Relational) –Probabilistic [Donjerkovic, et al, 1999]

Introduces idea of ‘selection cutoff’ to produce top k results without requiring SORT

Quantifies the risk of fewer than k results being generated using inherent database statistics– List the top 10 paid employees

becomesList the employees whose salary is greater than x where x is determined by the distribution of employees’ salaries

Page 29: probabilistic ranking

29

Probabilistic [Donjerkovic, et al, 1999]Comparison with STOP Operator

In theory, likely to be cheaper to simply ‘select’ the necessary intermediate rows using cutoff (fig b) rather than performing sort and returning top-k (fig a)

Page 30: probabilistic ranking

30

Probabilistic [Donjerkovic, et al, 1999]Implementation

Leverage same statistics used by traditional query optimizer to guess cutoff– Histograms– Selectivity factors

Page 31: probabilistic ranking

31

Probabilistic [Donjerkovic, et al, 1999]Performance

For simple query using no indexes (return k highest paid employees, no index on ‘Salary’ attribute), easily outperforms traditional (scan, sort, return top k)

Also provides benefit to JOIN queries due to complexity of estimating join selectivity

Page 32: probabilistic ranking

32

“Smart” Ranked Querying (Relational) – Statistical [Chaudhuri, Gravano, 1999]

Expansion of probabilistic model Maps rank queries into boolean range queries Works with a variety of scoring functions,

including Min, Euclidean, and Sum

Page 33: probabilistic ranking

33

Statistical [Chaudhuri, Gravano, 1999]Expansion of probabilistic model

Consider multiple levels of ‘selection cutoff’, here referred to as ‘search score’ (Sq)– NoRestarts – score low enough to guarantee no

restarts are even needed– Restarts – score high enough that restarts might

result– Intermediate – score between NoRestarts and

Restarts

Page 34: probabilistic ranking

34

Statistical [Chaudhuri, Gravano, 1999]Implementation

Determine Sq from histograms– Choose bounding tuples in each bucket to ensure

NoRestarts (fig a) or tight tuples to minimize selection but potentially require Restarts (fig b)

Page 35: probabilistic ranking

35

Statistical [Chaudhuri, Gravano, 1999]Implementation

Determine relational query to retrieve all tuples that score above Sq

– Compute n-rectangle bounding such tuples– SELECT *

FROM RWHERE (a1<=A1<=b1) ... AND ... (an<=An<=bn)

Compute score for all returned tuples Output top-k tuples with score > Sq or rerun

query with lower search score

Page 36: probabilistic ranking

36

Statistical [Chaudhuri, Gravano, 1999]Expansion of Fagin’s model

Expands Fagin’s ideas to relational queries– Substitute ‘search score’ query to determine top

tuples for each subsystem– Use NoRestarts strategy to ensure that expensive

re-querying is avoided

Page 37: probabilistic ranking

37

“Smart” Ranked Querying (Rank) – MPro [Chang, Hwang, 2002]

Extends consideration of top-k querying to expensive predicates (monotonic only) – As opposed to other work, which assumes the

expense of score calculation to be minimal

Attempt to minimize the number of scores calculated– Consider only Necessary Probes, i.e. only those

calculations without which the top-k results cannot be found

Page 38: probabilistic ranking

38

MPro [Chang, Hwang, 2002]Determining if probe is necessary

An object’s lowest calculated score represents “ceiling score” (i.e. it is impossible for any other score for that object to raise its lowest score)

If “ceiling score” falls below top-k object’s complete score, object is ruled out and no further calculations on the object need be performed

Simple Example: – Consider scoring function like Min and top-1 results desired– If we know object A’s combined rank with respect to F(x)

and F(y) is .8, and we calculate object B’s score with respect to F(x) to be .7, B’s score with respect to F(y) need not be calculated (its Min value cannot be higher than .7)

Page 39: probabilistic ranking

39

MPro [Chang, Hwang, 2002] Determining all necessary probes

Only objects with ceiling scores in the top-k need be further evaluated

If objects are kept in sorted order by current ceiling scores:– For any object u in the top-k slots, its next probe

is necessary

Page 40: probabilistic ranking

40

MPro [Chang, Hwang, 2002]Minimal Probes Algorithm (MPro)

Priority queue initialization– Evaluate each object over first predicate (same as

sequentially accessing objects sorted by x) Necessary probing

– Request from queue the object with highest ceiling score

– Evaluate object over next predicate y– Update ceiling score and reinsert into queue

Stop when at least k objects have been completely scored (and output these objects)

Page 41: probabilistic ranking

41

MPro [Chang, Hwang, 2002]Further Applications

Incremental results– Output top k, resume processing where it left off for

next k as user requests

Fuzzy joins– Consider join predicate in same manner

Parallel processing– Distribute necessary probes across multiple servers– Distribute data, calculate top-n over each chunk,

merge results

Page 42: probabilistic ranking

42

MPro [Chang, Hwang, 2002]Experimental Results

On experimental dataset, over 96% of complete probes found to be unnecessary

Elapsed time significantly improved (see below), from 21009 to 408 seconds for k = 10

Page 43: probabilistic ranking

43

“Smart” Ranked Querying (Rank) – AutoRank [Agrawal, et al, 2003]

Consider ranking of relational attributes in similar way to Information Retrieval (IR)– IDF Similarity

Extend TF-IDF based on frequency of occurrence of attribute values

– QF Similarity Use database workload to determine frequency with

which attributes and attribute values are referenced “Poor man’s relevance feedback”

– ITA Index-based top-k algorithm that exploits above ranking

functions

Page 44: probabilistic ranking

44

AutoRank [Agrawal, et al, 2003]IDF Similarity

Extend TF (term frequency)– IR – frequency of terms in a document – Relational – frequency of values for an attribute

Extend IDF (inverse document frequency)– IR – total documents / documents containing term– Relational – tuples / tuples where attribute = value

For all tuples matching the queried value, IDF Similarity is the attribute’s IDF (for the queried value), and 0 otherwise

Page 45: probabilistic ranking

45

AutoRank [Agrawal, et al, 2003]QF Similarity

Consider problem of IDF where desired result is also the most frequent

– Realty database where homes built in the last three years are most desired, but the few entries existing for old homes (with higher IDF) will be considered “top”

Instead, use frequency of occurrence of attribute values in executed queries to determine ranking (by examining workload)

Can extend workload analysis to draw comparative conclusions from attribute values queried together

– Assume similarity between ‘Honda’ and ‘Toyota’ if users frequently look for cars by either of these manufacturers

Page 46: probabilistic ranking

46

AutoRank [Agrawal, et al, 2003]Implementation

Store approximate representations of IDF and QF values using smooth function

– Minimal storage required– IDF and QF values can be quickly retrieved at runtime

ITA (Index-based Threshold Algorithm)– Use available, existing indexes (B+ trees)– Define threshold by computing best tuple in data not yet

examined– Stop processing when similarity of this tuple is no greater

than similarity of lowest ranking tuple in top-k buffer

Page 47: probabilistic ranking

47

AutoRank [Agrawal, et al, 2003]Experimental Results

Used large realtor database from http://homeadvisor.microsoft.com and MS- SQL Server

Measured result-quality via user studies– For each test query, asked users to identify

relevant and irrelevant tuples and compared results of QF and IDF queries to users’ responses

ITA judged to be more efficient than SQL Server’s Top-k operator when indexes exist

Page 48: probabilistic ranking

48

Conclusions

Clearly, an exciting and worthwhile field Research has gone in several directions but

all shares roots in Fagin and Carey’s work Combines many areas of computer science

– Artificial Intelligence (Fuzzy Logic)– Information Retrieval

Page 49: probabilistic ranking

49

The Future

Implementation in major RDBMS vendors– Microsoft should be among the first to revamp

their Top-K operator, as in-house research [Agrawal, et al, 2003] has provided a smarter, faster technique

Explore more complex ranking functions that cannot be easily mapped to range queries or used with indexes

Page 50: probabilistic ranking

50

References

M. J. Carey and D. Kossmann. On saying “enough already!" in SQL. 1997 SIGMOD Conference: 219-230, 1997.

D. Donjerkovic, R. Ramakrishnan. Probabilistic Optimization of Top N Queries. VLDB 1999: 411-422, 1999.

R. Fagin. Combining Fuzzy Information from Multiple Systems. PODS 1996: 216-226, 1996.

S. Nepal, M. V. Ramakrishna. Query Processing Issues in Image (Multimedia) Databases. ICDE 1999: 22-29, 1999.

Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. VLDB 1999: 397-410, 1999.

K.C. Chang, S. Hwang. Minimal Probing: Supporting Expensive Predicates for Top-k Queries. SIGMOD Conference 2002: 346-357, 2002.

Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis. Automated Ranking of Database Query Results. CIDR 2003, 2003.