probabilistic ranking

Supporting Ranking in QueriesScore-based Paradigm

Russell GreenspanCS 411Spring 2004

2

Supporting Ranking in QueriesTalk Outline

What Why How

– “Out-of-the-box” support– “Smart” top-k processing

3

Ranking in Queries What is ranking in queries?

A mechanism to return only the top-k results– Closest matches to user-specified boolean criteria– Scoring results based on user-specified

predicates SELECT Address

FROM HousesForSaleORDER BY Best(Size, Price)

Express similarity, relevance, or preference to a given query

4

What is ranking in queries?Definitions

Intuitive– Output an ordered list of k items such that the list

includes only those items whose scored rank is greater than the items not included

Formal– “Given retrieval size k and scoring function F, a

ranked query returns a list K of k objects (i.e. |K| = k) with query scores, sorted in a descending order, such that F(t1, ..., tn) [u] > F(t1, ..., tn) [v] for all u in K and all v not in K.” [Chang, Hwang, 2002]

5

What is ranking in queries?Differences from traditional queries

How does this differ from traditional queries?– Traditional queries:

Do not stop processing until all results are computed Do not focus on ranking tuples to best match the input

query

– Traditional boolean queries: Do not return “close” matches Can “over” or “under” match, producing too few or too

many results

6

Ranking in Queries Why use ranking in queries?

Exact matches not required– Often times something “close enough” satisfies a

user’s demands

Fuzzy matches desired– Multimedia/image matching, where the very nature

of the query does not involve an exact match

Avoid unnecessary computations– Find the “best” answers quickly as opposed to all

answers

7

Ranking in Queries How do we use execute ranked queries?

“Out-of-the-box” support– Perform query as any other, then perform sort

and return only first k rows– Why is this bad?

Lots of unnecessary processing Waste of resources in intermediate results If scoring function is expensive, could result in

computation of unneeded scores

Can we do better?

8

How do we use execute ranked queries?“Smart” Ranked Query Execution

Query Processing – Try to achieve significant reduction in query

execution time– Use mid-query (i.e. as query executes)

techniques to optimize query plan for top-k results– Consider minimal amount of tuples necessary to

return k results Scoring Predicate

– Consider expense of scoring function in determining optimal query plan

9

“Smart” Ranked Query ExecutionTwo Areas of Research Focus

Top-k processing– Reducing number of tuples considered at each

intermediate step Assume minimal work necessary to retrieve items sorted

by score (i.e. indexes on simple attributes)

Rank function– Reducing number of calls to ranking function

Assume rank calculation is expensive

– Implementing unusual ranking function

10

“Smart” Ranked Query ExecutionResearch and Techniques

Reducing number of tuples considered– Middleware/Multimedia

Garlic [Fagin, 1999] CHITRA [Nepal, Ramakrishna, 1999]

– Relational STOP Operator [Carey, Kossmann, 1997] Probabilistic [Donjerkovic, Ramakrishnan, 1999] Statistical [Chaudhuri, Gravano, 1999]

Reducing number of calls to ranking function – MPro [Chang, Hwang, 2002]

Implementing unusual ranking function– AutoRank [Agrawal, Chaudhuri, Das, Gionis, 2003]

11

“Smart” Ranked Querying (Middleware) –Garlic [Fagin, 1999]

Integrates data from different database systems or non-database data servers– Relational Query Set vs. “Sorted List”

Example: “Return the reddest covers of Beatle’s albums”i.e. (Artist = ‘Beatles’) AND (AlbumColor LIKE ‘red’), where Artists are stored relationally and Album colors in a multimedia database

Assign grade to each object– Boolean grade either 0 or 1– Fuzzy value 0<=x<=1 indicating closeness

12

Garlic [Fagin, 1999]Rank Processing Methods

How to combine two fuzzy values to retrieve top-k objects?– Inefficient

Consider graded sets of all objects by color and shape Compute combined score for every object, then output top

k objects

– Efficient Retrieve objects (sorted by grade) from each subsystem

until there are at least k of the same objects in each set Compute combined score for each of these k objects

13

Garlic [Fagin, 1999]Example Query

Example: (use combined scoring function = x * y) Return Top 2 Color = ‘red’ AND Shape = ‘round’

Object Roundness

A .6

B .8

C .3

D .2

E .9

F .1

G .7

H .4

Object Redness

A .2

B .6

C .1

D .8

E .3

F .5

G .9

H .3

14

Garlic [Fagin, 1999]Inefficient vs. Efficient Processing

Inefficient– Calculate combined

score for every object– Sort by score– Return top k objects

{G, B}

Object Score

A .12

B .48

C .03

D .16

E .27

F .05

G .63

H .12

15

Garlic [Fagin, 1999]Inefficient vs. Efficient Processing

Efficient (Fagin’s A0 algorithm)– Consider ordered members from each set until there

are k of the same object in each set A1 = {G(.9), D(.8), B(.6)} A2 = {E(.9), B(.8), G(.7)}

– Calculate combined score for each of the k objects G = .9 * .7 = .63 B = .6 * .8 = .48

– Return these objects ordered by combined score {G, B}

16

Garlic [Fagin, 1999]Conclusions

Why is this more efficient?– Incur expense of scoring function k times, as

opposed to n times (where n is the total number of items)

– Access each subsystem at least k and at most n times, as opposed to n times (again, where n is the total number of items)

17

“Smart” Ranked Querying (Middleware) –CHITRA [Nepal, Ramakrishna, 1999]

Expands on Fagin’s GARLIC system by proposing new “multi-step” processing algorithm

Experimental results show 50% improvement

18

CHITRA [Nepal, Ramakrishna, 1999]“Multi-step” Algorithm

Consider first sorted item x from each subsystem i

Perform random access into every other subsystem to obtain other rankings of x

Add object to result set if its rank is greater than the threshold grade, quit when we have k objects– Threshold is score of all objects considered each

iteration

19

CHITRA [Nepal, Ramakrishna, 1999]Example Query

Back to our example...Return Top 2 Color = ‘red’ AND Shape = ‘round’

Object Roundness

A .6

B .8

C .3

D .2

E .9

F .1

G .7

H .4

Object Redness

A .2

B .6

C .1

D .8

E .3

F .5

G .9

H .3

20

CHITRA [Nepal, Ramakrishna, 1999]Example Scoring Functions Results

Consider two scoring functions as examples:

– min[x, y]

– [x * y]

Iter. Items Grade Threshold Resultset

1 i1 = {G(.9)}i2 = {E(.9)}

G = min[.9, .7] = .7E = min[.9, .3] = .3

min[.9, .9] = .9

2 i1 = {D(.8)}i2 = {B(.8)}

D = min[.8, .2] = .2B = min[.8, .6] = .6

min[.8, .8] = .8

3 i1 = {B(.6)}i2 = {G(.7)}

B = min[.6, .8] = .6G = min[.7, .9] = .7

min[.6, .7] = .6 {G, B}

Iter. Items Grade Threshold Resultset

1 i1 = {G(.9)}i2 = {E(.9)}

G = [.9 * .7] = .63E = [.9 * .3] = .27

[.9 * .9] = .81

2 i1 = {D(.8)}i2 = {B(.8)}

D = [.8 * .2] = .16B = [.8 * .6] = .48

[.8 * .8] = .64

3 i1 = {B(.6)}i2 = {G(.7)}

B = [.6 * .8] = .48G = [.7 * .9] = .63

[.6 * .7] = .43 {G, B}

21

CHITRA [Nepal, Ramakrishna, 1999]Conclusions

Why is this more efficient?– Requires fewer accesses to each subsystem

How do we know this algorithm is correct?– Proof by contradiction

Assume object z which should have been included If Rank(z) > Rank(y), either:

– y must have at least one subsystem rank smaller than all subsystem ranks of z

– z must have at least one subsystem rank greater than all subsystem ranks of y

However, since Rank(z) < Threshold and Rank(y) >= Threshold, Rank(z) cannot be greater than Rank(y)

22

“Smart” Ranked Querying (Relational) –STOP Operator [Carey, et al, 1997]

Specifies extension to SQL-92 standard to allow limit on cardinality of result– STOP AFTER

Return subset of results from each section of query plan

Implement with STOP operator– STOP(N, D, E) where N is the number of desired

tuples, D is the Sort Directive [asc, desc, none], and E is the Sort Expression

– Heuristically determine when and how to apply

23

STOP Operator [Carey, et al, 1997] Example query plans

Fig a shows traditional JOIN– Join all EMP to DEPT, sort, output top k

Fig b shows implementation of STOP operators– Based on cardinality estimates, only 20 rows of EMP need

be joined with 30 rows of DEPT to produce top-k of 10

24

STOP Operator [Carey, et al, 1997]Conservative Heuristic

Ensures that every tuple in each intermediate result is guaranteed to generate at least one tuple of the overall query result

Advantages– No restarts from intermediate processing returning fewer than k

results– Intermediate STOP operators take their N value from overall

query k value Disadvantages

– Only inserts STOP operators where all remaining predicates are non-reductive (cannot use with multi-way joins)

25

STOP Operator [Carey, et al, 1997]Aggressive Heuristic

Applies STOP operator wherever it may be beneficial, thus reducing intermediate results to a greater degree

Choose N value using cardinality estimates

Requires RESTART operator when intermediate processing returns too few results

26

STOP Operator [Carey, et al, 1997]Experimental Results

Which heuristic is better?– Depends on cardinality, expense of processing

intermediate results, accuracy of prediction, etc.– With low expense of processing intermediate

results, experimental results show aggressive overestimation the best:

Traditional Conservative Aggressive,Underestimate (1/10)

Aggressive,Overestimate (10)

128.3 sec 63.9 sec 63.1 sec 18.5 sec

27

STOP Operator [Carey, et al, 1997]Experimental Results

Performance vs. Traditional (“out-of-the-box”) processing shows benefits in both indexed and non-indexed situations

28

“Smart” Ranked Querying (Relational) –Probabilistic [Donjerkovic, et al, 1999]

Introduces idea of ‘selection cutoff’ to produce top k results without requiring SORT

Quantifies the risk of fewer than k results being generated using inherent database statistics– List the top 10 paid employees

becomesList the employees whose salary is greater than x where x is determined by the distribution of employees’ salaries

29

Probabilistic [Donjerkovic, et al, 1999]Comparison with STOP Operator

In theory, likely to be cheaper to simply ‘select’ the necessary intermediate rows using cutoff (fig b) rather than performing sort and returning top-k (fig a)

30

Probabilistic [Donjerkovic, et al, 1999]Implementation

Leverage same statistics used by traditional query optimizer to guess cutoff– Histograms– Selectivity factors

31

Probabilistic [Donjerkovic, et al, 1999]Performance

For simple query using no indexes (return k highest paid employees, no index on ‘Salary’ attribute), easily outperforms traditional (scan, sort, return top k)

Also provides benefit to JOIN queries due to complexity of estimating join selectivity

32

“Smart” Ranked Querying (Relational) – Statistical [Chaudhuri, Gravano, 1999]

Expansion of probabilistic model Maps rank queries into boolean range queries Works with a variety of scoring functions,

including Min, Euclidean, and Sum

33

Statistical [Chaudhuri, Gravano, 1999]Expansion of probabilistic model

Consider multiple levels of ‘selection cutoff’, here referred to as ‘search score’ (Sq)– NoRestarts – score low enough to guarantee no

restarts are even needed– Restarts – score high enough that restarts might

result– Intermediate – score between NoRestarts and

Restarts

34

Statistical [Chaudhuri, Gravano, 1999]Implementation

Determine Sq from histograms– Choose bounding tuples in each bucket to ensure

NoRestarts (fig a) or tight tuples to minimize selection but potentially require Restarts (fig b)

35

Statistical [Chaudhuri, Gravano, 1999]Implementation

Determine relational query to retrieve all tuples that score above Sq

– Compute n-rectangle bounding such tuples– SELECT *

FROM RWHERE (a1<=A1<=b1) ... AND ... (an<=An<=bn)

Compute score for all returned tuples Output top-k tuples with score > Sq or rerun

query with lower search score

36

Statistical [Chaudhuri, Gravano, 1999]Expansion of Fagin’s model

Expands Fagin’s ideas to relational queries– Substitute ‘search score’ query to determine top

tuples for each subsystem– Use NoRestarts strategy to ensure that expensive

re-querying is avoided

37

“Smart” Ranked Querying (Rank) – MPro [Chang, Hwang, 2002]

Extends consideration of top-k querying to expensive predicates (monotonic only) – As opposed to other work, which assumes the

expense of score calculation to be minimal

Attempt to minimize the number of scores calculated– Consider only Necessary Probes, i.e. only those

calculations without which the top-k results cannot be found

38

MPro [Chang, Hwang, 2002]Determining if probe is necessary

An object’s lowest calculated score represents “ceiling score” (i.e. it is impossible for any other score for that object to raise its lowest score)

If “ceiling score” falls below top-k object’s complete score, object is ruled out and no further calculations on the object need be performed

Simple Example: – Consider scoring function like Min and top-1 results desired– If we know object A’s combined rank with respect to F(x)

and F(y) is .8, and we calculate object B’s score with respect to F(x) to be .7, B’s score with respect to F(y) need not be calculated (its Min value cannot be higher than .7)

39

MPro [Chang, Hwang, 2002] Determining all necessary probes

Only objects with ceiling scores in the top-k need be further evaluated

If objects are kept in sorted order by current ceiling scores:– For any object u in the top-k slots, its next probe

is necessary

40

MPro [Chang, Hwang, 2002]Minimal Probes Algorithm (MPro)

Priority queue initialization– Evaluate each object over first predicate (same as

sequentially accessing objects sorted by x) Necessary probing

– Request from queue the object with highest ceiling score

– Evaluate object over next predicate y– Update ceiling score and reinsert into queue

Stop when at least k objects have been completely scored (and output these objects)

41

MPro [Chang, Hwang, 2002]Further Applications

Incremental results– Output top k, resume processing where it left off for

next k as user requests

Fuzzy joins– Consider join predicate in same manner

Parallel processing– Distribute necessary probes across multiple servers– Distribute data, calculate top-n over each chunk,

merge results

42

MPro [Chang, Hwang, 2002]Experimental Results

On experimental dataset, over 96% of complete probes found to be unnecessary

Elapsed time significantly improved (see below), from 21009 to 408 seconds for k = 10

43

“Smart” Ranked Querying (Rank) – AutoRank [Agrawal, et al, 2003]

Consider ranking of relational attributes in similar way to Information Retrieval (IR)– IDF Similarity

Extend TF-IDF based on frequency of occurrence of attribute values

– QF Similarity Use database workload to determine frequency with

which attributes and attribute values are referenced “Poor man’s relevance feedback”

– ITA Index-based top-k algorithm that exploits above ranking

functions

44

AutoRank [Agrawal, et al, 2003]IDF Similarity

Extend TF (term frequency)– IR – frequency of terms in a document – Relational – frequency of values for an attribute

Extend IDF (inverse document frequency)– IR – total documents / documents containing term– Relational – tuples / tuples where attribute = value

For all tuples matching the queried value, IDF Similarity is the attribute’s IDF (for the queried value), and 0 otherwise

45

AutoRank [Agrawal, et al, 2003]QF Similarity

Consider problem of IDF where desired result is also the most frequent

– Realty database where homes built in the last three years are most desired, but the few entries existing for old homes (with higher IDF) will be considered “top”

Instead, use frequency of occurrence of attribute values in executed queries to determine ranking (by examining workload)

Can extend workload analysis to draw comparative conclusions from attribute values queried together

– Assume similarity between ‘Honda’ and ‘Toyota’ if users frequently look for cars by either of these manufacturers

46

AutoRank [Agrawal, et al, 2003]Implementation

Store approximate representations of IDF and QF values using smooth function

– Minimal storage required– IDF and QF values can be quickly retrieved at runtime

ITA (Index-based Threshold Algorithm)– Use available, existing indexes (B+ trees)– Define threshold by computing best tuple in data not yet

examined– Stop processing when similarity of this tuple is no greater

than similarity of lowest ranking tuple in top-k buffer

47

AutoRank [Agrawal, et al, 2003]Experimental Results

Used large realtor database from http://homeadvisor.microsoft.com and MS- SQL Server

Measured result-quality via user studies– For each test query, asked users to identify

relevant and irrelevant tuples and compared results of QF and IDF queries to users’ responses

ITA judged to be more efficient than SQL Server’s Top-k operator when indexes exist

48

Conclusions

Clearly, an exciting and worthwhile field Research has gone in several directions but

all shares roots in Fagin and Carey’s work Combines many areas of computer science

– Artificial Intelligence (Fuzzy Logic)– Information Retrieval

49

The Future

Implementation in major RDBMS vendors– Microsoft should be among the first to revamp

their Top-K operator, as in-house research [Agrawal, et al, 2003] has provided a smarter, faster technique

Explore more complex ranking functions that cannot be easily mapped to range queries or used with indexes

50

References

M. J. Carey and D. Kossmann. On saying “enough already!" in SQL. 1997 SIGMOD Conference: 219-230, 1997.

D. Donjerkovic, R. Ramakrishnan. Probabilistic Optimization of Top N Queries. VLDB 1999: 411-422, 1999.

R. Fagin. Combining Fuzzy Information from Multiple Systems. PODS 1996: 216-226, 1996.

S. Nepal, M. V. Ramakrishna. Query Processing Issues in Image (Multimedia) Databases. ICDE 1999: 22-29, 1999.

Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. VLDB 1999: 397-410, 1999.

K.C. Chang, S. Hwang. Minimal Probing: Supporting Expensive Predicates for Top-k Queries. SIGMOD Conference 2002: 346-357, 2002.

Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis. Automated Ranking of Database Query Results. CIDR 2003, 2003.

Technology

probabilistic ranking