Upload
noreen-simmons
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
1
Probabilistic Ranking of Database Query Results
Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik
Presented by Raghunath Ravi
Sivaramakrishnan SubramaniCSE@UTA
2
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
3
MotivationMany-answers problemTwo alternative solutions:
Query reformulation Automatic rankingApply probabilistic model in IR to
DB tuple ranking
4
Example – Realtor DatabaseHouse Attributes: Price, City,
Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year
Query: City =`Seattle’ AND Waterfront = TRUE
Too Many Results!
Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable
5
Rank According to Unspecified AttributesScore of a Result Tuple t depends onGlobal Score: Global Importance of
Unspecified Attribute Values [CIDR2003]◦ E.g., Newer Houses are generally preferred
Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock
Many Bedrooms Good School District
6
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
7
Key ProblemsGiven a Query Q, How to
Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).
How to Calculate the Global and Conditional Scores.Use Query Workload and Data.
8
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
9
System Architecture
10
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking
FunctionImplementationExperimentsConclusion and open problems
11
PIR Review
Bayes’ RuleProduct Rule
)(
)()|()|(
bp
apabpbap
),|()|()|,( cabpcapcbap
)|(
)|(
)(
)()|()(
)()|(
)|(
)|()(
Rtp
Rtp
tp
RpRtptp
RpRtp
tRp
tRptScore
Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents
12
Adaptation of PIR to DBTuple t is considered as a
documentPartition t into t(X) and t(Y)t(X) and t(Y) are written as X and
YDerive from initial scoring
function until final ranking function is obtained
13
Preliminary Derivation
14
Limited Independence AssumptionsGiven a query Q and a tuple t,
the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
Xx
CxpCXp )()(
Yy
CypCYp )()(
15
Continuing Derivation
16
Pre-computing Atomic Probabilities in Ranking Function
)( Wyp
)( Dyp
),( Dyxp
Relative frequency in W
Relative frequency in D
),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W
(#of tuples in D that conatains x, y)/total # of tuples in D
Yy XxYy DyxpDyp
RyptScore
),|(
1
)|(
)|()(
Use Workload
Use Data
17
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
18
Architecture of Ranking Systems
19
Scan Algorithm
Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y
ExecutionSelect Tuples that Satisfy the QueryScan and Compute Score for Each
Result-TupleReturn Top-K Tuples
20
Beyond Scan Algorithm
Scan algorithm is InefficientMany tuples in the answer set
Another extremePre-compute top-K tuples for all possible queriesStill infeasible in practice
Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples
21
Output from Index Module
CondList Cx
{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)
GlobList Gx
{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)
22
Index Module
23
Preprocessing ComponentPreprocessing For Each Distinct Value x of Database, Calculate and
Store the Conditional (Cx) and the Global (Gx) Lists as follows◦ For Each Tuple t Containing x Calculate
and add to Cx and Gx respectively Sort Cx, Gx by decreasing scores
Execution Query Q: X1=x1 AND … AND Xs=xs
Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs
24
List Merge Algorithm
25
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
26
Experimental Setup
Datasets:◦ MSR HomeAdvisor Seattle
(http://houseandhome.msn.com/)◦ Internet Movie Database
(http://www.imdb.com)
Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO
27
Quality ExperimentsConducted on Seattle Homes and
Movies tablesCollect a workload from usersCompare Conditional Ranking
Method in the paper with the Global Method [CIDR03]
28
Quality Experiment-Average Precision
For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
29
Quality Experiment- Fraction of Users Preferring Each Algorithm
5 new queries Users were given the top-5 results
30
Performance Experiments
Table NumTuples Database Size (MB)
Seattle Homes 17463 1.936
US Homes 1380762 140.432
Datasets
Compare 2 Algorithms: Scan algorithm List Merge algorithm
31
Performance Experiments – Pre-computation Time
32
Performance Experiments – Execution Time
33
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open
problems
34
Conclusions – Future WorkConclusionsCompletely Automated Approach for the Many-
Answers Problem which Leverages Data and Workload Statistics and Correlations
Based on PIR
DrawbacksMutiple-table queryNon-categorical attributes
Future WorkEmpty-Answer ProblemHandle Plain Text Attributes
35
Questions?