Probabilistic Information Retrieval Approach for Ranking of Database Query Results

PowerPoint Presentation

Probabilistic Information Retrieval Approach for Ranking of Database Query ResultsAuthors:SURAJIT CHAUDHURI, GAUTAM DAS, VAGELIS HRISTIDIS, GERHARD WEIKUM Presenter: Ketaki GadreIntroduction (1/4)Database Query Retrieval ModelBoolean ModelMany-Answers Problem:Too many tuples when query is not too selective

2Introduction (2/4)Realtor DatabaseEach tuple represents home for sale in USQuery: select * from Homes where City=Seattle and View=Waterfront3TIDPriceCitySchoolDistrictViewPoolBoatDockIntroduction (3/4)Many-Answers Problem in IRQuery reformulation techniques: User prompted to refine queryAutomatic ranking:Query results ranked by degree of relevance

4Introduction (4/4)Many-Answers Problem in DatabaseQuery: select * from Homes where City=Seattle and View=WaterfrontAutomatic Ranking:Preferable to first return homes with other desirable attributes like good SchoolDistrict, BoatDocks etc.

5Approach (1/3)Look beyond attributes specified in queryRanking function of tuple based on:Global Score: Captures global importance of unspecified attribute valuesConditional Score:Strengths of dependencies between specified and unspecified attribute values

6Approach (2/3)Query: select * from Homes where City=Seattle and View=WaterfrontSchoolDistrict=ExcellentGlobally desirable High rankBoatDock=YesPeople desiring waterfront likely to desire boat dock High rank7Approach (3/3)Challenge: Translate these intuitions into principled ranking function Proposed Solution: Probabilistic IRWhy PIR?Can extend to model data dependencies and correlations8Outline9Adaption of PIR to Structured DataSpecial CasesGeneralizationsImplementationExperimentsProbabilistic IR - Overview10User Information NeedDocument CollectionDocument RepresentationQuery RepresentationHow to match?Probabilistic IR - Overview11Some Probability FormulaeBayes Rule:

Product Rule:12Probabilistic IR - Notations13Probabilistic IR Ranking Function14How to adapt Probabilistic IR model for structured databases?Notations16Types of Queries17Adaptation for Structured Data (1/2)18Adaptation for Structured Data (2/2)19Limited Independence Assumption20Presence of Functional DependenciesIf attributes are related through FDs, we derive equation without making limited independence assumption21Functional Dependencies22Presence of FDs (1/4)23Presence of FDs (2/4)24Presence of FDs (3/4)25Presence of FDs (4/4)26R is unknownHow to estimate (|)?Workload-Based Estimation (1/6)28Workload-Based Estimation (2/6)29Workload-Based Estimation (3/6)30Workload-Based Estimation (4/6)31Workload-Based Estimation (5/6)32GlobalConditionalWorkload-Based Estimation (6/6)33GlobalConditionalConsidering functional dependencies,

Calculating Atomic Probabilities (1/2) 34Calculating Atomic Probabilities (2/2) 35Special CasesRanking function in absence of workload

Ranking function assuming no dependencies between attributes

36Ranking Function in Absence of Workload 37Ranking Function in Absence of Workload 38Ranking Function in Absence of Workload 39Ranking Function Assuming No Dependencies Between Attributes Make independence assumption between all attributes40Ranking Function Assuming No dependencies Between Attributes Make independence assumption between all attributes41Ranking Function Assuming No dependencies Between Attributes Make independence assumption between all attributes42Generalizations43IN Queries4445Recall equation,

IN Conditions in the Query (1/4)46Score function with IN conditions in the queriesIN Conditions in the Query (2/4)47Score function for point queriesIN Conditions in the Query (3/4)IN Conditions in the Query (4/4)48Extra factor needed to be multipliedEquivalent equation is,GlobalConditionalIN Conditions in the WorkloadConceptually expand the workloadSplit each IN query into point queriesQuery:City IN (Bellevue, Redmond, Carnation) AND Price IN (High, Moderate)Split into 32=6 point queries

49Numeric Attributes (1/2)Query:Age BETWEEN (5, 10) AND Sqft BETWEEN (2500, 3000)Simple approach:Treat numerical value as categorical valueConvert queries with range conditions to queries with IN conditionsProblem:Many distinct values are not adequately represented in the workload50Numeric Attributes (2/2)Solution:Discretize the numerical domain into bucketsThen treat it as categorical dataHistogram-based bucketing technique is used

51Multi-table DatabasesWe can have multiple tables in a database logically connected through foreign keysSo, create a logical view representing the join of all these tablesThis view contains all the attributes of interestApply the ranking methodology on this view52Implementation: Architecture of Ranking System53The preprocessing componentAn intermediate knowledge representation layerA query processing componentArchitecture of Ranking Function54

Source: PaperPreprocessing: Atomic Probabilities Module (1/2)55Preprocessing: Atomic Probabilities Module (2/2)For numerical attributes compute histogramsStored as database tables in the intermediate knowledge representation layerAppropriate indexes built to enable easy retrieval56Ranking Algorithm (1/2)Nave Algorithm (Scan Algorithm) Select all the tuples that satisfy the queryScan and compute score for each such tuple using the information in intermediate layerReturn the top-k tuplesInefficient for Many-Answers problem

57Ranking Algorithm (2/2)Other extreme algorithm Pre-compute top k tuples for all possible queriesAt query time, simply return the appropriate result setInfeasible due to combinatorial explosion

58How can we obtain a reasonable balance between pre-processing and query processing?60Ranking Algorithm: Approach (1/2)Ranking Algorithm: Approach (2/2)Adapt top-k algorithm - Threshold Algorithm (TA)No. of sorted lists increases TAs performance rapidly deterioratesThe no. of sorted lists in this ranking function depends on the total number of attributes in the databaseWould cause major performance problemsNeed pre-computed data structures that enable efficient adaption of TA61Preprocessing: Index Module (1/2)62Preprocessing: Index Module (2/2)63Preprocessing: Index Module Algorithm64Source: Paper

Computation of Score (1/3)65GlobalConditionalComputation of Score (1/3)66GlobalConditionalComputation of Score (2/3)67Computation of Score (3/3)68Query ProcessingNave Scan AlgorithmList Merge AlgorithmAdaptation of TAEfficiency depends on the data structures pre-computed by the Index Module69List Merge Algorithm (1/2)70List Merge Algorithm (2/2)7172

Limited Available Space (1/2)Assumption: There is enough space available to build the conditional and global listsWhat if space is expensive resource?73Limited Available Space (2/2)74Evaluating IN and Range Queries75IN Query:Split each IN query to a set of point queriesEvaluate as usual and merge resultsNumeric Range Query:Convert into IN queryReplaced range with bucketsThen evaluate as IN queryExperimentsPreparing an experimental setup for testing ranking quality was extremely challengingNo standard benchmarks availableSo, conducted user studies to evaluate the rankings76Experiments: Data Sets (1/2)MSN HomeAdvisor database Prepared a table of homes for sale in the U.S. Mix of categorical and numeric attributesAttributes: Price, Year, City, Bedrooms, Bathrooms, Sqft, Garage, etc.A subset of the HomeAdvisor database, consisting only of homes sold in the Seattle area To evaluate effect of size of database77Experiments: Data Sets (2/2)Internet Movie Database Prepared a table of moviesAttributes: Title, Year, Genre, Director etc.78TableNumTuplesDatabase Size (MB)ExperimentsSeattle Homes174631.936Performance, QualityU.S. Homes1380762140.432PerformanceMovies1446Less than 1QualitySizes of DatasetsExperiments: SetupMicrosoft SQL Server 2000 RDBMS1 GB of RAMLanguage: C#79Quality ExperimentsEvaluated the quality of three different ranking methods: Their ranking method called ConditionalThe ranking method described in Agrawal et al. [2003] called Global A baseline Random algorithmSurveys involve 14 employees of Microsoft Research80Experimental Setup for Seattle Homes TableCreated several different profiles of home buyers e.g. singles, middle class family, rich retirees etc.Users behave like these home buyersPost queries against the databaseThis is how workload is collectedCollected several hundred queriesThen run the ranking algorithm proposed on this workload81Experimental Setup for Movies TableCreated several different profiles of people watching movie e.g. people interested in comedies from the 1980s etc.Users post queries against the databaseCollected several hundred queriesThen run the ranking algorithm proposed on this workload82Examples of Ranking Results (1/2)Query: Bedrooms=4 and City=Kirkland and Price=Expensive, Conditional Ranked homes with waterfront views the highestGlobal Ranked homes in good school districts the highest

83Examples of Ranking Results (2/2)Expected Good school district globally more popularBut, waterfront view is more desirable than good school district for very rich people

Random Produced quite irrelevant results in most cases.84Ranking EvaluationTwo SurveysFirst Survey:Compared the rankings against user rankings Used standard precision/recall metricsSecond Survey:Asked users to rate which algorithms rankings they preferred85First Survey (1/6)For each dataset, for each test queryGenerated a set of 30 tuples Likely to contain a good mix of relevant and irrelevant tuplesPresented queries along with 30 tuples to each userUser marked 10 tuples as most relevant to the queryMeasured how 10 tuples marked as relevant by the user matched the 10 tuples returned by each algorithm

86First Survey (2/6)Recall: Fraction of relevant documents that are retrieved

Precision: Fraction of retrieved documents that are relevant

No. of relevant tuples = 10So, recall and precision are equal87First Survey (3/6)IN and Range queries for the Seattle Homes dataset:88Q1Bedrooms=4 AND City IN{Redmond, Kirkland, Bellevue}Q2City IN {Redmond, Kirkland, Bellevue} AND Price BETWEEN ($700K, $1000K)Q3Price BETWEEN ($700K, $1000K)Q4School=1 AND Price BETWEEN ($100K, $200K)First Survey (4/6)Quality of Conditional ranking was superior to GlobalRandom was significantly worse than either.89First Survey (5/6)Point queries for the Movies dataset:90Q1Genre=thriller AND Certificate=PG-13Q2YearMade=1980 AND Certificate=PG-13Q3Certificate=G AND Sound=MonoQ4Actor1=Dreyfuss, RichardQ5Genre=Sci-FiFirst Survey (6/6)Quality of Conditional ranking was superior to GlobalRandom was worse than either.91Second Survey (1/5)User were given top-5 results of three ranking methods for five queriesAsked to choose which rankings they preferred92Second Survey (2/5)IN and Range queries for the Seattle Homes dataset:93Q1Bedrooms=4 AND City IN (Redmond, Kirkland, Bellevue)Q2City IN (Bellevue, Kirkland) AND Price BETWEEN ($700K, $1000K)Q3Price BETWEEN ($500K, $700K) AND Bedrooms=4 AND Year>1990Q4City=Seattle AND Year>1990Q5City=Seattle AND Bedrooms=2 AND Price=500KSecond Survey (3/5)Conditional generally produces rankings of higher quality compared to Global94Second Survey (4/5)Point queries for the Movies dataset:95Q1YearMade=1980 AND Genre=ThrillerQ2Actor1=De Niro, RobertQ3YearMade=1990 AND Genre=ThrillerQ4YearMade=1995 AND Genre=ComedyQ5YearMade=1970 AND Genre=WesternSecond Survey (5/5)These experiments indicate that Conditional ranking approach gives good resultsBut much larger-scale user studies should be done96Performance ExperimentsCompared the performance of the various implementations of the Conditional algorithm: List MergeIts space-saving variantsScanUsed Seattle Homes and U.S. Homes datasetsResults on point queries97Preprocessing Time and SpaceList Merge algorithm is dominated by the Index ModuleTime and space scale linearly with table size 98Datasets List Building TimeList SizeSeattle Homes 1500 ms7.8 MBU.S. Homes 80,000 ms457.6 MBTime required to build all the conditional and global listsSpace-Saving Variations (1/3)Considered 10 queries with selection conditions that specify two attributes Measured their execution timesCompared algorithms:LM: List Merge with all lists availableLMM: List Merge where lists for one of the two specified attributes are missing halves spaceScan99Space-Saving Variations (2/3)Times averaged over 10 queries100

Execution times for the Seattle Homes DatasetSpace-Saving Variations (3/3)List Merge and its variations are preferable101NumSelected TuplesLM Time (ms)Scan Time (ms)3508006515200070039,2345000600115,28230000550566,516800005003,806,531Execution Times for U.S. Homes DatasetVarying Number of Specified Attributes102Varying K in Top-k103Selected queries with two attributes, which returned about 500 results

ConclusionCompletely automated approach for the Many-Answers ProblemUses data and workload statistics and correlationsRanking functions based upon the PIR, adapted for structured dataResults of experiments demonstrate the efficiency and quality of the ranking system104

Documents

Probabilistic Information Retrieval Approach for Ranking of Database Query Results