Ranked Queries over sources with Boolean Query Interfaces without Ranking Support

Ranked Queries over sources with Boolean Query Interfaces

without Ranking Support

Vagelis Hristidis, Florida International UniversityYuheng Hu, Arizona State UniversityPanos Ipeirotis, New York University

Motivation: PubMed (and USPTO, and Linked In, and…)

¨ PubMed offers only ranking by date, author, title, or journal¨ Usually, user like ranking by relevance

– Measured by IR ranking function, like tf-idf

Problem Definition¨ Input

– Query Q contains term t1,…tn – Database D contain documents d1,…,dm

¨ Output– Top-k documents ranked according to a relevance score function

¨ Example of ranking function: tf.idf

¨ Baseline: Submit a disjunctive query with all query keywords, retrieve all the documents, locally re-rank

¨ Problems with Baseline method: Too many results!– “immunodeficiency virus structure” 1,451,446 results

Query Relaxation Approach

¨ A tf.idf query has OR semantics¨ Using queries will AND semantics returns promising

documents earlier on¨ Gradual query relaxation allows fast execution¨ Key questions:

¨ Which (conjunctive) queries to execute?¨ When to stop?

Problem Setting and Challenges¨ Boolean query interface, (e.g PubMed)¨ Limited data access through web service (quota per day)¨ No useful ranking functions¨ No indices to rely on¨ No statistics exported from database

Probabilistic Approach¨ Document Score

¨ Estimate tf (and scores) probabilistically:– The tf of the terms in a database

tend to follow a Poisson distribution

– Document scores also follow a Poisson

tf parameter of Poissonfor the term in database

idf, (easy part)

tf, (challenging part)

Probabilistic top-k with query relaxation

¨ Querying strategy – How to pick a good query candidate?– A good query should have good “benefit”

¨ Benefit: Probability that document in results for relaxed query q in top-k.

The k-th highest score so farQuery candidate

We choose the query candidate q with maximum probability

Score follows Poisson, function of the λ parameters of query terms in Q

Pr{ScoreQ(D,q) > τ}

Estimation of Poisson Parameters ¨ Sample-based estimation: Fetch documents, construct

sample, use estimates from sample– Need very extensive sampling size for reliable estimates

¨ Query-based estimation: Combine sampling and query execution– Every query generates a sample and provides candidate top-k docs– Main challenge: Adjust estimates to compensate for querying bias

(we are looking for top-k documents, we do not perform random sampling)

¨ Document sample returned for each query is not random!¨ Sample is “conditional” on query terms (guaranteed to appear)

– Need to acknowledge in estimates that queries are trying to find the top-k, not intended for random sampling

¨ Without correction, estimates significantly off

Query-based Sampling

Top-k algorithm using query relaxation1. Send conjunctive query to the database with all terms

2. Update statistics for each termusing estimates from the biased sample

3. Compute benefits for each possible query relaxation

4. If benefit (i.e., probability of finding top-k document) belowthreshold, stop; else go to step 1

Experiments¨ Datasets

– PubMed– TREC

¨ Quality Measure– Spearman’s Footrule

¨ Algorithms– Baseline– Summary-based– Query-based

Experiments: Quality¨ Compared footrule distance compared to baseline

(baseline = retrieve everything, fetch locally, rerank)¨ Lower values better¨ Query-based sampling consistently better than alternatives

Experiments: Efficiency¨ Measured #documents, queries, and execution time of

alternative techniques

Conclusion¨ Technique for top-k queries on top of document databases without

ranking support

¨ Introduction of an exploration-exploitation framework for building necessary statistics on-the-fly, during query execution

¨ Order-of-magnitude efficiency improvements, small losses in quality

Thank you !

Questions?

Ranked Queries over sources with Boolean Query Interfaces without Ranking Support

Documents

Introduction to Information Retrievalsmbidoki.ir/courses/245_IR_932_Lecture2_InvertedIndex... · 2015. 3. 9. · Introduction to Information Retrieval Boolean queries: Exact match

Introduction to Information Retrieval ` `%%%`# ` …crystal.uta.edu/~cli/cse6339/slides/cse6339-spring09-02.pdfIntroduction Inverted index Processing Boolean queries Course overview

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary

Beyond Boolean Queries

Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,

Web Search Information Retrieval. 2 Boolean queries: Examples Simple queries involving relationships between terms and documents Simple queries involving

CpSc 881: Information Retrieval. 2 Ranked retrieval Thus far, our queries have all been Boolean. Documents either match or don’t. Good for expert users

Boolean Operators Boolean Search Examples Implied Boolean ...scsj.segi.edu.my/library/SearchTechniques.compressed.pdf · Search Techniques: Boolean Operators ... FBY) are limited

1 Boolean Algebra Digital circuits Digital circuits Boolean Algebra Boolean Algebra Two-Valued Boolean Algebra Two-Valued Boolean Algebra Boolean

1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e

Lecture 4: Term Weighting and the Vector Space Model · 4 Zipf’s Law and tf-idf weighting 5 The vector space model. Ranked retrieval Thus far, our queries have been Boolean. Documents

Compiling Path Queries - cs.princeton.edujrex/papers/pathquery15.pdfCompiling Path Queries Srinivas Narayana, ... automaton state into two bytes (e.g., ... boolean conditions on packet

Searching Binding of search statements Boolean queries – Boolean queries in weighted systems – Weighted Boolean queries in non-weighted systems Similarity

Introduction to Information Retrieval ` `%%%`# ` …...Boolean retrieval The Boolean model is arguably the simplest model to base an information retrieval system on. Queries are Boolean

The ConQuest Interface query space perform Boolean algebra with queries Boolean algebra with hit lists 2D & 3D browser for results

Boolean retrieval & basics of indexingce.sharif.edu/courses/93-94/1/ce324-1/resources/root/Lectures/Lecture 2.pdf · Boolean retrieval model 2 Query: Boolean expressions Boolean queries

boolean model - University of Windsorjlu.myweb.cs.uwindsor.ca/538/538boolean.pdfboolean model September 9, 2014 1 / 39. boolean queries ... then strip out lines containing Calpurnia

Supporting Ranked Boolean Similarity Queries in …...ORTEGA ET AL.: SUPPORTING RANKED BOOLEAN SIMILARITY QUERIES IN MARS 907 Although the above straightforward adaptation of Boo-

Boolean Algebra1 BOOLEAN ALGEBRA Boolean Algebra2 BOOLEAN ALGEBRA -REVIEW Boolean Algebra was proposed by George Boole in 1853. Basically AND,OR NOT

Learning and verifying quantified boolean queries by example