Compact Query Term Selection Using Topically Related Text

Compact Query Term SelectionUsing Topically Related TextDate : 2013/10/09Source : SIGIR13Authors : K. Tamsin Maxwell, W. Bruce CroftAdvisor : Dr.Jia-ling, KohSpeaker : Shun-Chen, Cheng1OutlineIntroductionThe PhRank AlgorithmGraph ConstructionEdge WeightRandom WalkVertex weightsTerm rankingDiversity filterExperimentConclusionsIntroduction

QueryLocations of volcanic activity which occurred within the present day boundaries of the U.S. and its territories.3Introductionlong queries contain words that are peripheral or shared across many topics so expansion is prone to query drift.Pastjointly optimize weights and term selection using both global statistics and local syntactic features

Shortcomingfail to detect or differentiate informative terms

Dont reflect local query contextDont identify all the informative relations4IntroductionGoalnovel term ranking algorithm, PhRank, that extends work on Markov chain frameworks for query expansion to select compact and focused terms from within a query itself.

5OutlineIntroductionThe PhRank AlgorithmGraph ConstructionEdge WeightRandom WalkVertex weightsTerm rankingDiversity filterExperimentConclusionsPrinciples for Term SelectionAn informative wordIs informative relative to a queryaccurately represent the meaning of a query.Is related to other informative wordsif one index term is good at discriminating relevant from non-relevant documents, then any closely associated index term is also likely to be good at thisContains informative wordsall terms must contain informative words.Is discriminative in the retrieval collectionA term that occurs many times within a small number of documents gives a pronounced relevance signal.Graph ConstructionCretrieval collection & English WikipediaExampleQa bTop k documents d1,d2 (if k=2)N(Neighborhood set){d0,d1,d2}d0query encodedGraph G

cbafed1c b ed2a f b8Edge Weight the counts of stem co-occurrence in window size=2 and 10 in N the probability of the document in which the stems i and j co-occur given QWith idf-weight

factor r confirms the importance of a connection between i and j in N9Random Walk

1230.90.0010.10.60.10.0050.0090.3950.80.009 0.9 0.010.1 0.8 0.10.6 0.005 0.395H =If it starts from node 1 at time=0

Then the probability that walks to node 3 at time=11 0 00.009 0.9 0.010.1 0.8 0.10.6 0.005 0.395=0.6 0.005 0.395

10Vertex weightsFactor s balances exhaustivity with global saliency to identify stems that are poor discriminators been relevant and non-relevant documents

frequency of a word wn in N ,averaged over k + 1 documents,and normalized by the maximum average frequency of any term in N

the number of documents in C containing wnTREC query #840Give the definition, locations, or characteristics of geysers.=> definition geysers is not more informative11ExampleWn = geysersThe avg frequency of geysers in N = 12/3 |N| = 3 , |C|=35max avg frequency of any term in N = 4 dfwn = 3

Wn = definitionThe frequency of definition in N = 2/3 max avg frequency of any term in N = 4 dfwn = 1

Term rankingInputall combinations of 1-3 words in a query that are not stopwords.OutputRank list sorted by f(x,Q) scoreTo avoid a bias towards longer terms, a term x is scored by averaging the affinity scores for its component words

factor zx that represents the degree to which the term is discriminative in a collection

the frequency of xe in CexampleTerm x = volcanic boundaries Term x = volcanic U.S

Query: Locations of volcanic activity which occurred within the present day boundaries of the U.S. and its territories.OutlineIntroductionThe PhRank AlgorithmDiversity filterExperimentConclusionsDiversity filterPhRank often assigns a high rank to multi-word terms that contain only one highly informative wordFor example, query: the destruction of Pan Am Flight 103 over Lockerbie, Scotlandterm pan flight 103 is informativepan is uninformative by itselfExample

on the assumption that the longer term better represents the information need.on the assumption that the shorter terms better represent the information need and the longer term is redundant...birth rate china....birth rateWay 1..declining birth.birth rate...declining birth rateWay 2Discarded!OutlineIntroductionThe PhRank AlgorithmDiversity filterExperimentConclusionsExperiment

DatasetFexcluded from features Tinclude in featuresExperiment

Experiment

TREC description topicsTREC title queriesOutlineIntroductionThe PhRank AlgorithmDiversity filterExperimentConclusionsConclusionshave presented PhRank, a novel term ranking algorithm that extends work on Markov chain frameworks for query expansion to select focused and succinct terms from within a query.For all collections, around 26% of queries have more than 5% decrease in MAP compared to SDEfficiency considerations surrounding the time to construct an affinity graph may be ameliorated by off-line indexing to precompute a language model for each document in a collection.

Documents

Compact Query Term Selection Using Topically Related Text