Upload
rory
View
21
Download
1
Tags:
Embed Size (px)
DESCRIPTION
CS621 : Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 27: Towards more intelligent search. Desired Features of the Search Engines. Meaning based More relevant results Multilingual Query in English, e.g. Fetch document in Hindi, e.g. Show it in English. - PowerPoint PPT Presentation
Citation preview
CS621 : Artificial Intelligence
Pushpak BhattacharyyaCSE Dept., IIT Bombay
Lecture 27: Towards more intelligent search
Desired Features of the Search Engines
• Meaning based– More relevant results
• Multilingual– Query in English, e.g.– Fetch document in Hindi, e.g.– Show it in English
Precision (P) and Recall (R)
• Tradeoff between P and R
Actual (A)Obtained (O)
Intersection: shaded area (S)
P= S/O
R= S/A
The UNL System: An Overview
Building blocks of UNL
• Universal Words (UWs)• Relations• Attributes• Knowledge Base
UNL Graph
obj
agt
@ entry @ past
minister(icl>person)
forward(icl>send)
mail(icl>collection)
he(icl>person)
@def
@def
gol
He forwarded the mail to the minister.
UNL Expression
agt (forward(icl>send).@ entry @ past, he(icl>person))
obj (forward(icl>send).@ entry @ past, minister(icl>person))
gol (forward(icl>send ).@ entry @ past, mail(icl>collection). @def)
Universal Word (UW)
• vocabulary of UNL• represents a concept
– Basic UW (an English word/compound word/phrase with no restrictions or Constraint List)
– Restricted UW (with a Constraint List )
• Examples:
“crane(icl>device)”
“crane(icl>bird)”
“crane(icl>do)”
nouns
verb (“crane the neck”)
Desirable features of UWs
• Expressibility: able to represent any concept in a language
• Economy: only enough to disambiguate the head word
• Formal situatedness: every UW should be defined in the UNL Knowledge-Base
UNL Knowledge Base
• A semantic network comprising every possible UW
• A lattice structure
Enconversion
InputSentence/Query
UNLexpression
Encoversion process
• Analysis at 3 levels– Morphological– Syntactic– Semantic
• Crucial role of disambiguation– Sense– Part of speech – Attachment (I saw the boy with a telescope)
(I bank with the bank on the river bank)
Deconversion
Outputsentence
UNLexpression
Deconversion process
• Syntax Planning• Case marking• Morphology
win
agtptn
obj
Brazil Japan
match
Braajila jaapaan mecha jiit
Braajila ne jaapaan ke saatha mecha jiitaa
@entry@past
Braajila ne jaapaan ke saatha mecha jiit
Application: meaning based multilingual search
Application: meaning based multilingual search
Top Level Description of the Methodology
• Documents represented in meanings graphs• Queries converted to meaning graphs• Matching on meaning graphs• Retrieved document (a collection of meaning
graphs) displayed in the language of interest
System Constituents: 1/2
• Search Front– Crawler– Indexer (3 level)
• On Expression• On Concept• On keywords
System Constituents: 2/2
• Language Front– EnConverter (analyses sentence to UNL)– DeConverter (generates sentence from UNL)– Stemmer and Morphology Analyser– Parser– Word Sense Disambiguator
• Needs wordnets
Retrieved UNL Documents
Complete UNL
Match
Search engine
Query
Stemmers
Enconverter
UNL
Deconverter
UNL
Index Index
Lucene
Search Results
Enconverter
UNL
Stemmers
Search Results
PartialUNL
Match
UW Match
Yes
Yes
Yes
No
No
No
Overall Architecture:
WSD
Query Expansion
HTMLCorpus
Failsafe Search Strategy
Indexing and Failsafe Search Strategy: 1/2
• The indexer creates a three level indexing in the form of
a. UNL expressions (phrasal and sentential concepts)
b. Universal Words (lexical concepts)
c. Keywords/Stem Words (Using Stemmers & Lucene)
Indexing and Failsafe Search Strategy: 2/2
• This enables a failsafe search strategy:
- Complete expression matching, else
- Partial expression matching, else
- Universal Word (UW) matching, else
- Search on Keywords/Stem Words
Indexing• UNL Document Index – Keeps information about each UNL Document
• UNL Index – Stores the actual index of UNL Expressions
Fields Description
docid UNL document id
orilink Link to original document
language Language of original document
numlines Number of sentences in the document
Fields Description
rel Stores the relation of a UNL expression
uw1 Stores the first Universal Word of a relation
uw2 Stores the second Universal Word of a relation
uwid1 Stores the id of uw1
uwid2 Stores the id of uw2
docid UNL document id in which above fields occur
sent Sentence no in which above fields occur
Index of UNL Expressions
rel uw1 uw2
mod support(icl>help) financial(mod<thing)
mod performance(icl>operation)
agriculture(icl>activity)
mod government(icl>governmental organization)
australian(mod<thing)
and strategy(icl>idea) trade(icl>activity)
… … …
agro4
1
agro7
20agro4
1
agro2
12agro4
1
agro4
1
• Each entry (UNL Expression (rel,uw1,uw2)) points to the pair of document id and its sentence number where it occurs.• Sample UNL Expressions:
mod:02(support(icl>help):4T, financial(mod<thing):4J)mod:01(government(icl>governmental organization):5L.@def, australian(mod<thing):5A)and:04(strategy(icl>idea):1D.@entry.@pl, trade(icl>activity):16)mod:05(performance(icl>operation):2B.@entry.@topic, agriculture(icl>activity):2X)
Index of UWs• Each UW points to pair of document id and its sentence number where it occurs.
UWs
support(icl>help)
financial(mod<thing)
government(icl>governmental organization)
performance(icl>operation)
indian(aoj>thing)
marketing(icl>commerce)
…
agro4 1
agro7 20
agro4 1
agro5 15
agro7 20
agro2 8,16..
agro4 1,22,26..
agro5 3,4,5..
agro6 3,5,22..
agro7 3,21,32..
agro2 12
agro4 1
agro2 12
agro3 15
agro4 1
agro1 13
agro4 6
agro5 24
agro6 6
Sophisticated matching
• Complete set of expressions matching• Weighted expression matching• Partial set of expressions matching• Complete UW matching• Headword matching (equivalent to keyword)• Restriction matching• Attribute matching
Keyword-based Matching needs morphology for Indian languages
User Input ganne Stemmers
LuceneIndexAll documents containing
gannoM, gannaa, ganneOutput
gannaa
Multilingual Keyword Search• An UW dictionary based approach• Given a query in a language, generates a multilingual query using
UW dictionaries• Example:
- Monolingual Query: Farmer - Multilingual Query: Farmer
UW Dictionary
Preprocessor
Dictionary database
Query
Multilingual Query
Multilingual KeywordGenerator
• Provides Multilingual capability at the keyword level to the search engine
Stemmer
शे�तकरी�किकसान
Experimentation
• Chosen Domain: Agriculture• Languages: English, Hindi, Marathi• Document base: Pesticide and Diseases• Word order sensitive
– Money lenders exploit farmers vs. farmers exploit moneylenders
• For CLIR: Tested on– Hindi and Marathi query retrieval from
English Display in Hindi/Marathi
System Interface
• Agricultural Search Engine
गा�य -- वह व्यक्ति� जो� बहुत सा�धा-साधा ह�:"वह गाय ह�,उसा� जो� क� छ भी� कह जोत ह� चु�पचुप स्व�करी करी ले�त ह�"
Wordnet Sub-graph(Hindi)
H Y P E R N Y M Yथन
जो�गाली% करीन
ब�ले
ABLIT Y
VERB
GLOSS
MERONYMY
H Y P O N Y M Y
लेवई न�चुक'
दुम
चु*पय
सा+गावले एक शेकहरी� मदा चु*पय जो� अपन� दूधा क� क्तिलेए प्रक्तिसाद्ध ह�:"किहन्दू ले�गा गाय क� गा� मत कहत� ह3 एव4 उसाक' प5जो करीत� ह3"
गा�य,गाऊ,गा�य(SYNONYMY)
स्तनपय� जो4त�
H Y P E R N Y M Y
H Y P E R N Y M Y
ANTONYNY
POLYSEMY
Wordnet Sub-graph(Marathi)
H Y P E R N Y M Yखो�ड रीन
बगा
HOLONYMY
GLOSS
MERONYMY
H Y P O N Y M Y
भी�मथड�अरीबी�
मी5ळ
सास्तन प्रणी�
ओझे� वहणी�,गाड� ओढणी� किंक@व बसाण्यसाठीC उपय�गात आणीले जोणीरी एक चुत�ष्पदा प्रणी�:"प्रचु�न कळपसा5न अरीबस्थानतले� घो�ड� प्रक्तिसाद्ध आह�त"
घो�डा�,अश्व
जोरीजो
H Y P E R N Y M Y
घो�डा� -- ब�द्धिद्धबळच्य खो�ळत�ले एक साKगाटी%:"घो�ड अड�चु
घोरी� चुलेत�"
POLYSE
MY
Semantically Relatable Set (SRS) Based Search
(Pl look up publications under www.cse.iitb.ac.in/~pb for descriptions
of SRS and SRS based search)
What is SRS
• SRSs are UNL expressions without the semantic relations
• E.g., “the first non-white president of USA”– (the, president)– (president, of, USA)– (first president)– (non-white president)
SRS Based matching
• Complete SRS match– All the SRSs of the query should match with
the SRSs of the sentence• Partial SRS match
– All the query SRSs need not match with that of the sentence SRSs.
System Architecture
Experimental Setup• Text Retrieval Conference (TREC) data was used.• TREC provides the gold standard for query and relevant
documents:
Query Number Document-ID Relevance Score
8 WSJ911010-0114 1
8 WSJ911011-0085 0
21 AP880304-0049 1
21 AP880304-0192 0
Table: Relevance Judgments in TREC
• We chose 1919 documents and the first 250 queries.– Mostly from the AP newswire, Wall Street Journal
and the Ziff data.
Experiment Process
• Lucene with search strategy tf-idf as the keyword based search engine (baseline)
• Used SRS based search on the other hand• Compared both the search methods on various
parameters
Precision Comparison
• Shows that SRS search filters out non-relevant documents much more effectively than the keyword based tf-idf search.
Recall Comparison
• tf-idf consistently outperforms the SRS search engine here.
Mean Average Precision (MAP) Comparison
• MAP contains both recall and precision oriented aspects and is also sensitive to entire ranking.
• SRS Search could not perform here because of the low recall.
R
rrelrPMAP
N
r
1
Reasons for poor Recall: Word Divergence 1/2
• Inflectional Morphology Divergence – Query: “child abuse”
• Query SRS: (child, abuse) – Sentence: “children are abused”
• Sentence SRS: (children, abused)
• Derivational Morphology Divergence– Query: “debt rescheduling”
• Query SRS: (debt, rescheduling)– Sentence: “rescheduling of debt”
• Sentence SRS: (rescheduling, of, debt)– Query: “polluted water”
• Query SRS: (polluted, water)– Sentence: “water pollution has increased in the city”
• Sentence SRS: (water, pollution)
Reasons for poor Recall: Word Divergence 2/2
• Synonymy Divergence– Query: “antitrust cases”
• Query SRS: (antitrust, cases)
– Sentence: “An antitrust lawsuit was charged today”. • Sentence SRS: (antitrust, lawsuit)
• Hypernymy Divergence– Query has keyword “car”, while the document has keyword
“automobile”.
• Hyponymy Divergence– Query can be “car” whereas the document might contain
“minicar”.
Physical Separation Divergence
• Physical Separation Divergence– Query: “antitrust lawsuit”
• Query SRS: (antitrust, lawsuit) – Sentence: “The federal lawsuit represents the
largest antitrust action”• Sentence SRSs: (lawsuit, represents),
(represents, action), (antitrust, action)
Solutions for Divergences
Solution to Morphological Divergence
• Stemming– All words in the document and the query
SRSs are stemmed before matching. – Gets the base form based on WordNet, while
keeping the tag of the word unchanged.• children_NN stemmed to child_NN, but
childish_JJ not stemmed to child_NN
Solution to Synonymy-Hyperonymy-Hyponmy Divergence
• Find related words from the WordNet• Algorithm Outline
1. Get synonyms2. Get hypernyms upto depth 23. Get hyponyms upto depth 24. Repeat step 1,2 and 3 for all synonyms5. All the words are related words
• Found related words for all words in corpus (Nouns and Verbs).
• Calculated similarity between a word and the related words
SRS Tuning
• Deals with the “Other Divergences” problem.• Enriches the SRSs in the corpus.• Basically adds new SRSs by applying augment
rules on existing SRSs.
Sample Rules I
Rule: (N1, N2) => (N2(J), N1)
Sentence: “water pollution”
Sentence SRS: (water_N, pollution_N) Tuned SRS: (polluted_J, water_N)
Sample Rules II
• Rule: (V, N) => (N, V(N))• Sentence: “destroy city”• Sentence SRS: (destroy_V, city_N)• Augmented SRS: (city_N, destruction_N)
Sample Rules III• Rule: (N1, of, N2) => (N2, N1)
– Sentence: “rescheduling of debt”– Sentence SRS: (rescheduling_N, of, debt_N)– Augmented SRS: (debt_N, rescheduling_N)
• Rule: (N1, of, N2) => (N2(J), N1)– Sentence: “cup of gold”– Sentence SRS: (cup_N, of, gold_N)– Augmented SRS: (golden_J, cup_N)
Sample Rules IV• Rule: (V, for, N) => (N, V(N))
– Sentence: “applied for a certificate”– Sentence SRS: (applied_V, for, certificate_N)– Augmented SRS: (certificate_N, application_N)
• Rule: (J, for, N-ANIMATE) => (N, J(N))– Sentence: “famous for her painting”– Sentence SRS: (famous_J, for, painting_N)– Augmented SRS: (painting_N, fame_N) – Sentence: “It is good for John”– Sentence SRS: (good_J, for, John_N)– Augmented SRS: (John_N, goodness_N) X
Getting Derived Form- Using Porter Stemmer
• Let the word be “national_J”. Want the noun form.
• Step 1. Get the stem using Porter– “national” -> “nat”
• Step 2. Get all nouns from WordNet which start with “nat”– “nature”, “natural”, “nation”, “nationhood”, “native” etc.
• Step 3. Get the words which have the largest lexicographical match with “national”– “nation”, “nationhood”
• Choose any one of them– “nation_N”
New System Architecture
New Formulation for Sentence Relevance
qsrsid
srsidsrs
qsrsidssrssrsidsrs
qsrsweight
srssrstsrsweightsr
))((max
)))',((max)((max)(
'
where,
the SRS Similarity t() is calculated as
'2,2','1,1', cwcwtfwfwequalcwcwtsrssrst
t (w1 , w2) is calculated using the similarity measure discussed.
t (cw1 , cw1 ’) and equal (fw , fw ’) become 1 while matching (FW , CW)s and (CW , CW)s respectively.
Recall Comparison
– Dramatic improvement in recall
Precision Comparison
• Drop observed in precision, but still higher than TF-IDF• Non relevant documents effectively filtered out
Mean Average Precision (MAP) Comparison
• MAP better than TF-IDF
Highlight of Results
• Recall of the enhanced system dramatically improved from 0.102 to 0.362 (comparable to TF-IDF)
• Significant rise in MAP (0.149 from 0.054)
Searching with Enriched Information and Structures
Verma Kamaljeet S. M.Tech Project, 2008
(advised by Prof. Pushpak Bhattacharyya)
Motivation
• Language Modeling Approach is popular– Solid theoretical foundations– Promising empirical retrieval performance
• Semantic Smoothing incorporates synonym, context and sense information to produce more accurate results
• SRSs – Semantically Relatable Sequences– Are usually unambiguous and should give precise
results– Ideal Candidates for Semantic Smoothing
SRS Language Model
• Two Goals– Incorporate synonym and sense information in the model
• e.g. query – “case”, documents – “lawsuit”
– Use the contextual information present in SRS tuples• e.g. SRS – “instrument case”
– p(container/“instrument case”) > p(lawsuit/ “instrument case”)
• SRS – “antitrust case”– p(lawsuit/“antitrust case”) > p(container/“antitrust case”)
SRS Language Model
• Query
– q = (q1,q2,…..,qn)• Corpus C
– documents d1,d2…..• Key Notion
– Document to Query translation or Query Generation– p(q / d)
• Query terms assumed to be independent
j
j dqpdqp )/()/(
High Level System Architecture
Searcher
Indexer
Evaluator
Raw Documents
SRS Index
Word Index
SRS Documents
SRS Generator
Trans Matrix
Trans Prob. Estimator
SearcherSearcher
IndexerIndexer
EvaluatorEvaluator
Raw Documents
Raw Documents
SRS IndexSRS Index
Word IndexWord Index
SRS Documents
SRS Documents
SRS Generator
SRS Generator
Trans MatrixTrans Matrix
Trans Prob. Estimator
Trans Prob. Estimator
High Level System Architecture
Searcher
Indexer
Evaluator
Raw Documents
SRS Index
Word Index
SRS Documents
SRS Generator
Trans Matrix
Trans Prob. Estimator
SearcherSearcher
IndexerIndexer
EvaluatorEvaluator
Raw Documents
Raw Documents
SRS IndexSRS Index
Word IndexWord Index
SRS Documents
SRS Documents
SRS Generator
SRS Generator
Trans MatrixTrans Matrix
Trans Prob. Estimator
Trans Prob. Estimator
SRS Pruning
Indexing
• Word Index– Stop Words are removed– Stemming done – Porter Stemmer
• SRS Index– Useful SRSs are indexed
VSRS Vd Vw
SRS1
SRS2
SRS3
w1
w2
w3
D1
D2
SRS pruning
• Goal– Identification of “Good” SRSs
• PoS tags Based Pruning– SRSs with 2 <= length <= 4 kept– Starting with NN, ending with NN– Starting with NN/JJ, ending with NN– Starting with NN/JJ/DT, ending with NN
• Results– Did not improve
Translation Probabilities
• EM Training– Starts by an initial guess– Iteratively improves the guess by increasing the
likelihood until it converges
• Update Equations
)/()/()1(
)/()1()(
)(
)()(
Cwpwp
wpwp
n
nn
ii
nki
nkn
wpDwc
wpDwcwp
)(),(
)(),()/(
)(
)()1(
SRS Translation Probabilities
{space, program}
Term Prob.
space 0.0266
program 0.0229
launch 0.0169
technology 0.0161
orbit 0.0148
astronaut 0.0148
mission 0.0139
NASA 0.0136
satellite 0.0134
earth 0.0132
Experimental Setup• Document Collection
– TREC AP89• 84,678 documents• 145,349 distinct words• 180.1 unique words per document in average• Criteria
– TREC Collections are popular and many published results exist
• Queries– TREC queries 1-50
• TREC queries have title, description, narrative and concept sections
• Used only title section
Experimental Setup
• Language Modeling Toolkit– Dragon Toolkit
• Java Based Toolkit• Has implemented baseline models• Designed to support language modeling IR
Results
Comparison of the SRS Model to the Okapi, Two-
Stage and MWE Models
Model MAP Recall P@10 P@100
Okapi 0.186 1627 0.259 0.139
Two-Stage 0.187 1623 0.259 0.139
MWE 0.204 1809 0.272 0.142
SRS 0.205 1836 0.262 0.150
Results
Comparison of the SRS Model with the MWE Model at λ = 1
Metric MWE SRS Improv.
MAP 0.077 0.098 +27.27%
Recall 1289 1413 +9.62%
P@10 0.130 0.168 +29.23%
P@100 0.091 0.104 +14.29%
Results
Metric MWE SRS SRS+MWE vs. MWE vs. SRS
MAP 0.204 0.205 0.217 +6.37% +5.85%
Recall 1809 1836 1865 +3.1% +1.58%
P@10 0.272 0.262 0.277 +1.84% +5.73%
P@100 0.142 0.150 0.153 +7.75% +2.0%
Some important tasks done
• Made simple yet effective changes to SRS Generation module.– Raw document to SRS document conversion time reduced from
5-10 minutes to 40sec – 2 minutes– Increased testing corpus size from 1818 to more than 84,000
documents
• Proposed and implemented entirely new Searching Strategy– SRS Based Context-Sensitive Semantic Smoothing– Decent SRS pruning Module to identify “good” SRSs– Results were the best amongst all the Language Modeling
Approaches (2-stage, Word Translation, MWE Topic Signature)
Conclusions
• Novel approach to Context-sensitive Semantic smoothing
• Semantically Relatable Sequences (SRSs) are used– Semantically related, not necessarily consecutive words– Context leads to more accurate results– Not all SRSs are useful indexing units (SRS Pruning).
• Mixture model of SRS Translation Model and Two stage language model is effective
• NLP inspired patterns in Language Modeling approach hold the promise of better IR performance
Future Work
• SRS Pruning– Other schemes like tf-idf
• Complex combination of MWE and SRSs– Mixture model of MWE, SRSs and baseline models
• Learning of the mixture weight coefficient• Improvement of the SRS generation• Experimenting with other NLP patterns
Thank You
References
1. J. Lafferty and C. Zhai, “Document Language Models, Query Models, and Risk Minimization for Information Retrieval,” Proc. 24th Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '01), pp. 111-119, 2001.
2. X. Zhou, X. Hu, X. Zhang, "Topic Signature Language Models for Ad hoc Retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 9, pp. 1276-1287, 2007.
3. R. Mohanty, M.K. Prasad, L. Narayanswamy and P. Bhattacharyya, “Semantically Relatable Sequences in the Context of Interlingua Based Machine Translation”, International Conference on Natural Language Processing, 2007.
4. S. Khaitan, K. Verma, R. Mohanty and P. Bhattacharyya, “Exploiting Semantic Proximity for Information Retrieval”, IJCAI 2007 Workshop on Cross Lingual Information Access, 2007.
5. J. Ponte and W.B. Croft, “A Language Modeling Approach to Information Retrieval,” Proc. 21st Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '98), pp. 275-281, 1998.
6. C. Zhai and J. Lafferty, “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval,” Proc. 24th Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '01), pp. 334-342, 2001.
7. C. Zhai and J. Lafferty, “Model-Based Feedback in the Language Modeling Approach to Information Retrieval,” Proc. 10th Int'l Conf. Information and Knowledge Management (CIKM '01), pp. 403-410, 2001.
References
8. C. Zhai and J. Lafferty, “Two-Stage Language Models for Information Retrieval,” Proc. ACM Conf. Research and Development in Information Retrieval (SIGIR '02), 2002.
9. A. Berger and J. Lafferty, “Information Retrieval as Statistical Translation,” Proc. 22nd Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '99), pp. 222-229, 1999.
10. A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. 39, pp. 1-38, 1977.
11. X. Zhou, X. Hu, X. Zhang, X. Lin, and I.-Y. Song, “Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR,” Proc. 29th Ann. Int'l ACM Conf. Research and Development on Information Retrieval (SIGIR '06), pp.70-77, Aug. 2006.
12. X. Zhou, X. Zhang, and X. Hu, “The Dragon Toolkit Developer Guide,” Data Mining and Bioinformatics Laboratory, Drexel Univ., http://www.dragontoolkit.org/tutorial.pdf, 2007.
13. S.E. Robertson et al., “Okapi at TREC-4,” Proc. Fourth Text Retrieval Conf. (TREC '95), 1995.
14. F. Smadja, “Retrieving Collocations from Text: Xtract,” Computational Linguistics, vol. 19, no. 1, pp. 143-177, 1993.
Presentation End
Translation Probabilities
• Dk- set of documents containing SRSk
• Not all terms in Dk center on SRSk
– Some terms address the issue of other SRSs– Some represent the background information
• Generative model similar to [7] is used– Mixture model of SRS translation model and the background
collection model
– θ is the set of parameters of the model of SRSk
)/()/()1()/( Cwpwpwp
Translation Probabilities
• Log-Likelihood of generating Dk
• c(w,Dk) is the frequency of the term w in Dk
• Goal– To estimate the translation probabilities by
maximizing the log-likelihood– Expectation Maximization
w kk wpDwcDp )/(log),()/(log
Non-Interpolated Average Precision
• Formula
• Where r(D) is the rank of the document D and Rel is the set of relevant documents for a query Q.
• To obtain the MAP score, we average the non-interpolated average precision across all the queries of the collection.
lD Dr
DrDrlD
l Re )(
)()'(,Re'
Re
1
Two Stage Language Model & Okapi
• TSLM Formula
• Okapi Model
• tf(q,D) is the term frequency of q in document D• df(q) is the document frequency of q• avg_dl is the average document length in the collection
CqpD
CqpDqtfDQp )}/(
||
)/(),()1{()/(
Dqtfdlavg
D
qdf
qdfNDqtf
DQsim
),(_
5.15.0
5.0)(
5.0)(log),(
),(