Finding Similar Questions in large Question and Answer Archives
Jiwoon Jeon, W. Bruce Croft and Joon Ho LeeACM CIKM ‘05
Presented by Mat KellyCS895 – Web-based Information Retrieval
Old Dominion UniversityDecember 13, 2011
Question Answering fromFrequently Asked Question Files
Robin D. Burke, Kristian J Hammond, Valdimir Kulyukin, Steven L. Lytinen, Noriko Tomurom and Scott Schoenberg
AI magazine; Summer 1997
What is FAQ Finder?
• Matches answers to questions already asked in a site’s FAQ file
• 4 Assumptions1. Information in QA Format2. All information needed to determine relevance of
QA is can be found in QA Pair3. Q half of QA pair most relevant for matching to
user’s question4. Broad, shallow knowledge of language is sufficient
for question matching
How Does It Work?
• Uses SMART IR system to narrow focus of relevant FAQ files
• Iterates through QA pairs in FAQ file, comparing against user’s question and computing a score using 3 metrics– Statistical term-vector similarity score t– Semantic similarity score s– Coverage score c
CWT
cCsStTm
T,S and C are constant weights that adjust reliance of system on each metric.
Calculating Similarity
• QA pair represented as a term vector w/ signif. Values for each term in the pair
• Significance value = tfidf• n (term freq) = # time term appears in QA pair• m = # QA pairs term appears in in file• tfidf = n x log(M/m)• Evaluate relative rarity of term within
documents– Use as factor to weight freq of term in document
Nuances
• Many ways to express the same question– Synonymous terms often used in large documents– Thus, variations will have no effect
• However, FAQ Finder is matching on small # of terms, system needs means of matching synonyms– How do I reboot my system?– What do I do when my computer crashes?– Causal relationship resolved with WordNet
WordNet
• Semantic network of English words• Provides relations between words and
synonym sets & between synonym sets and themselves
• FAQ Finder utilizes through marker-passing algorithm– Compares each word in the user’s question to
each word in FAQ file question
WordNet (cont…)
• Not a single semantic network, different sub-networks exist for nouns, verbs, etc.
• Syntactically ambiguous words (e.g. run) appears in more than one network.
• Simply relying on default word sense worked as well as any more sophisticated techniques
Evaluating Performance
• Corpus from log file of system’s use– May-Dec 1996.
• 241 questions used• Manually scanned and found 138 answers to
questions and 103 questions unanswered• Assumes there is a correct (single QA pair)• Because this task is different than conventional
IR problem, have to redefine recall and precision
Why Redefine Recall & Precision?
• RECALL – typically is measurement of % of relevant docs in set relative to query
• PRECISION – typically measurement of % retrieved docs that are relevant
• There is only one correct doc, these are not independent• e.g. query returns 5 QA pairs
– FAQ Finder returns either 100% recall and 20% precisionOR– Returns 0% recall, 0% precision– If no answer exists, precision = 0%, recall = undefined
Redefining Recall & Precision
• Recallnew=% questions FAQFinder returns correct answer when one exists– Does not penalize if >1 correct answer (original)
• Instead of precision, calculate rejection• Rejection - % questions FAQFinder correctly reports as
being answered– Adjusted to set cutoff point for minimum-allowable-matches
• There is still a tradeoff between rejection and recall– Rejection threshold too high, some correct answers eliminated– Rejection too low, incorrect answers given to user when no
answer exists
Results
• Correct file appears 88% of time within top 5 files returned, 48% of time in first position
Equates to 88% Recall, 23% Precision• System confidently
returns garbage when there is not correct answer in file
Ablation Study
• Evaluation of different components in matching scheme by disabling1. QA pairs selected randomly from FAQ file2. Coverage score for each condition used by itself3. Semantic scores from WordNet used in eval4. Term vector comparison used in isolation
Conditions’ Contributions
• WordNet and stat technique contribute strongly
• Their combination yields results that are better than either individually.
Where FAQ Finder Fails
• Biggest culprit of not finding is undue weight given to semantically useless words– Where can I find woodworking plans for a futon‽– woodworking is incorporated as strongly as futon– futon should be much more important inside the
woodworking FAQ than woodworking, which applies to everything
• Other problem: violation of assumptions about FAQ files
Conclusion
• When there is an existing collection of Qs & As, Qs can be reduced to matching new questions against QA pairs
• Power of approach is because FAQ Finder uses highly organized knowledge sources that are designed to answer commonly asked Qs.
Citing Paper’s Objectives
• Find questions in archive semantically similar to user’s question.
• Resolve:– Two questions that have the same meaning use
very different wording– Similarity measures developed for document
retrieval work poorly when there is little word overlap.
Approaches Toward The Word Mismatch Problem
1. Use knowledge databases as machine readable dictionaries (req. from first paper)– Current quality and structure are insufficient
2. Employ manual rules and templates– Expensive and hard to scale for large collections
3. Use statistical techniques from IR and natural language processing– Most promising with enough trained data
Problems with the Statistical Approach
• Need: Large # of semantically similar but lexically different sentences or Q pairs– No such collection exists on large scale
• Researchers artificially generate collections through methods like translation and subsequent reverse translation
• Paper proposed automatic way of building collections of semantically similar questions from existing Q&A collections
Question & Answer Archives
• Naver – leading portal site in S. Korea. Ex.
• Avg len of Q field = 5.8w• Avg Q body = 49w• Avg Answer = 179w• Made 2 test collections from archive– A-6.8M QA Pairs across all categories– B-68k QA Pairs across “Computer Novice” Categ.
Question Title How to make multi-booting systems?
Question Body I am using Windows98. I’d like to multi-boot with Windows XP. How can I do this?
Answer You must parition your hard disk, then install windows98 first. If there is no problem with windows98, then, install windows XP on…
• Need: Sets of topics with relevance judgments– 2 sets of 50 QA pairs rand. Selected• First set from Collection A and chosen across all Cats• Second set from Collection B, chosen from “Comp.
Novice” category
• Each pair converted to topic– QTITLE short query
– QBODY long query– A supplemental query } Used only in relevance judgement
procedure
Find Relevant QA Pairs
• Given a topic, employ TREC pooling technique• 18 diff. retrieval results generated by varying retrieval
algorithm, query type & search field• Retrieval models such as Okapi BM25, query-likelihood and
overlap coefficient used• Pooled top 20 QA pairs from each, did manual relevance
judgments– As long as seman. Identical or very similar to query, QA pair is
considered relevant– If no QA pairs found for a given topic, manually browse the
collection to find ≥1 QA pair• Result = 785 Relevant QA Pairs for A, 1557 for B
Verifying Field Importance
• Prev. Research: Similarity between questions is more important than similarity betw. Qs & As in FAQ Retrieval task
• Exp. 1: Search only QTitle field
• Exp. 2: Only QBody
• Exp 3: Only Answer• For all exps, use query
likelihood model with Dirichlet smoothing and Okapi BM25
Regardless of retrieval model, best performance from searching the question title field. Performance gaps for others are significant.
Collecting Semantically Similar Questions
• Many people don’t search to see if Q has already been asked, so ask a seman. similar Q.
• Assume: If two answers are similar then corresponding Qs are semantically similar but lexically different.
I’d like to insert music into Powerpoint.How can I link sounds in Powerpoint?
How can I shut down my system in Dos-mode.How to turn off computers in Dos-mode
Photo transfer from cell phones to computers.How to move photos taken by cell phones.
Sample semantically similar questions with little word overlap
Algorithm
• Consider 4 popular document similarity measures:1. Cosine similarity with vector space model2. Negative KL divergence between language
models3. Output score of query likelihood model4. Score of Okapi model
Finding a Similarity Measure: The Cosine Similarity Model
• Length of answers vary considerably– Some very short (factoids)– Others very long (C&P from web)
• Any similarity measure affected by length is not appropriate
Finding a Similarity Measure: Negative KL Divergence & Okapi
• Values are not symmetric and not probabilities– pair of answers that has a higher negative KL
divergence than another pair does not necessarily have stronger semantic connections
• Hard to rank pairs• Okapi Model has Similar Problems
Finding a Similarity Measure: Query Likelihood Model
• Score is a probability.• Can be used across different answer pairs• Score are NOT symmetric
Overcoming Problems
• Using ranks instead of scores was more effective– If answer A retrieves answer B @ rank r1 and
answer B retrieves answer A @ rank r2 then similarity between 2 answers = reverse harmonic mean of two ranks:
– Use query likelihood model to calc init. ranks
21
11
2
1),(
rrBAsim
Experiments & Results
• 68,000*67,999/2 answers possible from 68,000 Q&A pairs in Collection B
• All ranked using established measure• Empirically set threshold 0.005– Judge whether pair is related or not– Higher threshold = smaller but better quality collections– To acquire enough training samples, threshold cannot
be too high• 331,965 pairs have score above threshold
Word Translation Probabilities
• Question pair collection a parallel corpus• IBM model 1– Does not require any linguistic knowledge for
src/target language, treats every word alignment equally
– Translation from src s to target t =
– λs = normalization factor, so sum of probs = 1– N = # training samples– Ji= ith pair in training set
N
i
is JstcstP
1);|()|( 1
Word Translation Probabilities (cont)
• {s1,…,sn} = words in src sentence in Ji
• #(t,Ji) = number of times t occurs in Ji
• Still need: old translation probs• We initialize translation probs with rand values,
then est. new translation probs– Repeat until probs converge– Procedure always converges to same final solution1
),()#,()#|(...)|(
)|();|(
1ii
n
i
JsJtstPstP
stPJstc
[1] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra and R. L. Mercer. The mathematics of statistical machine translation: parameter estimation. Comput. Linguis., 19(2):263-311, 1993.
Experiments & Results(Word Translation)
• Removed stop words• Collection of 331965 Q pairs duplicated by
switching src and target pars then used as input
• Usually: most similar word to a given word is the word itself
• Found semantic relationships: found “bmp” to be similar to “jpg” and “gif”
Question Retrieval
• Where to go from Q titles from word translation probs?
• Similarity between query and document:
• Avoid 0 Probs, est. more accurate lang. models
• term w generated from collection C/D• In translation model, convert to:
Qw
DwPDQPDQsim
)|()|(),(
)|()|()1()|( CwPDwPDwP mlml
)|())|()|(()1()|( CwPDwPtwTDwP mlmlDt
Experiments & Results(Question Retrieval)
• 50 short queries from collection B, searching only title field
• Similarities betw. query Q and Q titles calculated
• Compare performance model with vector space model w/ cosine similarity, Okapi BM25 and query likelihood language model
Experiments & Results cont…(Question Retrieval)
Model Cosine LM Okapi Trans
MAP 0.183 0.258 0.251 0.314
R-Precision @ 5 0.368 0.492 0.476 0.520
R-Precision @ 10 0.310 0.456 0.436 0.480
•Approach outperforms other baseline models at recall levels•QL and Okapi show comparable performance•In all evaluations, approach outperforms other models
Conclusions and Seminal Paper Relevance
• Retrieval model based on translation probs learned from archive significantly outperforms other approaches in finding semantically similar questions despite lexical mismatch
• Using translation probabilities and determining similarity of answers is a much more robust approach for resolving similar QA pairs with fewer prerequisite of corpus
References
• Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N., & Schoenberg, S. (1997). Question answering from frequently asked question files: Experience with the FAQ finder system (Tech. Rep.). Chicago,, IL, USA.
• Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. 2005. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM '05). ACM, New York, NY, USA, 84-90.