View
217
Download
0
Category
Tags:
Preview:
Citation preview
A Study of Using Search Engine Page Hits as a Proxy for n-gram
Frequencies
A Study of Using Search Engine Page Hits as a Proxy for n-gram
Frequencies
Preslav Nakov and Marti HearstComputer Science Division and SIMS
University of California, Berkeley
Supported by NSF DBI-0317510 and a gift from Genentech
Overview
Web as a corpus n-gram frequencies Concern: Instability of n-gram estimates
Study the impact of the variability of the n-gram estimates for a particular task: Across time For different search engines (Not) using a language filter (Not) using inflections
Introduction
(Banko & Brill, 2001) “Scaling to Very Very Large Corpora for Natural Language
Disambiguation”, ACL 2001 Simple task: choose from a set of commonly confused set
of words for a given context, e.g. {principle, principal} Data comes for free: Assuming correct usage in the
training raw text. Log-linear improvement even to billion words
=> Getting more data is better than fine-tuning algorithms.
Today the obvious source of very large data is the Web.
Web as a Corpus
Machine Translation (Grefenstette 98; Resnik 99; Cao & Li 02; Way & Gough 03)
Question Answering (Dumais et al. 02; Soricut & Brill 04), Word Sense Disambiguation (Mihalcea & Moldovan 99;
Rigau et al. 02; Santamar´ıa et al. 03; Zahariev 04), Extraction of Semantic Relations (Chklovski & Pantel 04;
Idan Szpektor & Coppola 04; Shinzato & Torisawa 04), Anaphora Resolution: (Modjeska et al. 03), Prepositional Phrase Attachment: (Volk 01; Calvo &
Gelbukh 03; Nakov & Hearst 05), Language Modeling: (Zhu & Rosenfeld 01; Keller &
Lapata 03)
Page Hits as a Proxy for n-gram Frequencies
Plausibility: (Keller & Lapata 03) demonstrate a high correlation between: page hits and corpus bigram frequencies page hits and human plausibility judgments
Web as a baseline (Lapata & Keller 05): machine translation candidate selection, spelling correction, adjective ordering, article generation, noun compound bracketing, noun compound interpretation, countability detection and prepositional phrase attachment
More than a baseline State of the art results for noun compound
bracketing (Nakov & Hearst 05)
Web Count Problems (1) Page hits are not really n-gram frequencies
This may be OK (Keller&Lapata,2003) The Web lacks linguistic annotation
Cannot handle stem cells VERB PREPOSITION brain protein synthesis’ inhibition
Pr(health|care) = #(“health care”) / #(care) health: noun care: both verb and noun can be adjacent by chance can come from different sentences
Web Count Problems (2) Instability of the n-gram counts
Dynamics over time Query inconsistencies
Indexes spread across multiple machines Multiple (inconsistent) index copies
Search engine “dancing”,
tool at: http://www.seochat.com/googledance
Problem: Web experiments are not reproducible.
Web Count Problems (3) Rounding of page hits
Exact estimates MSN: always Google and Yahoo: for small numbers only
Possible reasons for rounding Not necessary for typical users Expensive to compute:
Distributed index Constant changes
Under high loads, search engines probably sample from their indexes.
The Task
Problem: What is the impact of n-gram variability (inconsistencies, rounding etc.)?
Approach NOT absolute n-gram variability BUT experiments wrt. a real task
noun compound bracketing allows for the use of n-grams of different lengths
Our Particular Task:
Noun Compound Bracketing
Noun Compound Bracketing
(a) [ [ liver cell ] antibody ] (left bracketing)
(b) [ liver [cell line] ] (right bracketing)
In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.
liver cell line liver cell antibody
Related Work
Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)
Lauer (1995) dependency model: Pr(w1|w2) vs. Pr(w1|w3)
Keller & Lapata (2004): use the Web unigrams and bigrams
Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given
Pr that w1 precedes w2
Adjacency & Dependency (1)
right bracketing: [w1 [w2w3] ] w2w3 is a compound (modified by w1)
home health care
w1 and w2 independently modify w3
adult male rat
left bracketing : [ [w1w2 ] w3] only 1 modificational choice possible
law enforcement officer
w1 w2 w3
w1 w2 w3
Adjacency & Dependency (2)
right bracketing: [w1 [w2w3] ] w2w3 is a compound (modified by w1)
w1 and w2 independently modify w3
adjacency model Is w2w3 a compound?
(vs. w1w2 being a compound)
dependency model Does w1 modify w3?
(vs. w1 modifying w2)
w1 w2 w3
w1 w2 w3
w1 w2 w3
Probabilities: Why?
Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?
Keller&Lapata (2004) calculate: AltaVista queries:
(a): 70.49% (b): 68.85%
British National Corpus: (a): 63.11% (b): 65.57%
Association Models: 2 (Chi Squared)
A = #(wi,wj)
B = #(wi) – #(wi,wj)
C = #(wj) – #(wi,wj)
D = N – (A+B+C) N = 8 trillion (= A+B+C+D)
8 billion Web pages x 1,000 words
Web-derived Surface Features:Possessive Marker
Attached to the first word brain’s stem cell right
Attached to the second word brain stem’s cell left
We can query directly for possessives Search engines drop the possessive marker,
but s is kept. Still, we cannot query for “brain stems’ cell”
Other Web-derived Features:Abbreviation
After the second word tumor necrosis (TN) factor left
After the third word tumor necrosis factor (NF) right
We query for e.g., “tumor necrosis tn factor” Problems:
Roman digits: IV, vii States: CA Short words: me
Other Web-derived Features:Concatenation
Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812
Adjacency model healthcare vs. carereform
Dependency model healthcare vs. healthreform
Triples (adjacency) “healthcare reform” vs. “health carereform”
Other Web-derived Features:Reorder
Reorders for “health care reform” “care reform health” right “reform health care” left
Other Web-derived Features:Internal Inflection Variability
First word bone mineral density bones mineral density
Second word bone mineral density bone minerals density
right
left
Experiments
Experiments
Lauer (95) dataset 244 noun compounds (NCs)
from Grolier’s encyclopedia inter-annotator agreement: 81.5%
Exact phrase queries (min freq. 5) Inflections:
Carroll’s morphological tools
Experiments
4 dimensions: time search engine language filter inflected forms
Comparison over time (P): Google
Precision (in %) for any language, no inflections. Average recall is shown in parentheses.
Varying time intervals,in case index changes happen periodically
Comparison over time (P): MSN
Precision (in %) for any language, no inflections. Average recall is shown in parentheses.
Statistically significant
Experiments
4 dimensions: timesearch engine language filter inflected forms
Comparison by search engine (P): for 6/6/2005
Precision (in %) for any language, no inflections. for 6/6/2005 Average recall is shown in parentheses.
Statistically significant
Comparison by search engine (R): for 6/6/2005
Recall (in %) for any language, no inflections. for 6/6/2005
No much variability of recall (but Googlehas the biggest index)
Experiments
4 dimensions: time search engine language filter inflected forms
Comparison by language (P): any language vs. English
Precision (in %), no inflections. for 6/6/2005
Minor inconsistent impact on precision
Comparison by language (R): any language vs. English
Recall (in %), no inflections. for 6/6/2005
Minor consistent drop in recall
Experiments
4 dimensions: search engine time language filter inflected forms
Comparison by search engine (P): inflections
Precision (in %), any language. for 6/6/2005
Minor inconsistent impact on precision
Comparison by search engine (R): inflections
Recall (in %), any language. for 6/6/2005
Minor consistent improvement in recall
Conclusions and Future Work
Good news: n-gram variability does not have a statistically significant impact on the performance (for our task).
Future work other NLP tasks other languages
The End
Thank you!
Recommended