A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti...

A Study of Using Search Engine Page Hits as a Proxy for n-gram

Frequencies

A Study of Using Search Engine Page Hits as a Proxy for n-gram

Frequencies

Preslav Nakov and Marti HearstComputer Science Division and SIMS

University of California, Berkeley

Supported by NSF DBI-0317510 and a gift from Genentech

Overview

Web as a corpus n-gram frequencies Concern: Instability of n-gram estimates

Study the impact of the variability of the n-gram estimates for a particular task: Across time For different search engines (Not) using a language filter (Not) using inflections

Introduction

(Banko & Brill, 2001) “Scaling to Very Very Large Corpora for Natural Language

Disambiguation”, ACL 2001 Simple task: choose from a set of commonly confused set

of words for a given context, e.g. {principle, principal} Data comes for free: Assuming correct usage in the

training raw text. Log-linear improvement even to billion words

=> Getting more data is better than fine-tuning algorithms.

Today the obvious source of very large data is the Web.

Web as a Corpus

Machine Translation (Grefenstette 98; Resnik 99; Cao & Li 02; Way & Gough 03)

Question Answering (Dumais et al. 02; Soricut & Brill 04), Word Sense Disambiguation (Mihalcea & Moldovan 99;

Rigau et al. 02; Santamar´ıa et al. 03; Zahariev 04), Extraction of Semantic Relations (Chklovski & Pantel 04;

Idan Szpektor & Coppola 04; Shinzato & Torisawa 04), Anaphora Resolution: (Modjeska et al. 03), Prepositional Phrase Attachment: (Volk 01; Calvo &

Gelbukh 03; Nakov & Hearst 05), Language Modeling: (Zhu & Rosenfeld 01; Keller &

Lapata 03)

Page Hits as a Proxy for n-gram Frequencies

Plausibility: (Keller & Lapata 03) demonstrate a high correlation between: page hits and corpus bigram frequencies page hits and human plausibility judgments

Web as a baseline (Lapata & Keller 05): machine translation candidate selection, spelling correction, adjective ordering, article generation, noun compound bracketing, noun compound interpretation, countability detection and prepositional phrase attachment

More than a baseline State of the art results for noun compound

bracketing (Nakov & Hearst 05)

Web Count Problems (1) Page hits are not really n-gram frequencies

This may be OK (Keller&Lapata,2003) The Web lacks linguistic annotation

Cannot handle stem cells VERB PREPOSITION brain protein synthesis’ inhibition

Pr(health|care) = #(“health care”) / #(care) health: noun care: both verb and noun can be adjacent by chance can come from different sentences

Web Count Problems (2) Instability of the n-gram counts

Dynamics over time Query inconsistencies

Indexes spread across multiple machines Multiple (inconsistent) index copies

Search engine “dancing”,

tool at: http://www.seochat.com/googledance

Problem: Web experiments are not reproducible.

Web Count Problems (3) Rounding of page hits

Exact estimates MSN: always Google and Yahoo: for small numbers only

Possible reasons for rounding Not necessary for typical users Expensive to compute:

Distributed index Constant changes

Under high loads, search engines probably sample from their indexes.

The Task

Problem: What is the impact of n-gram variability (inconsistencies, rounding etc.)?

Approach NOT absolute n-gram variability BUT experiments wrt. a real task

noun compound bracketing allows for the use of n-grams of different lengths

Our Particular Task:

Noun Compound Bracketing

(a) [ [ liver cell ] antibody ] (left bracketing)

(b) [ liver [cell line] ] (right bracketing)

In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.

liver cell line liver cell antibody

Related Work

Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)

Lauer (1995) dependency model: Pr(w1|w2) vs. Pr(w1|w3)

Keller & Lapata (2004): use the Web unigrams and bigrams

Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given

Pr that w1 precedes w2

Adjacency & Dependency (1)

right bracketing: [w1 [w2w3] ] w2w3 is a compound (modified by w1)

home health care

w1 and w2 independently modify w3

adult male rat

left bracketing : [ [w1w2 ] w3] only 1 modificational choice possible

law enforcement officer

w1 w2 w3

Adjacency & Dependency (2)

right bracketing: [w1 [w2w3] ] w2w3 is a compound (modified by w1)

w1 and w2 independently modify w3

adjacency model Is w2w3 a compound?

(vs. w1w2 being a compound)

dependency model Does w1 modify w3?

(vs. w1 modifying w2)

w1 w2 w3

Probabilities: Why?

Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?

Keller&Lapata (2004) calculate: AltaVista queries:

(a): 70.49% (b): 68.85%

British National Corpus: (a): 63.11% (b): 65.57%

Association Models: 2 (Chi Squared)

A = #(wi,wj)

B = #(wi) – #(wi,wj)

C = #(wj) – #(wi,wj)

D = N – (A+B+C) N = 8 trillion (= A+B+C+D)

8 billion Web pages x 1,000 words

Web-derived Surface Features:Possessive Marker

Attached to the first word brain’s stem cell right

Attached to the second word brain stem’s cell left

We can query directly for possessives Search engines drop the possessive marker,

but s is kept. Still, we cannot query for “brain stems’ cell”

Other Web-derived Features:Abbreviation

After the second word tumor necrosis (TN) factor left

After the third word tumor necrosis factor (NF) right

We query for e.g., “tumor necrosis tn factor” Problems:

Roman digits: IV, vii States: CA Short words: me

Other Web-derived Features:Concatenation

Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812

Adjacency model healthcare vs. carereform

Dependency model healthcare vs. healthreform

Triples (adjacency) “healthcare reform” vs. “health carereform”

Other Web-derived Features:Reorder

Reorders for “health care reform” “care reform health” right “reform health care” left

Other Web-derived Features:Internal Inflection Variability

First word bone mineral density bones mineral density

Second word bone mineral density bone minerals density

Experiments

Lauer (95) dataset 244 noun compounds (NCs)

from Grolier’s encyclopedia inter-annotator agreement: 81.5%

Exact phrase queries (min freq. 5) Inflections:

Carroll’s morphological tools

Experiments

4 dimensions: time search engine language filter inflected forms

Comparison over time (P): Google

Precision (in %) for any language, no inflections. Average recall is shown in parentheses.

Varying time intervals,in case index changes happen periodically

Comparison over time (P): MSN

Precision (in %) for any language, no inflections. Average recall is shown in parentheses.

Statistically significant

Experiments

4 dimensions: timesearch engine language filter inflected forms

Comparison by search engine (P): for 6/6/2005

Precision (in %) for any language, no inflections. for 6/6/2005 Average recall is shown in parentheses.

Statistically significant

Comparison by search engine (R): for 6/6/2005

Recall (in %) for any language, no inflections. for 6/6/2005

No much variability of recall (but Googlehas the biggest index)

Experiments

4 dimensions: time search engine language filter inflected forms

Comparison by language (P): any language vs. English

Precision (in %), no inflections. for 6/6/2005

Minor inconsistent impact on precision

Comparison by language (R): any language vs. English

Recall (in %), no inflections. for 6/6/2005

Minor consistent drop in recall

Experiments

4 dimensions: search engine time language filter inflected forms

Comparison by search engine (P): inflections

Precision (in %), any language. for 6/6/2005

Minor inconsistent impact on precision

Comparison by search engine (R): inflections

Recall (in %), any language. for 6/6/2005

Minor consistent improvement in recall

Conclusions and Future Work

Good news: n-gram variability does not have a statistically significant impact on the performance (for our task).

Future work other NLP tasks other languages

The End

Thank you!

A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti...

Documents

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Improved Word Alignments Using the Web as a Corpus Preslav Nakov, University of California, Berkeley

UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many slides originally by Barbara Rosario, modified here)

1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004

Svetlin Nakov - Database Transactions

Developing a Successful SemEval Task in Sentiment …...2 Preslav Nakov et al. The task ran in 2013 and 2014, attracting the highest number of partic-ipating teams at SemEval in both

Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1

Preslav Nakov - The Web as a Training Set Part 3

A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010,

FABRIQ - Presentation Nakov 0.8

A Tale about PRO and Monsters Preslav Nakov, Francisco Guzmán and Stephan Vogel ACL, Sofia August 5 2013

Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages Preslav Nakov and Hwee Tou Ng Dept. of Computer

Nakov - Teaching .NET Framework

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer

Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 3

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee

Unsupervised User Stance Detection on Twitter · 2020. 5. 22. · Unsupervised User Stance Detection on Twitter Kareem Darwish1, Peter Stefanov2, Michael Aupetit¨ 1, Preslav Nakov

The Web as a Corpus: Going Beyond the n-gramThe Web as a Corpus: Going Beyond the n-gram Preslav Nakov National University of Singapore (joint work with Marti Hearst, UC Berkeley)

Svetlin Nakov - CV€¦ · Web viewTitle: Svetlin Nakov - CV Author: Svetlin Nakov Keywords: Svetlin, Nakov, Programming, Java, Oracle, .NET Last modified by: Svetlin Nakov Created

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng