81

Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Felix Bildhauer and Roland S häfer

German Grammar and Linguisti s (FU Berlin)

HPSG 2013 pre- onferen e tutorial

August 26, Berlin

Page 2: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Overview

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 1/80

Page 3: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Why use web orpora?

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 2/80

Page 4: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Why use web orpora?

Why orpora from the web?

Size: alleviates data sparseness problems with rare phenomena

Content: registers not found in traditional orpora

(forum dis ussions, blogs, fan-� tion et .)

Repli ability: resear h with stati orpus

vs. resear h based on sear h engine results

Availability: data an be lo ally pro essed with preferred

tools.

Cost: (virtually) free

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 3/80

Page 5: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Why use web orpora?

Some potential drawba ks of web orpora

Noise: from properties of web do uments

and from imperfe t pro essing

Meta data (of the kind linguists are interested in):

not en oded in web do uments in a reliable way

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 4/80

Page 6: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Web orpus onstru tion: Overview

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 5/80

Page 7: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Web orpus onstru tion: Overview

Web orpus onstru tion: work�ow

Data olle tion

ó

Removal of markup, s ripts et .

ó

Dete tion/removal of "`boilerplate"'

ó

Dete tion/removal of "`non-texts"'

ó

Dete tion/removal of (near-) dupli ates

ó

Linguisti post-pro essing

These steps involve design de isions

whi h a�e t the properties of the �nal orpus.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 6/80

Page 8: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Web orpus onstru tion: Overview

Work�ow II

Data olle tion

ó

Removal of markup, s ripts et .

ó

Dete tion/removal of "`boilerplate"'

ó

Dete tion/removal of "`non-texts"'

ó

Dete tion/removal of (near-) dupli ates

ó

Linguisti post-pro essing

Whi h sampling

pro edure/ rawling

strategy?

What should ount as

�boilerplate�?

Whi h do uments

ontribute �good text�?

Whi h amount of

dupli ation is a eptable?

How do we treat

non-standard orthography,

spelling errors, non-words

et .?

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 7/80

Page 9: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 8/80

Page 10: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

Links

The web onsists of pages and links between them.

Ea h page has an in-degree (no. of links to that page),

and an out-degree (no. of links on that page).

Pages with a higher in-degree are easier to �nd, but not

ne essarily the more interesting ones for orpus onstru tion.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 9/80

Page 11: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

The stru ture of the web

SCCIN OUT

TUBE

TENDRIL

[Manning et al., 2009, 427℄

Contrary to ommon intuition, Broder et al., 2000 found that the IN, OUT,

SCC, and TENDRIL omponents are not extremely di�erent in size.

A more detailed report on the sizes: Serrano et al. [2007℄.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 10/80

Page 12: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

Ina essible web ontent (deep web)

expli itly hidden from spiders

(Robots Ex lusion, intentional fooling/spoo�ng)

requiring login

(so ial networks, some forums, intranet gateways)

in-degree� 0

in-degree¡ 0, but too far away from any seed URL

(whi h are mostly sear h-engine indexed pages;

depends on rawling strategy)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 11/80

Page 13: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

Stati and dynami web ontent

By de�nition, stati web ontent is any page whi h is

not generated by a database/ ontent generation system,

but edited manually.

Today, most ontent is dynami (CMS, blogs, et .),

and the stati /dynami distin tion should not play

a great role in rawling for orpus onstru tion.

The problem is rather the separation of (stati or dynami )

linguisti ally relevant from irrelevant ontent

(time tables, produ t listings, et .).

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 12/80

Page 14: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

Top-level domains and national diale ts

It is ommon pra ti e to rawl national TLDs for ontent

in the orresponding national language.

Cook and Hirst [2012℄ on lude that the TLDs for . a

and .uk indeed represent national variants of English.

Problems with TLD rawls:

In some ountries, . om has a high prestige:

http://elpais. om, http://www.lavanguardia. om/,

http://www.mar a. om/

TLDs of ountries with more than one o� ial language

(e. g., Indonesia, Spain) yield relatively fewer results

in the target language.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 13/80

Page 15: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

Breadth-�rst rawling

Common pra ti e (e. g. WaCky, COW): follow all links on a page

Pro:

e�e tive: yields lots of data whithin a short time

open sour e software available

Con:

very sus eptible to page rank: no uniform random sample

unknown sampling bias

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 14/80

Page 16: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Data olle tion

Alternative: Random walks

Yet not used for web orpora: follow one link on a page

Pro:

known sampling bias: an be orre ted for

uniform random sample of web do uments

Con:

mu h less e�e tive than BF

probably not suitable for giga-token orpora

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 15/80

Page 17: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 16/80

Page 18: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Boilerplate dete tion

Boilerplate is text a ompanying the main text on a web page:

navigation bars

banners

advertisements

other layout elements

opyright notes

Boilerplate is typi ally not text produ ed by person on a parti ular

o asion, but rather

generated by ontent management systems

similar or identi al on many web pages of the same website

similar or identi al a ross web sites

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 17/80

Page 19: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Why dete t boilerplate?

Boilerplate elements bias the frequen y of linguisti items in the

�nal orpus.

One of the most frequent tokens in an experimental German

web orpus: mehr `more', as in read more. . .

One of the most frequent senten es in an experimental English

web orpus: You are not allowed to post new ontent in the

forum.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 18/80

Page 20: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Text

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 19/80

Page 21: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Text (II)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 20/80

Page 22: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Text (III)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 21/80

Page 23: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate?

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 22/80

Page 24: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (II)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 23/80

Page 25: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (III)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 24/80

Page 26: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (IV)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 25/80

Page 27: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (V)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 26/80

Page 28: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate?(VI)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 27/80

Page 29: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (VII)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 28/80

Page 30: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (VIII)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 29/80

Page 31: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (IX)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 30/80

Page 32: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

What should ount as boilerplate? (X)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 31/80

Page 33: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Automati dete tion of boilerplate

Boilerplate must be dete ted automati ally.

Ma hine learning te hniques:

Manually annotate a number of paragraphs: boilerplate yes/no

Cal ulate a number of features from ea h paragraph

e. g., number of text hara ters, number of HTML tags,

position in HTML-do ument

Train a lassi�er to reprodu e the human's de isions

we use an arti� ial neural network (multilayer per eptron)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 32/80

Page 34: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Automati dete tion of boilerplate (II)

Automati dete tion is not perfe t:

some boilerplate will not be dis overed (re all   1)

some paragraphs will be mistakenly lassi�ed as boilerplate

(pre ision   1)

Keep this in mind when working with web orpora,

and double he k any implausible frequen y �gures.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 33/80

Page 35: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Boilerplate dete tion

Boilerplate: removal vs. �agging

�Classi � approa h (e. g., WaCky, COW2012): remove boilerplate.

Alternative: do not remove, but �ag

(possibly with a on�den e s ore)

Pro � User do not have to rely on:

the orpus designer's de�nition of boilerplate

the performan e of an automati lassi�er

Con:

in rease in orpus size

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 34/80

Page 36: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 35/80

Page 37: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

Do ument �ltering

Ideally, the �nal orpus should ontain only �good� do uments.

�Good�:

only do uments in the target language

do uments ontaining predominantly text

(i. e., oherent and onne ted text)

This ex ludes ertain do ument types:

lists (e. g. ompany names, vo abulary items)

tag louds

et .

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 36/80

Page 38: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

Example: a �good� do ument

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 37/80

Page 39: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

Distinguishing the �good� from the �bad� (I)

Classi� ation has to be performed automati ally

ML algorithms usually need manually annotated training data

But: the de ision is di� ult even for humans and arbitrary to

some extent

Experiment: 3 raters, 1000 random do uments from German web

orpus, using 5-point s ale:

�2,�1: do ument should not be in the orpus

1, 2: do ument should be in the orpus

0: unde ided/do ument might or might not be in the orpus

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 38/80

Page 40: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

Distinguishing the �good� from the �bad� (II)

Results (after training phase of rating of 100 do uments together,

with several hours of dis ussion of borderline ases):

statisti early 500 late 500 all 1,000

raw 0.566 0.300 0.433

κ (raw) 0.397 0.303 0.367

ICC pC , 1q 0.756 0.679 0.725

raw (r ¥ 0) 0.900 0.762 0.831

raw (r ¥ 1) 0.820 0.674 0.747

κ (r ¥ 0) 0.673 0.625 0.660

κ (r ¥ 1) 0.585 0.555 0.598

κ (r ¥ 2) 0.546 0.354 0.498

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 39/80

Page 41: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

�Good� vs. �bad� do uments: lists

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 40/80

Page 42: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

�Good� vs. �bad� do uments: lists (II)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 41/80

Page 43: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

�Good� vs. �bad� do uments: lists (III)

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 42/80

Page 44: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

A eptable results?

values below 0.68 [Krippendor�, 1980℄

even onsidering riti ism of Krippendorf's magi number

[Carletta, 1996, Bayerl and Paul, 2011℄:

un omfortably low for �gold standard�

more onfusion on late data (lower overall quality)

worse: disagreement between orpus designers

a eptan e at the threshold ¥ 0:

A: 78.4%, R: 73.8%, S: 84.9%

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 43/80

Page 45: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

General method and idea

simple metri with known properties

language-independent, unsupervised. . .

does not involve an obviously di� ult design de ision

strategy for leansing: high re all for everyone,

a ept medio re pre ision

for retained do uments: use as annotation in �nal orpus,

allow orpus users to �set� pre ision

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 44/80

Page 46: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Do ument �ltering

Implementation

based on �frequent/short word� method

in language identi� ation [Grefenstette, 1995℄

similar to WaCky [Baroni et al., 2009℄

but without manually ompiled lists of fun tion words

totally unsupervised pro edure for rawled data

predominantly in a single language (TLD rawl):

training: get weighted mean and standard deviation

of relative frequen ies of the most frequent words (�pro�le�)

produ tion: al ulate for the top m of them

the �standardized� negative deviation for ea h do ument

lamped and added up: the Badness s ore

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 45/80

Page 47: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 46/80

Page 48: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Dupli ate senten es in a orpus on ordan e

Query: made a donation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 47/80

Page 49: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 48/80

Page 50: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 49/80

Page 51: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 50/80

Page 52: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 51/80

Page 53: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Near dupli ation and in lusion

Many web pages are not perfe t dupli ates of others,

but just very similar.

One typi al sour e: slightly edited texts from news agen ies

in several newspapers.

Also, some web pages fully ontain other web pages

(in lusion).

In lusion situation is made worse by typi al blog and ontent

management systems.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 52/80

Page 54: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Failure of perfe t dupli ate hashing/�ngerprinting

D

1

: Yesterday, I wrote a letter to my parents.

D

2

: Yesterday I wrote a letter to my parents.

Hashes (e. g., SHA1 � very, maybe too strong for the given

task):

12d952 23 3869faead1e1aa6e02b98256e35f8

5ad8468f b45fe07ef420fd87 d360087514d1f1

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 53/80

Page 55: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Solution � in prin iple: Ja ard Coe� ients

One an measure the similarity of two do uments as

the Ja ard Coe� ient of the sets of their n-grams

[Manning et al., 2009, 61,438℄.

Pro edure:

Create the do uments' word/token-n-grams.

Cal ulate the Ja ard Coe� ient of the two sets.

If it is above a ertain threshold, delete the shorter

of the two do uments.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 54/80

Page 56: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Example: High similarity with JC over token-bi-grams I

Yesterday, I wrote a letter to my parents.

FPpD

1

q �{(yesterday;,), (,;i), (i,wrote), (wrote;a), (a;letter),

(letter;to), (to;my), (my;parents), (parents;.)}

Yesterday I wrote a letter to my parents.

FPpD

2

q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter), (letter;to),

(to;my), (my;parents), (parents;.)}

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 55/80

Page 57: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Example: High similarity with JC over token-bi-grams II

FPpD

1

q �{(yesterday;,), (,;i), (i,wrote), (wrote;a), (a;letter),

(letter;to), (to;my), (my;parents), (parents;.)}

FPpD

2

q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter),

(letter;to), (to;my), (my;parents), (parents;.)}

JpD

1

,D2

q �

|D

1

XD

2

|

|D

1

YD

2

|

7

10

� 0.7

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 56/80

Page 58: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Example: Low similarity with JC over token-bi-grams

FPpD

2

q �{(yesterday;i), (i,wrote), (wrote;a), (a;letter),

(letter;to), (to;my), (my;parents), (parents;.)}

D

3

�Yesterday I read a book to my nie e.

FPpD

3

q �{(yesterday;i), (i,read), (read;a), (a;book), (book;to),

(to;my), (my;nie e), (nie e;.)}

JpD

2

,D3

q �

|D

2

XD

3

|

|D

2

YD

3

|

2

14

� 0.14

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 57/80

Page 59: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Can it be done?

10,000,000 do uments

10,000,000

2

2

� 5� 10

13

omparisons

at 1µs per omparison: � 579 days

But 1µs is faster than available pro essors an do it.

Solution:

Do not al ulate JC for ea h pair of do uments.

Instead, estimate the JC [Broder et al., 1997℄.

Te hnique is known as w-Shingling.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 58/80

Page 60: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Dupli ation

Near-dupli ate dete tion with w-shingling

w-shingling does not tell us whi h amount of dupli ation

is a eptable in a orpus.

Setting a parti ular JC as a threshold for keeping/dis arding

a do ument is ultimately an arbitrary hoi e

Design de ision by orpus builders.

Web orpora will probably always ontain a ertain amount

of dupli ation

many other orpora do, too

Strategies to ope with dupli ation:

If ompatible with resear h question:

work with senten e-wise uniq'ed orpora.

Or work with unique (not too short) on ordan e lines.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 59/80

Page 61: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 60/80

Page 62: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Noise in web orpora: sour es

1. properties of web-do uments, e. g.

quasi-spontaneous writing situations with (often)

asual language, typos, non-standard spellings,

improper use of whitespa e and pun tuation

text ontributed by non-native speakers

a large lexi on with many out-of-vo abulary items

2. the stru ture of the WWW and its tte hnologies, e. g.

dupli ated text, boilerplate

3. short omings in post-pro essing, e. g.

HTML/ ode-stripping, tokenization

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 61/80

Page 63: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Example: spelling variants

Corre t form (frequen y) Misspelled form N edit distan e

ubernimmt 81 1

überninmmt 6 1

übernimmnt 1 1

überniemt 6 1

öbernimmt 6 1

übernimnmt 5 1

übernimmet 11 1

übernimmt (297440) überniehmt 2 2

überniemt 6 1

übernihmt 17 1

überniiiiiiiimmmmmmmt 1 12

überniommt 4 1

übernimrnt 4 2

überniummt 2 1

überninnt 2 2

übernimmtt 2 1

. . .

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 62/80

Page 64: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Example: text produ ed by non-native speakers

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 63/80

Page 65: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Noise in linguisti post-pro essing

Most available natural language pro essing tools expe t standard

written language as input.

Many tokenizers expe t proper use of whitespa e,

asing and pun tuation.

Part-of-spee h taggers expe t properly tokenized text as input.

POS-taggers usually perform worse on unknown items

(tokenization errors, misspellings, out-of-vo abulary items).

Proper tokenization and POS tagging is normally the basis

for higher-level linguisti annotation

(parsing, sense disambiguation et .).

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 64/80

Page 66: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Should noise be �normalized away�?

What exa tly is �noise� in the �rst pla e?

No general de�nition of �noise� that suits all purposes

What ounts as noise in one task may be valuable data

in another task

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 65/80

Page 67: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Can noise be �normalized away�?

In large web orpora, noise would have to be eliminated

automati ally, without human intera tion.

viable in some ases:

e. g. orre ting run-together words

a lot.I don't think Ñ a lot. I don't think

but not an easy task in other ases:

e. g. spelling orre tion

I didn't spend mu h time with Bibble. Ñ Bubble?

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 66/80

Page 68: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Non-destru tive normalization

Our approa h: do as mu h as possible non-destru tively

(i. e. do not alter the original data).

Normalization-as-annotation

But there are limits to this approa h imposed

by the indexing/query ar hite ture.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 67/80

Page 69: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Non-destru tive orthographi normalization

Word POS Lemma

The DT the

FA NP FA

does VBZ do

abosolutley JJ  unknown¡

nothing NN nothing

to TO to

help VB help

Clubs NNS lub

Word POS Lemma

The DT the

FA NP FA

does VBZ do

 norm from="abosolutley"¡

absolutely ADV absolutely

 /norm¡

nothing NN nothing

to TO to

help VB help

Clubs NNS lub

Original text: spelling error,

POS-tagging error, no lemma

Normalized text: standard spelling,

orre t POS and lemma, original

spelling preserved as meta data

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 68/80

Page 70: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Linguisti post-pro essing

Quality of linguisti annotation in web orpora

Automati annotation may be on ern, e. g.

Do not take type ounts in web orpora at fa e value.

There are many hapax legomena due to tokenization issues.

POS tagging of web orpora does not quite rea h

a ura y levels in the upper 90%

but many POS taggers do not rea h su h a ura y levels

on traditional, out-of-domain texts either

But:

Depending on the spe i� resear h question,

some or all of these problems by irrelevant.

Anyway, in many s enarios, there is no alternative

to working with web orpora.

And we (and others) are working to improve it.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 69/80

Page 71: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

We are here. . .

Why use web orpora?

Web orpus onstru tion: Overview

Data olle tion

Boilerplate dete tion

Do ument �ltering

Dupli ation

Linguisti post-pro essing

Evaluation

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 70/80

Page 72: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

What's in a web orpus

Crawled orpus:

No strati�ed sampling of do uments

Exa t omposition of the �nal orpus not known beforehand.

Establish orpus omposition after orpus onstru tion.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 71/80

Page 73: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

Web do ument lassi� ation

COWCat: lassi� ation s heme based on Sharo� [2006℄

Five dimensions:

Authorship, Mode, Audien e, Aim, Domain

Annotation guidelines available from:

http://hpsg.fu-berlin.de/ ow/files/ ow at2013.pdf

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 72/80

Page 74: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

UKCOW2012: Mode

Mode

Type % 95% CI �%

Written 90.5 2.8

Spoken 0.7 0.8

Quasi-Spont. 6.6 2.4

Blogmix 2.2 1.4

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 73/80

Page 75: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

UKCOW2012: Audien e

Audien e

Type % 95% CI �%

General 87.4 3.2

Informed 7.3 2.5

Professional 5.3 2.2

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 74/80

Page 76: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

UKCOW2012: Authorship

Authorship

Type % 95% CI �%

Single, female 7.8 2.6

Single, male 21.6 4.0

Multiple 13.8 3.3

Corporate 26.0 4.2

Unknown 30.8 4.5

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 75/80

Page 77: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

UKCOW2012: Aim

Aim

Type % 95% CI �%

Re ommendation 10.2 2.9

Instru tion 1.9 1.3

Information 10.7 3.0

Dis ussion 75.5 4.2

Fi tion 1.7 1.2

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 76/80

Page 78: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

Extrinsi evaluation: Biemann et al. [2013℄

Task oriented evaluation: ollo ation extra tion

Work by Stefan Evert

Di�erent orpora (web and traditional) ompaired

against a gold standard

size POS basi

name (tokens) orpus type tagged lemmatized unit

BNC 0.1 G referen e orpus � � text

WP500 0.2 G Wikipedia � � fragment

Wa kypedia 1.0 G Wikipedia � � arti le

ukWaC 2.1 G web orpus � � web page

WebBase 3.3 G web orpus � � paragraph

UKCOW 4.0 G web orpus � � senten e

LCC 0.9 G web orpus � � senten e

LCC (f ¥ k) 0.9 G web n-grams � � n-gram

Web1T5 (f ¥ 40) 1000.0 G web n-grams � � n-gram

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 77/80

Page 79: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

Verb-parti le ombinations Biemann et al. [2013℄ (II)

Gold standard: 3,078 English verb-parti le ombinations,

manually lassi�ed as non- ompositional ( arry on, kno k out)

or ompositional bring together, peer out)

Extra ted o-o urren e ounts for the word pairs

(3-word span to the right of the verb)

Clean, annotated web orpora seem to be a valid repla ement

for a traditional referen e orpus su h as the BNC.

Diversity may be a relevant fa tor: Web orpora of more than

1G words are needed to rival the 100M words BNC.

Web1T5: sheer size annot make up for �messy� ontent

and la k of annotation.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 78/80

Page 80: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

Referen es I

M. Baroni, S. Bernardini, A. Ferraresi, and E. Zan hetta. The WaCky Wide Web: A olle tion of very

large linguisti ally pro essed web- rawled orpora. Language Resour es and Evaluation, 43(3):

209�226, 2009.

P. S. Bayerl and K. I. Paul. What determines inter- oder agreement in manual annotations? a

meta-analyti investigation. Computational Linguisti s, 37(4):699�725, 2011.

C. Biemann, F. Bildhauer, S. Evert, D. Goldhahn, U. Quastho�, R. S häfer, J. Simon, L. Swiezinski,

and T. Zes h. S alable onstru tion of high-quality web orpora. Spe ial issue of JLCL, 2013.

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, R. Stata, A. Tomkins, and J. L. Wiener. Graph

stru ture in the web. In Pro eedings of the 9th International World Wide Web onferen e on

Computer Networks: The International Journal of Computer and Tele ommuni ations Networking,

pages 309�320. North-Holland Publishing Co, 2000.

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Synta ti lustering of the Web. Te hni al

Note 1997-115, SRC, Palo Alto, July 25 1997.

J. Carletta. Assessing agreement on lassi� ation tasks: The kappa statisti . Computational Linguisti s,

22(2):249�254, 1996.

P. Cook and G. Hirst. Do web- orpora from top-level domains represent national varieties of English? In

Pro eedings of the 11th International Conferen e on the Statisti al Analysis of Textual Data, pages

281�293, Liège, 2012.

G. Grefenstette. Comparing two language identi� ation s hemes. In Pro eedings of the 3rd Internation

onferen e on Statisti al Analysis of Textual Data (JADT 1995), pages 263�268, Rome, 1995.

K. Krippendor�. Content Analysis: An Introdu tion to its Methodology. Sage Publi ations, Beverly

Hills, 1980.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 79/80

Page 81: Linguistic research with large annotated web corporarolandschaefer.net/wp-content/uploads/2013/08/hpsg... · rge la annotated eb w rpra coo Overview Why use eb w rpra? coo eb W rpus

Linguisti resear h with large annotated web orpora

Evaluation

Referen es II

C. Manning, P. Raghavan, and H. S hütze. An Introdu tion to Information Retrieval. CUP, Cambridge,

2009.

M. A. Serrano, A. Maguitman, M. Boguñá, S. Fortunato, and A. Vespignani. De oding the stru ture of

the WWW: A omparative analysis of Web rawls. ACM Trans. Web, 1(2), 2007.

S. Sharo�. Creating general-purpose orpora using automated sear h engine queries. In M. Baroni and

S. Bernardini, editors, WaCky! Working papers on the Web as Corpus, pages 63�98. GEDIT,

Bologna, 2006.

Felix Bildhauer and Roland S häfer 2013, German Grammar and Linguisti s (FU Berlin) 80/80