101
IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

Embed Size (px)

Citation preview

Page 1: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

Off-line (and On-line) Text Analysis for Computational

LexicographyHannah Kermes

Page 2: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

2

Introduction

• Motivation• computational lexicography

• corpus linguistics

• Approaches to text analysis• symbolic vs. probabilistic approaches

• hand-written vs. learned

• on-line queries vs. chunking vs. full parsing

• Requirements• for the extraction tool

• for the corpus annotation

• classical chunking

Page 3: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

3

Motivation

• maintainance of consistency and completeness within lexica computer assisted methods

• lexical engineering scalable lexicographic work process processes reproducible on large amounts of text

• statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research

• full parsers are not robust enoughneed for analyzing tools that meet the specific needs of

corpus linguistic studies

Page 4: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

4

Dictonaries

• for human use• printed monolingual dictionaries

• electronic dictionaries

• machine readable dictionaries for NLP applications

Page 5: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

5

Printed monolingual dictionaries• intend to cover most important semantic and

syntactic aspects• maintenance of consistency and completeness

is a problem:• information is missing

• entries are incomplete

• information is not consistent

• language changes have to be covered

Page 6: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

6

Electronic dictionaries

• enormous amounts of information can be stored in a compact format

• search engines allow for easy and fast access to desired data

• users can choose how much and what kind of information they are interested in

• reference corpus as additional knowledge source

Page 7: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

7

Machine readable dictionaries

• NLP applications need detailed and consistent information about words• detailed morphological information

• subcategorization frames of verbs, adjectives, nouns

• specific syntactic information

• selectional preferences

• collocations

• idiomatic usage

Page 8: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

8

Information needed

• syntactic information• subcategorization patterns

• semantic information• selectional preferences, collocations

• synonyms

• multi-word units

• lexical classes

• morphological information• case, number, gender

• compounding and derivation

Page 9: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

9

Requirements for the tool

• it has to work on unrestricted text• shortcomings in the grammar should not lead

to a complete failure to parse• no manual checking should be required• should provide a clearly defined interface• annotation should follow linguistic standards

Page 10: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

10

Requirements for the annotation• head lemma• morpho-syntactic information• lexical-semantic information• structural and textual information• hierarchical representation

Page 11: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

11

A corpus linguistic approach

Page 12: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

12

Hypothesis

The better and more detailed the off-line annotation, the better and faster the on-line extraction.However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

Page 13: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

13

Three different dimensions

• type of grammar• symbolic grammar

• probabilistic grammar

• type of grammar development• hand-written grammar

• learning methods

• depth of analysis• analysis on token level only

• full parsing

• partial parsing

Page 14: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

14

Symbolic approaches

+precise rules can be formulated+lexical knowledge can be included+results can be predicted and controlled- sometimes not sufficient to solve ambiguities- only phenomena which are explicit in the

grammar can be dealt with

Page 15: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

15

Unification-based grammars

• usually complex grammars• model the hierarchical structure of language• handle attachment ambiguities• determine relations among constituents and

their grammatical function• extensive use of lexical information• richness and complexity of rules do not only

solve ambiguities, but produces them as well• usually large number of possible analysis

Page 16: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

16

Context-free Grammars (CFG)

• formal grammars consisting of a set of recursive rewriting rules

• small and modular grammar• minimal interaction among rules• parsing process usually fast• covers only basic aspects of language• robustness rules are used to overcome

shortcomings in the grammar

Page 17: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

17

Probabilistic approaches

+supervised or unsupervised training of rules+all possible analyses are produced+no need for comprehensive lexical or linguistic

knowledge+rules can be left underspecified- depend on the training corpus- highly frequent phenomena are preferred over

low frequent phenomena

Page 18: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

18

Probabilistic context-free grammar• CFG rules enriched by probability• make use of underspecification• not as fast as CFG• special case: head lexicalized context-free

grammar• unsupervised

• grammar rules are indexed by the lemma of the syntactic head

• extraction is performed on the rule set rather than on the annotated corpus

Page 19: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

19

Hand-written rules

+good control of the rule system+negative evidence can be taken into account- depends heavily on the experties of the

grammar writer

Page 20: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

20

Learning grammar rules

+infer grammar form text corpora+extensional syntactic descriptions

(annotations) are turned into intensional descriptions (rules)

+optimal or suboptimal training data+new resources in the form of text corpora can

be exploited+more or less independent of the knowledge of

the grammar developer- depends heavily on the learning corpus- needs an annotated, well-balanced corpus

Page 21: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

21

memory based learning

• special case of learning• most prominent is the data oriented parsing

(DOP)• fragments are stored and as such replace the grammar

• language generation and analysis is performed by combining the memorized fragments

• needs structurally annotated corpus

• the training corpus has great impact on the performance of the system

• highly sensitive to suboptimal data

• needs large storage capacity

Page 22: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

22

Annotation on token level

+usually a form of pattern matching+completely flexible+does not depend on previous syntactic

analysis+easily adaptable to different text types - full syntactic analysis has to be performed by

extraction queries- queries can become rather complex- often restricted to simple contexts

Page 23: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

23

Full Parsing

+provides rich and detailed information about structures, relations and functions

+extraction queries simply have to collect the annotated information

- slow parsing speed- lack of robustness- depend heavily on prerequisite lexical

information- ambiguous output

Page 24: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

24

Chunking

+relatively simple grammar rules+no need for extensive linguistic and

lexicographic information+robust- usually non-hierarchical and non-recursive

structures- annotated structures are simple and convey

less information

Page 25: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

25

Classical chunk definition

• Abney 1991:The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template

• Abney 1996:a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head

Page 26: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

26

State-of-the-art systems

• CASS parser• finite-state cascades• flat, non-recursive structures• small lexicon (tag-fixes)• information about the head is given as an attribute

• Conexor• symbolic constraint grammar parser• full-fedged grammar for English (ENGCG)• German:

• simple, non-recursive structure• no lexical information available• head lemma indicated by a special tag

Page 27: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

27

State-of-the-art systems

• KaRoParse• top-down bottom-up parser

• includes recursion

• internal structure is flat and non-hierarchical

• no agreement or lexical information

• Schiehlen's chunker• symbolic context free grammar

• recursion

• no head lemma or lexical-semantic information

• needs optimally tokenized text (including MWL recognition)

Page 28: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

28

State-of-the-art systems

• Chunkie• uses TnT-tagger to assign tree fragments to

sequences of PoS-tags• recursion in pre-head position (maximal depth of

three)• head lemma information, yet no agreement or lexical

information

• Cascaded Markov Models• stochastic context free grammar rules • several layers, each layer serving as input to the next• hierachical phrases, including complex recursion• head lemma information, yet no agreement or lexical

information

Page 29: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

29

Problems for extraction

• Kübler and Hinrichs (2001)focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

Page 30: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

30

An example

1. [PC mit kleinen ], [PC über die Köpfe ]

with small above the heads[NC der Apostel ] [NC gesetzten Flammen ]

the apostles set flames2. [PP mit [NP [AP kleinen ], [AP über [NP die Köpfe

with small above the heads[NP der Apostel ] ] gesetzten ] Flammen ] ]

the apostles set flames`with small flames set above the heads of the

apostles´

Page 31: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

31

Problems for extraction

• four NCs instead of only one NP• AN-pair:

+gesetzten + Flammen

- kleine + Flammen

• NN-pair Köpfe + Apostel needs agreement information

• VN-pair setzen + Flammen needs information about the deverbal character of gesetzten

a more complex analysis is needed PCs and NCs need to be combined

Page 32: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

32

Simple solution

PP PC (PC|NC)*• theoretical motivation?• rule covers this particular example, other

examples might need additional rules• rule is vague and largely underspecified

not very reliable

• internal structure is mainly left opague

Page 33: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

33

Complex solution

1. NP NC NCgen

2. PP preposition NP3. AP PP adjective4. NP AP* noun

Page 34: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

34

Complex solution

• solution for this particular example only• large number of rules needed• rules have to be repeated for every instance

of a complex phrase in order to support extractions, the classic

chunk concept has to be extended

Page 35: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

35

Conclusion

ChunkingFull

Parsing

• flat non-recursive structures

• simple grammar

• robust and efficient

• non-ambiguous output

• full hierarchical representation

• complex grammar

• not very robust

• ambiguous output

YAC

Page 36: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

36

Conclusion

• recursive chunking workable compromise between depth of analysis and robustness

• extracted data show correlation between• collocational preference

• subcategorization frames

• semantic classes of adjectives

• to a certain extent distributional preferences

Page 37: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

37

General Concept

• a recursive chunker for unrestricted German text

• technical framework• CWB• CQP• output formats• advantages of the architecture

• general framework of YAC• linguistic coverage• feature annotation• chunking process

Page 38: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

38

A recursive chunker for unrestricted German text• recursive chunker for unrestricted German text• fully automatic analysis• main goal:

provide a useful basis for extraction of linguistic as well as lexicographic information from corpora

Page 39: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

39

• based on a symbolic regular expression grammar

• grammar rules written in CQP• basis:

• tokenization

• PoS-tagging

• lemmatization

• agreement information

General aspects

Tree Tagger

IMSLex

Page 40: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

40

A typical chunker

• robust – works on unrestricted text• works fully automatically• does not provide full but partial analysis of text• no highly ambiguous attachment decisions are

made

Page 41: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

41

YAC goes beyond

• extends the chunk definition of Abney1. recursive embedding

2. post-head embedding

• provides additional information about annotated chunks

1. head lemma

2. agreement information

3. lexical-semantic and structural properties

Page 42: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

42

Extended chunk definition

A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head

as well as post-head modifiers but no PP-attachment, or sentential elements.

Page 43: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

43

Technical Framework

corpusPerl-Scripts

grammarrules

lexicon

ruleapplication

annotationof results

post-processing

Page 44: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

44

Technical framework - CQP

• regular expression matching on token and annotation strings

• tests for membership in user specific word lists• feature set operations• constraints to specify dependencies

Page 45: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

45

Perl-Scripts

• invocation of CQP• processing of the results• annotation of the results into the corpus

Page 46: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

46

Postprocessing

• values can be checked• values can be changed• values can be compared• range of structures can be changed

Page 47: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

47

Output formats

• CQP format, used for:• interactive grammar development

• parsing

• extraction

• an XML format, used for:• hierarchy building

• extraction

• data exchange

Page 48: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

48

Advantages of the system

• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules

Page 49: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

49

Linguistic coverage

• Adverbial phrases (AdvP)a) schön stark (beautifully strong)

b) daher (from there); irgendwoher (from anywhere)

c) heim (home); querfeldein (cross-country)

d) innen (inside); überall (everywhere)

e) "sehr bald" (very soon)

f) jetzt (now); damals (at that time)

Page 50: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

50

Linguistic coverage

• Adjectival phrases (AP)a) möglich (possible)

b) schreiend lila (screamingly purple)

c) rund zwei Meter hohearound two meter high

d) über die Köpfe der Apostel gesetzten

above the heads of the apostles set

'set above the heads of the apostles'

Page 51: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

51

Linguistic coverage

• Noun phrases (NP)a) Oktober (October); er (he)

b) 4,9 Milliarden Euro

4.9 billion Euros

c) "Frankensteins Fluch"

"Frankenstein's curse"

d) kleine, über die Köpfe der Apostel gesetzten

small, above the heads of the apostles set

Flammen

flames

'small flames set above the heads of the apostles'

Page 52: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

52

Linguistic coverage

• Prepositional phrases (PP)a) davon (thereof)

b) zwischen Basel und St. Moritz

between Basel and St. Moritz

c) mit kleinen, über die Köpfe der Apostel gesetzten

with small, above the heads of the apostles set

Flammen

flames

'with small flames set above the heads of the apostles

Page 53: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

53

Linguistic coverage

• Verbal complexes (VC)a) gemunkelt (rumored)

b) muß gerechnet werden

has counted to be

'has to be counted

c) zu bekommen

to get

d) bekommen zu haben

gotten to have

'to have gotten'

Page 54: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

54

Linguistic coverage

• Clauses (CL)a) … , daß selbst Ravel sich amüsiert hätte.

… , that even Ravel himself enjoyed had.

'… , that even Ravel would have enjoyed.'

b) … , die man in der griechischen Tragödie findet.

… , which one in the Greek tragedy finds.

'… , which one finds in the Greek tragedy.'

Page 55: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

55

Linguistic coverage

• Clauses (CL)a) … , Instrumente selbst zu bauen.

… , instruments oneself to build.

' … , to build instruments oneself.'

b) … , um einen Kaffee zu trinken.

… , in order a coffee to drink.

'… , in order to drink a coffee.'

Page 56: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

56

Feature annotation

• head lemma• morpho-syntactic information• lexical-semantic properties

Page 57: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

57

Feature annotation

feature value

AdvP

AP NP PP VC CL

lexical-semantic

X X X X X X

head lemma X X X X X X

agreement info

X X X

verbal head lemma

X

Page 58: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

58

Head lemma

• lemma attribute at the head position• normally a single token• multi-word proper nouns have a multi-token

head lemma• a separated verbal prefix is included in the

head lemma of the VCkommt … an ankommen (arrive)

• head lemma of PP:preposition:noun

Page 59: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

59

Morpho-syntactic information

• intersection of the morpho-syntactic information of relevant elements

• invariant elements are not considered• no guessing involved to solve ambiguities

Page 60: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

60

Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|

Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Gen:F:Pl:Def|Gen:F:Pl:Ind|Gen:F:Sg:Def|Gen:F:Sg:Ind|Gen:M:Pl:Def|Gen:M:Pl:Ind|Gen:M:Sg:Def|Gen:M:Sg:Ind|Gen:M:Sg:Nil|Gen:N:Pl:Def|Gen:N:Pl:Ind|Gen:N:Sg:Def|Gen:N:Sg:Ind|Gen:N:Sg:Nil|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>

Page 61: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

61

Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|

Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>

Page 62: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

62

Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|

Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|

Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>

Page 63: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

63

Agreement Information<np_agr |Akk:M:Sg:Def|>den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|

Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>

</np_agr>

<np_agr |Akk:M:Sg:Def|>

Page 64: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

64

Lexical-semantic properties

• important for parsing as well as for extraction• properties can be triggers for specific internal

structures, functions, and usages• properties inherent in the corpus

• PoS-tags

Johann Sebastian Bach

NE NE NE

• text markers

"Wilhelm Meisters Lehrjahre"

NE NN NN

Page 65: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

65

Lexical-semantic properties

• properties determined by external knowledge sources (lexica, ontologies, word lists)• locality:

hier (here); dort (there); Stuttgart

• temporality:

Jahr (year); damals (at that time)

• derivation:

gesetzten (set) deverbal adjective

Page 66: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

66

Lexical-semantic properties

• structural information• complex embeddings

[AP [PP über die Köpfe der Apostel ] gesetzten ]

above the heads of the apostles set

' set above the heads of the apostles'

[AP [NP der "Inkatha"-Partei ] angehörenden ]

to the Inkatha-party belonging

'belonging to the Inkatha-party'

Page 67: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

67

Some properties of NPs

card cardinal noun

meas measure noun

ne named entity

quot NP in quotation marks

street street address

temp temporal noun

date date

pron pronominal NP

Page 68: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

68

Other lexical-semantic properties• VC with separated prefix: pref

Er kommt an (he arrives)• PP with contracted preposition and article: fus

am Bahnhof (at the station)• complex APs embedding PPs: pp

über die Köpfe der Apostel gesetztenabove the heads of the apostles set'set above the heads of the apostles'

• AP with deverbal adjectives: vder

Page 69: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

69

Chunking process

Corpus CorpusThirdLevel

FirstLevel

Corpus

SecondLevel

Lexicon

Page 70: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

70

First level

• basic (non-recursive) chunks• chunks with specific internal structure

a) Ende September (end of Semptember)b) Jahre später (years later)c) 21. Juli 2003d) Johann Sebastian Bach

• lexical information is introduced• within the rules itself• within the Perl-scripts

Page 71: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

71

Advantages

• specific rules do not interact with main parsing rules

• additional (e.g. domain specific) rules can be included easily

• main parsing rules can be kept simple• number of main parsing rules can be kept

small

Page 72: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

72

Second level

• main parsing level• relatively simple and general rules

a) AP AdvP? (PP|NP)* ACb) NP Determiner? Cardinal? AP* NCc) PP Preposition (NP|AdvP)

• complex (recursive) structures are built in several iterations

Page 73: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

73

Rule blocks

Page 74: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

74

Second level

- complexity of phrases is achieved by the embedding of complex structures rather than by complex rules

a) [NP eine [AP verständliche ] Sprache ] an understandable language

b) [NP eine [AP für den Anwender verständliche ] Sprache ] a for the user understandable language'a language understandable for the user'

Page 75: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

75

Second level

a) [PP auf [NP dem Giebel ] ] on top of the gable

b) [PP auf [NP dem westwärts gerichteten Giebel on top of the westwards pointed gable des heute im barocken Gewande erscheinenden of the today in baroque garment appearingGotteshauses ] ]Lord's house

Page 76: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

76

Third level

• chunks of related but different categories can be subsumed under one category

• NPs with determiner (NP)• NPs without determiner (NCC) NP• base noun chunks (NC)

• coordination of maximal chunks• decisions are made which need full recursive

chunks• adverbially and predicatively used Adjectives can

only be differentiated by the actual usageadverbially used AP AdvP

Page 77: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

77

Hierarchy building

• resulting structures of all parsing stages are collected and stored in XML-files

• after the parsing process collected structures are combined into a hierarchical structure

• only the largest instance of a structure (sharing the same head) is taken into account

Page 78: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

78

Hierarchy building

a) [NP Faszination ]

fascinationb) [NP gewisse Faszination des Schattens ]

certain fascination of the shadowc) [NP eine gewisse Faszination des Schattens ]

a certain fascination of the shadowd) [NP des Schattens ]

of the shadowe) [NP eine gewisse Faszination [NP des Schattens ] ]

a certain fascination of the shadow

Page 79: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

79

Evaluation on automatic PoS-tags

all chunks maximal chunks

precision recall precision recall

NP 89.93 91.67 89.43 91.68

PP 94.05 89.67 94.04 89.65

AP 84.24 89.25 83.67 89.59

VC - - 97.72 96.62

Page 80: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

80

Evaluation on ideal PoS-tags

all chunks maximal chunks

precision recall precision recall

NP 96.36 96.51 95.55 96.47

PP 98.08 96.51 98.07 96.50

AP 96.39 97.50 96.12 97.45

VC - - 99.01 98.59

Page 81: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

81

Extraction

• Advantage of the system• Goal• Sample Extraction

Page 82: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

82

Advantages of the system

• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules

Page 83: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

83

Goal

• provide a fine-grained syntactic classification

of the extracted data at the level of • subcategorization• scrambling

• adjectives subcategorizing clauses• combinatory preferences with verbs• syntactic behavior

Page 84: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

84

Target data

• predicative(-like) constructions

Es war klar, daß ...

It was clear, that ...• ... with adverbial pronoun

Er ist davon überzeugt, daß ...

He is of it convinced, that ...• ... with reflexive pronoun

Es zeigt sich deutlich, daß ...

It shows itself clear, that ...

Page 85: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

85

Target data

• ... with infinite clauses

Es ist möglich, ihn zu besuchen.

It is possible, him to visit.• ... with clause in topicalized position

Daß ..., ist klar.

That ..., is clear.

Ihn zu besuchen, ist möglich.

Him to visit, is possible.

Page 86: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

86

Sample query

adjective + verb + finite clause

VC

APCL

Page 87: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

87

Sample query

adjective + verb + finite clause

VC

APpred

CLfin

Page 88: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

88

Sample query

adjective + verb + finite clause

VC Adjuncts*APpred

CLfin

Page 89: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

89

Sample query

adjective + verb + finite clause

VC (AdvP|PP|NPtemp|CLrel)*APpred

CLfin

Page 90: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

90

adjective + verb + finite clause

sein bleiben machen werden

fraglich 326 34 3

unklar 320 103

klar 225 41 30

offen 228 40

möglich 160 30 2

wichtig 180 2

deutlich 5 97 34

total 1500 177 168 75

Page 91: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

91

adjective + verb + finite clause

sein bleiben machen werden

fraglich 326 34 3

unklar 320 103

klar 225 41 30

offen 228 40

möglich 160 30 2

wichtig 180 2

deutlich 5 97 34

total 1500 177 168 75

Page 92: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

92

Topicalized finite clause

adjective + verb + finite clause CLfin

VC (AdvP|PP|NPtemp|CLrel)*APpred

Page 93: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

93

adjective + verb + finite clause

fincl_ex fincl_top total

fraglich 91 335 426

unklar 13 413 426

klar 221 159 380

offen 19 266 285

möglich 207 4 211

wichtig 192 9 201

deutlich 139 22 161

Page 94: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

94

adjective + verb + finite clause

fincl_ex fincl_top total

fraglich 91 335 426

unklar 13 413 426

klar 221 159 380

offen 19 266 285

möglich 207 4 211

wichtig 192 9 201

deutlich 139 22 161

Page 95: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

95

adjective + verb + infinite clause

sein fallen haben werden machen

bereit 431 4 6

schwer 162 221 108 33 26

möglich 532 40 35

schwierig 245 93 12

leicht 120 59 31 8 16

nötig 112 48 2 7

erforderlich 102 1 15

total 1708 280 195 183 111

Page 96: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

96

adjective + verb + infinite clause

sein fallen haben werden machen

bereit 431 4 6

schwer 162 221 108 33 26

möglich 532 40 35

schwierig 245 93 12

leicht 120 59 31 8 16

nötig 112 48 2 7

erforderlich 102 1 15

total 1708 280 195 183 111

Page 97: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

97

low freq adj + verb + infin clause

stehen bringen haben sein

frei 35 4

satt 19 10

fertig 24 1

Page 98: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

98

low freq adj + verb + clause

stehen bringen haben sein

frei 37 6

satt 27 11

fertig 26 1

Page 99: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

99

adjective subcategorization

• APs with PP complements embedded in NPsDie [AP dafür erforderlichen] 300 000 MarkThe for this needed 300 000 Marks„The 300 000 Marks needed for this“

Der [AP auf Sport spezialisierte] JournalistThe on sports specialised journalist„The journalist specialising in sports“

Page 100: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

100

multiword units and abbreviations• chunks/phrases in brackets or quotes

• multiword units„Teenage Mutant Hero Turtle“(FC Italia Frankfurt)

• abbreviationsDeutscher Aktienindex (Dax)Stickstoffdioxyd (NO2)

Page 101: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes

IMS Universität Stuttgart

101

Conclusion

• recursive chunking workable compromise between depth of analysis and robustness

• extracted data show correlation between• collocational preference

• subcategorization frames

• semantic classes of adjectives

• to a certain extent distributional preferences