108
Department of Informatics and Mathematics Chair for Distributed Information Systems Prof. Dr. Harald Kosch Enhancing Text Tokenization with Semantic Annotations using Natural Language Processing and Text Mining Techniques Raphael Pigulla June 22, 2009 37544, [email protected] Diploma thesis supervised by Dipl.-Ing. Günther Hölbling Advisor: Prof. Dr. Harald Kosch 2nd Advisor: Dr. Bernhard Sick

Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Department of Informatics andMathematics

Chair for Distributed Information Systems

Prof. Dr. Harald Kosch

Enhancing Text Tokenizationwith Semantic Annotations using

Natural Language Processingand Text Mining Techniques

Raphael Pigulla

June 22, 2009

37544, [email protected]

Diploma thesis supervised by Dipl.-Ing. Günther Hölbling

Advisor: Prof. Dr. Harald Kosch2nd Advisor: Dr. Bernhard Sick

Page 2: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Selbstständigkeitserklärung

Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig angefertigt, sie nicht

anderweitig zu Prüfzwecken vorgelegt und keine anderen als die angegebenen Hilfsmit-

tel verwendet habe. Wörtlich oder dem Sinn nach entnommene Stellen sind als solche

gekennzeichnet.

Passau, den 22. Juni 2009

Raphael Pigulla

Page 3: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Acknowledgements

I would like to extend my sincerest gratitude to everyone who has helped me with this

thesis. I thank my advisor Günther Hölbling for his assistance and support and Prof.

Dr. Harald Kosch for providing me the opportunity to work in this interesting and

challenging field of research.

I wish to thank Walesia Bernard, Paul Maier and Andrew Bromba for their much ap-

preciated comments and suggestions. Thanks go out to everyone who took the time to

participate in the web survey and contributed valuable data pivotal for various aspects

of this work. Finally, I am deeply obliged to my family for the ceaseless support and

patience.

Page 4: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Abstract

Extracting semantic information from natural language texts is a taskthat encompasses many different concepts. Implementations are oftentailored towards specific textual domains. In this thesis, we propose agenerally applicable approach to augmenting text with semantic an-notations.The system described in this thesis uses shallow semantic chunkingand a heuristic based on part-of-speech tags to generate a list of wordtuples descriptive for a given text. The input is first tokenized andeach word is annotated with one or more meanings using GermaNet.Neighboring tokens that match certain structural patterns are com-bined into chunks. From the thusly structured text descriptive wordtuples are extracted using part-of-speech tags and other lexical andgrammatical features.Our evaluation against a manually annotated test corpus has shownthat this approach is robust to texts of varying styles and genres. Thesemantic annotations can be used in conjunction with standard textmining techniques to improve the performance of conventional searchengines and recommender systems.

Page 5: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Contents

1. Introduction 11.1. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3. Explanation of Important Terms . . . . . . . . . . . . . . . . . . . . . . . 31.4. Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. Tokenization 62.1. Problems of Tokenizing Natural Language . . . . . . . . . . . . . . . . . 7

2.1.1. Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2. Sentence Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3. Sublanguages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2. Tokenization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1. Dictionary-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2. N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3. Perceptron Learning . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.4. Search-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3. Natural Language Processing and Text Mining Techniques 163.1. Applications of NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2. Morphological Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1. Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2. Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3. Compound Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3. Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4. Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5. Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6. Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7. Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7.1. Structural Description . . . . . . . . . . . . . . . . . . . . . . . . 293.7.2. Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.7.3. Measuring Semantic Relatedness . . . . . . . . . . . . . . . . . . 32

3.8. Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . 333.9. Frame Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

i

Page 6: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Contents

4. Assessment of Existing Libraries and Toolkits 384.1. Part-of-Speech Taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.1. OpenNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.2. Stanford NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.3. TreeTagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2. Lemmatizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.1. Morphy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.2. LemmaGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.3. LemServer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3. Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.1. GermaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.2. OpenThesaurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4. Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.1. GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.2. UIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5.1. Shalmaneser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5.2. JWordSplitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5. Implementation 515.1. Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2. Tokenization and PoS-Tagging . . . . . . . . . . . . . . . . . . . . . . . . 535.3. Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.1. Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.2. Compound Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.3. Ontology Integration . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.4. Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . 565.3.5. Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.4. Extraction Of Descriptive Terms . . . . . . . . . . . . . . . . . . . . . . . 655.4.1. Tuple Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.2. Search-Space Reduction . . . . . . . . . . . . . . . . . . . . . . . 665.4.3. Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5. Result Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6. Evaluation 696.1. Particularities of EPGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2. The Test Corpus and Preliminary Evaluation . . . . . . . . . . . . . . . . 716.3. Component Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3.1. Compound Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 746.3.2. Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . 756.3.3. Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4. Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5. Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.6. Unresolved Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

ii

Page 7: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Contents

7. Conclusion And Future Work 81

A. The STTS tag set 94

B. Example output of Sliver 96

iii

Page 8: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

List of Figures

2.1. A token hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2. Naive sentence detection algorithm . . . . . . . . . . . . . . . . . . . . . 9

3.1. An exemplary ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2. Application of a semantic frame . . . . . . . . . . . . . . . . . . . . . . . 363.3. Relations between semantic frames . . . . . . . . . . . . . . . . . . . . . 36

4.1. Excerpt of LemmaGens rule set for German . . . . . . . . . . . . . . . . 444.2. LemServer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3. The Apache UIMA project . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1. The Sliver processing pipeline . . . . . . . . . . . . . . . . . . . . . . . 515.2. Algorithm for word sense disambiguation . . . . . . . . . . . . . . . . . . 585.3. Example of semantic chunking . . . . . . . . . . . . . . . . . . . . . . . . 645.4. Basic algorithm for heuristic tuple generation . . . . . . . . . . . . . . . 665.5. Comparison of lexicographic and pos-weighted distances . . . . . . . . . 67

6.1. Distribution of description length . . . . . . . . . . . . . . . . . . . . . . 706.2. Distribution of pos-tags in the test corpus . . . . . . . . . . . . . . . . . 716.3. Screenshot of the Sliver web application . . . . . . . . . . . . . . . . . 726.4. Percieved relevance of pos-tag combinations . . . . . . . . . . . . . . . . 736.5. Percieved semantic connectedness . . . . . . . . . . . . . . . . . . . . . . 746.6. Performance of heuristic tuple generation . . . . . . . . . . . . . . . . . . 79

iv

Page 9: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

List of Tables

4.1. Performance of evaluated pos-taggers . . . . . . . . . . . . . . . . . . . . 414.2. Performance of evaluated lemmatizers . . . . . . . . . . . . . . . . . . . . 454.3. Coverage and relational density of evaluated ontologies . . . . . . . . . . 47

5.1. Overview of chunking patterns . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1. Performance of word sense disambiguation . . . . . . . . . . . . . . . . . 766.2. Performance of chunkers . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

v

Page 10: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 1.

Introduction

In 2008, the average German household received over 70 different TV stations [Süd09].

With a conservative estimate of one program per hour, this totals over 1,600 programs

per day, every day. Finding shows of interest has become a cumbersome and oftentimes

unmanageable task for the user. Intelligent recommender systems integrated into a user’s

TV can monitor his or her behavior and provide information about relevant broadcasts

in real time or record them autonomously while the user is away. The acceptance of

such systems by the user is directly connected to their performance. The performance,

in turn, depends fundamentally on the recommender’s ability to relate shows on a se-

mantic level and beyond the genres, actor names and keywords provided by electronic

program guides (EPG).

This thesis proposes a system to extract important semantic concepts from program

descriptions in a fully automated fashion. We have approached this problem combining

various techniques from the field of Natural Language Processing (NLP) and Text Min-

ing, described in detail in this work. Through a series of processing steps, the input text

is gradually augmented with semantic annotations that provide rich grounds for other

systems to make hand-tailored recommendations on a just-in-time basis.

1

Page 11: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 1. Introduction

1.1. Outline

This thesis is structured as follows. In the rest of this chapter we will first give a brief

overview of related techniques and applications. We will then establish some common

terminology that will be used throughout this work. Terminology only required in a

specific context is introduced as needed within the respective chapters.

In the second chapter follows an introduction to the principles of tokenization in a

broader context before discussing their applicability to NLP. Several distinct ways of

implementing a tokenizer for natural language texts will be presented and illustrated

with concrete examples. Chapter three gives an overview of some important and most

commonly used facets of Natural Language Processing and Text Mining, starting with

elementary word-based techniques and ending with sophisticated methods that combine

multiple aspects. Chapter four details our assessment of existing libraries, toolkits and

frameworks that we considered for integration into our system. They will be evaluated

in terms of performance and applicability to our aim. In chapter five we describe the

implementation of our system and how the previously evaluated techniques were em-

ployed. Chapter six introduces the evaluation setup and reviews the performances of

the individual components and the application as a whole. And finally, a conclusion is

drawn in chapter seven.

1.2. Related Work

A system structurally similar to our implementation was built by Zesch and Gurevych

[ZG06]. The purpose of that system, however, differs significantly from ours. It is meant

as a tool to automatically generate pairs of words that can be used to evaluate measures

of semantic relatedness. Unlike our document-based approach, it works on an entire

corpus and creates tuples of grammatically homogeneous words which are then meant

to be rated by a human in terms of similarity.

Latent Semantic Indexing is a related technique that can be used to quantify the seman-

tic relatedness between texts. However, it is patented and computationally expensive.

Its runtime scales disproportionately with the number of texts to be indexed and the

addition of a new text requires a complete re-run of the algorithm on the entire corpus.

This limits the practical applicability of LSA to rapidly growing text corpora like EPGs.

Furthermore, it was concluded by Rehder et al. that LSA does not perform well on

2

Page 12: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 1. Introduction

texts with less than 200 words [RSW+98]. The majority of EPG entries are consider-

ably shorter.

Less directly related is the field of Automatic Text Summarization. Here, the idea is

to capture a text’s most important aspects in order to automatically generate abstracts

or summaries. This is not applicable to our work for two reasons; one, the texts to

be processed are very short and concise to begin with. Two, little is gained in terms

of processability since a summary is computationally not easier to handle than the full

text from which it originated.

1.3. Explanation of Important Terms

corpus A corpus (pl. corpora) is a large set of texts that is often annotated with

additional information, depending on its intended use. For instance, the Bible

could be considered a corpus with verse annotations. Well-tended corpora are

frequently used in NLP to train machine learning system. They are also called

training corpora.

relevant document A document within a corpus is called relevant with respect to a

query if it should be returned by that query. In the trivial case of a full-text search

for the string money, a document would be relevant if and only if it contained the

word money.

relevant word In the context of this work, a relevant word is a word that has an intrinsic

and accessible meaning, or sense. Intrinsic means the word is meaningful in and

by itself. For instance, the word nonetheless has no innate sense and is as such

considered irrelevant. To be relevant, a word’s meaning also has to be accessible

to the system. This commonly means that the word can be found in a program’s

look-up dictionary, whatever that may be in a concrete implementation.

concept We use the term concept to describe a pair of words that is descriptive for a

text. The descision whether such a word tuple is descriptive or not is of course very

subjective. In this work, we use the loose definition of meaningful to a human in

the sense that a list of concepts can form a “mental image”, i.e. (prince, enchanted),

(frog, kiss), (spoiled, princess).

3

Page 13: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 1. Introduction

tf.idf The term frequency/inverse document frequency is a simple way to weigh the

significance of a term in a corpus. The tf.idf-score is the higher the more frequently

a word appears in a document. However, it is reduced for terms that are common

across the entire corpus. It is thus a measure for the specificity of a term for a

given text. More details on the mathematical background can be found in [SB88].

precision The precision of an IR system is the percentage of relevant documents among

all retrieved documents for a query. In other words, a precision of 1 means that

only relevant documents were returned, while a precision of 0 means none of the

returned documents were relevant.

recall The ratio of returned relevant documents to all relevant documents is the recall.

That is to say, a high recall means the result set contains most of the correct

answers, and a low recall means only few relevant documents were returned. It is

clear that neither precision nor recall alone is a meaningful measure. If the entire

corpus is returned, for instance, the recall is 100% while the precision will be poor.

f-measure The f-measure (or f-score) is a combination of both recall and precision. It

is commonly used to describe the performance of an IR system with a single value.

The f-score is defined as F = 2 · (p ·r)/(p+r). Variations emphasizing either recall

or precision are also common.

synset A synset (sometimes also called a synonym ring) is a set of semantically equiv-

alent elements. In linguistics, a synset is a set of synonymous words, i.e. words

that share the same meaning. An example of this is the set {father, dad, daddy}.

homonym A homonym is, in a sense, the opposite of a synonym. Two words with

distinct meanings are said to be homonymous if they are spelled and pronounced

identically. The word plane is a homonym with the meanings of aircraft and

surface, among others.

homograph Words that are spelled the same, but pronounced differently are homo-

graphs, i.e. the record and to record. Because in written language every homograph

is also a homonym, we will not always strictly distinguish between the two.

polyseme A homograph or homonym whose meanings are all closely related is called a

polyseme. The word screen with the conceptually related meanings of wire mesh

and display is an example of this.

4

Page 14: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 1. Introduction

morpheme The smallest linguistically meaningful unit is called a morpheme. While

morphemes have a meaning, they can often not stand by themselves. In English,

for example, the morphemic suffix -ish (with the meaning of to some degree) does

not occur by itself, but can be appended to adjectives or nouns (“Is he big? - No,

but biggish.”).

1.4. Remarks

While most techniques and issues discussed in this work are, to a greater or lesser extent,

applicable to all languages, this is not necessarily always the case. Wherever possible,

we try to use examples in English so the reading flow is not interrupted. In situations

where this is not sensible, we will provide English translations that may not reflect the

intended illustrative purposes of the German example.

5

Page 15: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2.

Tokenization

Working with textual input is one of the most frequently performed tasks in computing,

be it natural language, the source code of a program or something in between like a

domain specific language such as SQL. At any rate, the first step that needs to be

done is to transform the concrete textual representation into something suitable for

computational processing. The task of grouping semantically meaningful sequences of

elements into tokens is called lexical analysis or tokenization.

The “classical” use of tokenization is in the context of compilers. The tokenizer reads the

input sequentially character by character and creates a sequence of tokens that is used

to build the parse tree. Being an essential part of this process, tokenization is oftentimes

taken for granted and rarely given much thought as to how it is actually implemented.

This is partially due to the fact that most commonly parsed data are by design trivial

to tokenize, such as source code, arithmetic expressions or command line arguments.

However, as soon as one leaves the world of well-defined and computer-friendly grammar,

things quickly become more complex. While “traditional” tokenization can in most cases

easily be dealt with by finite state machines or regular expressions, more sophisticated

techniques are needed when working with texts in natural language.

In the first section of this chapter we will discuss how the concept of tokenization can

be applied to Natural Language Processing and what challenges arise when doing so.

The second section introduces a variety of techniques to effectively tokenize natural

language.

6

Page 16: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

2.1. Problems of Tokenizing Natural Language

In natural language it may seem intuitively clear what a token is: a word. But actually

agreeing on a concise, universal definition is not at all trivial or even feasible. This is

because different views on the same problem may require varying degrees of abstraction.

It is often helpful to interpret these different levels of granularity as a hierarchy, if

possible.

To illustrate, consider figure 2.1. The most basic level is the binary representation of

the data where each byte could be considered a token. One level above that, up to four

bytes represent a Unicode code point, or character. A sequence of characters form a word

token, and so on. The combination of semantically related tokens is called chunking. Its

practical application will be discussed in section 5.3.5.

Figure 2.1.: A token hierarchy

It is possible that a given syntax requires a token which does not possess any innate

semantic value. This is the case with blanks as can be seen in figure 2.1 on the transition

from the character to the word level. Another example are padding bytes in UTF-32

character encoding1.

There are several other “layers” that could be added to this hierarchy, such as phonemes

(the smallest unit of sound one can utter) or syllables. However, not all linguistic

elements follow this strict compositional pattern. For instance, morphemes are not

always assembled from syllables: the syllabic composition of the German word zerlegen

is zer·le·gen while its morphemic one is zer·leg·en.

It becomes clear that the definition of a token depends on the application at hand. In

any case, a token is considered atomic in the context in which it was defined.

1The byte representation in figure 2.1 is in UTF-8 which is a variable-length encoding using 1 to 4bytes as needed.

7

Page 17: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

In Natural Language Processing, tokenization commonly refers to the initial process of

splitting the textual input into its most basic relevant parts, namely words and punctu-

ation marks.

It’s never just a game when you’re winning .

At this stage, no semantic information is available and the tokenizer should make no

assumptions about the contents of the text. In practice, however, this can often not be

avoided, especially when ambiguities need to be resolved. Section 2.2 will discuss this

in more detail.

The complete process of tokenization can be divided into two distinct steps. First,

the individual sentences have to be identified. This is done through sentence boundary

disambiguation. Second, word segmentation splits the “content” of each sentence into

separate tokens. In practice, this is commonly done in reverse order since sentence

detection depends on the analysis of its tokens.

2.1.1. Word Segmentation

The task of dividing a sequence of characters into its component words is called word

segmentation. From a European or American point of view this is fairly straight forward

as it can largely be dealt with by splitting a string by the space character. However,

most morphemic languages (i.e. languages where a single symbol represents an entire

morpheme) lack such a distinct delimiter, for instance Chinese or Thai. In these cases,

word segmentation turns out to be a rather difficult problem and more sophisticated

approaches are required2.

Real world applications processing German language texts are oftentimes faced with

another issue. In part due to the steadily increasing adaptation of Anglicisms, the blank

is now frequently being used to (incorrectly) separate nominal compounds. This does

not only occur in colloquial speech but can also appear as a result of Machine Translation

(e.g. grape juice → Trauben Saft). Because languages generally allow for the use of two

consecutive nouns (“Er wollte aus den Trauben Saft machen”) the tokenizer cannot decide

which interpretation is correct by syntactic information alone.

2More precisely, it is not the lack of a distinct delimiter per se that is the problem. The real issue isthe lack of unambiguous word boundary indicators.

8

Page 18: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

In addition to this, it is often mandatory to detect and split contractions, i.e. the

shortening of a word by omitting internal characters (I am → I’m or going to → gonna).

While these contractions follow fairly strict rules in written language, they can be difficult

to detect and handle if the input has come from a speech recognition system.

2.1.2. Sentence Detection

For a human reader, deciding where one sentence ends and the next begins is an al-

most trivial task. However, most of what we subconsciously factor in when making this

decision is not easily quantifiable: reading experience, context knowledge and subtle

semantics. Formalizing this process of sentence detection or sentence boundary disam-

biguation has proven to be more difficult than one might expect.

Although a sentence is commonly ended by a period, the reverse is not always true. A

period can also be a decimal point, denote an abbreviation or ellipsis, or be part of an

emoticon or URL (or possibly a combination thereof). The actual distribution obviously

varies greatly between corpora. It has been estimated that almost 50% of all periods

in the Wall Street Journal corpus are due to their use in abbreviations while it is less

than 10% in the Brown corpus [SFK99]. Even further complications arise with the use

of colloquial speech, embedded quotations or some rethorical devices.

for ( i := 0 ; i < t . l ength ; i++) doi f ( t [ i ] = " . " ) then

i f in_set ( abbr , t [ i −1] then continue ;i f c a p i t a l i z e d ( t [ i +1]) then mark_sentence_end ( i ) ;

end ;end ;

Figure 2.2.: Naive sentence detection (simplified)

Given a list t of tokens and a set abbr of known abbreviations, a crude rule-based

approach like in figure 2.2 could be employed. This is reported to achieve an average

accuracy of approximately 95% on standard text corpora. Given the fact that humans

agree unanimously in almost each and every case, this seems much less impressive.

Furthermore, errors made in an early stage of an analysis propagate rapidly throughout

subsequent steps and can significantly impact the overall result. Grammatical analysis

in particular, such as part-of-speech tagging as described in section 3.4, depends heavily

on accurate sentence detection.

9

Page 19: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

While simple rule-based approaches or regular expressions can work reasonably well for

the text genres they were initially designed for, they naturally do not generalize well

beyond them. For this reason, more flexible and domain-independent algorithms were

devised that can adapt to any corpus by training them appropriately. Among the most

commonly chosen machine learning techniques for this task are neural networks [Pal94]

and maximum entropy models [RR97]. Current implementations achieve a precision of

98.5% and higher.

2.1.3. Sublanguages

Although trainable systems alleviate the need for genre-specific tokenizers notably, there

are still cases in which they do not perform viably. Texts from very restricted domains

often resort to a unique sublanguage for that domain.

In biomedical literature, for instance, the structure and inconsistent use of the same term

poses a serious problem for Information Retrieval systems. To illustrate, consider a gene

symbol like MIP-1-alpha, which can be written as MIP-1alpha or (MIP)-1 alpha, among

others. A generic tokenizer, even if it is trained specifically on such a corpus, cannot

adapt well enough and will in many cases perform poorly for two reasons. First, it will

produce different sets of tokens for the same logical entity, depending on the concrete

textual input. Second, the tokens it does generate will most likely not reflect the true

constituents of the gene. In the aforementioned example the individual parts of the gene

are (MIP-1, alpha). Generic tokenization, however, is more likely to produce a segmenta-

tion such as (MIP, 1, alpha). As a result, implementing domain-specific tokenizers that

incorporate expert knowledge can improve performance significantly [JZ07].

2.2. Tokenization Methods

As previously mentioned, the problem of tokenizing German and English texts can be

considered solved for all practical purposes. For other languages this is still a topic of

research. The Special Interest Group of the Association for Computational Linguistics[1]

(SIGHAN) regularly organizes a “Chinese Language Processing Bakeoff” with word seg-

mentation being a core discipline. This section will use Chinese as a concrete example

to illustrate the difficulties one faces when tokenizing input that lacks a distinct word

10

Page 20: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

delimiter. However, these challenges are by no means limited to processing Asian lan-

guages. The very same issues arise in the context of speech recognition systems where

a continuous stream of sound has to be segmented, or in OCR applications.

In Western languages words are made of individual characters that have no meaning by

themselves. This implies directly that a word, as a “sequence of meaningless characters”,

is atomic with respect to its meaning. In Chinese, on the other hand, the situation is

different. Any Chinese character, or symbol, has an intrinsic meaning. When multiple

characters form a word, the meaning of the whole does not necessarily reflect the mean-

ing of its parts. As a consequence, it can be very difficult to determine which characters

belong together. Consider the following sequence of symbols as part of a sentence:

S = (t1, t2, t3, t4, t5)

Such a sequence can potentially harbor two kinds of ambiguities. First, assume both the

entire sequence S, as well as a partition S1 = (t1, t2) and S2 = (t3, t4, t5) of S form valid

words. That is to say, the word S is a set of other words. This is called a conjunctive

ambiguity in S. At first glance, this seems to be a form of compounding (see section

3.2.3). However, the key difference is that it happens on the grammatical level, while

compounds are a purely semantic issue3.

A second, less common way in how S can be ambivalent to segment is if overlapping

subsequences can be found. For instance, let S1 = (t1, t2, t3) and S2 = (t3, t4, t5) so that

S, S1 and S2 are all valid words. In this case, S is said to possess disjunctive ambiguity.

We will now discuss four different techniques that can be employed to perform word

segmentation on texts with no clear word boundary indicators. With the exception of

dictionary-based tokenizers, all of the following methods are based on collocations. A

collocation is a sequence of words that occurs more frequently than would be expected

by chance alone. For instance, a speech recognition system could use collocation data

to distinguish “grade A beef” from “gray day beef” because the latter words have a lower

probability to occur in sequence. Once collected, this information can be harnessed in

various ways to build sophisticated tokenizers.

3Consider the concatenated sentence of words “thedragonflies” which can be segmented as either “the

dragonflies” or “the dragon flies” with inherently different repercussions on both the grammaticalstructure and the semantics implied by it.

11

Page 21: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

2.2.1. Dictionary-Based

One of the most obvious ways of performing tokenization is to use a dictionary that

defines the set of all valid words. The tokenizer can then match the input characters

against the dictionary to determine which sequences form a valid word and which do

not. Implementations of dictionary-based methods can be classified by the following

three characteristics (see [WK92]):

• direction

The direction specifies whether the tokenizer will, given a start character, search

to the right (forward) or in the opposite direction (backward). For reasons of

simplicity, we will ignore the issue of left-to-right and right-to-left languages, such

as German and Arabic, respectively.

• greediness

If a tokenizer returns the minimum matching, it is called greedy. Likewise, a

tokenizer that returns the maximum matching is ungreedy. In other words, greedy

algorithms are satisfied with local optima whereas ungreedy ones try to find global

optima.

• omission/addition

The tokenizer can either start with the full sequence of characters and successively

omit elements, or start with the empty sequence and gradually add characters.

This is called omission-based or addition-based, respectively.

It is obvious that these characteristics are not orthogonal. For instance, it makes little

sense to implement a greedy addition-based tokenizer as it would practically always stop

after reading the first input character.

Given an adequately large dictionary, this class of tokenizers was shown to achieve an

identification rate as high as 98%. The dependency on a dictionary, however, leads

to poor tokenization of texts that contain new words and leaves very little room for

incorporating domain-specific expert knowledge.

As a side note, an interesting thing to remark about dictionary-based tokenizers is that

they are “doing things backwards”. That is to say, they take a list of words to generate

tokenization rules, when it should be the other way around.

12

Page 22: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

2.2.2. N-Grams

Given a sequence S, an n-gram of that sequence is a continuous subsequence of length

n. An n-gram based tokenizer is first trained with a pre-tokenized text corpus. Based

on that data a probabilistic model is built that can be used to predict the next token

of a given sequence. In other words, an n-gram tokenizer calculates the probability

P (ti | ti−1, ti−2, . . . , ti−n) to decide what the next token should be based on past experi-

ence.

Suppose an OCR system encounters the word often preceded by the words the and power.

If trained on scientific texts, the model will predict that the sequence the, power, of)

(as in “five to the power of ten”) is statistically more probable than the sequence (the,

power, often) (as in “the power often goes out”). Another typical application for this

technique is in speech recognition to distinguish oronyms (i.e. words that sound alike,

but are spelled differently), or in spam detection systems [Zdz05].

Real-world implementations, such as the n-gram tokenizer proposed in [KXW03], achieve

a recall of up to 98% on morphemic languages. However, a common criticism to this

approach is that it is essentially problem-agnostic, i.e. it does not incorporate any lin-

guistic knowledge and operates purely on an abstract mathematical level. As a result

of this, n-gram tokenizers generally do not handle out-of-vocabulary words well and are,

as such, not better than dictionary-based methods, at least on a theoretical level.

2.2.3. Perceptron Learning

Li et al. have successfully used a Perceptron with Uneven Margins for the word segmen-

tation of Chinese [LMBC05]. The key idea is to classify each character in a language

as either a single-character word or a character that occurs at the beginning, middle

or end of a multi-character word. For each of these cases a classifier was trained on a

hand-annotated corpus using the one-vs-all paradigm. A sliding window of size five was

used to generate the input for each perceptron. In other words, the input consisted of a

center character and the two characters both preceding and following it. This was done

so collocations could be recognized in order to improve accuracy. The resulting tokenizer

was shown to achieve f-measures between 92.7% and 95.6% for the four SIGHAN test

corpora.

A major advantage of this approach over the other techniques besides its relative sim-

plicity is that it is character-based. Given a large enough training corpus, the classifiers

13

Page 23: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

are exhaustive and can cover the entire set of characters of a given language, whereas

word-based systems are naturally limited by their dictionary (either the implicit dictio-

nary of the training corpus, or the explicit look-up dictionary). As a consequence, this

class of tokenizers is significantly more robust against previously unknown words.

2.2.4. Search-Based

An interesting alternative to conventional word segmentation methods was proposed by

Wang et al. using common web search engines [WQL07]. The tokenization process is

split into three steps: segment collecting, segment scoring and segment selection.

After doing an initial partitioning by punctuation marks, the resulting segments are

submitted to a search engine such as Google or Yahoo. The returned results are analyzed

with respect to what parts of the segment appear together. To illustrate, consider the

contrived example of the English search query q =“Jude has a second major in criminal

law”4. Running the query will show that the sequences criminal law, second major and jude

law appear most frequently within the search results. The set of all returned segments

is S ⊆ P(q).

In the second step, all segments in S are ranked using a scoring function σ : P(q)→ R.

An obvious function to use is the segment frequency, i.e. the ratio between the number

of search results that contained the segment and the total number of results. Wang et

al. [WQL07] have also evaluated a Support Vector Machine-based scorer which in the

end yielded a higher recall but slightly lower precision and a f-measure of around 88%.

A subset s ∈ S is called a valid segmentation if it can reconstruct the original query,

that is to say if s is a partition of q. Note that the word order is important, so the

segment jude law from the above example would never be part of a valid segmentation.

The set of all valid segmentations of q is S̃(q) ⊆ S. The final result R is then the valid

segmentation with the highest average score:

R(q) = arg maxs̃∈S̃(q)

(

1

|s̃|

s∈s̃

σ(s)

)

While so far not much research has been done for search-based tokenization, it appears

to be a promising approach that has several advantages over traditional tokenization

4In this example, we assume that we already have word tokens. We are now trying to segment on asemantic level, i.e. we do chunking. While this is not the intended application of the algorithm, thesame principles can be applied and it serves the illustrative purpose nicely.

14

Page 24: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 2. Tokenization

strategies. In particular, it is neither dictionary-bound nor does it require training. One

downside is that current search engines often do optimizations and query modifications

that are unfavorable for this method, e.g. stop words are removed from the query or some

form of stemming is done (see the following chapter). Using a search engine specifically

tailored to this purpose could further improve performance.

In this chapter we have introduced the concept of tokenization and illustrated its prac-

tical use in NLP applications. It is clear that inaccurate tokenization can have many

negative consequences. Tokens represent the “physical” structure of a text, errors made

here will propagate quickly throughout subsequent analytical steps. For instance, if a

word is not tokenized correctly, dictionary look-ups will fail. Likewise, if a comma or

period is missed, grammatical analysis is negatively affected. In the following chapter

we will assume a correct tokenization. Given the relative simplicity with which German

can be tokenized, this is not too bold an assumption to make.

15

Page 25: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3.

Natural Language Processing and Text

Mining Techniques

Programming languages are in most cases easy to parse due to their context-free gram-

mar and well-defined syntax. Texts in natural language, on the other hand, are very com-

plex and oftentimes ambiguous in both syntax and semantics. The science of recognizing,

processing and understanding human language is subsumed under the term Natural Lan-

guage Processing or Computational Linguistics. It is an interdisciplinary field that com-

bines computer science, logic, mathematics, linguistics and others. Natural Language

Processing encompasses a wide range of methodologies ranging from speech recognition

to the automated extraction of semantic features.

The term Data Mining describes the discovery of novel and useful information in vast

amounts of data. Text Mining is the application of Data Mining techniques to the do-

main of natural language texts. The pivotal difference here is that, unlike in traditional

Data Mining applications, the data is not already more or less rigorously structured and

thus not easily processable for computation.

For this reason, Natural Language Processing and Text Mining often goes hand in hand.

The textual input is first processed by NLP tools, then transformed into an appropriate

representation suitable for the employment of Text Mining algorithms. The result of

these algorithms is often fed back into the NLP system and the cycle starts anew.

In this chapter, we will first briefly motivate the importance of NLP in real-world ap-

plications. It follows a discussion of the most commonly utilized NLP techniques. Of

particular interest here are their technical interdependencies, how they relate to one

another on a global level, their fundamental limitations and the concrete applicability

in the context of our work. Precise linguistic terminology is introduced as needed.

16

Page 26: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

3.1. Applications of NLP

Humans are constantly exposed to natural language both in written and spoken form.

With computers permeating more and more aspects of our everyday lives, processing

language in an intuitive and efficient manner becomes ever more important. Some of

the most established, real-world areas of application are:

• Word Processing

One of the first areas in which NLP techniques has found its way was Word Pro-

cessing. Features such as context-sensitive thesauri and the automated grammar

and spell check are now standard in any modern word processor.

• Information Retrieval

The exponential growth of knowledge makes finding information increasingly dif-

ficult. Traditional methods like full-text searches have issues of scalability and no

longer meet demands of handling the vast number of documents of current corpora.

• Machine Translation

The automated translation of text was one of the initial motivations for NLP and

is still a subject of great interest. It is often regarded as the one application by

whose success the entire field of research can (or should) be measured.

• Speech Recognition and Synthesis

The recognition and synthesis of language forms the most natural interface between

human and machine. It has become an integral part of barrier-free applications,

guidance systems or automated directory assistance.

• Automatic Summarization

Summarizing a text by either reducing it to its most relevant phrases or by creating

an abstract from scratch is a valuable tool when large amounts of texts need to

be surveyed, e.g. performing automated news aggregation from various external

sources or outlining the contents of the latest scientific papers.

• Sentiment Analysis

Services like Metacritic[2] utilize NLP to autonomously evaluate reviews and clas-

sify them as either favorable or unfavorable. Manufacturers can employ techniques

like this to gather feedback about their products in a fully automated fashion.

17

Page 27: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

3.2. Morphological Techniques

The form a word takes typically changes depending on the grammatical context in which

it is used. The manner in which a language handles grammatical relations and relational

categories such as case, mood, voice, tense, aspect, person, number and gender is called

inflection. The analysis and description of these inflectional rules is called morphology.

For instance, the German verb fliegen (eng. to fly) can, among others, take the form

of flöge and geflogen. All inflected forms are representations of the same abstract mor-

phological unit called its lexeme1. It is evident that while the inflection can potentially

carry useful information, it complicates the identification of a word’s lexeme and thus

its intrinsic meaning.

It is important to realize that a lexeme is a theoretical construct and has, as such, no

innate concrete textual representation. In practice, a lexeme is generally typified by its

canonical (or “dictionary”) form, or lemma (see section 3.2.2). Consequently, correctly

inferring the lexeme of a given word is an important step as it abstracts from its textual

representation and allows for processing on the semantic level.

3.2.1. Stemming

The part of a word that is common to all its inflected variants is called a stem. The

stem itself is often not a genuine word. Consider the set {independent, independence,

independently} whose common stem is independ (the trailing e is usually omitted).

Originally, the primary application of stemming was in Information Retrieval where

its use has led to significant improvements in recall of search results with only minor

impacts on precision as concluded by Kraaji and Pohlmann [KP96]. Grouping similar

words together is also very useful when techniques like tf.idf are to be employed because

it decreases the spread of relevant terms across the search-space. In that context, it is

of no importance that the stem itself is not meaningful.

Some stemmers work purely algorithmically in the sense that they have a fixed set of

rules and do not respect the peculiarities of a given language. These stemmers do not

perform well in cases where the stem of a word varies between its inflections. Words

with this characteristic are called suppletives. This happens when one inflected form is

of a different historical origin than another. Examples of this are irregular verbs (to go,

1A lexeme is, in a sense, the abstract class of a word, while the inflections could be seen as concreteinstantiations of that class.

18

Page 28: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

went, gone), but it can also appear with other word classes, mostly adjectives (good,

better, best) or nouns (person, people). Albeit, the latter is rarely to be found in English

or German. However, even simple derivational forms frequently cause problems. Most

stemmers do not map closely related words like cohere and cohesion to the same stem

because they (rightfully) consider the common stem coh not to be valid.

Conceptually, stemmers can be divided into two classes. A light (or weak) stemmer

conflates only very closely related words such as (replace, replacement, replacing). This

conservative strategy often leads to under-stemming, i.e. words are not being grouped

together when they should. Heavy (or strong) stemmers merge words more aggressively.

This is more prone to mapping unrelated words to the same stem, for example (divisor,

division, dividend). This is called over-stemming.

The first stemming algorithm was proposed by Julie Beth Lovins in [Lov68]. Since then,

a variety of stemming methods has been developed:

Brute Force The stem is retrieved by matching the word against a static look-up table

which relates inflected forms to their stem. While being comparatively inflexible,

this approach has the advantage of being able to reliably stem any known word,

including suppletive forms.

Affix Stripping By removing common affixes (i.e. pre- and suffixes), according to a

given set of rules, the word is successively reduced to its stem. The correct iden-

tification of affixes is often problematic. For instance, in uncommon the affix un-

should be removed, but in understand it must not. This class of stemmers can not

handle suppletives.

Probabilistic Stemming Probabilistic stemmers are trained on a set of inflected words

of which the correct stem is known. A stochastic model is built that can then be

applied to unknown words in order to deduce their most likely stem, i.e. the stem

of which the expected error is minimal.

Hybrids The precision of a stemming algorithm can be improved by combining multiple

approaches. For example, an affix-stripping system can be improved by adding

a list of all irregular verbs and the most commonly used suppletives. This can

significantly increase its overall performance without restricting its versatility.

19

Page 29: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Languages where words are inflected by stringing individual affixes together (so-called

agglutinative languages) are well suited for stemmers. An entirely agglutinative lan-

guage would have distinct morphemes for each of its morphological paradigms such as

case, gender, tense, number or mood. Examples of strongly agglutinative languages are

Hungarian, Georgian and Finnish. Constructed languages also commonly fall into this

category, for instance Esperanto. The higher the grade of inflectional regularity, the

easier it is to systematically extract a set of stemming rules and algorithmically apply

them to previously unknown words.

Conversely, if inflectional affixes are merged, or if a single affix marks multiple grammat-

ical categories, then this is called a fusional language. For instance, in the Latin word

medicinarum (eng. medicine) the atomic suffix -arum marks both the plural and the gene-

tive. Fusional languages are much more difficult to stem correctly because the inflecting

morphemes are harder to identify. In this case, the use of a dictionary is mandatory.

At this time, the most commonly used stemmer for the English language is the affix-

stripping based Porter Stemmer, see [Por80]. Implementations are freely available at the

Snowball[3] project. A brief comparison of stemming algorithms as well as a discussion

of means to compare them quantitatively can be found in [FZ98].

Stemming naturally entails a loss of information by reducing a word to an artificial root

form, losing a significant amount of its semantic value in the process. Moreover, in

practice, different algorithms often yield different stems for the same input and cannot

be used interchangeably. As a result of this, it is impossible to reliably infer a word’s

lexene from its stem. This being essential for further analysis, stemming is of very

limited use in this work.

3.2.2. Lemmatization

Lemmatization is a technique conceptually similar to stemming with the pivotal dif-

ference that a token is reduced to its “dictionary form”, or lemma. For instance, the

inflected words seeks, seeking and sought are all lemmatized to the same canonical form

seek.

To do this correctly, the lemmatizer (unlike a stemmer) employs contextual information

such as pos-tags (see section 3.4). In doing so, it can distinguish inflectional homonyms

like Stahl (eng. steel) and stahl (past indicative of to steal) whose lemma are Stahl and

stehlen, respectively, with entirely different meanings.

20

Page 30: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Aside from its higher overall precision, the primary advantage of lemmatization over

stemming lies in the fact that the lemma - as opposed to the stem - preserves most of a

word’s semantic value, is intelligible for a human, and allows for immediate dictionary

look-ups. On the downside, it is clear that lemmatization is generally slower and requires

more complex preprocessing to provide the necessary contextual information.

The way in which lemmatization is done is even more language-dependent than stem-

ming. Therefore, no one single algorithm is available and implementations tailored

specifically to one particular language (or closely related families of languages), are re-

quired. This is in most cases considered impractical due to its expensiveness and need

for the involvement of linguistic experts. A more viable alternative is to deploy machine

learning techniques. The most commonly used approach is based on Ripple Down Rules

as described by Plisson et al. [PLME08]. This will be discussed in more detail in section

4.2.2.

3.2.3. Compound Splitting

Natural languages are constantly evolving, new words are incessantly being created while

others slowly become obsolete. One of the most common ways to form new words is

the combination of existing ones into compounds, or polymorphemes. More formally, a

compound is a word that is comprised of multiple stems. For analytical purposes it is

often desirable to reverse this process in order to reduce new, unknown expressions to

familiar ones. The complexity of decompounding varies from language to language. Most

Germanic languages, for example, allow for the formation of ad-hoc compounds. English,

on the other hand, is comparatively unproblematic because new found polymorphemes

are not concatenated seamlessly and therefore trivial to dissect.

One designated element of a compound is its head. The head determines the compounds

case, number and grammatical gender (if applicable). For instance, the head of the

German compound Büroklammer (eng. paper clip) is Klammer (singular, feminine). The

compound is thus also singular and feminine. The gender and number of the other

constituent is not inflected.

From a semantic point of view, compounds can be classified into four major categories:

endocentric, exocentric, copulative and appositional.

21

Page 31: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Endocentric compounds are characterized by the semantic predominance of its head.

Its primary meaning as defined by the head is narrowed down by the remaining

parts (its modifiers). In other words, the whole compound is a special case (or

“subclass”) of its head element. For instance, the word Kellertür (eng. basement

door) is a special kind of door, namely the one that leads to the cellar.

Exocentric (or possessive) compounds resemble a predicate-argument type structure.

They stand in a has-a relationship with a semantic head that is not explicitly

mentioned. As a result, the meaning of the compound is oftentimes not deducible

from its constituents. An example of this is the adjective blue-blooded that de-

scribes people of noble descent who oftentimes had superficial veins and untanned

skin, giving the impression of blue blood. The unexpressed head here is person.

Exocentric compounds are often used metaphorically.

Copulative (or primary) compounds have no clearly identifiable head in the sense that

one carries significantly more semantic meaning than the other. The compound is

in principle an enumeration of several independent elements and the meaning of

the whole is usually not directly connected with its individual parts. This type of

compound is rarely found in German or English. A contrived example of this class

of compounds is the word for the German state Schleswig-Holstein.

Appositional compounds are made of two equipollent but often contradictory parts

(e.g. singer-songwriter, African-American). The compound is a hyponym of each

constituent and it thus inherits all their individual meanings.

Decomposition of endocentric, copulative and appositional compounds generally yields

sensible results because each of the individual parts contributes to the meaning of the

whole, albeit to varying degrees. Exocentric compounds, on the other hand, are prob-

lematic because the semantics of the entire word is not a sum of its parts. It is apparent

that it is impossible to automatically determine to which category a given compound

belongs. However, exocentric compounds are typically idiomatic phrases and not gener-

ated ad-hoc in everyday speech. They can thus usually be found in dictionaries and do

not need to be decompounded.

Traditionally, compound splitting was used in the context of automated hyphenation,

i.e. its primary purpose was to identify the most suitable points at which a word could

be broken over two lines. A compound that is hyphenated into its semantic constituents

is considerably easier to read and understand than other syntactically correct segmen-

22

Page 32: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

tations (consider gutter-ball and gut-terball). In other words, decompounding can be

understood as a specialized form of hyphenation with the added restriction that all con-

stituents are genuine words themselves.

The question whether a given word is a compound, and if so, whether a valid segmen-

tation exists, is generally a difficult one. Moreover, if multiple decompositions can be

found it is often not clear which one is correct, even to a human reader with knowledge

of the context in which it is being used. The word Druckerzeugnis can be split into

Druck·erzeugnis (eng. print·product) as well as the slightly unorthodox Drucker·zeugnis

(eng. printer·certificate), both of which are valid and in the same general domain of

meaning. More commonly, erroneous decomposition leads to syntactically sound but

nonsensical results, e.g. See·lachse (eng. sea·salmon) and Seel·achse (eng. soul·axis). If

there is a chance of ambiguity, it is often better not to decompound in order to avoid

semantic distortion (conservative decompounding). A more in-depth analysis of this

problem with regard to sense-conveying hyphenation can be found in [BN85].

One could assume that it is advisable to split compounds whenever an unambiguous

decomposition is possible. However, compounds that are already semantically relevant

should not be dissected as there is only little (if any) information to be gained. In re-

ality, the decomposition of this type of compound will more often than not only have

a negative effect on the analysis. Consider the germanized Anglicism Teenager that al-

ready has a distinct innate meaning. The correct decomposition is Teen·ager and yields

little additional information. However, forcibly dissecting it in a German context results

in the nonsensical segmentation Tee·nager (eng. tea·rodent) which retains none of the

original meaning.

For these reasons, decompounding is a technique that can be very prone to semantic

errors if applied too liberally. If done cautiously, however, it can be a useful tool for

establishing semantic information for previously unknown words.

3.3. Stop Words

Every language contains stop words or noise words, i.e. words that occur too frequently

to be of any specificity or do not possess a significant innate meaning. Put differently,

stop words are words that can be left out with little to no impact on the overall infor-

mational value of the text.

23

Page 33: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Whether a given word constitutes as noise depends on the area of application and cannot

be defined universally. For instance, in the medical domain, the term patient is highly

common and could be considered as a stop word, while in other domains it might be

important. Articles, conjunctions, pronouns, pre- and postpositions are typically con-

sidered irrelevant regardless of the domain.

3.4. Part-of-Speech Tagging

Each word of a sentence is implicitly assigned a lexical category or part-of-speech (pos).

This is not to be confused with a word’s grammatical function such as subject or object.

A pos-tagger is a program that, given a tokenized sentence, infers each token’s lexical

class and tags it accordingly. The most basic categories in the English language are

nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions and interjections.

For practical purposes it is often helpful to use a more fine-grained categorization like

the STTS tag set (see appendix A). Because natural languages are not context-free,

the decision to which class a given word belongs can normally not be made without

analyzing the sentence in its entirety. For instance, drink can be both a noun (“I need

a drink”) and a verb (“I need to drink”). At worst, even a sentence as a whole can be

ambiguous (“Fruit flies like a banana”).

Using the commonly adapted notation of 〈word〉 / 〈pos〉, the output of a pos-tagger using

the Penn Treebank tag set as described in [San90] could look something like this:

There/EX ’s/VBZ no/DT place/NN like/IN home/NN ./SENT

Here, EX stands for existential-there, VBZ for verb-3rd-person, DT for determiner, NN

for normal-noun, IN for preposition and SENT for end-of-sentence.

A multitude of different techniques to approach this problem have been developed. A

more in-depth discussion of the methods outlined below can be found in [WM06].

Rule-based Purely rule-based taggers as first proposed by Klein and Simmons [KS63]

and Greene and Rubin [GR71] are based on a two-phase algorithm. In the first

phase, each word is assigned a list of possible pos-tags based on a dictionary. In the

second phase, these lists are narrowed down by successively applying a predefined

24

Page 34: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

set of rules until each list cannot be shortened any further (ideally to a size of one).

More recent implementations have refined this approach [RV95].

Rule-based systems do not require a training corpus, but they are very complex,

require linguistic experts to implement and are not transferable to other languages.

Stochastic Stochastic taggers are based on the observation that some pos-combinations

are more likely to occur than others (e.g. a verb followed by a noun is less probable

than an article followed by a noun). These probabilities are derived from analyzing

pre-annotated text corpora such as the Brown Corpus[4], the Negra Corpus[5] or

the British National Corpus[6].

Transformation-based learning Transformation-based taggers as suggested in [Bri92]

combine rule-based and stochastic means. The key element here is that it adopts

new rules during its learning phase and can resort to morphological rules when

it encounters new words not found in its internal dictionary. It has been shown

in [RS95] that the Brill tagger is up to ten times faster than purely stochastic

approaches.

Decision Trees Stochastic taggers typically assign an epsilon probability to any pos-

sequence that did not occur in its training corpus. This is because it cannot

distinguish between grammatically impossible sequences and sequences that did

not occur purely by mischance. As a result, they usually require large training

corpora to perform adequately. To alleviate this problem, taggers based on binary

decision trees were developed [Sch94].

It was shown that taggers can reliably achieve accuracies of 93% and higher [Der89],

[Cha97] and can perform as well as 97.5% for German [Sch95]. This seemingly high

percentage still implies an average error about every 40 words - and this is under “perfect”

conditions where the test and training texts come from the same corpus. It appears,

however, that progress is asymptotically reaching a limit. There has been only very little

progress in the past 10 to 15 years [Gie08]. In real world applications where training

and actual data vary greatly, the number of expected errors is typically higher.

25

Page 35: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

3.5. Named Entity Recognition

In Information Extraction it is often necessary to identify names of locations, persons or

organizations that occur within a text (this is often extended to include temporal and

numeric expressions as well). This process is called Named Entity Recognition (NER).

Generally, NER system annotate the text by assigning tags to individual (or a set of

consecutive) tokens. The typical output of such a system looks like this:

〈Arthur Dent〉TY PE=PERSON worked for the 〈UN〉TY PE=ORGANIZATION since

〈2007〉TY PE=DATE and visited the 〈Czech Republic〉TY PE=LOCATION frequently.

This example follows the ENAMEX[7] tag set presented at the Message Understand-

ing Conference MUC-6. Others have since been introduced like the more fine-grained

BBN[8] or Sekine[9] tag set which contain approximately 90 and 200 types and subtypes,

respectively. Most implementations are based on Maximum Entropy [CN03] or Hidden

Markov Models [KSNM03]. In any case, they usually require additional resources like

dictionaries and gazetteers (geographical directories) to work effectively.

Although NER systems have achieved f-measure scores of up to 93.39% in the Message

Understanding Conference MUC-7, it was concluded by Poibeau and Kosseim [PK00]

that performance depends heavily on the adaptation to the characteristics of a given

corpus, including domain-specific grammar and the availability of suitable gazetteers.

Because of this, Named Entity Recognition is difficult to adapt in applications like ours

that are not domain-restricted. Furthermore, our work is predominantly based on the

analysis of synsets which inherently contain semantic information exceeding that of most

NER-tags. A token that is recognized as a named entity but has no synset associated

with it cannot be processed further by our system. However, it can still be of use for

other purposes such as cross-referencing specific named entites across multiple texts.

3.6. Coreference Resolution

Within a text are often many distinct terms that refer to the same logical entity, their

referent or antecedent. The process of identifying these words and grouping them un-

ambiguously into equivalence classes is called coreference resolution (CR).

26

Page 36: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

To illustrate, consider the following excerpt:

Obama flew to Berlin where he talked with Merkel about it. After exchanging

pleasantries with the German chancellor the President returned home.

To a human reader with the necessary common knowledge it is obvious that {Obama,

he, President} and {chancellor, Merkel} form equivalence classes, they corefer to the same

entity.

It is to be distinguished between endophoric and exophoric references. Endophora have

an intralinguistic antecedent, i.e. they refer to an entity that is explicitly mentioned

in the text and can, by a human, easily be identified, even without much contextual

knowledge. In the sentence above, he is such an endophoric term. By contrast, exophora

refer to entities that are textually undisclosed. In other words, their referent cannot be

deduced from the text alone. In the aforementioned example, it is an exophora that

cannot be resolved.

Depending on their positioning relative to their referent, references can further be clas-

sified as either anaphoric or cataphoric. An anaphora is a backward reference to a

previously mentioned entity (“Chris sat down, he was tired”). Conversely, a cataphora is

a forward reference that is resolved at a later time (“He was tired, so Chris sat down”).

Even in situations that can be considered grammatically simple it is oftentimes difficult,

if not impossible, to identify an endophora’s referent with certainty:

(1) The baby played with the cat. It meowed.

(2) The baby played with the cat. It giggled.

(3) The baby played with the cat. It was cute.

In the first case, it clearly refers to the cat. In the second case, the referent is most likely

the baby, although one can’t be entirely sure. In the last case, it is not at all evident

whether it is a reference to the baby, the cat, or the act of them playing together.

If references are stringed together they can form coreference chains of theoretically arbi-

trary length. Considering the fact that such a chain can consist of proper and common

nouns (that do not even need to be synonymous), pronouns, substantive adjectives and

many others, it becomes clear that coreference resolution is a task of high complexity

and ambiguity.

27

Page 37: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Extensive research has been done in this field with seminal contributions by Lappin

and Leass [LL94], McCarthy and Lehnert [ML95] and Soon et al. [SNL01], among oth-

ers. Today, statistical methods yield best results [GHC98]. They commonly infer the

most probable referent by considering semantical, syntactical and lexical features that

are largely orthogonal to each other, such as gender, part-of-speech, animacy, number,

grammatical function and many others. The minimum-edit-distance feature proposed

by Strube et al. [SRM02] seems very promising as it matches our area of application in

both average text length and domain independence.

The practical use of CR algorithms, however, poses a considerable challenge. Research

implementations typically assume the availability of the required set of features, and

given that assumption, they perform reasonably well. However, in real-world applica-

tions, these features are commonly not readily available and their acquisition requires

very complex and computationally expensive preprocessing. For instance, the grammat-

ical function feature depends on a deep grammatical analysis which is of substantial

difficulty for morphologically rich languages like German (see [Pav05] for a brief discus-

sion). In addition, most approaches are either language dependent, limited to a specific

subset of the problem (e.g. pronouns) or are specific to a certain domain and do not

generalize well. In the past, research has focused mainly on the English language. For

this and other reasons discussed in more detail in section 6.1, significant restrictions are

imposed on the applicability of available solutions and CR in general for this work.

3.7. Ontologies

An ontology is an “explicit specification of a conceptualization” [GG93], or less formally,

it is a model of a domain that describes what objects and concepts exist, the properties

they possess and how they are related to each other. In real-world applications, the

complexity of ontologies grows rapidly with the size of the domain they cover. Conse-

quently, they are usually limited to very specific areas of expertise. Typical examples

are the SIOC Core[10] ontology for describing information from online communities on

the Semantic Web, the Open Biomedical Ontologies Foundry[11] or the Gene Ontology[12]

for gene product properties.

28

Page 38: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

3.7.1. Structural Description

The structure of lexical semantics can be interpreted as an ontology: sets of synony-

mous lexemes form concepts (or synsets)2 which are related through lexical relations

like hypernymy, meronymy or antonymy. Such a lexical ontology is called a word net.

The predominant implementation and de-facto standard of a word net for the English

language is WordNet[13] from the University of Princeton. The term “WordNet” has

been widely adopted in literature and is being used interchangeably with “word net”.

Hypernymy/Hyponymy The terms hypernymy and hyponymy describe a generalization

and specialization relationship between two nouns. The word panda is a hyponym

of animal and conversely, animal is a hypernym of panda.

Troponymy A verb that describes more precisely the manner of doing something by

substituting another verb of more generalized meaning is called a troponym (e.g.

trample is a troponym of walk).

Holonymy/Meronymy Holonymy and meronymy describe the two opposing views on

a part-whole relation. For instance, branch is a meronym of tree, and forest is a

holonym of tree.

Antonymy Two concepts with opposite meanings are antonyms of each other, so high

is an antonym of low and vice versa.

Entailment A verb that necessarily implies another is said to entail it. The word

snore entails the concept of sleep. Troponymy is a special case of entailment, the

difference being that troponyms are always temporally coextensive3.

Causation If one action is the logical cause of another, a causation between the two ex-

ists. The pair show and see forms such a relationship. This differs from an entail-

ment in that the verbs of a causation relation have different referents: “Peterreferent

shows Lois the map” but “Loisreferent sees it”. This kind of connection is compara-

tively rare in Western languages.

2These are not synonyms in a strictly linguistic sense. A synset is rather to be understood as a set ofwords that can be used interchangeably in most, but not necessarily all contexts.

3Consider jumpentails−→ hover and trampling

entails−→ walking. At any given point in time where someone

tramples, he also walks. But there is only very brief moment during a jump where a person hovers.

Entailment can even be “temporally reversed” as in the case of succeedentails−→ try.

29

Page 39: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Figure 3.1.: An exemplary ontology centered around the word eagle

The precise meanings of these relations are not universally agreed upon and they often-

times differ subtly. A more in-depth discussion with respect to WordNet can be found

in [MBF+90].

Figure 3.1 shows part of an exemplary word net originating from the word eagle and three

of its most common meanings (the military airplane, the bird, and the golf term). Note

that only a subset of possible relations is shown. In particular, holonymy and hypernymy

relations are omitted as they are defined implicitly by the existence of meronymy and

hyponomy relations, respectively. For practical reasons, an artificial root node entity

(sometimes written as ⊤) is commonly introduced.

In addition to these relations, a word net often contains further annotations such as a

gloss (a brief and concise description or definition of a word’s meaning) or a short list

of sentences to exemplify the use of that word.

30

Page 40: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

3.7.2. Formalization

In order to work with a word net algorithmically, it is necessary to formalize its structure

and properties. The nature of an ontology, namely it being a set of nodes linked by a

number of interconnections, suggests its interpretation as a graph. Given the set S of

all synsets and a finite set R so that

R = {r1, r2, . . . , rn} with ri ⊆ S × S,

a word net forms an edge-labeled multigraph G

G := (S, E) with E :=⋃

r∈R

r

In practice, G is commonly directed but can also include undirected edges. The set R

typically consists of (but is not limited to) the relations described in the previous section.

Unless a synset is taken to be synonymous to itself, G is commonly loop free.

Depending on the implementation, different elements of R possess different properties.

For example, antonomy is a symmetric relation whereas meronymy clearly is not. Like-

wise, entailment is transitive but meronymy generally isn’t - a hand is part of a goal-

keeper which in turn is part of a team, but a hand is hardly part of a team. This

is because “meronymy” is actually a summation of semantically related but logically

different relationships, such as part-of, member-of and made-of.

For subgraphs GR′ of G defined by

GR′ := (S, E ′) with R′ ⊆ R and E ′ :=⋃

r∈R′r

special properties can be inferred by examining the properties of the relations included

in R′. The subgraph G{holonymy}, for instance, is directed and acyclic; G{antonymy} is

undirected and 1-regular. This of course depends on the concrete implementation, in

this case whether antonyms are modeled by one undirected or two directed, but opposite

edges. In the latter case, it is directed and 2-regular. While the existence of ⊤ guarantees

G’s connectedness, GR′ generally is disconnected.

31

Page 41: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

The acyclic graph

Gsub := G{hyponymy,hypernymy} with ⊤ ∈ S

is called the subsumption hierarchy. Because a synonym can be the hyponym of multiple

other concepts (comparable to multiple inheritance), Gsub is not a tree. For instance,

manager is a hyponym of both organism and causal agent.

This formalization of a word net allows for definition of various metrics. Of particular

interest is the measure of semantic relatedness, or its inverse the semantic distance.

This is based on the idea that each edge represents a semantic connection between two

concepts. The shorter the distance between two nodes is, the more closely they are

related. It is to be noted that a high degree of semantic relatedness does not necessarily

imply a strong similarity. Antonyms, for instance, are adjacent nodes but (by definition)

conceptual opposites. A measure for semantic similarity can commonly be obtained by

restricting a measure for relatedness to the subsumption hierarchy.

3.7.3. Measuring Semantic Relatedness

This section briefly introduces three common measures for the semantic distance between

synsets in an ontology. A more in-depth discussion can be found in [BH06]. It is assumed

that the word net graph is restricted to its subsumption hierarchy Gsub.

Definition: lso

Let GR′ be an acyclic, directed word net graph and C ⊆ S. The lowest super-

ordinate (also most specific common subsumer or join) of C is then defined as the

supremum of C with respect to the partial order induced on S by R′:

lso(C) : P(S)→ S := sup (C)

It is guaranteed that lso(C) ∈ S if C 6= ∅ and ⊤ ∈ S.

Definition: distance

Let GR′ be a word net graph and c1, c2 ∈ S arbitrary. The distance of c1 and c2 is

then defined by dist : S ×S → N0 as the number of edges on the shortest path from

c1 to c2. This path always exists if ⊤ ∈ S.

32

Page 42: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Definition: depth

Let GR′ be a word net graph, ⊤ ∈ S and c ∈ S arbitrary. The depth of c is then

defined by its distance from the root of the graph:

depth(c) := dist(c,⊤)

Wu-Palmer The Wu-Palmer similarity measure was proposed in [PW95]. The numer-

ator is used as a scaling factor.

simWP (c1, c2) =2× depth(lso(c1, c2))

dist(c1, lso(c1, c2)) + dist(c2, lso(c1, c2)) + 2× depth(lso(c1, c2))

Leacock-Chodorow Leacock and Chodorow proposed the following formula in [LCM98]:

simLC(c1, c2) = −logdist(c1, c2)

2×maxc∈V

(depth(c))

Lin Unlike Wu-Palmer and Leacock-Chodorow, Lin’s universal similarity measure [Lin98]

is based on probabilities derived from word frequencies measured in a training cor-

pus.

simL(c1, c2) =2× log p(lso(c1, c2))

log p(c1) + log p(c2)

While Lin’s measure is commonly presumed to give better results than Wu-Palmer and

Leacock-Chodorow, its dependency on word frequencies in a corpus limits it applicability

for our work. We have thus opted for the largely domain-independent Leacock-Chodorow

similarity whenever such a measure was needed.

3.8. Word Sense Disambiguation

In written language, homographs pose a serious challenge to NLP applications since an

erroneous interpretation can have a significant impact on the overall semantics. Word

Sense Disambiguation (WSD) is the process of distinguishing a word’s intended meaning

from other lexicographically possible, but semantically less likely meanings. As with

33

Page 43: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

decompounding and coreference resolution, even humans are oftentimes not able to

identify the intended meaning of a homograph with absolute certainty [MM01].

In many cases, the knowledge of a word’s lexical class can be sufficient to infer its

concrete meaning:

Why did the dog bark at the tree?

In this case, bark is a verb and clearly refers to the sound a dog makes and not the part

of a plant, albeit the occurrence of the word tree might suggest as much.

Generally, however, disambiguation requires the analysis of the (immediate) context in

which a word occurs. The disambiguation is then done by cross-referencing words that

appear in the context with words that have a high statistical probability of occurring in

its proximity. For instance, if the word spring is found in the proximity of {river, fish},

it is likely to refer to a water source and not to the mechanical device or the first season

of the year.

These probabilities are commonly derived from sense-annotated corpora such as the

DSO corpus[14] or SemCor[15]. The first implementations, however, were based on using

a word’s dictionary definition to compare shared vocabulary (Lesk [Les86]). This was

later complemented by the inclusion of WordNet (Banerjee and Pedersen [BP02]) and

has led to purely ontology-based strategies (Li et al. [LSM95], Fragos et al. [FMS03]).

An interesting alternative was proposed by Mihalcea [Mih07], suggesting the use of

Wikipedia[16] as a sense annotated corpus.

3.9. Frame Semantics

The techniques discussed so far operated locally and did not yet interrelate tokens with

one another. However, in order to understand natural language on a higher level, it is

necessary to establish semantic connections between different parts of a sentence. This

process is called shallow semantic parsing4.

One approach to this is to examine verbs as the central element of expressions in natural

language. In linguistics, verbs can be classified by their valence, i.e. the number of

“arguments” they take [Vol00].

4“Shallow” in the sense that it tries to capture only certain semantic aspects. Complete understandingof a text requires deep semantic parsing which is substantially more complex.

34

Page 44: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

avalent A verb that takes no argument or a pleonastic pronoun (“dummy argument”):

It snows.

monovalent A verb that takes one argument which is typically the subject:

Sally cried.

divalent A verb that takes two arguments, usually the subject and one object:

She pushed him.

trivalent A verb that takes three arguments (subject and two objects):

Brian bought Lois some flowers.

tetravalent A verb that takes four arguments:

Brian bought Lois some flowers for her birthday .

The valence of a verb can vary depending on its semantics (as in the last two cases in the

examples above). A verb generally requires all its arguments in a well-formed sentence,

although, sometimes an argument can be omitted: “I married” instead of “I married her”.

This is called valency reduction. Conversely, if an additional argument is added to a

verb, valency expansion takes place. For instance, in colloquial English, “He scared me”

can be expanded to “He scared the bejeezus out of me”.

Each argument assumes a certain semantic role. The process of meeting another person

(as defined by the verb to meet) could be described by the following “formula”:

to meet 〈someone〉object at 〈somewhere〉location for 〈something〉purpose

This construct is called a case frame and was first introduced in Charles Fillmore’s Case

Grammar framework [Fil68]. This idea was later developed into Frame Semantics [Fil82]

and implemented in the FrameNet project[17]. Figure 3.2 illustrates schematically how

the elements of a sentence can be mapped onto a semantic frame. Dotted boxes denote

optional arguments.

Frames are not isolated entities. They are themselves structured and complexly inter-

connected through a variety of relationships. For instance, the frame eclipse is a very

specialized subclass of the frame hiding_objects. Frame relations can be visualized

using the FrameGrapher[18] website. Figure 3.3 shows an excerpt of FrameNet centered

around the transfer of goods.

35

Page 45: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

Figure 3.2.: Application of a semantic frame

At the time of writing, the FrameNet project contains over 800 frames exemplified in

more than 135,000 annotated sentences. Localized adaptations are available for Span-

ish[19], Japanese[20] and German[21]. The process of mapping parts of a sentence to the

appropriate roles of a semantic frame is called semantic role labeling. Several algorithms

for this purpose have been proposed [GJ02], [SM04].

Figure 3.3.: Relations between semantic frames

For our application, we decided not to employ semantic frames for various reasons.

First, the German variant of FrameNet is significantly smaller than its archetype and

consequently covers much fewer concepts. Second, frame semantics are primarily aimed

at aiding in Machine Translation and Information Extraction for strongly structured

36

Page 46: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 3. Natural Language Processing and Text Mining Techniques

text corpora such as newspaper articles or scientific papers. As a result, existing systems

generally do not cope well with informal language. And finally, there was no suitable

implementation available (see section 4.5.1).

37

Page 47: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4.

Assessment of Existing Libraries and

Toolkits

The field of Natural Language Processing appeals to a wide range of people, both pro-

fessionals and hobby programmers. This has led to a multitude of available tools, plat-

forms and frameworks in a variety of programming languages and for a variety of natural

languages. The high level of diversity, complexity and ambiguity of natural language

oftentimes causes software with congruent feature sets and similar theoretical function-

ing to perform very differently under real-world conditions. It is this very reason that

necessitates the evaluation of each software solution on real-world data.

We have assessed a number of libraries for various facets of NLP. The evaluation fo-

cused on performance, ease of integratability, licensing aspects and applicability to our

problem domain. Whenever possible, we evaluated the tools against our test corpus to

attain results close to the real application for which they are intended. The test corpus

was manually annotated with part-of-speech tags, word senses and lemmas. It is dis-

cussed in detail in the context of the overall system evaluation in chapter 6. The link to

the website of each evaluated tool can be found through the endnotes in the respective

sections.

In this chapter, a distinction will often be made between word and a relevant word. A

relevant word is a word that can be found in GermaNet and has, as such, at least one

synset assigned to it. So in this context, “relevant” means that a word carries a semantic

value which is accessible to our system through the ontology.

38

Page 48: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

4.1. Part-of-Speech Taggers

We evaluated OpenNLP in version 1.4.2, the tagger of the Stanford NLP group in

version 1.6 and the TreeTagger in version 3.2. All three taggers combine the steps of

tokenization and part-of-speech tagging and use the STTS tag set[22] (see appendix A).

In theory, sharing a common tag set allows the taggers to be used interchangeably.

However, in practice, all implementations have particularities that need to be taken into

account. For instance, some taggers assign individual tags to certain punctuations such

as colons or quotation marks, while others do not and leave them with the word that

precedes them. In the latter case some minor postprocessing is necessary.

4.1.1. OpenNLP

OpenNLP[23] is a Java-based open source project licensed under the GNU Lesser Gen-

eral Public License. It houses a variety of NLP tools, most notably tokenizers and

part-of-speech taggers for English, German, Spanish and Thai as well as NER- and CR-

models for English only. Most tools are based on the maximum entropy model described

in [BPP96] and [Rat98] which is also available as a standalone tool.

The sentence boundary detection of OpenNLP for German did show some weakness

during our evaluation. It occasionally “missed” the end of a sentence and incorrectly

concatenated consecutive sentences. It is not clear what causes this behavior, especially

considering that OpenNLP correctly handles all common edge cases, i.e. periods that

are part of a decimal number, an abbreviation or ellipses. Oddly, the incorrect splitting

did not impact the performance of the pos-tagging. In any case, manually splitting

sentences at the end-of-sentence tags resolved this issue without further difficulties.

4.1.2. Stanford NLP

The Stanford Natural Language Processing Group[24] has implemented a Java version

of the maximum entropy-based log-linear part-of-speech tagger described in [TM00] and

[TKMS03]. It is licensed under the GNU General Public License and comes with fully

trained tagger models for various languages, including German. For the lack of a distinct

name, we will refer to this implementation as Stanford.

39

Page 49: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

Sentence detection was almost flawless except for a few cases where a period as a part of

an abbreviation caused an incorrect sentence split. The part-of-speech tagging itself was

generally solid but did occasionally show some weaknesses. For instance, opening and

closing quotation marks are consistently tagged as CARD (number) and VVFIN (finite

verb), respectively.

During our evaluation we encountered two critical issues. First, the internal tokenizer

inexplicably trips over hyphenated compounds if the second element contains an umlaut.

For instance, the word Hollywood-Rückkehr (eng. the return to Hollywood) is tokenized

and tagged as Hollywood-R/NN ückkehr/ADJD, yielding unusable results.

Second, the performance degraded drastically under certain conditions. The following

issues were identified to cause dramatic performance degradation:

- Extended sequences of consecutive nouns

e.g. “United States Senator Thomas Andrew Daschle always supported Obama.”

- Ellipses, particularly within a sentence

e.g. “But then... it got worse!”

- Dashes and quotation marks

e.g. “He said ‘No problem - I swear!’ and ran away.”

On several occasions, the performance degraded for no apparent reason. This unpre-

dictable behavior leads to often erratic variation in performance. The resulting problems

were significant as can be seen in table 4.1.3. In some extreme cases the throughput

dropped from several hundred to less than one word per second.

4.1.3. TreeTagger

The TreeTagger[25] is a closed-source, language independent part-of-speech tagger and

lemmatizer. It is written in C and released under a proprietary license that allows free

use for evaluation and research purposes. TreeTagger has been successfully used to

tag a wide variety of languages (German, English, Spanish, Dutch, Russian and Chinese,

among others) and was shown to achieve an accuracy of over 95% on the Penn-Treebank

corpus [Sch94]. Binaries are available for Linux, Microsoft Windows, Sun Solaris and

MacOS.

40

Page 50: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

TreeTagger had one noteworthy particularity unfavorable for our application: occa-

sionally verbs or prepositions are mistagged as nouns or named entities. For instance,

the sentence “Der Grund für sein Verhalten (...)” (eng. “The reason for his behavior (...)”)

is tagged as

Der/ART Grund/NN für/NN sein/PPOSAT Verhalten/NN (...)

where the preposition für (eng. for) is tagged as a noun. Aside from this, it showed no

significant shortcomings in our evaluation.

Remarks

We evaluated the speed of each tagger as well as its accuracy for sentence detection asd

and tagging apos. On our test corpus, the accuracy of the sentence detection asd was

extremely high for all taggers1. It was only one very specific edge case involving an

extended enumeration of proper nouns and abbreviations that caused problems for all

implementations. As discussed above, the Stanford tagger showed serious performance

Tagger total runtime speed asd apos

Stanford 23.70 min 0.1 texts/sec 99.9% 93.2%

OpenNLP 0.27 min 9.3 texts/sec 99.8% 95.8%

TreeTagger 0.19 min 13.2 texts/sec 99.9% 95.6%

Table 4.1.: Performance of evaluated pos-taggers

problems in terms of speed. Its comparatively poor accuracy apos is primarily due to the

nature of our test corpus which had an above-average percentage of cases with which

the pre-trained model could not cope, namely with a lot of parenthesis and quotation

marks.

The TreeTagger was the fastest of all taggers, in all likelihood because it is a native C

application and was being run from the command line and not through a Java wrapper.

The OpenNLP implementation was fast and achieved a slightly higher accuracy.

In conclusion, it can be said that all three taggers had issues of varying gravity. We

opted for the OpenNLP library because it performs consistently strong, is liberally

licensed and can be used natively from Java.

1This comparison was done using the mentioned workaround for sentence splitting for OpenNLP.

41

Page 51: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

4.2. Lemmatizers

German is a complex and inflectionally rich language that is difficult to lemmatize.

Partially as a consequence of this, relatively few lemmatizers are available in comparison

to English. We have evaluated Morphy 1.1, LemmaGen 2.0 and LemServer 1.02.

All three lemmatizers are stand-alone applications that take a single word as an input

and return a list of possible lemmas.

4.2.1. Morphy

Morphy[26] is a freely available closed-source tool for morphological analysis, described

in detail in [Lez98]. While it runs under Microsoft Windows only, the internal dictionary

can be exported as an SQL dump and used independently. The exported data is then

licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

A notable disadvantage of Morphy is that its dictionary has, for the most part, not

been adapted to the German orthography reform of 1996. The development of Morphy

has ceased in late 1999 and it was included primarily for comparative reasons.

Morphy maps approximately 430,000 inflectional forms to 90,000 distinct lemmas. It

contains no additional information such as part-of-speech tags or probabilities of the

occurrences of the various inflected forms.

4.2.2. LemmaGen

LemmaGen[27] is a C++ implementation of the lemmatization algorithm based on

Ripple Down Rules as described in [MJ07]. It is published under the GNU Lesser

General Public License. The lemmatization rules used by LemmaGen are acquired

during a learning phase and stored in a parameter file. Pre-trained parameter files are

available for 14 different languages, including German and English.

A Ripple Down Rule system is similar to a decision tree in that each node n has two

sons n1 and n2 that represent an IF n THEN n1 EXCEPT n2 type structure. This rule

tree is built during a learning phase where new rules are added and existing rules are

subdivided and refined with exceptions as more training data comes in.

42

Page 52: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

To illustrate, consider training LemmaGen with the following set of pairs of inflected

forms and their lemmas:

(laughed→laugh, looked→look, smiled→smile, agreed→agree, steed→steed)

After processing the first two elements, the system would deduce that simply removing

the suffix ed is sufficient for lemmatization:

i f hasSfx ( ’ ed ’ ) then removeSfx ( ’ ed ’ )

The next two elements violate this rule because the lemmas of smiled and agreed are not

smil and agre, respectively. Hence, two exceptions are added to the rule:

i f hasSfx ( ’ ed ’ ) then removeSfx ( ’ ed ’ ) except

i f hasSfx ( ’ led ’ ) then removeSfx ( ’ d ’ )

i f hasSfx ( ’ eed ’ ) then removeSfx ( ’ d ’ )

This set of rules would correctly lemmatize the so far unknown word freed, but it would

still fail for the final word steed and lemmatize it to stee. To cover this case, an “exception

to the exception” is needed:

i f hasSfx ( ’ ed ’ ) then removeSfx ( ’ ed ’ ) except

i f hasSfx ( ’ led ’ ) then removeSfx ( ’ d ’ )

i f hasSfx ( ’ eed ’ ) then removeSfx ( ’ d ’ ) except

i f hasSfx ( ’ teed ’ )

The above example illustrates nicely the hierarchical structure of the rules and how new

rules “ripple down” through it. In practice, the tree is not deeper than eight to ten

levels, depending on the language and the lemmatizer is thus extremely fast. However,

it is clear that it does not generalize well to previously unknown words due to the lack

of grammatical rules and heuristics. In the above example, the previously unknown

word bleed would incorrectly be lemmatized to blee. LemmaGen can display its set of

rules, figure 4.1 shows an excerpt of its rule set for German. It is often interesting to

retrace how a specific result was inferred. For instance, the German ruleset contains an

odd exception that lemmatizes the phrase en to Einkaufspark. This is most likely due to

faulty training data.

43

Page 53: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

|---> RULE:( suffix("ser") transform("er"-->"") except(8) ); {:

| |---> RULE:( suffix("eser") transform(""-->"") );

| |---> RULE:( suffix("iser") transform("r"-->"") except(4) ); {:

| | |---> RULE:( suffix("aiser") transform(""-->"") );

| | |---> RULE:( suffix("Reiser") transform("er"-->"") );

| | |---> RULE:( suffix("riser") transform(""-->"") );

| | ‘---> RULE:( suffix("ziser") transform("er"-->"") ); :}

| |

| |---> RULE:( suffix("sser") transform(""-->"") except(2) ); {:

| | |---> RULE:( suffix("besser") transform("besser"-->"gut") );

| | ‘---> RULE:( suffix("össer") transform("össer"-->"oß") ); :}

| |

| |---> RULE:( suffix("äuser") transform("äuser"-->"aus") );

| |---> RULE:( suffix("äser") transform("äser"-->"as") );

| ‘---> RULE:( suffix("öser") transform("er"-->"") except(2) ); {:

| ‘---> RULE:( suffix("böser") transform("r"-->"") ); :}

| :}

Figure 4.1.: Excerpt of LemmaGens rule set for German

As the algorithm outlined above indicates, LemmaGen performs purely lexical lemma-

tization and does not take potentially available additional information into account,

such as part-of-speech tags. It does, however, return the statistically most likely lemma

if there were multiple possibilities during the learning phase. On the other hand, this

approach is language independent and can be applied to a large variety of languages.

4.2.3. LemServer

LemServer is a part of the RuPosTagger[28] NLP application and is published under

the GNU General Public License. It is implemented in C++ and runs on any POSIX-

compliant operating system, including Microsoft Windows with Cygwin[29]. A Java

wrapper for accessing it via XML-RPC is available and our evaluation has shown that

it performs reasonably fast.

LemServer employs a hybrid approach by utilizing both a set grammatical heuristics

and a dictionary of over 220,000 words. It returns a list of all possible lemmas with their

associated lexical class. For instance, for the word Liegen two lemmas are returned:

liegen/VER (eng. lie/VERB) and Liege/SUB (eng. couch/NOUN). This result can then

be compared with the part-of-speech tag assigned to the original word by the tagger in

order to chose the lemma most likely to be correct. This requires a tag mapping from

LemServer’s proprietary tag set to the STTS tag set.

44

Page 54: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

Figure 4.2.: LemServer architecture

Remarks

The results are split into two categories, the global accuracy a (percentage of correctly

lemmatized words) and the accuracy ar of correctly lemmatized relevant words. This

distinction is significant because many lemmatization errors occur with uncommon words

for which the lemmatizers were not trained. However, most of these words are not part of

the word net we used, in which case the wrong lemma does not matter in our particular

case. Our analysis has shown that LemServer achieves the best results and outperforms

Lemmatizer a ar runtime words/sec

LemmaGen 86.2% 89.3% 1.3s 12307

Morphy 79.5% 93.9% 23.2s 689

LemServer 94.5% 97.5% 6.8s 2353

Table 4.2.: Performance of evaluated lemmatizers

its competitors by a significant margin. This is not surprising considering that neither

LemmaGen nor Morphy utilize part-of-speech tags. The total runtime listed in table

4.2.3 should not be taken at face value: LemmaGen is a command-line application,

LemServer was accessed via XML-RPC and Morphy required a large amount of

database queries. In any case, both LemmaGen and LemServer were fast enough

for practical use. The performance of Morphy could be increased by transforming the

data into a format that can be queried locally without resorting to a database.

The decision for LemServer was clear cut; it had the highest accuracy, was easy to

integrate and its dictionary and parameter files are continuously being updated[30]. Its

speed could be improved by implementing a native wrapper and accessing it directly via

the Java Native Interface (JNI).

45

Page 55: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

4.3. Ontologies

A lexical ontology is the central source of semantic meaning in our application. For

this reason, the quality and density of the word net was crucial. While there are a

variety of ontologies for specialized fields such as the Standard Thesaurus Wirtschaft[31]

for economics, there are, to our knowledge, only two that attempt to cover the entire

German language: GermaNet and OpenThesaurus.

4.3.1. GermaNet

GermaNet[32] is an ontology developed and maintained by the University of Tübingen.

It is available free of charge for academic users but a licensing agreement has to be

signed.

Being very similar in concept and structure to WordNet, it contains more than 80,000

distinct lexical units grouped in approximately 58,000 synsets. These synsets are inter-

connected with roughly 80,000 conceptual and lexical relations. GermaNet models the

following relations: hypernymy/hyponymy, meronymy/holonymy, entailment/entailed,

causation/caused and association. It is distributed in XML format, APIs allowing ac-

cess to all relevant features are available for various programming languages.

4.3.2. OpenThesaurus

OpenThesaurus[33] is an open source word net that is being used as a thesaurus for

OpenOffice[34] and KWord[35]. It contains about 64,000 entries assigned to 26,000 synsets

that are connected only with hyponymy/hypernymy-relationships. The data is available

from the official website as a daily updated SQL dump. However, it suffers from some

inconsistencies in terms of character encodings and formatting.

It is clear that the intended purpose of OpenThesaurus is not to be a complete and

richly interrelated ontology, but - as the name indicates - to be a thesaurus. It is

thus limited to a subsumption hierarchy which greatly restricts its use for any other

purpose. In particular, measures for semantic relatedness, as described in section 3.7.3,

do not work well if essential relations like antomymy and holonymy/meronymy are not

available.

46

Page 56: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

Remarks

For this evaluation, we focused on the coverage c and relational density r of the nouns

and verbs in our test corpus. A noun or verb is considered “covered” by the ontology if it

is associated with at least one synset. The relational density was computed by averaging

the number of inter-synset relations of all synsets the noun or verb is a part of. As figure

Ontology cnouns rnouns cverbs rverbs

GermaNet 85.6% 10.05 96.3% 4.13

OpenThesaurus 79.0% 4.87 95.2% 1.21

Table 4.3.: Coverage and relational density of evaluated ontologies

4.3.2 shows, GermaNet clearly outperforms OpenThesaurus in our comparison. It

has a better coverage and a significantly higher relational density for both nouns and

verbs. Consequently, for applications where no domain-specific ontology is available,

GermaNet is the only viable choice.

4.4. Frameworks

At the time of writing there were two major Java frameworks for NLP applications:

GATE and Apache UIMA. We evaluated both in order to decide whether our appli-

cation could benefit from using a framework or if it could even be implemented as a

plug-in.

4.4.1. GATE

GATE[36] (General Architecture for Text Engineering) is a Java framework for NLP

applications developed by the NLP Group of the University of Sheffield. It employs a

component-based architecture where individual components are combined into a pro-

cessing pipeline (often simply called an application). The input is sent through the

pipeline, each component receiving the output of its predecessor. This naturally implies

a strictly sequential processing which can in some cases be suboptimal [MS05]. GATE

is licensed under the GNU Lesser General Public License.

47

Page 57: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

A GATE component is either a Language Resource, a Processing Resource or a Visual

Resource. Language Resources are centered around data, i.e. they provide an inter-

face to query external sources like ontologies, dictionaries or text corpora. By contrast,

Processing Resources encapsulate algorithmic components such as pos-taggers or coref-

erence resolvers. Visual Resources can extend the GATE’s GUI application and provide

an interface to viewing, editing and visualizing various aspects of the processing chain.

The set of all available plugins is called CREOLE (Collection of Reusable Objects for

Language Engineering). A wide variety of commercial and free plugins are available,

ranging from stemmers to components that build text corpora on-the-fly from Google

queries.

4.4.2. UIMA

The Unstructured Information Management Architecture UIMA[37] is an OASIS[38]-

approved standard for the analysis of unstructured data and knowledge discovery and

was originally developed by IBM. Apache UIMA2 is an open source implementation

of this specification licensed under the Apache License v2.0.

The structure of an UIMA application is in principle similar to GATE in that a pro-

cessing pipeline is built from separate, reusable components called Analysis Engines.

Analysis Engines can be written in Java or C++ as well as Perl, Python and TCL

(through SWIG[39]). They are packaged and redistributed as a single Processing Engine

Archive (PEAR) file. Figure 4.3 gives a schematic overview of the various aspects that

are a part of the Apache UIMA project. In contrast to GATE, which is intended to

process exclusively textual input, UIMA was designed as a unified framework for trans-

forming any kind of unstructured data into structured data. For instance, it could be

employed in an application that extracts geographical information from video streams

and cross-references them with current news feeds, all within the same framework. The

processing of texts in natural language is only one possible area of application.

2The terms “UIMA” and “Apache UIMA” are commonly used synonymously and no distinction ismade between the specification and its at this time only implementation.

48

Page 58: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

Figure 4.3.: The Apache UIMA project (Source: project website)

Remarks

The decision for or against the employment of a framework commonly depends on

whether the gain in flexibility and abstraction is more important than the increased

overhead entailed by it. The architectural infrastructure provided by such frameworks

naturally means a higher overall complexity and a potential performance loss.

Both GATE and Apache UIMA are mature and powerful frameworks. The tradeoff

outlined above was to be weighted against the constraints discussed at the beginning

of this chapter, primarily ease of integratability and low overall coupling. We decided

against the use of a major framework as they offered no clearly identifiable benefit to us

but would have introduced many new dependencies and additional layers.

4.5. Miscellaneous

The following two tools that we evaluated did not fit into the previous categories. Shal-

maneser, a toolkit for assigning semantic role labels and JWordSplitter, a library

for decompunding German nouns.

49

Page 59: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 4. Assessment of Existing Libraries and Toolkits

4.5.1. Shalmaneser

Shalmaneser[40] is a shallow semantic parser for assigning roles to frame elements using

the SALSA[41] corpus (see section 3.9). It is implemented in Ruby and licensed under

the GNU General Public License. A detailed description can be found in [EP06a] and

[EP06b].

There are several practical issues with Shalmaneser. For one, it seems to be no

longer actively maintained, the latest version dating back to early 2007. Second, it

runs, despite being written in Ruby, exclusively on Linux. In addition, it has various

dependencies including a MySQL database and Amit Dubey’s Sleepy parser which is

no longer publicly available.

For these reasons, Shalmaneser was not a viable option for integration into our system.

However, for comparative reasons, we decided to examined the SALSA corpus and how

it might be applicable to this work. An analysis of the verbs modeled in the corpus

has shown that it was only able to cover approximately 27% of all verbs that occured

in our test corpus and was thus sparse at best. The semantic frames defined in the

SALSA corpus are based exclusively on formal newspaper articles. It is questionable

if any program trained on that corpus could successfully be used on descriptions from

EPGs.

4.5.2. JWordSplitter

JWordSplitter[42] by Daniel Naber is a small open source library for splitting Ger-

man compounds written in Java. The input is matched against an internal word list

to identify its constituents. It does take epenthesis3 into account. Despite being an

algorithmically simplistic approach, it has improved our results.

In our test corpus, approximately 45% of all compounds were successfully split. The

other 55% consisted of compounds where at least one part was not considered relevant.

This was either because the word itself was unknown or because a known word was

inflected in such a way that it could not be correctly lemmatized, e.g. Ess·störung (eng.

eating disorder) where Ess is a shortened form of Essen (eng. food) which does not occur

in this inflective form on its own.

3An epenthesis is the addition of a sound when forming compounds. Consider compounding Ver-

wendung and Zweck to Verwendungszweck which requires the addition of an s due to the implicitdeclension of Verwendung to the genetive.

50

Page 60: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5.

Implementation

The goal of this work is to develop a complete system for analyzing program descriptions

written in natural language. Through a series of processing steps, the input is analyzed

and transformed into a sequence of tokens which are augmented with semantic informa-

tion. At the end of the process stands a structured summary of all identified entities,

concepts, and keywords. That information could then be used to compare TV shows and

movies on a semantic level or to improve existing recommender systems, among other

things. The extracted concepts can be utilized to aid the user interactively in finding

programs similar in content and setting.

Figure 5.1.: The Sliver processing pipeline

The software is called Sliver and is a pure Java implementation with an emphasis on

a simple architecture and a high degree of modularity and extensibility. Our implemen-

tation was designed to be incorporated into a larger software system, but it can also be

51

Page 61: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

used as a stand-alone library with little external dependencies and minimal configuration

required. This chapter will illustrate various aspects of Sliver’s implementation and

describe the inner workings of our application in detail. Figure 5.1 shows a schematic

of Sliver’s analysis pipeline and gives a broad overview over the individual processing

steps and their subcomponents.

The purpose of our system is to analyze text in natural language and extract terms and

phrases descriptive for it. This process can be broken down into three distinct phases:

• Basic Tokenization

The first phase is the basic tokenization of the textual input. This includes some

general preprocessing of the text and modifications specific to the used tokenizer

and part-of-speech tagger.

• Semantic Analysis

During this phase, the tokens are analyzed and augmented with semantic informa-

tion. This is the central phase and consists of several steps that process various

aspects of the tokens.

• Tuple Generation

Based on the information from the previous two phases, the system now tries to

identify and weigh the most descriptive tokens and phrases (token tuples).

Each phase can be subdivided in several smaller steps. In the following sections, each

of these steps will be discussed in more detail.

5.1. Preprocessing

This step subsumes all actions necessary to process an input document. This includes

the following steps:

• Reading the input data

Depending on the data source, this means parsing XML files, connecting to a

database or any other mean of acquiring the data. In any case, the end result

must be a single string.

52

Page 62: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

• Replacing special characters

GermaNet uses underscores to concatenate multi-word expressions. For instance,

the phrase log out has to be queried as log_out. For this reason, the text should not

contain this character. However, natural language text rarely contains underscores,

so in practice, this poses no restriction on the format of the input.

• Performing tokenizer-specific modifications

While most tokenizers require only little preprocessing (if any), some do. Stan-

ford, for instance, had massive problems processing URLs. Issues like these

should be addressed during this step.

For our test, data preprocessing was minimal. In a real-world application, this can be

more complex and require any number of steps commonly involved in processing textual

data, such as handling different character encodings or removing markup.

5.2. Tokenization and PoS-Tagging

After preprocessing is complete, the tokenizer and part-of-speech tagger parses the text

and generates a sequence of BasicTokens. We will use the Java-like notation of t.pos,

t.lemma and so forth to refer to the annotations of a given token t.

All part-of-speech taggers that we evaluated combine tokenization and pos-tagging in

either one single or two strongly linked steps. As discussed in section 4.1.1, some postpro-

cessing is required for OpenNLP to account for its sometimes faulty sentence detection.

The sentence splitting returned by OpenNLP is reexamined and adjusted as needed.

This step is essentially a conversion from the tagger-specific format to our own represen-

tation. This implies a conversion to the STTS tag set, should the part-of-speech tagger

use a different one.

5.3. Semantic Analysis

During the semantic analysis, the BasicTokens are analyzed and augmented with addi-

tional information. This phase consists of five steps: lemmatization, compound splitting,

integration of GermaNet, word sense disambiguation and semantic chunking.

53

Page 63: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

5.3.1. Lemmatization

The value of each token is sent to the lemmatizer and the lemma is assigned to the token.

LemServer returns multiple values if the lemma can not be determined unambiguously.

The part-of-speech tag of the token is then used to infer the lemma most likely to be

correct. If the ambiguity cannot be resolved this way, the first lemma is used as a

fallback. LemServer always returns a lemma; if it does not recognize the input, it

returns it unaltered.

5.3.2. Compound Splitting

As outlined in section 3.2.3, splitting words into their constituents is in practice a rather

difficult task. Besides the problems already discussed, further problems arise when

part-of-speech taggers make mistakes, for instance by labeling a verb or an adjective

as a noun. When that happens, the decompounding algorithm oftentimes produces

nonsensical results that affect the following processing steps negatively. For this reason,

we employ a very conservative approach. A word w is only decompounded if all of the

following conditions are met:

(1) w is tagged as a noun

(2) w is capitalized

(3) w itself is not relevant

(4) all lemmatized constituents of w are relevant

Constraints one and two ensure that only nouns are decompounded. The fraction of

compound verbs or adjectives is for practical purposes negligible. Constraint three keeps

nouns from being decompounded for which a more specific synset is already known. The

final constraint acts as a safeguard against incorrect decomposition. If all constituents

are genuine nouns, the compound was most likely endocentric and decomposition is likely

to yield sensible results. While these constraints naturally only permit comparatively

few decompoundings, the total error is also kept to a minimum. This will be discussed

further in section 6.3.2.

54

Page 64: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

5.3.3. Ontology Integration

Linking a token to an entity in an ontology is a key element of establishing semantic

information. In a word net, a token is assigned to one or more synsets, depending on

how ambiguous it is. The large number of interconnections between synsets provide a

rich amount of semantic information that can be harnessed. We will now illustrate how

GermaNet was integrated into our system and what issues arose in doing so.

During this discussion, the terms meaning, sense and synset will be used synonymously,

even though they differ subtly in theory. A meaning (or sense) commonly refers to

something concrete, although possibly abstract. The term addition, for example, has an

intangible but concise meaning. GermaNet, on the other hand, contains synsets whose

only purpose is to complete the subsumption hierarchy. These “artificial synsets” have

no representation in the real language. For instance, neither English nor German has

words that mean to move into a specific direction (as a hypernym of rise and plummet)

or unspecified movement (as a hypernym of arrive and stop).

Nouns, verbs and adjectives are covered extensively by GermaNet. Other word classes

such as prepositions, conjunctions or interjections, are not a part of GermaNet as

they carry no intrinsic meaning. In that sense, they are noise or stop words. It could

be argued that the semantics of negations should not be ignored. To a human, it may

matter whether the kitten was killed or the kitten was not killed. However, for the

purpose of information extraction this distinction is basically irrelevant. Any concept is

innately related to its negation, so the differentiation between the two is of little overall

importance.

GermaNet can be queried directly through a Java API. Given a word’s lemma, a list

of all synsets that the word is part of is returned. It is important to realize that the

number of returned synsets is often much higher than one would intuitively suspect.

This is because GermaNet contains a large number of polysemes (i.e. closely related

but distinctly modeled word senses). For instance, the verb gehen (eng. to go) is a part

of 15 different synsets that sometimes differ only slightly in meaning. It is not always

apparent, even to a human, what meaning is most appropriate in the context from which

the word is originated. In most cases, several meanings can be considered adequate.

The more the synsets of words are scattered throughout an ontology, the less they are

semantically related. Its meaning is then fuzzy. Ideally, a word should be completely

unambiguous (i.e. it only has one synset), where its meaning would then be concise. The

next section discusses how the fuzziness of a word’s meaning can be reduced through

sense disambiguation.

55

Page 65: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

5.3.4. Word Sense Disambiguation

As described in section 3.8, algorithms for word sense disambiguation work by comparing

the current context of a word with a “reference context”. This implicitly requires the

existence of such a context, which is commonly given through one of the following:

• Gloss

If a word has a gloss, the words in that gloss can be used as a reference context.

The gloss can be taken from a word net or from an external source, such as a

dictionary. In the latter case, a mapping from the word net synsets to dictionary

entries is required.

• Example Usage

If a list of example uses of a word is available, this can be used as a context. These

examples can either be manually annotated with the precise sense or come from

a corpus whose general domain of meaning is already known, for instance medical

journals.

• Domain Annotation

A domain annotation specifies the general domain of meaning to which a synset

belongs. For instance, branchbusiness and branchbiology. There can still be multiple

synsets per annotation, as in bankbusiness (the financial institution) and bankbusiness

(the building). This allows the quick clustering of words and disambiguation by

finding to which cluster the word in question is closest1.

If no suitable context is available, an alternative is the use the frequency score of a word

sense, either by itself as a heuristic or in combination with other methods. The frequency

of a sense is an indication of its statistical likeliness to occur. For instance, the word eye

is much more commonly used in the sense of organ than it is in the sense of the eye of

a needle.

The problem is that GermaNet, unlike WordNet, contains neither sense annotations,

nor frequency scores and only a minuscule fraction of all entries have a gloss. To overcome

these limitations, we devised an algorithm for word sense disambiguation that operates

exclusively on lexical relations and is thus completely domain-independent. Needless to

1In our case, domain annotations could prove to be particularly useful if they can be mapped onto thegenre information that is available for most entries of the EPG.

56

Page 66: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

say, an algorithm with less information at its disposal will necessarily be less accurate

than an algorithm that can utilize more data. This is partially offset by the fact that

for our application, it is not required (or even possible) to identify the sense of a word

uniquely, i.e. reduce the list of its synsets to the size of one. The goal is rather to remove

those synsets that are clearly inadequate. For instance, for the word bank it is important

to distinguish between the “river bank” and the “money bank”. The differentiation

between the abstract financial institution and the concrete building is, in comparison,

much less important.

The algorithm has the following key elements:

• The context of a token t, written as context(t), is the set of words in its immediate

surrounding (excluding the word itself).

• A similarity measure sim for semantic relatedness (see section 3.7.3). sim(s, S)

is the average similarity between a synset s and all synsets in some set S ⊆ S.

• A threshold value τ that determines what the minimum similarity score is for a

synset to be retained relative to sim(s, S).

• A merge function that combines synsets that are closely related.

The algorithm is a fixpoint iteration that successively reduces ambiguities until no further

changes occur. The following steps are executed for each ambiguous token t with a set

of synsets t.senses:

- Invoke the merge function to combine senses that are closely related

- Determine the set fp of all tokens in the context of t that are unambiguous

- Calculate the semantic relatedness between each sense of t and all tokens in fp

- Discard all senses of t whose similarity to fp are below the threshold τ

This is repeated until no more senses can be discarded. Figure 5.2 is a more formal

description of this algorithm. The termination of the algorithm is trivial to prove.

Exact parameters for this algorithm and a performance evaluation will be discussed in

more detail in section 6.3.2.

57

Page 67: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

repeat

merge_similar(t.senses);

foreach t in {ti ∈ tokens | card(ti.senses) > 1} begin

fp := {ti ∈ context(t) | card(ti.senses) = 1};τ := calculate_threshold(t.senses, fp);

t.senses := {s ∈ t.senses | sim(s, fp) ≥ τ};end;

until not changed;

Figure 5.2.: Algorithm for word sense disambiguation

5.3.5. Chunking

At this point in the processing chain, we have a sequence of tokens, each annotated with

its lemma, its pos-tag and a list of synsets. Such a token is called an AnnotatedToken.

The next step now is to combine consecutive tokens that are intrinsically related. This

is called chunking. The intent is to infer semantic relatedness, reflect it in the data

representation and prepare the model for the following step in the processing pipeline.

We will first briefly formalize the concept of a chunk and introduce needed terminology.

Next, three different types of chunking will be discussed and what their exact semantics

are. And finally, we will describe the chunking strategies that were implemented in our

system.

Formalization

Given a sequence T = (t1, t2, . . . , tn) of tokens, a chunk cki ⊆ T is defined as the subse-

quence (ti, . . . , ti+k) with k > 0. In this work, all chunks are assumed to be mutually

disjoint, i.e. given a chunking C = (cki , clk, . . .) of a text, c ∩ d = ∅ for all c, d ∈ C with

c 6= d. In general,⋃

c∈C c 6= T , so C is not necessarily a partition of T .

In our implementation, a chunk is itself modeled as a token2. To illustrate, consider the

text represented by the sequence (t1, t2, t3, t4, t5) where each ti is an AnnotatedToken.

Assuming the chunking C = {c23}, the resulting token sequence is then (t1, t2, c23, t5), or

(t1, t2, (t3, t4), t5).

2A direct consequence of this approach is that chunks can be nested as deeply as needed and implicitlyform a hierarchy. This will be discussed in more detail later on.

58

Page 68: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

A core issue is how the aggregation of tokens is handled with respect to their synsets.

We will be dealing with synsets, sets of synsets, and sets thereof. Confusion can arise

from the fact that a synset is only a set in so far that it consists of multiple lexemes,

but these lexemes are irrelevant here. In this context, a synset is considered atomic.

To avoid confusion, we will in this section primarily be using the term sense instead of

synset. The symbol S denotes the set of all synsets contained in GermaNet.

When a sequence T of tokens is to be combined into a chunk c, a unification function

φ : P̂(S)→ P(S) is required3. This function maps the multiset of senses of all t ∈ T to

a single set of senses. The synsets of c are then defined by

c.synsets := φ(⊎

t∈T

t.synsets)

It is apparent that the definition of φ needs to reflect the intended semantics of the

chunking and can not be defined globally. For instance, the most intuitive functions

φ∪(S) :=⋃

s∈S s and φ∩(S) :=⋂

s∈S s are only adequate if all the tokens in the chunk

have similar meanings to begin with. In any other case, φ∪ will lead to a significant

increase in fuzziness, and φ∩ will yield the empty set unless the word were synonyms of

each other. Various techniques could be considered to counter this, for instance using

synset expansion (i.e. extending the synsets to include the hypernyms of all its elements).

This would guarantee that the intersection is never empty, but inevitably lead to very

abstract meanings close to the root of the subsumption hierarchy.

Intended Semantics

In the above formalization, we have introduced the concept of chunking, but it was left

open what its implications are. Each token in a chunk has its own value, lemma and

synsets. The question is what the synset of the chunk is and, to a lesser extent, what its

value and lemma is4. We will first discuss the principal methods of how chunking can

be done, and then describe various strategies as to when tokens can be aggregated and

what the exact semantics of the aggregate is.

3P̂(M) is the “power multiset”, i.e. the multiset of all subsets of M . This is needed because themultiplicity of a synset may be relevant for concrete implementation of φ.

4For a semantic analysis the concrete textual representation of a word is only of secondary interest.

59

Page 69: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

Consider the token sequence (t1, t2, t3, t4, t5) where the chunk (t2, t3) was identified.

Chunking can then happen in three different ways.

(1) remove the tokens completely: (t1, t4, t5)

(2) merge the tokens into a single new token: (t1, tc, t4, t5)

(3) combine the tokens into a composite token: (t1, (t2, t3), t4, t5)

In the first case, a sequence of tokens is removed entirely from the system. This should

be done when the sequence has little informational value and its removal does not change

the semantics of the text. φ is trivially defined as φ(S) := ∅. Analogous to the term stop

word, this is called a stop sequence.

The second option assumes that the terms in the chunk can sensibly be represented by

a single token. For this reason, φ∪ and φ∩ can be considered adequate. In practice,

however, such a sequence rarely occurs in natural language. The most likely occurrence

is in the slightly concealed form of enumerations like “He jogs, swims and skydives”.

For this reason, it is called a uniform sequence. It is in our system represented by an

AnnotatedToken.

When a uniform sequence is transformed into an AnnotatedToken, all information of

the individual tokens is lost. In many cases, this is not desirable. The key idea of chunk-

ing a sequence of semantically heterogeneous tokens is to identify a head token that

can represent the entire chunk without significantly changing its semantics or gram-

matical function. Such a sequence is called a head sequence. It is converted into a

CompositeToken that is atomic for the purpose of algorithmic processing, but still re-

tains the original constituents5. Given a head token ti, the unification function is then

commonly defined as φ(S) := ti.synsets.

Patterns

We have defined and implemented six different chunking patterns to improve the per-

formance of our system.

• Attributive Adjective/Noun (A/N)

The part-of-speech combination adjective (ADJA) and noun (NN) form a head

sequence that can be merged. This is because in this particular combination the

5The notation of a CompositeToken consisting of the tokens t1, t2 and t3 with the head t3 is (t1, t2, t3).

60

Page 70: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

adjective is in all but a few edge cases a qualifier for the noun. The resulting

CompositeToken inherits its primary meaning from the noun and can be treated

as such in subsequent steps.

For instance, the sequence (beautiful/ADJD, morning/NN) can be replaced by the

composite (beautiful∼morning/NN) with the primary meaning morning. The two

can be used interchangeably with only minimal alteration of the overall semantics.

• Adverbial Adjective/Verb (A/V)

Similarly to the adjective/noun chunking, the combination of ADJD and a verb

can be combined. This is true for all verbs, including auxiliary verbs.

• Annotation (ANN)

Descriptions of movies and series often contain annotations such as by whom a

character is portrayed, for example

But now Dirty Harry (Clint Eastwood) is ready to make your day once again.

The annotation is unwanted for two reasons. First, it is in most cases redundant

because the EPG entry usually already contains an explicit list of actors. Second,

and more importantly, these tokens needlessly add to the distance between the

tokens to the left and right of them. Why this matters will be explained in the next

section. In any event, omitting the entire annotation does not impact the semantics

of the text. Another, less common example of an annotation is a reference to a

specific year or age, such as “Harrison Ford (66) is still a great actor”. Annotations

are immediately discarded and not used for further analysis.

• Named Entities (NES)

Consecutive sequences of NE-tagged tokens are combined into one AnnotatedToken.

That composite inherits NE as its part-of-speech tag. The benefit of this merging

lies in the fact that tokens that are part of the same semantic entity are now

also represented as one element. For instance, it is desirable to treat (Neil/NE

Patrick/NE Harris/NE) as a single token because it stands for a single entity.

There are edge cases where this approach causes semantically incorrect chunking.

Consider the sentence “She told Peter Brian was home” where the named entities

Peter and Brian should not be merged. In practice, however, this style is commonly

avoided in written language and occurs only seldom.

61

Page 71: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

• Phrase (PHR)

A phrase refers to a sequence of words that is recognized by GermaNet and has its

own synset. This can be either a very common named entity or a figure of speech.

Typical examples of this are New York State, Hot Dog or mentally challenged.

When merged, a phrase is replaced by an AnnotatedToken. No information is

lost because the assigned synset contains the meaning of the entire phrase. The

part-of-speech tag is inferred from the lexical class of the assigned synset6.

• Personified Profession (PPR)

If a named entity is qualified with the name of a profession, that entity is called

a personified profession. A typical example of this is Inspector Columbo or Dr.

Beverly Crusher.

The pattern is similar to that of regular named entities, except that it is preceded

by a noun. That noun needs to be a hyponym of an abstract GermaNet concept

that subsumes all professions. The head of the resulting CompositeToken is the

noun as it carries the primary semantic meaning of the composite (the named

entities commonly do not have any synsets assigned to them anyway).

Table 5.1 shows a brief summary of the different chunking patterns. The column “mean-

ing” describes the value of φ, i.e. which synset(s) are assigned to the replacement to-

ken.

Name Pattern Replaced By PoS Meaning

A/N ADJA, ADJA, . . ., NN CompositeToken NN noun

A/V ADJD, ADJD, . . ., verb CompositeToken verb verb

ANN ’(’, NE, . . ., NE, ’)’ nothing - -

NES NE, NE, . . . AnnotatedToken NE -

PPR NNprof , NE, . . . CompositeToken NN noun

PHR any, . . . AnnotatedToken gn gn

Table 5.1.: Overview of chunking patterns

The pattern are executed sequentially and it is obvious that the order of execution is

important. Phrase must be run first in order to identify idioms before other chunking

patterns destroy them. Personified Profession should be run before Named Entity to

avoid unnecessary nesting, and so forth.

6Each synset in GermaNet is classified as either a noun, a verb or an adjective.

62

Page 72: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

One correct order of execution is thus:

1. Phrase and Annotation

2. Personified Profession

3. Named Entities

4. Adverbial Adjective/Verb and Attributive Adjective/Noun

An important thing to realize is that a key characteristic of all above modifications is

that they do not change the semantics of the text except for very rare edge cases as

discussed in section 6.3.3. The grammatical structure is essentially unaltered and the

original text could be, within limits, reconstructed. Figure 5.3 illustrates how chunking

imposes a hierarchical structure on a text.

Chunking bears some resemblance to fully parsing a sentence. In particular, the hierar-

chy imposed on the token structure often looks similar to a parse tree. However, there

are two differences that distinguish both techniques. First, the nesting of chunks rarely

exceeds two to three levels and is as such much more shallow than a parse tree. Second,

as mentioned in section 5.3.5, chunking is not exhaustive and may ignore some tokens.

A consequence of these difference is that chunking is generally much less error prone

because it can avoid ambiguous states.

63

Page 73: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter

5.

Implem

enta

tion

Figure 5.3.: Example of semantic chunking

64

Page 74: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

5.4. Extraction Of Descriptive Terms

So far we have not addressed the question how the information that we extracted from

the text will be structured. Chunking, as described above, often leads to nested tokens

in form of CompositeTokens which carry rich semantics but are cumbersome to handle.

An easier and more natural way to represent descriptive terms is to resort to simple

pairs of tokens. Such a pair is also called a concept. It is therefore desirable to flatten

the hierarchical structure of composite tokens and transform it into a set of token pairs.

The concept of a head token makes this task almost trivial. Consider the CompositeToken

(beautiful, French, poetry). It can be expanded by pairing the head with each of its

qualifiers, we thus get the two tuples (beautiful, poetry) and (French, poetry).

5.4.1. Tuple Generation

The expansion of composites alone is not sufficient to cover the better part of a text’s

characteristics. There are other pairs of word classes that are likely to carry semantic

information besides adverb/verb and adjective/noun. For instance, verb/noun pairs are

very descriptive. Oftentimes just a few of these tuples are sufficient to sketch a course of

events. Consider the phrases (break, toy), (slap, sister) and (run, mother). Immediately a

multitude of associations is invoked and it only takes very little fantasy to imagine what

the whole story could be.

Unlike the sequences examined in the previous section, finding these kinds of token pairs

is very difficult. The word order of a German sentence is extremely flexible, especially

in comparison to English. Consider the following three sentences:

(1) Er geht ein Bier trinken.

(2) Bier zu trinken war sein Hobby.

(3) Trinken wollte er eigentlich nur ein paar Bier.

In all three cases, the essential notion is (trinken, Bier), but the positioning of those words

varies greatly, both in absolute terms as well as relative to each other. Actually finding

this concept is challenging. The only reliable way to do so is to parse the sentence using

a complete grammar for natural language in order to identify grammatical relations. For

real-world applications, this is not a feasible option, if it can even be done at all.

65

Page 75: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

As an alternative to a complex linguistic analysis, we propose a heuristic based on the

assumption that certain part-of-speech combinations are more likely to carry meaningful

information than others. The basic idea of this approach is outlined in the algorithm

in figure 5.4. The algorithm essentially calculates the Cartesian product of the tokens

P := ∅;foreach s in S, t1 in s begin

foreach t2 in context(t1) begin

if pass_filter(t1, t2) then P := P ∪ (t1, t2);

end

end

result := rank_pairs(P);

Figure 5.4.: Basic algorithm for heuristic tuple generation

of each sentence with itself and then assigns a score to each tuple that reflects the

likelihood of that particular tuple being descriptive for the text. Generating the full

Cartesian product is for practical purposes not desirable and, as will be discussed in the

following section, generally not needed.

5.4.2. Search-Space Reduction

It is not necessary to fully analyze all possible token combinations. A large number of

candidates can be discarded directly if certain conditions are met. We used the following

three exclusion criteria:

pos-combinations The combination of many part-of-speech tags is never relevant, for

instance if one of the tags is $, (a punctuation). Other combinations are only

sensible under very specific and statistically unlikely circumstances and should

be ignored. The differentiation is is done using a weight matrix that reflects the

probability of a pos-combination of being of interest.

element irrelevancy For the resulting token to be semantically meaningful, it is nec-

essary that both of its elements themselves are meaningful, i.e. that they have a

non-empty list of synsets. This automatically reduces the search-space significantly

because GermaNet only covers adjectives, adverbs, nouns and verbs.

66

Page 76: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

element distance In order to further narrow down the number of candidates, the tuple

generation is limited to only pair a word with those in its immediate context instead

of all words in the sentence. The context is thus defined as the set of tokens in the

direct neighborhood of the current token. We have used two different measures

when defining a word’s context: the lexicographic distance and the pos-weighted

distance. The former measures the distance in equidistant steps, i.e. each token

has a distance value of one. The latter uses a variable distance value depending

on the pos-tag of the token.

Figure 5.5.: Context of the word Lust with a search radius of 3.0 using lexicographic(top) and pos-weighted distance values (bottom)

Figure 5.5 illustrates how this affects the search radius when computing the context

of the word Lust. If a lexicographic distance is used, the verb hatte is not part of the

context and the combination Lust haben (eng. to feel like doing something) is thus not

included in the search-space. Using the pos-weighted distance extends the radius far

enough. For the sake of clarity, the sentence contains no nested tokens, but the concept

can trivially be applied to include those as well. How the distance values for each tag

were determined and how it improves the tuple generation of our application will be

discussed in section 6.4.

5.4.3. Scoring

Having generated the set of candidate tuples, the next step is to assign a score to each

candidate using a scoring function σ. The tuples with the highest score are assumed to

be most likely to be descriptive for the text.

Given a tuple t, the value of σ(t) should reflect the general probability of t being descrip-

tive, its conciseness and possibly the concrete values of its constituents. Consider the

part-of-speech combination (CARD, NN) that is generally of little interest, except if the

noun is Celsius or Fahrenheit. In that particular case, the tuple denotes a temperature

67

Page 77: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 5. Implementation

and could be considered important. Likewise, concise tuples should score higher than

tuples with very fuzzy meanings. The tuple (Fall, gehen), for instance, has a total of

4 × 15 = 60 possible meanings. Word sense disambiguation mitigates this problem to

an extent but can not solve it entirely.

Actual implementations of σ can access a wide variety of information, including all se-

mantic annotations and lexical information of the tokens. The following is an incomplete

list of possible parameters:

- the part-of-speech tags

- the lexicographical or pos-weighted distance between the tokens

- the number of synsets

- properties of the synsets (e.g. penalty for very broad terms)

- the positioning of the tokens within the text

When all tuples are scored, a subset of them will be selected and passed on to the

final step. This can be done by either selecting the top n tuples, discarding all tuples

below a threshold τ , or a combination of both. Section 6.4 will discuss in detail what

parameters turned out to be most significant and what results different implementations

of σ yielded.

5.5. Result Set

The final step in the processing pipeline is the compilation of the actual result set. The

result consists of a list of concepts and a set of keywords. The list of concepts is built by

expanding composite tokens (as described in the beginning of this section on page 65)

and selecting the best candidate tuples. Keywords are subdivided into locations, named

entities and people7. While the explicit listing of named entities provides little additional

value compared to a simple full text search, it can can be used to efficiently generate

indexes for cross-referencing texts. An complete example output of such a result set can

be seen in appendix B.

7Technically, locations and people are also named entities. The difference is that locations and peoplehave semantic meaning in form of an assigned synset, while for named entities no information isavailable beyond their textual representation.

68

Page 78: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6.

Evaluation

This chapter will discuss the evaluation of our system. We will first go into some partic-

ularities of electronic program guides, how they differ from “regular” text corpora and

what that implies for this work. Section 6.2 introduces our test corpus and discusses the

results of a preliminary evaluation we have done in order to gather empirical data. In

section 6.3, individual system components are evaluated, namely the compound splitter,

the word sense disambiguation module and the semantic chunkers. The scoring strategy

for heuristic token pairs is discussed in section 6.4. Finally, the system as a whole is

evaluated in section 6.5 and its limitations are being discussed.

6.1. Particularities of EPGs

Unlike most text corpora commonly used in Text Mining, the entries of an EPG are

very heterogeneous. In particular the following five aspects are of interest:

• Style

The style in which the program descriptions are written varies greatly, particularly

between genres. For instance, thrillers and mystery stories are commonly written

in a prose-like fashion, history documentaries or political magazines usually follow

a more formal style, and programs aimed at teenagers often make use of lurid

formulations.

As a consequence of this, there are no clearly recognizable patterns for the use of

punctuation, reported speech or other structural aspects. This means that it is

69

Page 79: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

very difficult for the tokenizer and pos-tagger to ensure that these constructs are

handled consistently across the corpus.

• Length

While some descriptions are very comprehensive and exceed 400 and more words,

others are only one or two sentences long. In some cases, the description field only

contained a few catchwords. Figure 6.1 illustrates the distribution of text length

measured in a EPG dump of roughly 2000 entries, excluding those with empty

descriptions. The average length of a program description is approximately 100

words. About 50% of all EPG entries had no description.

word count

400350300250200150100500

do

cu

men

ts

100

80

60

40

20

0

Figure 6.1.: Distribution of description length in number of words

• Quality

Most text corpora to which semantic analysis is applied meet a certain standard,

for instance scientific papers, newspaper articles or books. This means that these

texts are of a high quality, particularly with regard to grammar, spelling and use

of colloquialisms. Parsing texts from these corpora is thus often easier and less

error prone compared to descriptions from EPGs.

• Actuality

Description of newscasts or political discussions often make use of up-to-date vo-

cabulary. This is not limited to geographical locations or names of people currently

in the spotlight but include the latest “buzzwords”. Because public interest shifts

almost on a daily basis, ontologies like GermaNet are hesitant to incorporate

these words and only do so with a delay of several months, if it is done at all. As a

consequence of this, the semantic analysis of these programs can be very difficult.

70

Page 80: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

(a) over all words (b) over relevant words

Figure 6.2.: Distribution of pos-tags in the test corpus

• Extrinsic Context

Many program descriptions are written under the assumption that the reader is,

to some extent, already familiar with the program. These texts are said to have

an extrinsic context. TV series in particular are prone to this. For instance, a

regular viewer of House M.D. will know that “Thirteen” refers to a person and

that the series takes place in Princeton, New Jersey. Information like this is rarely

mentioned explicitly, and if so, only at the beginning of a series. It is as such

commonly not available to an analyzing system.

6.2. The Test Corpus and Preliminary Evaluation

In order to evaluate the performance of the system objectively, it was necessary to build

a test corpus as a point of reference. We selected 150 texts from various genres with a

length between 50 and 175 words. All texts were tokenized and each token was manually

annotated with its lemma, part-of-speech tag and a list of adequate synsets.

We wanted to gather some empirical data in terms of what combinations of pos-tags are

most commonly considered descriptive, and how the distance between words factors in.

In order to do this, we implemented a website where users were shown a list of twenty

texts from our corpus (see figure 6.3). Given a list of predetermined word pairs, the

user was asked to select the most descriptive pairs for each text, drag-and-drop them

into a separate list and order them by descriptiveness. The set of “candidates pairs”

71

Page 81: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

Figure 6.3.: Screenshot of the Sliver web application

was created using a prototypic implementation of our system. Each word in a sentence

was paired with every other word in its immediate surrounding, up to a maximum

lexicographic distance of four. This resulted in approximately 58,000 pairs of words, or

about 380 words per text, which was obviously too much for any user to survey in a

reasonable amount of time.

The difficulty was to reduce the number of candidates to a manageable amount without

anticipating the results. In other words, it was necessary to only filter out pairs where

one could be sufficiently sure that they were not relevant. It was decided to ignore all

tokens that were, among others, tagged as conjunctions, negations, or numbers. Also

excluded were several pos-combinations, for instance pairs of auxiliary verbs. And finally,

all words were rejected that could not be found in GermaNet and have, as such, no

accessible semantic value. The number of candidates was thus reduced from 58,000 to

about 4,300 pairs, averaging a manageable 29 tuples per text.

Figure 6.4 shows the distribution of tuples that were perceived as descriptive with respect

to the possible pos-combinations. Subfigure (a) show the raw data as entered by the

users, subfigure (b) shows the adjusted distribution of pos-combinations after removing

outliers (i.e. combinations that occurred less than five times) and combinations that

we did not consider to be descriptive in our intended sense, for example noun/noun-

combinations. During a period of three weeks, 73 users participated and created a total

72

Page 82: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

t1.pos t2.pos p

ADJA NN 12.9 %NN NN 9.9 %NN VVFIN 9.1 %APPR NN 5.3 %NN VAFIN 5.3 %ART NN 4.1 %ADJD NN 3.3 %ADJA VVFIN 2.8 %NN VVPP 2.6 %NN VVINF 2.4 %NE NN 2.2 %ADJA ART 1.6 %NN PPOSAT 1.6 %ADJA VAFIN 1.4 %ADJA APPR 1.4 %VAFIN VVPP 1.3 %APPR VVFIN 1.3 %

other 31.5 %(a) all

t1.pos t2.pos p

ADJA NN 29.4 %NN VVFIN 20.7 %ADJD NN 7.5 %ADJA VVFIN 6.3 %NN VVPP 6.0 %NN VVINF 5.6 %ADJD VAFIN 2.8 %ADV NN 2.8 %ADJD VVFIN 2.5 %VMFIN VVINF 1.6 %ADJA VVPP 1.5 %ADJA VVINF 1.4 %ADV VVFIN 1.2 %ADJD VVPP 1.2 %ADJD VVINF 1.2 %ADJD ADJD 1.1 %ADJA ADV 1.1 %

other 6.3 %(b) adjusted

Figure 6.4.: Percieved relevance of pos-tag combinations

of 1286 evaluations, or 8.6 evaluations per text in the test corpus. The interface was

a web 2.0 application written using ExtJS[43] and jQuery[44]. It was designed to be

easily accessible and self-explanatory, especially to users that were not familiar with these

technologies. The backend was implemented in PHP[45] using the Zend Framework[46]

and a MySQL[47] database.

Two conclusions in particular could be drawn from analyzing the user evaluations:

• over 75% of all pairs fall into the same six classes of pos-combinations

• the perceived relatedness between words is directly connected to their distance

We have addressed the first observation by implementing the A/N and A/V chunkers

which enabled us to capture a large quantity of descriptive terms without the need for

complex analysis. Other important classes of noun/verb combinations are harder to

recognize because the two words were usually not directly adjacent as was the case with

the A/N and A/V patterns.

73

Page 83: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

lexicographic token distance

654321

se

lec

tio

n p

erc

en

tag

e

50%

40%

30%

20%

10%

0%

Figure 6.5.: Empirical correlation between the distance of two words and their perceivedsemantic connectedness

Figure 6.5 illustrates the empiric correlation between the lexicographical distance of two

words and their perceived relatedness. Almost half of all pairs that were considered

descriptive by the users where direct neighbors. Almost anything beyond a distance of

three was not considered relevant.

We used this data to determine the parameters for the tuple generation and the scoring

function. Using a search radius greater than three words will increase the size of the

search space exponentially while only marginally improving recall.

6.3. Component Performance

6.3.1. Compound Splitting

Our test corpus contained a total of 639 words that were tagged as nouns but could not

be found in GermaNet, averaging 4.3 per text. This includes wrongly tagged words

74

Page 84: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

as well as foreign words and abbreviations. Out of these, 71.4% satisfied the conditions

specified in section 5.3.2 and were marked for decompounding. The evaluation has shown

that our set of criteria was tight and had led to an accuracy of 99.7%. It was only in edge

cases that a word was split wrongly, for instance the word Adventure was lemmatized to

Adventur and split into Advent·ur.

6.3.2. Word Sense Disambiguation

For the evaluation of the word sense disambiguation algorithm (WSD) described in

section 5.3.4 we have chosen the following parameters.

Context Size The context of a word consisted of all nouns in the sentence itself, and

the sentences following and preceding it. Interestingly, increasing the size of the context

window had only very little impact on the performance. This is because the texts in

our test corpus had 6.25 sentences on average, so the seemingly small context window

already covered approximately half of the entire text.

Similarity Measure The Leacock-Chodorow similarity measure was used to calculate

the semantic relatedness of the synsets (see 3.7.3). It was extended to include all relations

modeled in GermaNet.

Threshold Value Let the set fp be defined as the set of unambiguous tokens in the

context of t (see figure 5.2). Let α be defined as the minimum average similarity, and

β as the maximum average similarity between the synsets of t and fp:

α(t, fp) := mins∈t.senses

(

sim(s, fp))

β(t, fp) := maxs∈t.senses

(

sim(s, fp))

The threshold value τ that we chose for disambiguating the token t is then defined as

the mean of α and β:

τ :=α(t, fp) + β(t, fp)

2

In other words, the algorithm discards all senses that are “less than averagely similar to

the context”. Other definitions of τ have in our evaluation only shown little impact on

the overall result.

75

Page 85: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

The size of the interval [α(t, fp); β(t, fp)] is an indicator for the confidence with which

t can be disambiguated. To decrease error, our implementation required a minimum

interval size of δ. Our evaluation has shown that a value of δ = 0.07 yields good results.

Merge Function To preserve as much conciseness as possible, our implementation only

merges synset that are very closely related. More precisely, synsets are only merged if

they stand in a parent-son or sibling relationship in the subsumption hierarchy:

merge(S) =

lso(S) if ∀s ∈ S : dist(s, lso(S)) ≤ 1

S elsewith S ⊆ P(S)

If multiple subsets of S can be merged, it is necessary to combine them “bottom-up” so

that the merging can properly chain upwards.

WSD performance

Table 6.1 shows the results of our evaluation using the parameters described above.

Three values were used to measure the performance. If none of the synsets assigned

to a word is adequate, it is considered a complete miss. Almost 4% of all nouns fall

into that category even before WSD is applied. This is due to faulty pos-tagging or

lemmatization. Average purity describes the percentage of synsets that were marked as

adequate. The average ambiguity denotes the average number of synsets assigned to a

noun.

default(no WSD)

WSD(no merging)

WSD(merging)

Complete Misses 3.96% 9.04% 7.98%

Average Purity 75.06% 91.76% 95.53%

Average Ambiguity 2.93% 1.95% 1.72%

Table 6.1.: Performance of word sense disambiguation

In 6% of the cases synsets could be merged and thus conciseness increased. One case

in particular benefited greatly from this: GermaNet assigns 15 synsets to the word

dollar, each representing a concrete currency (US Dollar, Australian Dollar, and so forth).

Merging these to their lowest super-ordinate naturally decreased ambiguity by a great

margin.

76

Page 86: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

The results show that word sense disambiguation has for the most part significantly

improved the quality of the annotations in terms of conciseness. However, in some cases,

the algorithms has discarded all adequate meanings and lead to awkward results. For

instance, the word Sohn (eng. son) is more often than not disambiguated to the sense son

of god instead of male child. In these cases, the inappropriate meaning was more densely

interconnected within GermaNet, which lead to a higher maximum similarity.

6.3.3. Chunking

An important aspect of the implemented chunking patterns was that they should be

semantically safe (see section 5.3.5). In other words, the process of chunking should

introduce as little error as possible. As the precision values in table 6.2 show, this was

in most instances achieved. The majority of errors made were due to faulty pos-tags

or out-of-vocabulary words, especially in the case of the NES and PPR chunker. The

Name n recall precision f-score

A/N 796 99.4% 98.6% 99.0%

A/V 102 98.0% 97.0% 97.5%

ANN 19 95.0% 100.0% 97.4%

NES 130 92.3% 89.6% 90.8%

PPR 30 79.1% 97.7% 87.4%

PHR 11 - - -

Table 6.2.: Performance of chunkers

distinction between NN and NE tags was sometimes poor, in particular when a named

entity was immediately followed by a noun, or vice versa. Another issue are names

that are also regular nouns, which is not at all uncommon in German (e.g. the word

Bäcker is a common surname as well as a profession). The performance of the A/N and

A/V chunkers was very good. It was only in very few edge cases that they had false

positives, e.g. “Er war immer freundlich Tieren gegenüber”, where the adjective freundlich

(eng. friendly) does not belong to the noun Tier (eng. animal). The phrase chunker

had only little overall impact, but where it could be applied it often combined three to

four words and improved the local performance significantly. Recall and precision were

not calculated because the definition of what a phrase is and whether it is included in

GermaNet is more or less arbitrary.

77

Page 87: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

6.4. Scoring

Our evaluation has shown that the predominant parameters for scoring token pairs are

the weight matrix that assigns a “relevancy weight” to a combination of pos-tags, and

the distance between the elements of the pair.

Weight Matrix The STTS tag set consists of 55 tags, so the weight matrix has a total of

3025 entries. However, as the data in figure 6.2 suggests, little over ten tags are relevant

and the matrix is thus very sparse. This is primarily due to the fact that only verbs,

nouns and adjectives carry any semantically relevant meaning which already reduces the

number of “interesting” pos-tags to 17. Further analysis has shown that only about 30

combinations of tags can be considered reliable in terms of semantic connectedness.

Our initial assumption was that a real-valued weight matrix would be favorable as it

allowed for a more fine-grained ranking of the tuples. In practice, however, the decision

whether a token pair is descriptive or not has for the most part turned out to be a binary

classification problem.

Word Distance and Search Radius Figure 6.5 indicates a strong correlation between

the distance of two words and their likeliness of being semantically connected. This

directly implies that a search radius of 3 to 4 is sufficient to include most relevant pairs.

If a pos-weighted distance is used where most tags are mapped to a distance value of

less than one, a search radius of 2 to 3 is adequate.

Other factors, in particular the ambiguity of a word, its location in the text or properties

of its assigned synsets seem to have no measurable impact on their perceived relevancy.

This is not surprising because for one, users rarely give the ambiguity of a word much

thought unless it is really striking. Put differently, ambiguity is an insignificant factor

unless it can not easily be resolved. Naturally, texts with little contextual information

try to avoid these cases.

Furthermore, while the location of a word in a text can be important in some genres (e.g.

papers where important terms tend to be concentrated in the abstract and conclusion),

we were unable to identify any such correlation for EPGs.

78

Page 88: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

6.5. Overall Performance

Measuring the performance of the system as a whole is not a trivial task. This is for

two reasons. First, there is no concise reference solution to compare against. Word

pairs some users find very descriptive may seem irrelevant to others. Second, the use

of a bounded search radius puts a hard limit on our system’s capacity to detect pairs

whose constituents are far apart. Given these circumstances, we have evaluated the

system’s performance within its technical limits. In a second step we analyzed how

many important concepts the implementation has missed.

(a) simple (b) weighted

Figure 6.6.: Performance of heuristic tuple generation

Figure 6.6 shows the percentage of hits and misses for all pairs with a score up to the

threshold specified on the x-axis. Figure (a) is with the lexicographic word distance

and a binary weight matrix, figure (b) is with pos-weighted distances and a real-valued

matrix. As one can see, using a more fine-grained weight matrix and distance measure

has increased performance by 10%. The empirically best cut-off point is at around 0.85

(the highest score for a token pair is 1.0).

Using that threshold, the heuristic yielded an average of 0.4 concepts per sentence (or 2.2

per text). Given the fact that the heuristic is primarily used to identify pairs containing

a verb, this is not surprising. In addition to the 6.3 concepts per text from expanding

the A/V and A/N, this totals approximately 8.4 concepts per description.

On average, 1.2 important concepts per text were missed because their elements were

too far apart and thus beyond the search radius. The primary reason for this are verbal

brackets (see next section).

79

Page 89: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 6. Evaluation

6.6. Unresolved Issues

During our evaluation we encountered several issues that could not be resolved with the

current implementation. The reason for this is that some grammatical and linguistic

constructs lead to a word order that is structurally identical to some of the patterns we

have assumed to be safe for chunking. It is unlikely that the following problems can be

solved without incorporating a deep grammatical analysis.

• Sentence Adverbials

Some adverbs do not refer to a particular word of a sentence, but to the sentence as

a whole. These are called sentence adverbials. Consider the sentence “Surprisingly,

he left early”. In this instance, surprisingly is a comment on the entire sentence.

Sentence adverbials are less problematic in English since they are in most cases

separated by a comma. This is not the case in German. As a result of this, an

adverb is sometimes erroneously paired with a verb in its proximity:

(1) Langsam hatte er genug.

(2) Langsam fuhr er um die Ecke.

In the first case, the adverb is a qualifier for the entire sentence, whereas in the

second case it belongs to the verb. The pair (langsam, haben) makes little sense,

while (langsam, fahren) is very descriptive.

• Verbal Brackets

The splitting of verbs into two parts is a feature of German grammar that is very

problematic. It is not at all uncommon that two two words that are syntactically

and semantically related are on opposite ends of a sentence:

Peter riet Lois dann doch von der Reise ab.

Here, riet and ab are two elements of the word abraten (eng. to advise against).

This is even worsened by the fact that both riet and ab can occur independently:

“Ab dann riet er nur noch”.

• Idioms

A more obvious problem is idiomatic phrases such as “to bite the bullet”. Some of

these expressions are covered by GermaNet, but the majority is not.

80

Page 90: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 7.

Conclusion And Future Work

With several thousand TV shows being broadcasted every single day, keeping track

of interesting programs is a tedious and time-consuming chore. This has led to the

emergence of recommender systems that analyze a user’s behavior and suggest programs

similar to what he or she has enjoyed watching in the past. To work effectively, it is

necessary to process and compare program descriptions on a semantic level. Gaining a

complete understanding of natural language text is a difficult and by no means solved

problem, even more so for loosely structured and colloquially written texts as commonly

found in EPGs. In view of this, attaining partial text comprehension is a sensible

compromise.

In this thesis we have described and implemented a system for augmenting natural

language text with semantic annotations. Using both semantic chunking and heuristic

tuple generation, we were able to create a list of descriptive word pairs that can be used

to assess semantic relatedness between documents on a per-concept basis.

Our evaluation has shown that the combination of “safe” chunking and heuristic tuple

generation can provide useful information about the contents of a program description

and the people and locations involved. Its performance, however, is limited by the

accuracy of the heuristic and more directly, by the comprehensiveness of the underlying

ontology. Furthermore, errors made by the pos-tagger and other components as well as

certain grammatical constructs can cause it to miss important descriptive elements. For

this reason, the annotations should be understood as an auxiliary technique that, used

in conjunction with other methods, can improve the performance of applications like

recommender systems.

81

Page 91: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Chapter 7. Conclusion And Future Work

The performance improvement achieved by using pos-weighted word distances suggests

that implementing more advanced techniques could prove worthwhile, such as adaptive

word distances that assign weights based on the local context. This would require the

incorporation of linguistic experts or extensive statistical analysis.

One area in which our application could be improved is the use of a common data source

for all dictionary-based components. In the current implementation, LemServer,

JWordSplitter and GermaNet all use their own dictionaries, and consolidating

them into a central, authorative data pool could lead to more coherent and compre-

hensible results, as well as an easier identification why a certain term does not show

up in the list of extracted concepts. A promising approach to this issue could be the

integration of the hyphenation data maintained by trennmuster.org[48], either as an

extension to or a replacement for JWordSplitter.

Another way to integrate the different components more tightly could be the incorpo-

ration of the Wiktionary project[49] which provides grammatical annotations, sense

annotations, hyphenation points, a list of synonyms and a gloss, among other things.

Research needs to be done in order to estimate the applicability of the data to our sys-

tem, in particular with respect to vocabulary coverage, completeness of the annotations

and how it could be interconnected with GermaNet.

The question whether our implementation could be applied to languages other than

German arises naturally. While specific aspects of our system are catered to German

(e.g. decompounding), most methodologies described in this thesis should be applicable

to languages that are grammatically and syntactically similar to German. In practice,

however, our system relies on the STTS tag set which is used throughout the entire

application. An interesting approach to resolve this limitation and achieve language

independency is the employment of tagset drivers as proposed in [Zem08].

In conclusion, it can be said that our system performs well within its intended area

of application and yields valuable semantic annotations. While it naturally does not

achieve the comprehensiveness of approaches that employ deep semantic parsing or other

complex linguistic models, it is fast, extensible and robust to the heterogeneity of de-

scriptions from EPGs and leaves various extension points for future projects to enhance

and improve the performance of this work.

82

Page 92: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Bibliography

[BH06] Alexander Budanitsky and Graeme Hirst. Evaluating WordNet-based Mea-sures of Lexical Semantic Relatedness. Computational Linguistics, 32(1):13–47, 2006.

[BN85] Wilhelm Barth and Heinrich Nirschl. Sichere sinnentsprechende Silbentren-nung für die deutsche Sprache. Angewandte Informatik, 27(4):152–159, 1985.

[BP02] Satanjeev Banerjee and Ted Pedersen. An adapted lesk algorithm for wordsense disambiguation using WordNet. In CICLing, pages 136–145, 2002.

[BPP96] Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A Max-imum Entropy Approach to Natural Language Processing. ComputationalLinguistics, 22(1):39–71, 1996.

[Bri92] Eric Brill. A simple rule-based part of speech tagger, 1992.

[Cha97] Eugene Charniak. Statistical Techniques for Natural Language Parsing. AIMagazine, 18(4):33–44, 1997.

[CN03] Hai Leong Chieu and Hwee Tou Ng. Named entity recognition with a maxi-mum entropy approach. In Proceedings of the seventh conference on Naturallanguage learning at HLT-NAACL 2003, pages 160–163, Morristown, NJ,USA, 2003. Association for Computational Linguistics.

[Der89] S. J. Derose. Stochastic methods for resolution of grammatical category am-biguity in inflected and uninflected languages. PhD thesis, Brown University,Providence, RI, USA, 1989.

[EP06a] Katrin Erk and Sebastian Pado. Shalmaneser - A flexible toolbox for semanticrole assignment. In Proceedings of LREC 2006, Genoa, Italy, 2006.

[EP06b] Katrin Erk and Sebastian Pado. Shalmaneser - A Toolchain For ShallowSemantic Parsing. In Proceedings of LREC-06, Genoa, Italy, 2006.

[Fil68] Charles Fillmore. The case for case. In Emmon Bach and R. Harms, editors,

83

Page 93: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Bibliography

Universals in Linguistic Theory. Holt, Rinehart, and Winston, New York,1968.

[Fil82] Charles J. Fillmore. Frame semantics. In Linguistics in the Morning Calm,pages 111–137, 1982.

[FMS03] Kostas Fragos, Yannis Maistros, and Christos Skourlas. Word sense disam-biguation using WordNet relations. In In Proc. of the 1st Balkan Conferencein Informatics, Thessaloniki, 2003.

[FZ98] M. Fuller and J. Zobel. Conflation-based comparison of stemming algorithms.In Proceedings of the Australian Document Computing Symposium, pages 8–13, Sydney, Australia, August 1998.

[GG93] Thomas R. Gruber and Thomas R. Gruber. A translation approach toportable ontology specifications. Knowledge Acquisition, 5:199–220, 1993.

[GHC98] Niyu Ge, John Hale, and Eugene Charniak. A statistical approach toanaphora resolution. In In Proceedings of the Sixth Workshop on Very LargeCorpora, pages 161–170, 1998.

[Gie08] Eugenie Giesbrecht. Evaluation of pos tagging for web as corpus. Master’sthesis, University of Osnabrück, 2008.

[GJ02] Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic roles.Computational Linguistics, 28(3):245–288, 2002.

[GR71] Barbara B. Greene and Gerald M. Rubin. Automatic Grammatical Taggingof English. Technical report, Department of Linguistics, Brown University,Providence, Rhode Island, 1971.

[JZ07] Jing Jiang and ChengXiang Zhai. An empirical study of tokenization strate-gies for biomedical information retrieval. Inf. Retr., 10(4-5):341–363, 2007.

[KP96] Wessel Kraaij and Ren Ee Pohlmann. Viewing stemming as recall enhance-ment. In In Proceedings of the 19th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval, pages 40–48,1996.

[KS63] Sheldon Klein and Robert F. Simmons. A Computational Approach to Gram-matical Coding of English Words. J. ACM, 10(3):334–347, 1963.

[KSNM03] Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning.Named entity recognition with character-level models. In Walter Daelemans

84

Page 94: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Bibliography

and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 180–183. Ed-monton, Canada, 2003.

[KXW03] Chunyu Kit, Zhiming Xu, and Jonathan J. Webster. Integrating ngrammodel and case-based learning for Chinese word segmentation. In Proceedingsof the second SIGHAN workshop on Chinese language processing, pages 160–163, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

[LCM98] Claudia Leacock, Martin Chodorow, and George A. Miller. Using CorpusStatistics and WordNet Relations for Sense Identification. ComputationalLinguistics, 24(1):147–165, 1998.

[Les86] Michael Lesk. Automatic sense disambiguation using machine readable dic-tionaries: how to tell a pine cone from an ice cream cone. In SIGDOC ’86:Proceedings of the 5th annual international conference on Systems documen-tation, pages 24–26, New York, NY, USA, 1986. ACM.

[Lez98] Wolfgang Lezius. A freely available morphological analyzer, disambigua-tor and context sensitive lemmatizer for German. In In Proceedings of theCOLING-ACL, pages 743–747, 1998.

[Lin98] Dekang Lin. An information-theoretic definition of similarity. In Jude W.Shavlik, editor, ICML, pages 296–304. Morgan Kaufmann, 1998.

[LL94] Shalom Lappin and Herbert J. Leass. An algorithm for pronominal anaphoraresolution. Computational Linguistics, 20(4):535–561, 1994.

[LMBC05] Yaoyong Li, Chuanjiang Miao, Kalina Bontcheva, and Hamish Cunningham.Perceptron learning for Chinese word segmentation. In In Proceedings of theFourth SIGHAN Workshop, Jeju, Korea, pages 154–157, 2005.

[Lov68] Julie B. Lovins. Development of a stemming algorithm. Mechanical Trans-lation and Computational Linguistics, June 1968.

[LSM95] Xiaobin Li, Stan Szpakowicz, and Stan Matwin. A WordNet-based algorithmfor word sense disambiguation. In IJCAI, pages 1368–1374, 1995.

[MBF+90] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, andKatherine Miller. WordNet: An on-line lexical database. International Jour-nal of Lexicography, 3:235–244, 1990.

[Mih07] Rada Mihalcea. Using Wikipedia for automatic word sense disambiguation.In Candace L. Sidner, Tanja Schultz, Matthew Stone, and ChengXiang Zhai,

85

Page 95: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Bibliography

editors, HLT-NAACL, pages 196–203. The Association for ComputationalLinguistics, 2007.

[MJ07] N. Lavrac M. Jursic, I. Mozetic. Learning Ripple Down Rules for EfficientLemmatization. In Proc. 10th Intl. Multiconference Information Society,pages 206–209, 2007.

[ML95] Joseph F. McCarthy and Wendy G. Lehnert. Using decision trees for coref-erence resolution. In IJCAI, pages 1050–1055, 1995.

[MM01] Rada Mihalcea and Dan Moldovan. Automatic generation of a coarse grainedWordNet. In Proceedings of the NAACL worshop on WordNet and OtherLexical Resources, Pittsburg, USA, 2001.

[MS05] Tomasz Marciniak and Michael Strube. Beyond the pipeline: Discrete op-timization in NLP. In Proceedings of the Ninth Conference on Computa-tional Natural Language Learning (CoNLL-2005), pages 136–143, Ann Ar-bor, Michigan, June 2005. Association for Computational Linguistics.

[Pal94] David D. Palmer. Satz - an adaptive sentence segmentation system. Techni-cal Report UCB/CSD-94-846, EECS Department, University of California,Berkeley, Dec 1994.

[Pav05] Sladjana Pavlovic. Unsupervised Coreference Resolution for German. Mas-ter’s thesis, Department of Linguistics, Everhard-Karls-Universität Tübin-gen, 2005.

[PK00] Thierry Poibeau and Leila Kosseim. Proper name extraction from non-journalistic texts. In Walter Daelemans, Khalil Sima’an, Jorn Veenstra,and Jakub Zavrel, editors, CLIN, volume 37 of Language and Computers- Studies in Practical Linguistics, pages 144–157. Rodopi, 2000.

[PLME08] Joël Plisson, Nada Lavrac, Dunja Mladenic, and Tomaz Erjavec. Rip-ple Down Rule learning for automated word lemmatisation. AI Commun.,21(1):15–26, 2008.

[Por80] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137,1980.

[PW95] Martha Stone Palmer and Zhibiao Wu. Verb semantics for English-Chinesetranslation. Machine Translation, 10(1-2):59–92, 1995.

[Rat98] Adwait Ratnaparkhi. Maximum entropy models for natural language ambi-

86

Page 96: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Bibliography

guity resolution. PhD thesis, University of Pennsylvania, Philadelphia, PA,USA, 1998. Supervisor-Marcus„ Mitchell P.

[RR97] Jeffrey C. Reynar and Adwait Ratnaparkhi. A Maximum Entropy Approachto Identifying Sentence Boundaries. In ANLP, pages 16–19, 1997.

[RS95] Emmanuel Roche and Yves Schabes. Deterministic part-of-speech taggingwith finite state transducers. Computational Linguistics, 21:227–253, 1995.

[RSW+98] Bob Rehder, M. E. Schreiner, Michael B. W. Wolfe, Darrell Laham,Thomas K. Landauer, and Walter Kintsch. Using latent semantic analysisto assess knowledge: Some technical considerations. Discourse Processes,25:337–354, 1998.

[RV95] Atro Voutilainen Research and Atro Voutilainen. A syntax-based part-of-speech analyser. In In EACL-95, pages 157–164, 1995.

[San90] Beatrice Santorini. Part-of-speech tagging guidelines for the Penn TreebankProject. Technical report, Department of Computer and Information Science,University of Pennsylvania, 1990.

[SB88] Gerard Salton and Christopher Buckley. Term-weighting approaches in au-tomatic text retrieval. In Information Processing and Management, pages513–523, 1988.

[Sch94] Helmut Schmid. Probabilistic part-of-speech tagging using decision trees,1994.

[Sch95] Helmut Schmid. Improvements in part-of-speech tagging with an applicationto German. In In Proceedings of the ACL SIGDAT-Workshop, pages 47–50,1995.

[Süd09] Mediendaten Südwest. Aktuelle Basisdatenzu TV, Hörfunk, Print, Film und Internet.http://www.mediendaten.de/fernsehen-empfang-sender.html, 2009.

[SFK99] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic extraction ofrules for sentence boundary disambiguation, 1999.

[SM04] Lei Shi and Rada Mihalcea. An algorithm for open text semantic parsing.In In Proceedings of the ROMAND 2004 workshop on Robust Methods inAnalysis of Natural language Data, 2004.

[SNL01] Wee Meng Soon, Hwee Tou Ng, and Chung Yong Lim. A machine learning

87

Page 97: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Bibliography

approach to coreference resolution of noun phrases. Computational Linguis-tics, 27(4):521–544, 2001.

[SRM02] Michael Strube, Stefan Rapp, and Christoph Müller. The influence of mini-mum edit distance on reference resolution. In Proceedings of the Conferenceon Empirical Methods in Natural Language Processing (EMNLP), pages 312–319, Philadelphia, July 2002. Association for Computational Linguistics.

[TKMS03] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer.Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. InHuman Language Technology Conference (HLT-NAACL 2003), 2003.

[TM00] Kristina Toutanova and Christopher Manning. Enriching the knowledgesources used in a maximum entropy part-of-speech tagger. In Joint SIG-DAT Conference on Empirical Methods in NLP and Very Large Corpora(EMNLP/VLC-2000), October 2000.

[Vol00] Johannes Volmert. Grundkurs Sprachwissenschaft. UTB, Stuttgart, 2000.

[WK92] Jonathan J. Webster and Chunyu Kit. Tokenization as the initial phase inNLP. In Proceedings of the 14th conference on Computational linguistics,pages 1106–1110, Morristown, NJ, USA, 1992. Association for Computa-tional Linguistics.

[WM06] René Witte and Jutta Mülle, editors. Text Mining: Wissensgewinnungaus natürlichsprachigen Dokumenten, Interner Bericht 2006-5. UniversitätKarlsruhe, Fakultät für Informatik, Institut für Programmstrukturen undDatenorganisation (IPD), 2006.

[WQL07] Xin-Jing Wang, Yong Qin, and Wen Liu. A search-based Chinese wordsegmentation method. In WWW ’07: Proceedings of the 16th internationalconference on World Wide Web, pages 1129–1130, New York, NY, USA,2007. ACM.

[Zdz05] Jonathan A. Zdziarski. Ending Spam: Bayesian Content Filtering and theArt of Statistical Language Classification. No Starch Press, San Francisco,CA, USA, 2005.

[Zem08] Daniel Zeman. Reusable tagset conversion using tagset drivers. In EuropeanLanguage Resources Association (ELRA), editor, Proceedings of the SixthInternational Language Resources and Evaluation (LREC’08), Marrakech,Morocco, may 2008.

[ZG06] Torsten Zesch and Iryna Gurevych. Automatically creating datasets for mea-

88

Page 98: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Bibliography

sures of semantic relatedness. In Proceedings of the Workshop on LinguisticDistances, pages 16–24, Sydney, Australia, July 2006. Association for Com-putational Linguistics.

89

Page 99: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

LIST OF URLS

List of URLs

[1] http://www.sighan.org/

[2] http://www.metacritic.com/

[3] http://snowball.tartarus.org/

[4] http://khnt.aksis.uib.no/icame/manuals/brown/

[5] http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html

[6] http://www.natcorp.ox.ac.uk/

[7] http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html

[8] http://www.ldc.upenn.edu/Catalog/docs/LDC2005T33/BBN-Types-Subtypes.html

[9] http://nlp.cs.nyu.edu/ene/

[10] http://rdfs.org/sioc/spec/

[11] http://www.obofoundry.org/

[12] http://www.geneontology.org/

[13] http://wordnet.princeton.edu/

[14] http://www.ldc.upenn.edu/Catalog/LDC97T12.html

[15] http://www.cse.unt.edu/~rada/downloads.html

[16] http://www.wikipedia.org

[17] http://framenet.icsi.berkeley.edu/

[18] http://framenet.icsi.berkeley.edu/FrameGrapher/grapher.php

[19] http://gemini.uab.es:9080/SFNsite

[20] http://jfn.st.hc.keio.ac.jp/

[21] http://www.laits.utexas.edu/gframenet/

[22] http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/stts.asc

[23] http://opennlp.sourceforge.net/

[24] http://nlp.stanford.edu

90

Page 100: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

LIST OF URLS

[25] http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

[26] http://www.wolfganglezius.de/doku.php?id=public:cl:morphy

[27] http://kt.ijs.si/software/LemmaGen/

[28] http://rupostagger.sourceforge.net/

[29] http://www.cygwin.com/

[30] http://www.aot.ru/download.php

[31] http://www.genios.de

[32] http://www.sfs.uni-tuebingen.de/

[33] http://www.openthesaurus.de/

[34] http://www.openoffice.org

[35] http://koffice.org/

[36] http://gate.ac.uk/

[37] http://incubator.apache.org/uima/

[38] http://www.oasis-open.org/news/oasis-news-2009-03-19.php

[39] http://www.swig.org/

[40] http://coli.uni-saarland.de/projects/salsa/shal/

[41] http://coli.uni-saarland.de/projects/salsa/page.php?id=index-salsa1

[42] http://sourceforge.net/projects/jwordsplitter/

[43] http://extjs.com/

[44] http://jquery.com/

[45] http://www.php.net

[46] http://framework.zend.com

[47] http://www.mysql.com/

[48] http://groups.google.de/group/trennmuster-opensource/

[49] http://www.wiktionary.de/

91

Page 101: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Index

ambiguityconjunctive, 11disjunctive, 11

antecedent, see referentantonym, see antonymyantonymy, 29

case frame, see frame semanticscase grammar, see frame semanticscausation, 29chunking, 7, 58chunking pattern

adverbial adjective/verb, 61annotation, 61attributive adjective/noun, 60named entities, 61personified profession, 62phrase, 62

collocations, 11compound, 21

appositional, 22copulative, primary, 22endocentric, 22exocentric, possessive, 22

compound splitting, see decompoundingconcept, 3, 65conciseness, 55context

extrinsic, 71coreference chain, 27coreference resolution, 26corpus, 3CR, see coreference resolution

decompounding, 21

distancelexicographic, 67pos-weighted, 67

entailment, 29epenthesis, 50

f-measure, 4f-score, see f-measureframe semantics, 34fuzziness, 55

gloss, 30

head sequence, 60head token, 60holonym, see holonymyholonymy, 29homograph, 4homonym, 4hypernym, see hypernymyhypernymy, 29hyponym, see hyponymyhyponymy, 29

inflection, 18

languageagglutinative, 20fusional, 20morphemic, 8

lemma, 18lemmatization, 20lexeme, 18lexical category, see part-of-speechlexical relation, 29lowest super-ordinate, 32

92

Page 102: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Index

lso, see lowest super-ordinate

meaningconcise, see concisenessfuzzy, see fuzziness

meronym, see meronymymeronymy, 29morpheme, 4morphology, 18most specific common subsumer, see low-

est super-ordinate

n-gram, 13named entity recognition, 26NER, see named entity recognitionnoise word, see stop word

ontology, 28oronym, 13

part-of-speech tagging, 24polymorpheme, see compoundpolyseme, 4pos, see part-of-speech taggingprecision, 4

qualifier, 61

recall, 4reference

anaphoric, 27cataphoric, 27endophoric, 27exophoric, 27

referent, 26relevant word, 3, 38

scoring function, 67segment frequency, 14semantic role labeling, 36semantic similarity, 32sentence adverbials, 80sentence boundary disambiguation, 9sentence detection, 9shallow semantic parsing, 34stem, 18

stemmer, see stemmingheavy, strong, 19light, weak, 19

stemming, 18stop sequence, 60stop word, 23sublanguage, 10subsumption hierarchy, 32suppletive, 18synonym ring, see synsetsynonym set, see synsetsynset, 4, 29synset expansion, 59

text corpus, see corpustf.idf, 3token sequence

head, 60stop, 60uniform, 60

tokenizationdictionary-based, 12n-gram, 13perceptron learning, 13search-based, 14

troponym, see troponymytroponymy, 29

unification function, 59uniform sequence, 60

valency, 34expansion, 35reduction, 35

weight matrix, 66word net, 29word segmentation, 8word sense disambiguation, 33WSD, see word sense disambiguation

93

Page 103: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Appendix A.

The STTS tag set

Name Meaning

ADJA attributives Adjektiv

ADJD adverbiales oder prädikatives Adjektiv

ADV Adverb

APPR Präposition; Zirkumposition links

APPRART Präposition mit Artikel

APPO Postposition

APZR Zirkumposition rechts

ART bestimmter oder unbestimmter Artikel

CARD Kardinalzahl (Ordinalzahlen sind als ADJA getaggt)

FM Fremdsprachliches Material

ITJ Interjektion

KOUI unterordnende Konjunktion mit “zu” und Infinitiv

KOUS unterordnende Konjunktion mit Satz

KON nebenordnende Konjunktion

KOKOM Vergleichskonjunktion

NN normales Nomen

NE Eigennamen

PDS substituierendes Demonstrativpronomen

PDAT attribuierendes Demonstrativpronomen

PIS substituierendes Indefinitpronomen

PIAT attribuierendes Indefinitpronomen ohne Determiner

PIDAT attribuierendes Indefinitpronomen mit Determiner

PPER irreflexives Personalpronomen

Continued on next page

94

Page 104: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Appendix A. The STTS tag set

Name Meaning

PPOSS substituierendes Possessivpronomen

PPOSAT attribuierendes Possessivpronomen

PRELS substituierendes Relativpronomen

PRELAT attribuierendes Relativpronomen

PRF reflexives Personalpronomen

PWS substituierendes Interrogativpronomen

PWAT attribuierendes Interrogativpronomen

PWAV adverbiales Interrogativ- oder Relativpronomen

PAV Pronominaladverb

PTKZU “zu” vor Infinitiv

PTKNEG Negationspartikel

PTKVZ abgetrennter Verbzusatz

PTKANT Antwortpartikel

PTKA Partikel bei Adjektiv oder Adverb

TRUNC Kompositions-Erstglied

VVFIN finites Verb, voll

VVIMP Imperativ, voll

VVINF Infinitiv, voll

VVIZU Infinitiv mit “zu”, voll

VVPP Partizip Perfekt, voll

VAFIN finites Verb, aux

VAIMP Imperativ, aux

VAINF Infinitiv, aux

VAPP Partizip Perfekt, aux

VMFIN finites Verb, modal

VMINF Infinitiv, modal

VMPP Partizip Perfekt, modal

XY Nichtwort, Sonderzeichen enthaltend

$, Komma

$. Satzbeendende Interpunktion

$( sonstige Satzzeichen; satzintern

95

Page 105: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

Appendix B.

Example output of Sliver

Program description:

“Shaggy und seine liebenswerte Deutsche Dogge, der riesengroße, aber nichtallzu heldenhafte Scooby-Doo, müssen neue Abenteuer bestehen! Alles fängtdamit an, dass Shaggy eine große Summe von seinem spurlos verschwunde-nen Onkel Albert erbt. Zusammen mit Scooby-Doo zieht er in dessen alteVilla. Hier erfahren die beiden den Grund für das plötzliche Verschwindendes Onkels - und damit beginnt für sie eine weitere Reise in aufregende Aben-teuer...”

96

Page 106: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

App

endix

B.

Exam

ple

outp

ut

of

Sliv

er

<?xml version="1.0" encoding="ISO-8859-1"?>

<AnalysisResult>

<Locations/>

<NamedEntities>

<NamedEntity name="Albert"/>

<NamedEntity name="Shaggy"/>

<NamedEntity name="Scooby-Doo"/>

</NamedEntities>

<Pairs>

<Pair>

<AnnotatedToken value="Dogge" pos="NN" lemma="Dogge" synsets="nTier.2108"/>

<AnnotatedToken value="liebenswerte" pos="ADJA" lemma="liebenswert" synsets="aVerhalten.11"/>

</Pair>

<Pair>

<AnnotatedToken value="Dogge" pos="NN" lemma="Dogge" synsets="nTier.2108"/>

<AnnotatedToken value="deutsche" pos="ADJA" lemma="deutsch" synsets="aGesellschaft.396"/>

</Pair>

<Pair>

<AnnotatedToken value="Scooby-Doo" pos="NN" lemma="Scooby-Doo" synsets=""/>

<AnnotatedToken value="heldenhafte" pos="ADJA" lemma="heldenhaft" synsets="aVerhalten.282"/>

</Pair>

<Pair>

<AnnotatedToken value="Abenteuer" pos="NN" lemma="Abenteuer" synsets="nGeschehen.676"/>

<AnnotatedToken value="neue" pos="ADJA" lemma="neu" synsets="aZeit.182,aZeit.335"/>

</Pair>

<Pair>

<AnnotatedToken value="Summe" pos="NN" lemma="Summe" synsets="nBesitz.127,nMenge.678"/>

<AnnotatedToken value="große" pos="ADJA" lemma="groß" synsets="aAllgemein.3,aMenge.134"/>

</Pair>

<Pair>

<AnnotatedToken value="Onkel" pos="NN" lemma="Onkel" synsets="nMensch.89"/>

<AnnotatedToken value="verschwundenen" pos="ADJA" lemma="verschwinden"

synsets="vBesitz.456,vVeraenderung.377,vVeraenderung.1482"/>

</Pair>

<Pair>

<AnnotatedToken value="Villa" pos="NN" lemma="Villa" synsets="nArtefakt.3687"/>

97

Page 107: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

App

endix

B.

Exam

ple

outp

ut

of

Sliv

er

<AnnotatedToken value="alte" pos="ADJA" lemma="alt" synsets="aZeit.235,aZeit.329,aZeit.337"/>

</Pair>

<Pair>

<AnnotatedToken value="Verschwinden" pos="NN" lemma="Verschwinden" synsets="nGeschehen.4142"/>

<AnnotatedToken value="plötzliche" pos="ADJA" lemma="plötzlich" synsets="aZeit.29"/>

</Pair>

<Pair>

<AnnotatedToken value="Reise" pos="NN" lemma="Reise" synsets="nGeschehen.690"/>

<AnnotatedToken value="weitere" pos="ADJA" lemma="weiter" synsets="aZeit.43"/>

</Pair>

<Pair>

<AnnotatedToken value="Abenteuer" pos="NN" lemma="Abenteuer" synsets="nGeschehen.676"/>

<AnnotatedToken value="aufregende" pos="ADJA" lemma="aufregend" synsets="aGefuehl.302"/>

</Pair>

</Pairs>

<HeuristicPairs>

<HeuristicPair score="1,000">

<AnnotatedToken value="Abenteuer" pos="NN" lemma="Abenteuer" synsets="nGeschehen.676"/>

<AnnotatedToken value="bestehen" pos="VVINF" lemma="bestehen"

synsets="vAllgemein.11,vAllgemein.310,vGesellschaft.970,vGesellschaft.998,..."/>

</HeuristicPair>

<HeuristicPair score="0,714">

<AnnotatedToken value="erfahren" pos="VVFIN" lemma="erfahren"

synsets="aGeist.88,vBesitz.290,vKognition.257,vKognition.264"/>

<AnnotatedToken value="Grund" pos="NN" lemma="Grund" synsets="nArtefakt.6315,nMotiv.2,nnatGegenstand.3"/>

</HeuristicPair>

<HeuristicPair score="0,650">

<AnnotatedToken value="spurlos" pos="ADJD" lemma="spurlos" synsets="aprivativ.91"/>

<AnnotatedToken value="Onkel" pos="NN" lemma="Onkel" synsets="nMensch.89"/>

</HeuristicPair>

<HeuristicPair score="0,588">

<AnnotatedToken value="Onkels" pos="NN" lemma="Onkel" synsets="nMensch.89"/>

<AnnotatedToken value="beginnt" pos="VVFIN" lemma="beginnen" synsets="vVeraenderung.6,vVeraenderung.95"/>

</HeuristicPair>

<HeuristicPair score="0,500">

<AnnotatedToken value="Summe" pos="NN" lemma="Summe" synsets="nBesitz.127,nMenge.678"/>

<AnnotatedToken value="spurlos" pos="ADJD" lemma="spurlos" synsets="aprivativ.91"/>

98

Page 108: Enhancing Text Tokenization with Semantic Annotations ...pub.n3rd.org/thesis/Thesis.pdfOur evaluation against a manually annotated test corpus has shown that this approach is robust

App

endix

B.

Exam

ple

outp

ut

of

Sliv

er

</HeuristicPair>

<HeuristicPair score="0,500">

<AnnotatedToken value="Onkel" pos="NN" lemma="Onkel" synsets="nMensch.89"/>

<AnnotatedToken value="erbt" pos="VVFIN" lemma="erben" synsets="vBesitz.374"/>

</HeuristicPair>

<HeuristicPair score="0,500">

<AnnotatedToken value="zieht" pos="VVFIN" lemma="ziehen"

synsets="vKoerperfunktion.65,vKoerperfunktion.298,vKontakt.18,vLokation.273,..."/>

<AnnotatedToken value="Villa" pos="NN" lemma="Villa" synsets="nArtefakt.3687"/>

</HeuristicPair>

<HeuristicPair score="0,500">

<AnnotatedToken value="beginnt" pos="VVFIN" lemma="beginnen" synsets="vVeraenderung.6,vVeraenderung.95"/>

<AnnotatedToken value="Reise" pos="NN" lemma="Reise" synsets="nGeschehen.690"/>

</HeuristicPair>

</HeuristicPairs>

</AnalysisResult>

99