View
5
Download
0
Category
Preview:
Citation preview
Department of Informatics andMathematics
Chair for Distributed Information Systems
Prof. Dr. Harald Kosch
Enhancing Text Tokenizationwith Semantic Annotations using
Natural Language Processingand Text Mining Techniques
Raphael Pigulla
June 22, 2009
37544, raphael@pigulla.net
Diploma thesis supervised by Dipl.-Ing. Günther Hölbling
Advisor: Prof. Dr. Harald Kosch2nd Advisor: Dr. Bernhard Sick
Selbstständigkeitserklärung
Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig angefertigt, sie nicht
anderweitig zu Prüfzwecken vorgelegt und keine anderen als die angegebenen Hilfsmit-
tel verwendet habe. Wörtlich oder dem Sinn nach entnommene Stellen sind als solche
gekennzeichnet.
Passau, den 22. Juni 2009
Raphael Pigulla
Acknowledgements
I would like to extend my sincerest gratitude to everyone who has helped me with this
thesis. I thank my advisor Günther Hölbling for his assistance and support and Prof.
Dr. Harald Kosch for providing me the opportunity to work in this interesting and
challenging field of research.
I wish to thank Walesia Bernard, Paul Maier and Andrew Bromba for their much ap-
preciated comments and suggestions. Thanks go out to everyone who took the time to
participate in the web survey and contributed valuable data pivotal for various aspects
of this work. Finally, I am deeply obliged to my family for the ceaseless support and
patience.
Abstract
Extracting semantic information from natural language texts is a taskthat encompasses many different concepts. Implementations are oftentailored towards specific textual domains. In this thesis, we propose agenerally applicable approach to augmenting text with semantic an-notations.The system described in this thesis uses shallow semantic chunkingand a heuristic based on part-of-speech tags to generate a list of wordtuples descriptive for a given text. The input is first tokenized andeach word is annotated with one or more meanings using GermaNet.Neighboring tokens that match certain structural patterns are com-bined into chunks. From the thusly structured text descriptive wordtuples are extracted using part-of-speech tags and other lexical andgrammatical features.Our evaluation against a manually annotated test corpus has shownthat this approach is robust to texts of varying styles and genres. Thesemantic annotations can be used in conjunction with standard textmining techniques to improve the performance of conventional searchengines and recommender systems.
Contents
1. Introduction 11.1. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3. Explanation of Important Terms . . . . . . . . . . . . . . . . . . . . . . . 31.4. Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. Tokenization 62.1. Problems of Tokenizing Natural Language . . . . . . . . . . . . . . . . . 7
2.1.1. Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2. Sentence Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3. Sublanguages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2. Tokenization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1. Dictionary-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2. N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3. Perceptron Learning . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.4. Search-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3. Natural Language Processing and Text Mining Techniques 163.1. Applications of NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2. Morphological Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1. Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2. Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3. Compound Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3. Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4. Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5. Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6. Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7. Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7.1. Structural Description . . . . . . . . . . . . . . . . . . . . . . . . 293.7.2. Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.7.3. Measuring Semantic Relatedness . . . . . . . . . . . . . . . . . . 32
3.8. Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . 333.9. Frame Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
i
Contents
4. Assessment of Existing Libraries and Toolkits 384.1. Part-of-Speech Taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1. OpenNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.2. Stanford NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.1.3. TreeTagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2. Lemmatizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.1. Morphy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.2. LemmaGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.3. LemServer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3. Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.1. GermaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.2. OpenThesaurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4. Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.1. GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.2. UIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5.1. Shalmaneser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5.2. JWordSplitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5. Implementation 515.1. Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2. Tokenization and PoS-Tagging . . . . . . . . . . . . . . . . . . . . . . . . 535.3. Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1. Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.2. Compound Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.3. Ontology Integration . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.4. Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . 565.3.5. Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4. Extraction Of Descriptive Terms . . . . . . . . . . . . . . . . . . . . . . . 655.4.1. Tuple Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.2. Search-Space Reduction . . . . . . . . . . . . . . . . . . . . . . . 665.4.3. Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5. Result Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6. Evaluation 696.1. Particularities of EPGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2. The Test Corpus and Preliminary Evaluation . . . . . . . . . . . . . . . . 716.3. Component Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1. Compound Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 746.3.2. Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . 756.3.3. Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4. Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5. Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.6. Unresolved Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ii
Contents
7. Conclusion And Future Work 81
A. The STTS tag set 94
B. Example output of Sliver 96
iii
List of Figures
2.1. A token hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2. Naive sentence detection algorithm . . . . . . . . . . . . . . . . . . . . . 9
3.1. An exemplary ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2. Application of a semantic frame . . . . . . . . . . . . . . . . . . . . . . . 363.3. Relations between semantic frames . . . . . . . . . . . . . . . . . . . . . 36
4.1. Excerpt of LemmaGens rule set for German . . . . . . . . . . . . . . . . 444.2. LemServer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3. The Apache UIMA project . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1. The Sliver processing pipeline . . . . . . . . . . . . . . . . . . . . . . . 515.2. Algorithm for word sense disambiguation . . . . . . . . . . . . . . . . . . 585.3. Example of semantic chunking . . . . . . . . . . . . . . . . . . . . . . . . 645.4. Basic algorithm for heuristic tuple generation . . . . . . . . . . . . . . . 665.5. Comparison of lexicographic and pos-weighted distances . . . . . . . . . 67
6.1. Distribution of description length . . . . . . . . . . . . . . . . . . . . . . 706.2. Distribution of pos-tags in the test corpus . . . . . . . . . . . . . . . . . 716.3. Screenshot of the Sliver web application . . . . . . . . . . . . . . . . . 726.4. Percieved relevance of pos-tag combinations . . . . . . . . . . . . . . . . 736.5. Percieved semantic connectedness . . . . . . . . . . . . . . . . . . . . . . 746.6. Performance of heuristic tuple generation . . . . . . . . . . . . . . . . . . 79
iv
List of Tables
4.1. Performance of evaluated pos-taggers . . . . . . . . . . . . . . . . . . . . 414.2. Performance of evaluated lemmatizers . . . . . . . . . . . . . . . . . . . . 454.3. Coverage and relational density of evaluated ontologies . . . . . . . . . . 47
5.1. Overview of chunking patterns . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1. Performance of word sense disambiguation . . . . . . . . . . . . . . . . . 766.2. Performance of chunkers . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
v
Chapter 1.
Introduction
In 2008, the average German household received over 70 different TV stations [Süd09].
With a conservative estimate of one program per hour, this totals over 1,600 programs
per day, every day. Finding shows of interest has become a cumbersome and oftentimes
unmanageable task for the user. Intelligent recommender systems integrated into a user’s
TV can monitor his or her behavior and provide information about relevant broadcasts
in real time or record them autonomously while the user is away. The acceptance of
such systems by the user is directly connected to their performance. The performance,
in turn, depends fundamentally on the recommender’s ability to relate shows on a se-
mantic level and beyond the genres, actor names and keywords provided by electronic
program guides (EPG).
This thesis proposes a system to extract important semantic concepts from program
descriptions in a fully automated fashion. We have approached this problem combining
various techniques from the field of Natural Language Processing (NLP) and Text Min-
ing, described in detail in this work. Through a series of processing steps, the input text
is gradually augmented with semantic annotations that provide rich grounds for other
systems to make hand-tailored recommendations on a just-in-time basis.
1
Chapter 1. Introduction
1.1. Outline
This thesis is structured as follows. In the rest of this chapter we will first give a brief
overview of related techniques and applications. We will then establish some common
terminology that will be used throughout this work. Terminology only required in a
specific context is introduced as needed within the respective chapters.
In the second chapter follows an introduction to the principles of tokenization in a
broader context before discussing their applicability to NLP. Several distinct ways of
implementing a tokenizer for natural language texts will be presented and illustrated
with concrete examples. Chapter three gives an overview of some important and most
commonly used facets of Natural Language Processing and Text Mining, starting with
elementary word-based techniques and ending with sophisticated methods that combine
multiple aspects. Chapter four details our assessment of existing libraries, toolkits and
frameworks that we considered for integration into our system. They will be evaluated
in terms of performance and applicability to our aim. In chapter five we describe the
implementation of our system and how the previously evaluated techniques were em-
ployed. Chapter six introduces the evaluation setup and reviews the performances of
the individual components and the application as a whole. And finally, a conclusion is
drawn in chapter seven.
1.2. Related Work
A system structurally similar to our implementation was built by Zesch and Gurevych
[ZG06]. The purpose of that system, however, differs significantly from ours. It is meant
as a tool to automatically generate pairs of words that can be used to evaluate measures
of semantic relatedness. Unlike our document-based approach, it works on an entire
corpus and creates tuples of grammatically homogeneous words which are then meant
to be rated by a human in terms of similarity.
Latent Semantic Indexing is a related technique that can be used to quantify the seman-
tic relatedness between texts. However, it is patented and computationally expensive.
Its runtime scales disproportionately with the number of texts to be indexed and the
addition of a new text requires a complete re-run of the algorithm on the entire corpus.
This limits the practical applicability of LSA to rapidly growing text corpora like EPGs.
Furthermore, it was concluded by Rehder et al. that LSA does not perform well on
2
Chapter 1. Introduction
texts with less than 200 words [RSW+98]. The majority of EPG entries are consider-
ably shorter.
Less directly related is the field of Automatic Text Summarization. Here, the idea is
to capture a text’s most important aspects in order to automatically generate abstracts
or summaries. This is not applicable to our work for two reasons; one, the texts to
be processed are very short and concise to begin with. Two, little is gained in terms
of processability since a summary is computationally not easier to handle than the full
text from which it originated.
1.3. Explanation of Important Terms
corpus A corpus (pl. corpora) is a large set of texts that is often annotated with
additional information, depending on its intended use. For instance, the Bible
could be considered a corpus with verse annotations. Well-tended corpora are
frequently used in NLP to train machine learning system. They are also called
training corpora.
relevant document A document within a corpus is called relevant with respect to a
query if it should be returned by that query. In the trivial case of a full-text search
for the string money, a document would be relevant if and only if it contained the
word money.
relevant word In the context of this work, a relevant word is a word that has an intrinsic
and accessible meaning, or sense. Intrinsic means the word is meaningful in and
by itself. For instance, the word nonetheless has no innate sense and is as such
considered irrelevant. To be relevant, a word’s meaning also has to be accessible
to the system. This commonly means that the word can be found in a program’s
look-up dictionary, whatever that may be in a concrete implementation.
concept We use the term concept to describe a pair of words that is descriptive for a
text. The descision whether such a word tuple is descriptive or not is of course very
subjective. In this work, we use the loose definition of meaningful to a human in
the sense that a list of concepts can form a “mental image”, i.e. (prince, enchanted),
(frog, kiss), (spoiled, princess).
3
Chapter 1. Introduction
tf.idf The term frequency/inverse document frequency is a simple way to weigh the
significance of a term in a corpus. The tf.idf-score is the higher the more frequently
a word appears in a document. However, it is reduced for terms that are common
across the entire corpus. It is thus a measure for the specificity of a term for a
given text. More details on the mathematical background can be found in [SB88].
precision The precision of an IR system is the percentage of relevant documents among
all retrieved documents for a query. In other words, a precision of 1 means that
only relevant documents were returned, while a precision of 0 means none of the
returned documents were relevant.
recall The ratio of returned relevant documents to all relevant documents is the recall.
That is to say, a high recall means the result set contains most of the correct
answers, and a low recall means only few relevant documents were returned. It is
clear that neither precision nor recall alone is a meaningful measure. If the entire
corpus is returned, for instance, the recall is 100% while the precision will be poor.
f-measure The f-measure (or f-score) is a combination of both recall and precision. It
is commonly used to describe the performance of an IR system with a single value.
The f-score is defined as F = 2 · (p ·r)/(p+r). Variations emphasizing either recall
or precision are also common.
synset A synset (sometimes also called a synonym ring) is a set of semantically equiv-
alent elements. In linguistics, a synset is a set of synonymous words, i.e. words
that share the same meaning. An example of this is the set {father, dad, daddy}.
homonym A homonym is, in a sense, the opposite of a synonym. Two words with
distinct meanings are said to be homonymous if they are spelled and pronounced
identically. The word plane is a homonym with the meanings of aircraft and
surface, among others.
homograph Words that are spelled the same, but pronounced differently are homo-
graphs, i.e. the record and to record. Because in written language every homograph
is also a homonym, we will not always strictly distinguish between the two.
polyseme A homograph or homonym whose meanings are all closely related is called a
polyseme. The word screen with the conceptually related meanings of wire mesh
and display is an example of this.
4
Chapter 1. Introduction
morpheme The smallest linguistically meaningful unit is called a morpheme. While
morphemes have a meaning, they can often not stand by themselves. In English,
for example, the morphemic suffix -ish (with the meaning of to some degree) does
not occur by itself, but can be appended to adjectives or nouns (“Is he big? - No,
but biggish.”).
1.4. Remarks
While most techniques and issues discussed in this work are, to a greater or lesser extent,
applicable to all languages, this is not necessarily always the case. Wherever possible,
we try to use examples in English so the reading flow is not interrupted. In situations
where this is not sensible, we will provide English translations that may not reflect the
intended illustrative purposes of the German example.
5
Chapter 2.
Tokenization
Working with textual input is one of the most frequently performed tasks in computing,
be it natural language, the source code of a program or something in between like a
domain specific language such as SQL. At any rate, the first step that needs to be
done is to transform the concrete textual representation into something suitable for
computational processing. The task of grouping semantically meaningful sequences of
elements into tokens is called lexical analysis or tokenization.
The “classical” use of tokenization is in the context of compilers. The tokenizer reads the
input sequentially character by character and creates a sequence of tokens that is used
to build the parse tree. Being an essential part of this process, tokenization is oftentimes
taken for granted and rarely given much thought as to how it is actually implemented.
This is partially due to the fact that most commonly parsed data are by design trivial
to tokenize, such as source code, arithmetic expressions or command line arguments.
However, as soon as one leaves the world of well-defined and computer-friendly grammar,
things quickly become more complex. While “traditional” tokenization can in most cases
easily be dealt with by finite state machines or regular expressions, more sophisticated
techniques are needed when working with texts in natural language.
In the first section of this chapter we will discuss how the concept of tokenization can
be applied to Natural Language Processing and what challenges arise when doing so.
The second section introduces a variety of techniques to effectively tokenize natural
language.
6
Chapter 2. Tokenization
2.1. Problems of Tokenizing Natural Language
In natural language it may seem intuitively clear what a token is: a word. But actually
agreeing on a concise, universal definition is not at all trivial or even feasible. This is
because different views on the same problem may require varying degrees of abstraction.
It is often helpful to interpret these different levels of granularity as a hierarchy, if
possible.
To illustrate, consider figure 2.1. The most basic level is the binary representation of
the data where each byte could be considered a token. One level above that, up to four
bytes represent a Unicode code point, or character. A sequence of characters form a word
token, and so on. The combination of semantically related tokens is called chunking. Its
practical application will be discussed in section 5.3.5.
Figure 2.1.: A token hierarchy
It is possible that a given syntax requires a token which does not possess any innate
semantic value. This is the case with blanks as can be seen in figure 2.1 on the transition
from the character to the word level. Another example are padding bytes in UTF-32
character encoding1.
There are several other “layers” that could be added to this hierarchy, such as phonemes
(the smallest unit of sound one can utter) or syllables. However, not all linguistic
elements follow this strict compositional pattern. For instance, morphemes are not
always assembled from syllables: the syllabic composition of the German word zerlegen
is zer·le·gen while its morphemic one is zer·leg·en.
It becomes clear that the definition of a token depends on the application at hand. In
any case, a token is considered atomic in the context in which it was defined.
1The byte representation in figure 2.1 is in UTF-8 which is a variable-length encoding using 1 to 4bytes as needed.
7
Chapter 2. Tokenization
In Natural Language Processing, tokenization commonly refers to the initial process of
splitting the textual input into its most basic relevant parts, namely words and punctu-
ation marks.
It’s never just a game when you’re winning .
At this stage, no semantic information is available and the tokenizer should make no
assumptions about the contents of the text. In practice, however, this can often not be
avoided, especially when ambiguities need to be resolved. Section 2.2 will discuss this
in more detail.
The complete process of tokenization can be divided into two distinct steps. First,
the individual sentences have to be identified. This is done through sentence boundary
disambiguation. Second, word segmentation splits the “content” of each sentence into
separate tokens. In practice, this is commonly done in reverse order since sentence
detection depends on the analysis of its tokens.
2.1.1. Word Segmentation
The task of dividing a sequence of characters into its component words is called word
segmentation. From a European or American point of view this is fairly straight forward
as it can largely be dealt with by splitting a string by the space character. However,
most morphemic languages (i.e. languages where a single symbol represents an entire
morpheme) lack such a distinct delimiter, for instance Chinese or Thai. In these cases,
word segmentation turns out to be a rather difficult problem and more sophisticated
approaches are required2.
Real world applications processing German language texts are oftentimes faced with
another issue. In part due to the steadily increasing adaptation of Anglicisms, the blank
is now frequently being used to (incorrectly) separate nominal compounds. This does
not only occur in colloquial speech but can also appear as a result of Machine Translation
(e.g. grape juice → Trauben Saft). Because languages generally allow for the use of two
consecutive nouns (“Er wollte aus den Trauben Saft machen”) the tokenizer cannot decide
which interpretation is correct by syntactic information alone.
2More precisely, it is not the lack of a distinct delimiter per se that is the problem. The real issue isthe lack of unambiguous word boundary indicators.
8
Chapter 2. Tokenization
In addition to this, it is often mandatory to detect and split contractions, i.e. the
shortening of a word by omitting internal characters (I am → I’m or going to → gonna).
While these contractions follow fairly strict rules in written language, they can be difficult
to detect and handle if the input has come from a speech recognition system.
2.1.2. Sentence Detection
For a human reader, deciding where one sentence ends and the next begins is an al-
most trivial task. However, most of what we subconsciously factor in when making this
decision is not easily quantifiable: reading experience, context knowledge and subtle
semantics. Formalizing this process of sentence detection or sentence boundary disam-
biguation has proven to be more difficult than one might expect.
Although a sentence is commonly ended by a period, the reverse is not always true. A
period can also be a decimal point, denote an abbreviation or ellipsis, or be part of an
emoticon or URL (or possibly a combination thereof). The actual distribution obviously
varies greatly between corpora. It has been estimated that almost 50% of all periods
in the Wall Street Journal corpus are due to their use in abbreviations while it is less
than 10% in the Brown corpus [SFK99]. Even further complications arise with the use
of colloquial speech, embedded quotations or some rethorical devices.
for ( i := 0 ; i < t . l ength ; i++) doi f ( t [ i ] = " . " ) then
i f in_set ( abbr , t [ i −1] then continue ;i f c a p i t a l i z e d ( t [ i +1]) then mark_sentence_end ( i ) ;
end ;end ;
Figure 2.2.: Naive sentence detection (simplified)
Given a list t of tokens and a set abbr of known abbreviations, a crude rule-based
approach like in figure 2.2 could be employed. This is reported to achieve an average
accuracy of approximately 95% on standard text corpora. Given the fact that humans
agree unanimously in almost each and every case, this seems much less impressive.
Furthermore, errors made in an early stage of an analysis propagate rapidly throughout
subsequent steps and can significantly impact the overall result. Grammatical analysis
in particular, such as part-of-speech tagging as described in section 3.4, depends heavily
on accurate sentence detection.
9
Chapter 2. Tokenization
While simple rule-based approaches or regular expressions can work reasonably well for
the text genres they were initially designed for, they naturally do not generalize well
beyond them. For this reason, more flexible and domain-independent algorithms were
devised that can adapt to any corpus by training them appropriately. Among the most
commonly chosen machine learning techniques for this task are neural networks [Pal94]
and maximum entropy models [RR97]. Current implementations achieve a precision of
98.5% and higher.
2.1.3. Sublanguages
Although trainable systems alleviate the need for genre-specific tokenizers notably, there
are still cases in which they do not perform viably. Texts from very restricted domains
often resort to a unique sublanguage for that domain.
In biomedical literature, for instance, the structure and inconsistent use of the same term
poses a serious problem for Information Retrieval systems. To illustrate, consider a gene
symbol like MIP-1-alpha, which can be written as MIP-1alpha or (MIP)-1 alpha, among
others. A generic tokenizer, even if it is trained specifically on such a corpus, cannot
adapt well enough and will in many cases perform poorly for two reasons. First, it will
produce different sets of tokens for the same logical entity, depending on the concrete
textual input. Second, the tokens it does generate will most likely not reflect the true
constituents of the gene. In the aforementioned example the individual parts of the gene
are (MIP-1, alpha). Generic tokenization, however, is more likely to produce a segmenta-
tion such as (MIP, 1, alpha). As a result, implementing domain-specific tokenizers that
incorporate expert knowledge can improve performance significantly [JZ07].
2.2. Tokenization Methods
As previously mentioned, the problem of tokenizing German and English texts can be
considered solved for all practical purposes. For other languages this is still a topic of
research. The Special Interest Group of the Association for Computational Linguistics[1]
(SIGHAN) regularly organizes a “Chinese Language Processing Bakeoff” with word seg-
mentation being a core discipline. This section will use Chinese as a concrete example
to illustrate the difficulties one faces when tokenizing input that lacks a distinct word
10
Chapter 2. Tokenization
delimiter. However, these challenges are by no means limited to processing Asian lan-
guages. The very same issues arise in the context of speech recognition systems where
a continuous stream of sound has to be segmented, or in OCR applications.
In Western languages words are made of individual characters that have no meaning by
themselves. This implies directly that a word, as a “sequence of meaningless characters”,
is atomic with respect to its meaning. In Chinese, on the other hand, the situation is
different. Any Chinese character, or symbol, has an intrinsic meaning. When multiple
characters form a word, the meaning of the whole does not necessarily reflect the mean-
ing of its parts. As a consequence, it can be very difficult to determine which characters
belong together. Consider the following sequence of symbols as part of a sentence:
S = (t1, t2, t3, t4, t5)
Such a sequence can potentially harbor two kinds of ambiguities. First, assume both the
entire sequence S, as well as a partition S1 = (t1, t2) and S2 = (t3, t4, t5) of S form valid
words. That is to say, the word S is a set of other words. This is called a conjunctive
ambiguity in S. At first glance, this seems to be a form of compounding (see section
3.2.3). However, the key difference is that it happens on the grammatical level, while
compounds are a purely semantic issue3.
A second, less common way in how S can be ambivalent to segment is if overlapping
subsequences can be found. For instance, let S1 = (t1, t2, t3) and S2 = (t3, t4, t5) so that
S, S1 and S2 are all valid words. In this case, S is said to possess disjunctive ambiguity.
We will now discuss four different techniques that can be employed to perform word
segmentation on texts with no clear word boundary indicators. With the exception of
dictionary-based tokenizers, all of the following methods are based on collocations. A
collocation is a sequence of words that occurs more frequently than would be expected
by chance alone. For instance, a speech recognition system could use collocation data
to distinguish “grade A beef” from “gray day beef” because the latter words have a lower
probability to occur in sequence. Once collected, this information can be harnessed in
various ways to build sophisticated tokenizers.
3Consider the concatenated sentence of words “thedragonflies” which can be segmented as either “the
dragonflies” or “the dragon flies” with inherently different repercussions on both the grammaticalstructure and the semantics implied by it.
11
Chapter 2. Tokenization
2.2.1. Dictionary-Based
One of the most obvious ways of performing tokenization is to use a dictionary that
defines the set of all valid words. The tokenizer can then match the input characters
against the dictionary to determine which sequences form a valid word and which do
not. Implementations of dictionary-based methods can be classified by the following
three characteristics (see [WK92]):
• direction
The direction specifies whether the tokenizer will, given a start character, search
to the right (forward) or in the opposite direction (backward). For reasons of
simplicity, we will ignore the issue of left-to-right and right-to-left languages, such
as German and Arabic, respectively.
• greediness
If a tokenizer returns the minimum matching, it is called greedy. Likewise, a
tokenizer that returns the maximum matching is ungreedy. In other words, greedy
algorithms are satisfied with local optima whereas ungreedy ones try to find global
optima.
• omission/addition
The tokenizer can either start with the full sequence of characters and successively
omit elements, or start with the empty sequence and gradually add characters.
This is called omission-based or addition-based, respectively.
It is obvious that these characteristics are not orthogonal. For instance, it makes little
sense to implement a greedy addition-based tokenizer as it would practically always stop
after reading the first input character.
Given an adequately large dictionary, this class of tokenizers was shown to achieve an
identification rate as high as 98%. The dependency on a dictionary, however, leads
to poor tokenization of texts that contain new words and leaves very little room for
incorporating domain-specific expert knowledge.
As a side note, an interesting thing to remark about dictionary-based tokenizers is that
they are “doing things backwards”. That is to say, they take a list of words to generate
tokenization rules, when it should be the other way around.
12
Chapter 2. Tokenization
2.2.2. N-Grams
Given a sequence S, an n-gram of that sequence is a continuous subsequence of length
n. An n-gram based tokenizer is first trained with a pre-tokenized text corpus. Based
on that data a probabilistic model is built that can be used to predict the next token
of a given sequence. In other words, an n-gram tokenizer calculates the probability
P (ti | ti−1, ti−2, . . . , ti−n) to decide what the next token should be based on past experi-
ence.
Suppose an OCR system encounters the word often preceded by the words the and power.
If trained on scientific texts, the model will predict that the sequence the, power, of)
(as in “five to the power of ten”) is statistically more probable than the sequence (the,
power, often) (as in “the power often goes out”). Another typical application for this
technique is in speech recognition to distinguish oronyms (i.e. words that sound alike,
but are spelled differently), or in spam detection systems [Zdz05].
Real-world implementations, such as the n-gram tokenizer proposed in [KXW03], achieve
a recall of up to 98% on morphemic languages. However, a common criticism to this
approach is that it is essentially problem-agnostic, i.e. it does not incorporate any lin-
guistic knowledge and operates purely on an abstract mathematical level. As a result
of this, n-gram tokenizers generally do not handle out-of-vocabulary words well and are,
as such, not better than dictionary-based methods, at least on a theoretical level.
2.2.3. Perceptron Learning
Li et al. have successfully used a Perceptron with Uneven Margins for the word segmen-
tation of Chinese [LMBC05]. The key idea is to classify each character in a language
as either a single-character word or a character that occurs at the beginning, middle
or end of a multi-character word. For each of these cases a classifier was trained on a
hand-annotated corpus using the one-vs-all paradigm. A sliding window of size five was
used to generate the input for each perceptron. In other words, the input consisted of a
center character and the two characters both preceding and following it. This was done
so collocations could be recognized in order to improve accuracy. The resulting tokenizer
was shown to achieve f-measures between 92.7% and 95.6% for the four SIGHAN test
corpora.
A major advantage of this approach over the other techniques besides its relative sim-
plicity is that it is character-based. Given a large enough training corpus, the classifiers
13
Chapter 2. Tokenization
are exhaustive and can cover the entire set of characters of a given language, whereas
word-based systems are naturally limited by their dictionary (either the implicit dictio-
nary of the training corpus, or the explicit look-up dictionary). As a consequence, this
class of tokenizers is significantly more robust against previously unknown words.
2.2.4. Search-Based
An interesting alternative to conventional word segmentation methods was proposed by
Wang et al. using common web search engines [WQL07]. The tokenization process is
split into three steps: segment collecting, segment scoring and segment selection.
After doing an initial partitioning by punctuation marks, the resulting segments are
submitted to a search engine such as Google or Yahoo. The returned results are analyzed
with respect to what parts of the segment appear together. To illustrate, consider the
contrived example of the English search query q =“Jude has a second major in criminal
law”4. Running the query will show that the sequences criminal law, second major and jude
law appear most frequently within the search results. The set of all returned segments
is S ⊆ P(q).
In the second step, all segments in S are ranked using a scoring function σ : P(q)→ R.
An obvious function to use is the segment frequency, i.e. the ratio between the number
of search results that contained the segment and the total number of results. Wang et
al. [WQL07] have also evaluated a Support Vector Machine-based scorer which in the
end yielded a higher recall but slightly lower precision and a f-measure of around 88%.
A subset s ∈ S is called a valid segmentation if it can reconstruct the original query,
that is to say if s is a partition of q. Note that the word order is important, so the
segment jude law from the above example would never be part of a valid segmentation.
The set of all valid segmentations of q is S̃(q) ⊆ S. The final result R is then the valid
segmentation with the highest average score:
R(q) = arg maxs̃∈S̃(q)
(
1
|s̃|
∑
s∈s̃
σ(s)
)
While so far not much research has been done for search-based tokenization, it appears
to be a promising approach that has several advantages over traditional tokenization
4In this example, we assume that we already have word tokens. We are now trying to segment on asemantic level, i.e. we do chunking. While this is not the intended application of the algorithm, thesame principles can be applied and it serves the illustrative purpose nicely.
14
Chapter 2. Tokenization
strategies. In particular, it is neither dictionary-bound nor does it require training. One
downside is that current search engines often do optimizations and query modifications
that are unfavorable for this method, e.g. stop words are removed from the query or some
form of stemming is done (see the following chapter). Using a search engine specifically
tailored to this purpose could further improve performance.
In this chapter we have introduced the concept of tokenization and illustrated its prac-
tical use in NLP applications. It is clear that inaccurate tokenization can have many
negative consequences. Tokens represent the “physical” structure of a text, errors made
here will propagate quickly throughout subsequent analytical steps. For instance, if a
word is not tokenized correctly, dictionary look-ups will fail. Likewise, if a comma or
period is missed, grammatical analysis is negatively affected. In the following chapter
we will assume a correct tokenization. Given the relative simplicity with which German
can be tokenized, this is not too bold an assumption to make.
15
Chapter 3.
Natural Language Processing and Text
Mining Techniques
Programming languages are in most cases easy to parse due to their context-free gram-
mar and well-defined syntax. Texts in natural language, on the other hand, are very com-
plex and oftentimes ambiguous in both syntax and semantics. The science of recognizing,
processing and understanding human language is subsumed under the term Natural Lan-
guage Processing or Computational Linguistics. It is an interdisciplinary field that com-
bines computer science, logic, mathematics, linguistics and others. Natural Language
Processing encompasses a wide range of methodologies ranging from speech recognition
to the automated extraction of semantic features.
The term Data Mining describes the discovery of novel and useful information in vast
amounts of data. Text Mining is the application of Data Mining techniques to the do-
main of natural language texts. The pivotal difference here is that, unlike in traditional
Data Mining applications, the data is not already more or less rigorously structured and
thus not easily processable for computation.
For this reason, Natural Language Processing and Text Mining often goes hand in hand.
The textual input is first processed by NLP tools, then transformed into an appropriate
representation suitable for the employment of Text Mining algorithms. The result of
these algorithms is often fed back into the NLP system and the cycle starts anew.
In this chapter, we will first briefly motivate the importance of NLP in real-world ap-
plications. It follows a discussion of the most commonly utilized NLP techniques. Of
particular interest here are their technical interdependencies, how they relate to one
another on a global level, their fundamental limitations and the concrete applicability
in the context of our work. Precise linguistic terminology is introduced as needed.
16
Chapter 3. Natural Language Processing and Text Mining Techniques
3.1. Applications of NLP
Humans are constantly exposed to natural language both in written and spoken form.
With computers permeating more and more aspects of our everyday lives, processing
language in an intuitive and efficient manner becomes ever more important. Some of
the most established, real-world areas of application are:
• Word Processing
One of the first areas in which NLP techniques has found its way was Word Pro-
cessing. Features such as context-sensitive thesauri and the automated grammar
and spell check are now standard in any modern word processor.
• Information Retrieval
The exponential growth of knowledge makes finding information increasingly dif-
ficult. Traditional methods like full-text searches have issues of scalability and no
longer meet demands of handling the vast number of documents of current corpora.
• Machine Translation
The automated translation of text was one of the initial motivations for NLP and
is still a subject of great interest. It is often regarded as the one application by
whose success the entire field of research can (or should) be measured.
• Speech Recognition and Synthesis
The recognition and synthesis of language forms the most natural interface between
human and machine. It has become an integral part of barrier-free applications,
guidance systems or automated directory assistance.
• Automatic Summarization
Summarizing a text by either reducing it to its most relevant phrases or by creating
an abstract from scratch is a valuable tool when large amounts of texts need to
be surveyed, e.g. performing automated news aggregation from various external
sources or outlining the contents of the latest scientific papers.
• Sentiment Analysis
Services like Metacritic[2] utilize NLP to autonomously evaluate reviews and clas-
sify them as either favorable or unfavorable. Manufacturers can employ techniques
like this to gather feedback about their products in a fully automated fashion.
17
Chapter 3. Natural Language Processing and Text Mining Techniques
3.2. Morphological Techniques
The form a word takes typically changes depending on the grammatical context in which
it is used. The manner in which a language handles grammatical relations and relational
categories such as case, mood, voice, tense, aspect, person, number and gender is called
inflection. The analysis and description of these inflectional rules is called morphology.
For instance, the German verb fliegen (eng. to fly) can, among others, take the form
of flöge and geflogen. All inflected forms are representations of the same abstract mor-
phological unit called its lexeme1. It is evident that while the inflection can potentially
carry useful information, it complicates the identification of a word’s lexeme and thus
its intrinsic meaning.
It is important to realize that a lexeme is a theoretical construct and has, as such, no
innate concrete textual representation. In practice, a lexeme is generally typified by its
canonical (or “dictionary”) form, or lemma (see section 3.2.2). Consequently, correctly
inferring the lexeme of a given word is an important step as it abstracts from its textual
representation and allows for processing on the semantic level.
3.2.1. Stemming
The part of a word that is common to all its inflected variants is called a stem. The
stem itself is often not a genuine word. Consider the set {independent, independence,
independently} whose common stem is independ (the trailing e is usually omitted).
Originally, the primary application of stemming was in Information Retrieval where
its use has led to significant improvements in recall of search results with only minor
impacts on precision as concluded by Kraaji and Pohlmann [KP96]. Grouping similar
words together is also very useful when techniques like tf.idf are to be employed because
it decreases the spread of relevant terms across the search-space. In that context, it is
of no importance that the stem itself is not meaningful.
Some stemmers work purely algorithmically in the sense that they have a fixed set of
rules and do not respect the peculiarities of a given language. These stemmers do not
perform well in cases where the stem of a word varies between its inflections. Words
with this characteristic are called suppletives. This happens when one inflected form is
of a different historical origin than another. Examples of this are irregular verbs (to go,
1A lexeme is, in a sense, the abstract class of a word, while the inflections could be seen as concreteinstantiations of that class.
18
Chapter 3. Natural Language Processing and Text Mining Techniques
went, gone), but it can also appear with other word classes, mostly adjectives (good,
better, best) or nouns (person, people). Albeit, the latter is rarely to be found in English
or German. However, even simple derivational forms frequently cause problems. Most
stemmers do not map closely related words like cohere and cohesion to the same stem
because they (rightfully) consider the common stem coh not to be valid.
Conceptually, stemmers can be divided into two classes. A light (or weak) stemmer
conflates only very closely related words such as (replace, replacement, replacing). This
conservative strategy often leads to under-stemming, i.e. words are not being grouped
together when they should. Heavy (or strong) stemmers merge words more aggressively.
This is more prone to mapping unrelated words to the same stem, for example (divisor,
division, dividend). This is called over-stemming.
The first stemming algorithm was proposed by Julie Beth Lovins in [Lov68]. Since then,
a variety of stemming methods has been developed:
Brute Force The stem is retrieved by matching the word against a static look-up table
which relates inflected forms to their stem. While being comparatively inflexible,
this approach has the advantage of being able to reliably stem any known word,
including suppletive forms.
Affix Stripping By removing common affixes (i.e. pre- and suffixes), according to a
given set of rules, the word is successively reduced to its stem. The correct iden-
tification of affixes is often problematic. For instance, in uncommon the affix un-
should be removed, but in understand it must not. This class of stemmers can not
handle suppletives.
Probabilistic Stemming Probabilistic stemmers are trained on a set of inflected words
of which the correct stem is known. A stochastic model is built that can then be
applied to unknown words in order to deduce their most likely stem, i.e. the stem
of which the expected error is minimal.
Hybrids The precision of a stemming algorithm can be improved by combining multiple
approaches. For example, an affix-stripping system can be improved by adding
a list of all irregular verbs and the most commonly used suppletives. This can
significantly increase its overall performance without restricting its versatility.
19
Chapter 3. Natural Language Processing and Text Mining Techniques
Languages where words are inflected by stringing individual affixes together (so-called
agglutinative languages) are well suited for stemmers. An entirely agglutinative lan-
guage would have distinct morphemes for each of its morphological paradigms such as
case, gender, tense, number or mood. Examples of strongly agglutinative languages are
Hungarian, Georgian and Finnish. Constructed languages also commonly fall into this
category, for instance Esperanto. The higher the grade of inflectional regularity, the
easier it is to systematically extract a set of stemming rules and algorithmically apply
them to previously unknown words.
Conversely, if inflectional affixes are merged, or if a single affix marks multiple grammat-
ical categories, then this is called a fusional language. For instance, in the Latin word
medicinarum (eng. medicine) the atomic suffix -arum marks both the plural and the gene-
tive. Fusional languages are much more difficult to stem correctly because the inflecting
morphemes are harder to identify. In this case, the use of a dictionary is mandatory.
At this time, the most commonly used stemmer for the English language is the affix-
stripping based Porter Stemmer, see [Por80]. Implementations are freely available at the
Snowball[3] project. A brief comparison of stemming algorithms as well as a discussion
of means to compare them quantitatively can be found in [FZ98].
Stemming naturally entails a loss of information by reducing a word to an artificial root
form, losing a significant amount of its semantic value in the process. Moreover, in
practice, different algorithms often yield different stems for the same input and cannot
be used interchangeably. As a result of this, it is impossible to reliably infer a word’s
lexene from its stem. This being essential for further analysis, stemming is of very
limited use in this work.
3.2.2. Lemmatization
Lemmatization is a technique conceptually similar to stemming with the pivotal dif-
ference that a token is reduced to its “dictionary form”, or lemma. For instance, the
inflected words seeks, seeking and sought are all lemmatized to the same canonical form
seek.
To do this correctly, the lemmatizer (unlike a stemmer) employs contextual information
such as pos-tags (see section 3.4). In doing so, it can distinguish inflectional homonyms
like Stahl (eng. steel) and stahl (past indicative of to steal) whose lemma are Stahl and
stehlen, respectively, with entirely different meanings.
20
Chapter 3. Natural Language Processing and Text Mining Techniques
Aside from its higher overall precision, the primary advantage of lemmatization over
stemming lies in the fact that the lemma - as opposed to the stem - preserves most of a
word’s semantic value, is intelligible for a human, and allows for immediate dictionary
look-ups. On the downside, it is clear that lemmatization is generally slower and requires
more complex preprocessing to provide the necessary contextual information.
The way in which lemmatization is done is even more language-dependent than stem-
ming. Therefore, no one single algorithm is available and implementations tailored
specifically to one particular language (or closely related families of languages), are re-
quired. This is in most cases considered impractical due to its expensiveness and need
for the involvement of linguistic experts. A more viable alternative is to deploy machine
learning techniques. The most commonly used approach is based on Ripple Down Rules
as described by Plisson et al. [PLME08]. This will be discussed in more detail in section
4.2.2.
3.2.3. Compound Splitting
Natural languages are constantly evolving, new words are incessantly being created while
others slowly become obsolete. One of the most common ways to form new words is
the combination of existing ones into compounds, or polymorphemes. More formally, a
compound is a word that is comprised of multiple stems. For analytical purposes it is
often desirable to reverse this process in order to reduce new, unknown expressions to
familiar ones. The complexity of decompounding varies from language to language. Most
Germanic languages, for example, allow for the formation of ad-hoc compounds. English,
on the other hand, is comparatively unproblematic because new found polymorphemes
are not concatenated seamlessly and therefore trivial to dissect.
One designated element of a compound is its head. The head determines the compounds
case, number and grammatical gender (if applicable). For instance, the head of the
German compound Büroklammer (eng. paper clip) is Klammer (singular, feminine). The
compound is thus also singular and feminine. The gender and number of the other
constituent is not inflected.
From a semantic point of view, compounds can be classified into four major categories:
endocentric, exocentric, copulative and appositional.
21
Chapter 3. Natural Language Processing and Text Mining Techniques
Endocentric compounds are characterized by the semantic predominance of its head.
Its primary meaning as defined by the head is narrowed down by the remaining
parts (its modifiers). In other words, the whole compound is a special case (or
“subclass”) of its head element. For instance, the word Kellertür (eng. basement
door) is a special kind of door, namely the one that leads to the cellar.
Exocentric (or possessive) compounds resemble a predicate-argument type structure.
They stand in a has-a relationship with a semantic head that is not explicitly
mentioned. As a result, the meaning of the compound is oftentimes not deducible
from its constituents. An example of this is the adjective blue-blooded that de-
scribes people of noble descent who oftentimes had superficial veins and untanned
skin, giving the impression of blue blood. The unexpressed head here is person.
Exocentric compounds are often used metaphorically.
Copulative (or primary) compounds have no clearly identifiable head in the sense that
one carries significantly more semantic meaning than the other. The compound is
in principle an enumeration of several independent elements and the meaning of
the whole is usually not directly connected with its individual parts. This type of
compound is rarely found in German or English. A contrived example of this class
of compounds is the word for the German state Schleswig-Holstein.
Appositional compounds are made of two equipollent but often contradictory parts
(e.g. singer-songwriter, African-American). The compound is a hyponym of each
constituent and it thus inherits all their individual meanings.
Decomposition of endocentric, copulative and appositional compounds generally yields
sensible results because each of the individual parts contributes to the meaning of the
whole, albeit to varying degrees. Exocentric compounds, on the other hand, are prob-
lematic because the semantics of the entire word is not a sum of its parts. It is apparent
that it is impossible to automatically determine to which category a given compound
belongs. However, exocentric compounds are typically idiomatic phrases and not gener-
ated ad-hoc in everyday speech. They can thus usually be found in dictionaries and do
not need to be decompounded.
Traditionally, compound splitting was used in the context of automated hyphenation,
i.e. its primary purpose was to identify the most suitable points at which a word could
be broken over two lines. A compound that is hyphenated into its semantic constituents
is considerably easier to read and understand than other syntactically correct segmen-
22
Chapter 3. Natural Language Processing and Text Mining Techniques
tations (consider gutter-ball and gut-terball). In other words, decompounding can be
understood as a specialized form of hyphenation with the added restriction that all con-
stituents are genuine words themselves.
The question whether a given word is a compound, and if so, whether a valid segmen-
tation exists, is generally a difficult one. Moreover, if multiple decompositions can be
found it is often not clear which one is correct, even to a human reader with knowledge
of the context in which it is being used. The word Druckerzeugnis can be split into
Druck·erzeugnis (eng. print·product) as well as the slightly unorthodox Drucker·zeugnis
(eng. printer·certificate), both of which are valid and in the same general domain of
meaning. More commonly, erroneous decomposition leads to syntactically sound but
nonsensical results, e.g. See·lachse (eng. sea·salmon) and Seel·achse (eng. soul·axis). If
there is a chance of ambiguity, it is often better not to decompound in order to avoid
semantic distortion (conservative decompounding). A more in-depth analysis of this
problem with regard to sense-conveying hyphenation can be found in [BN85].
One could assume that it is advisable to split compounds whenever an unambiguous
decomposition is possible. However, compounds that are already semantically relevant
should not be dissected as there is only little (if any) information to be gained. In re-
ality, the decomposition of this type of compound will more often than not only have
a negative effect on the analysis. Consider the germanized Anglicism Teenager that al-
ready has a distinct innate meaning. The correct decomposition is Teen·ager and yields
little additional information. However, forcibly dissecting it in a German context results
in the nonsensical segmentation Tee·nager (eng. tea·rodent) which retains none of the
original meaning.
For these reasons, decompounding is a technique that can be very prone to semantic
errors if applied too liberally. If done cautiously, however, it can be a useful tool for
establishing semantic information for previously unknown words.
3.3. Stop Words
Every language contains stop words or noise words, i.e. words that occur too frequently
to be of any specificity or do not possess a significant innate meaning. Put differently,
stop words are words that can be left out with little to no impact on the overall infor-
mational value of the text.
23
Chapter 3. Natural Language Processing and Text Mining Techniques
Whether a given word constitutes as noise depends on the area of application and cannot
be defined universally. For instance, in the medical domain, the term patient is highly
common and could be considered as a stop word, while in other domains it might be
important. Articles, conjunctions, pronouns, pre- and postpositions are typically con-
sidered irrelevant regardless of the domain.
3.4. Part-of-Speech Tagging
Each word of a sentence is implicitly assigned a lexical category or part-of-speech (pos).
This is not to be confused with a word’s grammatical function such as subject or object.
A pos-tagger is a program that, given a tokenized sentence, infers each token’s lexical
class and tags it accordingly. The most basic categories in the English language are
nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions and interjections.
For practical purposes it is often helpful to use a more fine-grained categorization like
the STTS tag set (see appendix A). Because natural languages are not context-free,
the decision to which class a given word belongs can normally not be made without
analyzing the sentence in its entirety. For instance, drink can be both a noun (“I need
a drink”) and a verb (“I need to drink”). At worst, even a sentence as a whole can be
ambiguous (“Fruit flies like a banana”).
Using the commonly adapted notation of 〈word〉 / 〈pos〉, the output of a pos-tagger using
the Penn Treebank tag set as described in [San90] could look something like this:
There/EX ’s/VBZ no/DT place/NN like/IN home/NN ./SENT
Here, EX stands for existential-there, VBZ for verb-3rd-person, DT for determiner, NN
for normal-noun, IN for preposition and SENT for end-of-sentence.
A multitude of different techniques to approach this problem have been developed. A
more in-depth discussion of the methods outlined below can be found in [WM06].
Rule-based Purely rule-based taggers as first proposed by Klein and Simmons [KS63]
and Greene and Rubin [GR71] are based on a two-phase algorithm. In the first
phase, each word is assigned a list of possible pos-tags based on a dictionary. In the
second phase, these lists are narrowed down by successively applying a predefined
24
Chapter 3. Natural Language Processing and Text Mining Techniques
set of rules until each list cannot be shortened any further (ideally to a size of one).
More recent implementations have refined this approach [RV95].
Rule-based systems do not require a training corpus, but they are very complex,
require linguistic experts to implement and are not transferable to other languages.
Stochastic Stochastic taggers are based on the observation that some pos-combinations
are more likely to occur than others (e.g. a verb followed by a noun is less probable
than an article followed by a noun). These probabilities are derived from analyzing
pre-annotated text corpora such as the Brown Corpus[4], the Negra Corpus[5] or
the British National Corpus[6].
Transformation-based learning Transformation-based taggers as suggested in [Bri92]
combine rule-based and stochastic means. The key element here is that it adopts
new rules during its learning phase and can resort to morphological rules when
it encounters new words not found in its internal dictionary. It has been shown
in [RS95] that the Brill tagger is up to ten times faster than purely stochastic
approaches.
Decision Trees Stochastic taggers typically assign an epsilon probability to any pos-
sequence that did not occur in its training corpus. This is because it cannot
distinguish between grammatically impossible sequences and sequences that did
not occur purely by mischance. As a result, they usually require large training
corpora to perform adequately. To alleviate this problem, taggers based on binary
decision trees were developed [Sch94].
It was shown that taggers can reliably achieve accuracies of 93% and higher [Der89],
[Cha97] and can perform as well as 97.5% for German [Sch95]. This seemingly high
percentage still implies an average error about every 40 words - and this is under “perfect”
conditions where the test and training texts come from the same corpus. It appears,
however, that progress is asymptotically reaching a limit. There has been only very little
progress in the past 10 to 15 years [Gie08]. In real world applications where training
and actual data vary greatly, the number of expected errors is typically higher.
25
Chapter 3. Natural Language Processing and Text Mining Techniques
3.5. Named Entity Recognition
In Information Extraction it is often necessary to identify names of locations, persons or
organizations that occur within a text (this is often extended to include temporal and
numeric expressions as well). This process is called Named Entity Recognition (NER).
Generally, NER system annotate the text by assigning tags to individual (or a set of
consecutive) tokens. The typical output of such a system looks like this:
〈Arthur Dent〉TY PE=PERSON worked for the 〈UN〉TY PE=ORGANIZATION since
〈2007〉TY PE=DATE and visited the 〈Czech Republic〉TY PE=LOCATION frequently.
This example follows the ENAMEX[7] tag set presented at the Message Understand-
ing Conference MUC-6. Others have since been introduced like the more fine-grained
BBN[8] or Sekine[9] tag set which contain approximately 90 and 200 types and subtypes,
respectively. Most implementations are based on Maximum Entropy [CN03] or Hidden
Markov Models [KSNM03]. In any case, they usually require additional resources like
dictionaries and gazetteers (geographical directories) to work effectively.
Although NER systems have achieved f-measure scores of up to 93.39% in the Message
Understanding Conference MUC-7, it was concluded by Poibeau and Kosseim [PK00]
that performance depends heavily on the adaptation to the characteristics of a given
corpus, including domain-specific grammar and the availability of suitable gazetteers.
Because of this, Named Entity Recognition is difficult to adapt in applications like ours
that are not domain-restricted. Furthermore, our work is predominantly based on the
analysis of synsets which inherently contain semantic information exceeding that of most
NER-tags. A token that is recognized as a named entity but has no synset associated
with it cannot be processed further by our system. However, it can still be of use for
other purposes such as cross-referencing specific named entites across multiple texts.
3.6. Coreference Resolution
Within a text are often many distinct terms that refer to the same logical entity, their
referent or antecedent. The process of identifying these words and grouping them un-
ambiguously into equivalence classes is called coreference resolution (CR).
26
Chapter 3. Natural Language Processing and Text Mining Techniques
To illustrate, consider the following excerpt:
Obama flew to Berlin where he talked with Merkel about it. After exchanging
pleasantries with the German chancellor the President returned home.
To a human reader with the necessary common knowledge it is obvious that {Obama,
he, President} and {chancellor, Merkel} form equivalence classes, they corefer to the same
entity.
It is to be distinguished between endophoric and exophoric references. Endophora have
an intralinguistic antecedent, i.e. they refer to an entity that is explicitly mentioned
in the text and can, by a human, easily be identified, even without much contextual
knowledge. In the sentence above, he is such an endophoric term. By contrast, exophora
refer to entities that are textually undisclosed. In other words, their referent cannot be
deduced from the text alone. In the aforementioned example, it is an exophora that
cannot be resolved.
Depending on their positioning relative to their referent, references can further be clas-
sified as either anaphoric or cataphoric. An anaphora is a backward reference to a
previously mentioned entity (“Chris sat down, he was tired”). Conversely, a cataphora is
a forward reference that is resolved at a later time (“He was tired, so Chris sat down”).
Even in situations that can be considered grammatically simple it is oftentimes difficult,
if not impossible, to identify an endophora’s referent with certainty:
(1) The baby played with the cat. It meowed.
(2) The baby played with the cat. It giggled.
(3) The baby played with the cat. It was cute.
In the first case, it clearly refers to the cat. In the second case, the referent is most likely
the baby, although one can’t be entirely sure. In the last case, it is not at all evident
whether it is a reference to the baby, the cat, or the act of them playing together.
If references are stringed together they can form coreference chains of theoretically arbi-
trary length. Considering the fact that such a chain can consist of proper and common
nouns (that do not even need to be synonymous), pronouns, substantive adjectives and
many others, it becomes clear that coreference resolution is a task of high complexity
and ambiguity.
27
Chapter 3. Natural Language Processing and Text Mining Techniques
Extensive research has been done in this field with seminal contributions by Lappin
and Leass [LL94], McCarthy and Lehnert [ML95] and Soon et al. [SNL01], among oth-
ers. Today, statistical methods yield best results [GHC98]. They commonly infer the
most probable referent by considering semantical, syntactical and lexical features that
are largely orthogonal to each other, such as gender, part-of-speech, animacy, number,
grammatical function and many others. The minimum-edit-distance feature proposed
by Strube et al. [SRM02] seems very promising as it matches our area of application in
both average text length and domain independence.
The practical use of CR algorithms, however, poses a considerable challenge. Research
implementations typically assume the availability of the required set of features, and
given that assumption, they perform reasonably well. However, in real-world applica-
tions, these features are commonly not readily available and their acquisition requires
very complex and computationally expensive preprocessing. For instance, the grammat-
ical function feature depends on a deep grammatical analysis which is of substantial
difficulty for morphologically rich languages like German (see [Pav05] for a brief discus-
sion). In addition, most approaches are either language dependent, limited to a specific
subset of the problem (e.g. pronouns) or are specific to a certain domain and do not
generalize well. In the past, research has focused mainly on the English language. For
this and other reasons discussed in more detail in section 6.1, significant restrictions are
imposed on the applicability of available solutions and CR in general for this work.
3.7. Ontologies
An ontology is an “explicit specification of a conceptualization” [GG93], or less formally,
it is a model of a domain that describes what objects and concepts exist, the properties
they possess and how they are related to each other. In real-world applications, the
complexity of ontologies grows rapidly with the size of the domain they cover. Conse-
quently, they are usually limited to very specific areas of expertise. Typical examples
are the SIOC Core[10] ontology for describing information from online communities on
the Semantic Web, the Open Biomedical Ontologies Foundry[11] or the Gene Ontology[12]
for gene product properties.
28
Chapter 3. Natural Language Processing and Text Mining Techniques
3.7.1. Structural Description
The structure of lexical semantics can be interpreted as an ontology: sets of synony-
mous lexemes form concepts (or synsets)2 which are related through lexical relations
like hypernymy, meronymy or antonymy. Such a lexical ontology is called a word net.
The predominant implementation and de-facto standard of a word net for the English
language is WordNet[13] from the University of Princeton. The term “WordNet” has
been widely adopted in literature and is being used interchangeably with “word net”.
Hypernymy/Hyponymy The terms hypernymy and hyponymy describe a generalization
and specialization relationship between two nouns. The word panda is a hyponym
of animal and conversely, animal is a hypernym of panda.
Troponymy A verb that describes more precisely the manner of doing something by
substituting another verb of more generalized meaning is called a troponym (e.g.
trample is a troponym of walk).
Holonymy/Meronymy Holonymy and meronymy describe the two opposing views on
a part-whole relation. For instance, branch is a meronym of tree, and forest is a
holonym of tree.
Antonymy Two concepts with opposite meanings are antonyms of each other, so high
is an antonym of low and vice versa.
Entailment A verb that necessarily implies another is said to entail it. The word
snore entails the concept of sleep. Troponymy is a special case of entailment, the
difference being that troponyms are always temporally coextensive3.
Causation If one action is the logical cause of another, a causation between the two ex-
ists. The pair show and see forms such a relationship. This differs from an entail-
ment in that the verbs of a causation relation have different referents: “Peterreferent
shows Lois the map” but “Loisreferent sees it”. This kind of connection is compara-
tively rare in Western languages.
2These are not synonyms in a strictly linguistic sense. A synset is rather to be understood as a set ofwords that can be used interchangeably in most, but not necessarily all contexts.
3Consider jumpentails−→ hover and trampling
entails−→ walking. At any given point in time where someone
tramples, he also walks. But there is only very brief moment during a jump where a person hovers.
Entailment can even be “temporally reversed” as in the case of succeedentails−→ try.
29
Chapter 3. Natural Language Processing and Text Mining Techniques
Figure 3.1.: An exemplary ontology centered around the word eagle
The precise meanings of these relations are not universally agreed upon and they often-
times differ subtly. A more in-depth discussion with respect to WordNet can be found
in [MBF+90].
Figure 3.1 shows part of an exemplary word net originating from the word eagle and three
of its most common meanings (the military airplane, the bird, and the golf term). Note
that only a subset of possible relations is shown. In particular, holonymy and hypernymy
relations are omitted as they are defined implicitly by the existence of meronymy and
hyponomy relations, respectively. For practical reasons, an artificial root node entity
(sometimes written as ⊤) is commonly introduced.
In addition to these relations, a word net often contains further annotations such as a
gloss (a brief and concise description or definition of a word’s meaning) or a short list
of sentences to exemplify the use of that word.
30
Chapter 3. Natural Language Processing and Text Mining Techniques
3.7.2. Formalization
In order to work with a word net algorithmically, it is necessary to formalize its structure
and properties. The nature of an ontology, namely it being a set of nodes linked by a
number of interconnections, suggests its interpretation as a graph. Given the set S of
all synsets and a finite set R so that
R = {r1, r2, . . . , rn} with ri ⊆ S × S,
a word net forms an edge-labeled multigraph G
G := (S, E) with E :=⋃
r∈R
r
In practice, G is commonly directed but can also include undirected edges. The set R
typically consists of (but is not limited to) the relations described in the previous section.
Unless a synset is taken to be synonymous to itself, G is commonly loop free.
Depending on the implementation, different elements of R possess different properties.
For example, antonomy is a symmetric relation whereas meronymy clearly is not. Like-
wise, entailment is transitive but meronymy generally isn’t - a hand is part of a goal-
keeper which in turn is part of a team, but a hand is hardly part of a team. This
is because “meronymy” is actually a summation of semantically related but logically
different relationships, such as part-of, member-of and made-of.
For subgraphs GR′ of G defined by
GR′ := (S, E ′) with R′ ⊆ R and E ′ :=⋃
r∈R′r
special properties can be inferred by examining the properties of the relations included
in R′. The subgraph G{holonymy}, for instance, is directed and acyclic; G{antonymy} is
undirected and 1-regular. This of course depends on the concrete implementation, in
this case whether antonyms are modeled by one undirected or two directed, but opposite
edges. In the latter case, it is directed and 2-regular. While the existence of ⊤ guarantees
G’s connectedness, GR′ generally is disconnected.
31
Chapter 3. Natural Language Processing and Text Mining Techniques
The acyclic graph
Gsub := G{hyponymy,hypernymy} with ⊤ ∈ S
is called the subsumption hierarchy. Because a synonym can be the hyponym of multiple
other concepts (comparable to multiple inheritance), Gsub is not a tree. For instance,
manager is a hyponym of both organism and causal agent.
This formalization of a word net allows for definition of various metrics. Of particular
interest is the measure of semantic relatedness, or its inverse the semantic distance.
This is based on the idea that each edge represents a semantic connection between two
concepts. The shorter the distance between two nodes is, the more closely they are
related. It is to be noted that a high degree of semantic relatedness does not necessarily
imply a strong similarity. Antonyms, for instance, are adjacent nodes but (by definition)
conceptual opposites. A measure for semantic similarity can commonly be obtained by
restricting a measure for relatedness to the subsumption hierarchy.
3.7.3. Measuring Semantic Relatedness
This section briefly introduces three common measures for the semantic distance between
synsets in an ontology. A more in-depth discussion can be found in [BH06]. It is assumed
that the word net graph is restricted to its subsumption hierarchy Gsub.
Definition: lso
Let GR′ be an acyclic, directed word net graph and C ⊆ S. The lowest super-
ordinate (also most specific common subsumer or join) of C is then defined as the
supremum of C with respect to the partial order induced on S by R′:
lso(C) : P(S)→ S := sup (C)
It is guaranteed that lso(C) ∈ S if C 6= ∅ and ⊤ ∈ S.
Definition: distance
Let GR′ be a word net graph and c1, c2 ∈ S arbitrary. The distance of c1 and c2 is
then defined by dist : S ×S → N0 as the number of edges on the shortest path from
c1 to c2. This path always exists if ⊤ ∈ S.
32
Chapter 3. Natural Language Processing and Text Mining Techniques
Definition: depth
Let GR′ be a word net graph, ⊤ ∈ S and c ∈ S arbitrary. The depth of c is then
defined by its distance from the root of the graph:
depth(c) := dist(c,⊤)
Wu-Palmer The Wu-Palmer similarity measure was proposed in [PW95]. The numer-
ator is used as a scaling factor.
simWP (c1, c2) =2× depth(lso(c1, c2))
dist(c1, lso(c1, c2)) + dist(c2, lso(c1, c2)) + 2× depth(lso(c1, c2))
Leacock-Chodorow Leacock and Chodorow proposed the following formula in [LCM98]:
simLC(c1, c2) = −logdist(c1, c2)
2×maxc∈V
(depth(c))
Lin Unlike Wu-Palmer and Leacock-Chodorow, Lin’s universal similarity measure [Lin98]
is based on probabilities derived from word frequencies measured in a training cor-
pus.
simL(c1, c2) =2× log p(lso(c1, c2))
log p(c1) + log p(c2)
While Lin’s measure is commonly presumed to give better results than Wu-Palmer and
Leacock-Chodorow, its dependency on word frequencies in a corpus limits it applicability
for our work. We have thus opted for the largely domain-independent Leacock-Chodorow
similarity whenever such a measure was needed.
3.8. Word Sense Disambiguation
In written language, homographs pose a serious challenge to NLP applications since an
erroneous interpretation can have a significant impact on the overall semantics. Word
Sense Disambiguation (WSD) is the process of distinguishing a word’s intended meaning
from other lexicographically possible, but semantically less likely meanings. As with
33
Chapter 3. Natural Language Processing and Text Mining Techniques
decompounding and coreference resolution, even humans are oftentimes not able to
identify the intended meaning of a homograph with absolute certainty [MM01].
In many cases, the knowledge of a word’s lexical class can be sufficient to infer its
concrete meaning:
Why did the dog bark at the tree?
In this case, bark is a verb and clearly refers to the sound a dog makes and not the part
of a plant, albeit the occurrence of the word tree might suggest as much.
Generally, however, disambiguation requires the analysis of the (immediate) context in
which a word occurs. The disambiguation is then done by cross-referencing words that
appear in the context with words that have a high statistical probability of occurring in
its proximity. For instance, if the word spring is found in the proximity of {river, fish},
it is likely to refer to a water source and not to the mechanical device or the first season
of the year.
These probabilities are commonly derived from sense-annotated corpora such as the
DSO corpus[14] or SemCor[15]. The first implementations, however, were based on using
a word’s dictionary definition to compare shared vocabulary (Lesk [Les86]). This was
later complemented by the inclusion of WordNet (Banerjee and Pedersen [BP02]) and
has led to purely ontology-based strategies (Li et al. [LSM95], Fragos et al. [FMS03]).
An interesting alternative was proposed by Mihalcea [Mih07], suggesting the use of
Wikipedia[16] as a sense annotated corpus.
3.9. Frame Semantics
The techniques discussed so far operated locally and did not yet interrelate tokens with
one another. However, in order to understand natural language on a higher level, it is
necessary to establish semantic connections between different parts of a sentence. This
process is called shallow semantic parsing4.
One approach to this is to examine verbs as the central element of expressions in natural
language. In linguistics, verbs can be classified by their valence, i.e. the number of
“arguments” they take [Vol00].
4“Shallow” in the sense that it tries to capture only certain semantic aspects. Complete understandingof a text requires deep semantic parsing which is substantially more complex.
34
Chapter 3. Natural Language Processing and Text Mining Techniques
avalent A verb that takes no argument or a pleonastic pronoun (“dummy argument”):
It snows.
monovalent A verb that takes one argument which is typically the subject:
Sally cried.
divalent A verb that takes two arguments, usually the subject and one object:
She pushed him.
trivalent A verb that takes three arguments (subject and two objects):
Brian bought Lois some flowers.
tetravalent A verb that takes four arguments:
Brian bought Lois some flowers for her birthday .
The valence of a verb can vary depending on its semantics (as in the last two cases in the
examples above). A verb generally requires all its arguments in a well-formed sentence,
although, sometimes an argument can be omitted: “I married” instead of “I married her”.
This is called valency reduction. Conversely, if an additional argument is added to a
verb, valency expansion takes place. For instance, in colloquial English, “He scared me”
can be expanded to “He scared the bejeezus out of me”.
Each argument assumes a certain semantic role. The process of meeting another person
(as defined by the verb to meet) could be described by the following “formula”:
to meet 〈someone〉object at 〈somewhere〉location for 〈something〉purpose
This construct is called a case frame and was first introduced in Charles Fillmore’s Case
Grammar framework [Fil68]. This idea was later developed into Frame Semantics [Fil82]
and implemented in the FrameNet project[17]. Figure 3.2 illustrates schematically how
the elements of a sentence can be mapped onto a semantic frame. Dotted boxes denote
optional arguments.
Frames are not isolated entities. They are themselves structured and complexly inter-
connected through a variety of relationships. For instance, the frame eclipse is a very
specialized subclass of the frame hiding_objects. Frame relations can be visualized
using the FrameGrapher[18] website. Figure 3.3 shows an excerpt of FrameNet centered
around the transfer of goods.
35
Chapter 3. Natural Language Processing and Text Mining Techniques
Figure 3.2.: Application of a semantic frame
At the time of writing, the FrameNet project contains over 800 frames exemplified in
more than 135,000 annotated sentences. Localized adaptations are available for Span-
ish[19], Japanese[20] and German[21]. The process of mapping parts of a sentence to the
appropriate roles of a semantic frame is called semantic role labeling. Several algorithms
for this purpose have been proposed [GJ02], [SM04].
Figure 3.3.: Relations between semantic frames
For our application, we decided not to employ semantic frames for various reasons.
First, the German variant of FrameNet is significantly smaller than its archetype and
consequently covers much fewer concepts. Second, frame semantics are primarily aimed
at aiding in Machine Translation and Information Extraction for strongly structured
36
Chapter 3. Natural Language Processing and Text Mining Techniques
text corpora such as newspaper articles or scientific papers. As a result, existing systems
generally do not cope well with informal language. And finally, there was no suitable
implementation available (see section 4.5.1).
37
Chapter 4.
Assessment of Existing Libraries and
Toolkits
The field of Natural Language Processing appeals to a wide range of people, both pro-
fessionals and hobby programmers. This has led to a multitude of available tools, plat-
forms and frameworks in a variety of programming languages and for a variety of natural
languages. The high level of diversity, complexity and ambiguity of natural language
oftentimes causes software with congruent feature sets and similar theoretical function-
ing to perform very differently under real-world conditions. It is this very reason that
necessitates the evaluation of each software solution on real-world data.
We have assessed a number of libraries for various facets of NLP. The evaluation fo-
cused on performance, ease of integratability, licensing aspects and applicability to our
problem domain. Whenever possible, we evaluated the tools against our test corpus to
attain results close to the real application for which they are intended. The test corpus
was manually annotated with part-of-speech tags, word senses and lemmas. It is dis-
cussed in detail in the context of the overall system evaluation in chapter 6. The link to
the website of each evaluated tool can be found through the endnotes in the respective
sections.
In this chapter, a distinction will often be made between word and a relevant word. A
relevant word is a word that can be found in GermaNet and has, as such, at least one
synset assigned to it. So in this context, “relevant” means that a word carries a semantic
value which is accessible to our system through the ontology.
38
Chapter 4. Assessment of Existing Libraries and Toolkits
4.1. Part-of-Speech Taggers
We evaluated OpenNLP in version 1.4.2, the tagger of the Stanford NLP group in
version 1.6 and the TreeTagger in version 3.2. All three taggers combine the steps of
tokenization and part-of-speech tagging and use the STTS tag set[22] (see appendix A).
In theory, sharing a common tag set allows the taggers to be used interchangeably.
However, in practice, all implementations have particularities that need to be taken into
account. For instance, some taggers assign individual tags to certain punctuations such
as colons or quotation marks, while others do not and leave them with the word that
precedes them. In the latter case some minor postprocessing is necessary.
4.1.1. OpenNLP
OpenNLP[23] is a Java-based open source project licensed under the GNU Lesser Gen-
eral Public License. It houses a variety of NLP tools, most notably tokenizers and
part-of-speech taggers for English, German, Spanish and Thai as well as NER- and CR-
models for English only. Most tools are based on the maximum entropy model described
in [BPP96] and [Rat98] which is also available as a standalone tool.
The sentence boundary detection of OpenNLP for German did show some weakness
during our evaluation. It occasionally “missed” the end of a sentence and incorrectly
concatenated consecutive sentences. It is not clear what causes this behavior, especially
considering that OpenNLP correctly handles all common edge cases, i.e. periods that
are part of a decimal number, an abbreviation or ellipses. Oddly, the incorrect splitting
did not impact the performance of the pos-tagging. In any case, manually splitting
sentences at the end-of-sentence tags resolved this issue without further difficulties.
4.1.2. Stanford NLP
The Stanford Natural Language Processing Group[24] has implemented a Java version
of the maximum entropy-based log-linear part-of-speech tagger described in [TM00] and
[TKMS03]. It is licensed under the GNU General Public License and comes with fully
trained tagger models for various languages, including German. For the lack of a distinct
name, we will refer to this implementation as Stanford.
39
Chapter 4. Assessment of Existing Libraries and Toolkits
Sentence detection was almost flawless except for a few cases where a period as a part of
an abbreviation caused an incorrect sentence split. The part-of-speech tagging itself was
generally solid but did occasionally show some weaknesses. For instance, opening and
closing quotation marks are consistently tagged as CARD (number) and VVFIN (finite
verb), respectively.
During our evaluation we encountered two critical issues. First, the internal tokenizer
inexplicably trips over hyphenated compounds if the second element contains an umlaut.
For instance, the word Hollywood-Rückkehr (eng. the return to Hollywood) is tokenized
and tagged as Hollywood-R/NN ückkehr/ADJD, yielding unusable results.
Second, the performance degraded drastically under certain conditions. The following
issues were identified to cause dramatic performance degradation:
- Extended sequences of consecutive nouns
e.g. “United States Senator Thomas Andrew Daschle always supported Obama.”
- Ellipses, particularly within a sentence
e.g. “But then... it got worse!”
- Dashes and quotation marks
e.g. “He said ‘No problem - I swear!’ and ran away.”
On several occasions, the performance degraded for no apparent reason. This unpre-
dictable behavior leads to often erratic variation in performance. The resulting problems
were significant as can be seen in table 4.1.3. In some extreme cases the throughput
dropped from several hundred to less than one word per second.
4.1.3. TreeTagger
The TreeTagger[25] is a closed-source, language independent part-of-speech tagger and
lemmatizer. It is written in C and released under a proprietary license that allows free
use for evaluation and research purposes. TreeTagger has been successfully used to
tag a wide variety of languages (German, English, Spanish, Dutch, Russian and Chinese,
among others) and was shown to achieve an accuracy of over 95% on the Penn-Treebank
corpus [Sch94]. Binaries are available for Linux, Microsoft Windows, Sun Solaris and
MacOS.
40
Chapter 4. Assessment of Existing Libraries and Toolkits
TreeTagger had one noteworthy particularity unfavorable for our application: occa-
sionally verbs or prepositions are mistagged as nouns or named entities. For instance,
the sentence “Der Grund für sein Verhalten (...)” (eng. “The reason for his behavior (...)”)
is tagged as
Der/ART Grund/NN für/NN sein/PPOSAT Verhalten/NN (...)
where the preposition für (eng. for) is tagged as a noun. Aside from this, it showed no
significant shortcomings in our evaluation.
Remarks
We evaluated the speed of each tagger as well as its accuracy for sentence detection asd
and tagging apos. On our test corpus, the accuracy of the sentence detection asd was
extremely high for all taggers1. It was only one very specific edge case involving an
extended enumeration of proper nouns and abbreviations that caused problems for all
implementations. As discussed above, the Stanford tagger showed serious performance
Tagger total runtime speed asd apos
Stanford 23.70 min 0.1 texts/sec 99.9% 93.2%
OpenNLP 0.27 min 9.3 texts/sec 99.8% 95.8%
TreeTagger 0.19 min 13.2 texts/sec 99.9% 95.6%
Table 4.1.: Performance of evaluated pos-taggers
problems in terms of speed. Its comparatively poor accuracy apos is primarily due to the
nature of our test corpus which had an above-average percentage of cases with which
the pre-trained model could not cope, namely with a lot of parenthesis and quotation
marks.
The TreeTagger was the fastest of all taggers, in all likelihood because it is a native C
application and was being run from the command line and not through a Java wrapper.
The OpenNLP implementation was fast and achieved a slightly higher accuracy.
In conclusion, it can be said that all three taggers had issues of varying gravity. We
opted for the OpenNLP library because it performs consistently strong, is liberally
licensed and can be used natively from Java.
1This comparison was done using the mentioned workaround for sentence splitting for OpenNLP.
41
Chapter 4. Assessment of Existing Libraries and Toolkits
4.2. Lemmatizers
German is a complex and inflectionally rich language that is difficult to lemmatize.
Partially as a consequence of this, relatively few lemmatizers are available in comparison
to English. We have evaluated Morphy 1.1, LemmaGen 2.0 and LemServer 1.02.
All three lemmatizers are stand-alone applications that take a single word as an input
and return a list of possible lemmas.
4.2.1. Morphy
Morphy[26] is a freely available closed-source tool for morphological analysis, described
in detail in [Lez98]. While it runs under Microsoft Windows only, the internal dictionary
can be exported as an SQL dump and used independently. The exported data is then
licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
A notable disadvantage of Morphy is that its dictionary has, for the most part, not
been adapted to the German orthography reform of 1996. The development of Morphy
has ceased in late 1999 and it was included primarily for comparative reasons.
Morphy maps approximately 430,000 inflectional forms to 90,000 distinct lemmas. It
contains no additional information such as part-of-speech tags or probabilities of the
occurrences of the various inflected forms.
4.2.2. LemmaGen
LemmaGen[27] is a C++ implementation of the lemmatization algorithm based on
Ripple Down Rules as described in [MJ07]. It is published under the GNU Lesser
General Public License. The lemmatization rules used by LemmaGen are acquired
during a learning phase and stored in a parameter file. Pre-trained parameter files are
available for 14 different languages, including German and English.
A Ripple Down Rule system is similar to a decision tree in that each node n has two
sons n1 and n2 that represent an IF n THEN n1 EXCEPT n2 type structure. This rule
tree is built during a learning phase where new rules are added and existing rules are
subdivided and refined with exceptions as more training data comes in.
42
Chapter 4. Assessment of Existing Libraries and Toolkits
To illustrate, consider training LemmaGen with the following set of pairs of inflected
forms and their lemmas:
(laughed→laugh, looked→look, smiled→smile, agreed→agree, steed→steed)
After processing the first two elements, the system would deduce that simply removing
the suffix ed is sufficient for lemmatization:
i f hasSfx ( ’ ed ’ ) then removeSfx ( ’ ed ’ )
The next two elements violate this rule because the lemmas of smiled and agreed are not
smil and agre, respectively. Hence, two exceptions are added to the rule:
i f hasSfx ( ’ ed ’ ) then removeSfx ( ’ ed ’ ) except
i f hasSfx ( ’ led ’ ) then removeSfx ( ’ d ’ )
i f hasSfx ( ’ eed ’ ) then removeSfx ( ’ d ’ )
This set of rules would correctly lemmatize the so far unknown word freed, but it would
still fail for the final word steed and lemmatize it to stee. To cover this case, an “exception
to the exception” is needed:
i f hasSfx ( ’ ed ’ ) then removeSfx ( ’ ed ’ ) except
i f hasSfx ( ’ led ’ ) then removeSfx ( ’ d ’ )
i f hasSfx ( ’ eed ’ ) then removeSfx ( ’ d ’ ) except
i f hasSfx ( ’ teed ’ )
The above example illustrates nicely the hierarchical structure of the rules and how new
rules “ripple down” through it. In practice, the tree is not deeper than eight to ten
levels, depending on the language and the lemmatizer is thus extremely fast. However,
it is clear that it does not generalize well to previously unknown words due to the lack
of grammatical rules and heuristics. In the above example, the previously unknown
word bleed would incorrectly be lemmatized to blee. LemmaGen can display its set of
rules, figure 4.1 shows an excerpt of its rule set for German. It is often interesting to
retrace how a specific result was inferred. For instance, the German ruleset contains an
odd exception that lemmatizes the phrase en to Einkaufspark. This is most likely due to
faulty training data.
43
Chapter 4. Assessment of Existing Libraries and Toolkits
|---> RULE:( suffix("ser") transform("er"-->"") except(8) ); {:
| |---> RULE:( suffix("eser") transform(""-->"") );
| |---> RULE:( suffix("iser") transform("r"-->"") except(4) ); {:
| | |---> RULE:( suffix("aiser") transform(""-->"") );
| | |---> RULE:( suffix("Reiser") transform("er"-->"") );
| | |---> RULE:( suffix("riser") transform(""-->"") );
| | ‘---> RULE:( suffix("ziser") transform("er"-->"") ); :}
| |
| |---> RULE:( suffix("sser") transform(""-->"") except(2) ); {:
| | |---> RULE:( suffix("besser") transform("besser"-->"gut") );
| | ‘---> RULE:( suffix("össer") transform("össer"-->"oß") ); :}
| |
| |---> RULE:( suffix("äuser") transform("äuser"-->"aus") );
| |---> RULE:( suffix("äser") transform("äser"-->"as") );
| ‘---> RULE:( suffix("öser") transform("er"-->"") except(2) ); {:
| ‘---> RULE:( suffix("böser") transform("r"-->"") ); :}
| :}
Figure 4.1.: Excerpt of LemmaGens rule set for German
As the algorithm outlined above indicates, LemmaGen performs purely lexical lemma-
tization and does not take potentially available additional information into account,
such as part-of-speech tags. It does, however, return the statistically most likely lemma
if there were multiple possibilities during the learning phase. On the other hand, this
approach is language independent and can be applied to a large variety of languages.
4.2.3. LemServer
LemServer is a part of the RuPosTagger[28] NLP application and is published under
the GNU General Public License. It is implemented in C++ and runs on any POSIX-
compliant operating system, including Microsoft Windows with Cygwin[29]. A Java
wrapper for accessing it via XML-RPC is available and our evaluation has shown that
it performs reasonably fast.
LemServer employs a hybrid approach by utilizing both a set grammatical heuristics
and a dictionary of over 220,000 words. It returns a list of all possible lemmas with their
associated lexical class. For instance, for the word Liegen two lemmas are returned:
liegen/VER (eng. lie/VERB) and Liege/SUB (eng. couch/NOUN). This result can then
be compared with the part-of-speech tag assigned to the original word by the tagger in
order to chose the lemma most likely to be correct. This requires a tag mapping from
LemServer’s proprietary tag set to the STTS tag set.
44
Chapter 4. Assessment of Existing Libraries and Toolkits
Figure 4.2.: LemServer architecture
Remarks
The results are split into two categories, the global accuracy a (percentage of correctly
lemmatized words) and the accuracy ar of correctly lemmatized relevant words. This
distinction is significant because many lemmatization errors occur with uncommon words
for which the lemmatizers were not trained. However, most of these words are not part of
the word net we used, in which case the wrong lemma does not matter in our particular
case. Our analysis has shown that LemServer achieves the best results and outperforms
Lemmatizer a ar runtime words/sec
LemmaGen 86.2% 89.3% 1.3s 12307
Morphy 79.5% 93.9% 23.2s 689
LemServer 94.5% 97.5% 6.8s 2353
Table 4.2.: Performance of evaluated lemmatizers
its competitors by a significant margin. This is not surprising considering that neither
LemmaGen nor Morphy utilize part-of-speech tags. The total runtime listed in table
4.2.3 should not be taken at face value: LemmaGen is a command-line application,
LemServer was accessed via XML-RPC and Morphy required a large amount of
database queries. In any case, both LemmaGen and LemServer were fast enough
for practical use. The performance of Morphy could be increased by transforming the
data into a format that can be queried locally without resorting to a database.
The decision for LemServer was clear cut; it had the highest accuracy, was easy to
integrate and its dictionary and parameter files are continuously being updated[30]. Its
speed could be improved by implementing a native wrapper and accessing it directly via
the Java Native Interface (JNI).
45
Chapter 4. Assessment of Existing Libraries and Toolkits
4.3. Ontologies
A lexical ontology is the central source of semantic meaning in our application. For
this reason, the quality and density of the word net was crucial. While there are a
variety of ontologies for specialized fields such as the Standard Thesaurus Wirtschaft[31]
for economics, there are, to our knowledge, only two that attempt to cover the entire
German language: GermaNet and OpenThesaurus.
4.3.1. GermaNet
GermaNet[32] is an ontology developed and maintained by the University of Tübingen.
It is available free of charge for academic users but a licensing agreement has to be
signed.
Being very similar in concept and structure to WordNet, it contains more than 80,000
distinct lexical units grouped in approximately 58,000 synsets. These synsets are inter-
connected with roughly 80,000 conceptual and lexical relations. GermaNet models the
following relations: hypernymy/hyponymy, meronymy/holonymy, entailment/entailed,
causation/caused and association. It is distributed in XML format, APIs allowing ac-
cess to all relevant features are available for various programming languages.
4.3.2. OpenThesaurus
OpenThesaurus[33] is an open source word net that is being used as a thesaurus for
OpenOffice[34] and KWord[35]. It contains about 64,000 entries assigned to 26,000 synsets
that are connected only with hyponymy/hypernymy-relationships. The data is available
from the official website as a daily updated SQL dump. However, it suffers from some
inconsistencies in terms of character encodings and formatting.
It is clear that the intended purpose of OpenThesaurus is not to be a complete and
richly interrelated ontology, but - as the name indicates - to be a thesaurus. It is
thus limited to a subsumption hierarchy which greatly restricts its use for any other
purpose. In particular, measures for semantic relatedness, as described in section 3.7.3,
do not work well if essential relations like antomymy and holonymy/meronymy are not
available.
46
Chapter 4. Assessment of Existing Libraries and Toolkits
Remarks
For this evaluation, we focused on the coverage c and relational density r of the nouns
and verbs in our test corpus. A noun or verb is considered “covered” by the ontology if it
is associated with at least one synset. The relational density was computed by averaging
the number of inter-synset relations of all synsets the noun or verb is a part of. As figure
Ontology cnouns rnouns cverbs rverbs
GermaNet 85.6% 10.05 96.3% 4.13
OpenThesaurus 79.0% 4.87 95.2% 1.21
Table 4.3.: Coverage and relational density of evaluated ontologies
4.3.2 shows, GermaNet clearly outperforms OpenThesaurus in our comparison. It
has a better coverage and a significantly higher relational density for both nouns and
verbs. Consequently, for applications where no domain-specific ontology is available,
GermaNet is the only viable choice.
4.4. Frameworks
At the time of writing there were two major Java frameworks for NLP applications:
GATE and Apache UIMA. We evaluated both in order to decide whether our appli-
cation could benefit from using a framework or if it could even be implemented as a
plug-in.
4.4.1. GATE
GATE[36] (General Architecture for Text Engineering) is a Java framework for NLP
applications developed by the NLP Group of the University of Sheffield. It employs a
component-based architecture where individual components are combined into a pro-
cessing pipeline (often simply called an application). The input is sent through the
pipeline, each component receiving the output of its predecessor. This naturally implies
a strictly sequential processing which can in some cases be suboptimal [MS05]. GATE
is licensed under the GNU Lesser General Public License.
47
Chapter 4. Assessment of Existing Libraries and Toolkits
A GATE component is either a Language Resource, a Processing Resource or a Visual
Resource. Language Resources are centered around data, i.e. they provide an inter-
face to query external sources like ontologies, dictionaries or text corpora. By contrast,
Processing Resources encapsulate algorithmic components such as pos-taggers or coref-
erence resolvers. Visual Resources can extend the GATE’s GUI application and provide
an interface to viewing, editing and visualizing various aspects of the processing chain.
The set of all available plugins is called CREOLE (Collection of Reusable Objects for
Language Engineering). A wide variety of commercial and free plugins are available,
ranging from stemmers to components that build text corpora on-the-fly from Google
queries.
4.4.2. UIMA
The Unstructured Information Management Architecture UIMA[37] is an OASIS[38]-
approved standard for the analysis of unstructured data and knowledge discovery and
was originally developed by IBM. Apache UIMA2 is an open source implementation
of this specification licensed under the Apache License v2.0.
The structure of an UIMA application is in principle similar to GATE in that a pro-
cessing pipeline is built from separate, reusable components called Analysis Engines.
Analysis Engines can be written in Java or C++ as well as Perl, Python and TCL
(through SWIG[39]). They are packaged and redistributed as a single Processing Engine
Archive (PEAR) file. Figure 4.3 gives a schematic overview of the various aspects that
are a part of the Apache UIMA project. In contrast to GATE, which is intended to
process exclusively textual input, UIMA was designed as a unified framework for trans-
forming any kind of unstructured data into structured data. For instance, it could be
employed in an application that extracts geographical information from video streams
and cross-references them with current news feeds, all within the same framework. The
processing of texts in natural language is only one possible area of application.
2The terms “UIMA” and “Apache UIMA” are commonly used synonymously and no distinction ismade between the specification and its at this time only implementation.
48
Chapter 4. Assessment of Existing Libraries and Toolkits
Figure 4.3.: The Apache UIMA project (Source: project website)
Remarks
The decision for or against the employment of a framework commonly depends on
whether the gain in flexibility and abstraction is more important than the increased
overhead entailed by it. The architectural infrastructure provided by such frameworks
naturally means a higher overall complexity and a potential performance loss.
Both GATE and Apache UIMA are mature and powerful frameworks. The tradeoff
outlined above was to be weighted against the constraints discussed at the beginning
of this chapter, primarily ease of integratability and low overall coupling. We decided
against the use of a major framework as they offered no clearly identifiable benefit to us
but would have introduced many new dependencies and additional layers.
4.5. Miscellaneous
The following two tools that we evaluated did not fit into the previous categories. Shal-
maneser, a toolkit for assigning semantic role labels and JWordSplitter, a library
for decompunding German nouns.
49
Chapter 4. Assessment of Existing Libraries and Toolkits
4.5.1. Shalmaneser
Shalmaneser[40] is a shallow semantic parser for assigning roles to frame elements using
the SALSA[41] corpus (see section 3.9). It is implemented in Ruby and licensed under
the GNU General Public License. A detailed description can be found in [EP06a] and
[EP06b].
There are several practical issues with Shalmaneser. For one, it seems to be no
longer actively maintained, the latest version dating back to early 2007. Second, it
runs, despite being written in Ruby, exclusively on Linux. In addition, it has various
dependencies including a MySQL database and Amit Dubey’s Sleepy parser which is
no longer publicly available.
For these reasons, Shalmaneser was not a viable option for integration into our system.
However, for comparative reasons, we decided to examined the SALSA corpus and how
it might be applicable to this work. An analysis of the verbs modeled in the corpus
has shown that it was only able to cover approximately 27% of all verbs that occured
in our test corpus and was thus sparse at best. The semantic frames defined in the
SALSA corpus are based exclusively on formal newspaper articles. It is questionable
if any program trained on that corpus could successfully be used on descriptions from
EPGs.
4.5.2. JWordSplitter
JWordSplitter[42] by Daniel Naber is a small open source library for splitting Ger-
man compounds written in Java. The input is matched against an internal word list
to identify its constituents. It does take epenthesis3 into account. Despite being an
algorithmically simplistic approach, it has improved our results.
In our test corpus, approximately 45% of all compounds were successfully split. The
other 55% consisted of compounds where at least one part was not considered relevant.
This was either because the word itself was unknown or because a known word was
inflected in such a way that it could not be correctly lemmatized, e.g. Ess·störung (eng.
eating disorder) where Ess is a shortened form of Essen (eng. food) which does not occur
in this inflective form on its own.
3An epenthesis is the addition of a sound when forming compounds. Consider compounding Ver-
wendung and Zweck to Verwendungszweck which requires the addition of an s due to the implicitdeclension of Verwendung to the genetive.
50
Chapter 5.
Implementation
The goal of this work is to develop a complete system for analyzing program descriptions
written in natural language. Through a series of processing steps, the input is analyzed
and transformed into a sequence of tokens which are augmented with semantic informa-
tion. At the end of the process stands a structured summary of all identified entities,
concepts, and keywords. That information could then be used to compare TV shows and
movies on a semantic level or to improve existing recommender systems, among other
things. The extracted concepts can be utilized to aid the user interactively in finding
programs similar in content and setting.
Figure 5.1.: The Sliver processing pipeline
The software is called Sliver and is a pure Java implementation with an emphasis on
a simple architecture and a high degree of modularity and extensibility. Our implemen-
tation was designed to be incorporated into a larger software system, but it can also be
51
Chapter 5. Implementation
used as a stand-alone library with little external dependencies and minimal configuration
required. This chapter will illustrate various aspects of Sliver’s implementation and
describe the inner workings of our application in detail. Figure 5.1 shows a schematic
of Sliver’s analysis pipeline and gives a broad overview over the individual processing
steps and their subcomponents.
The purpose of our system is to analyze text in natural language and extract terms and
phrases descriptive for it. This process can be broken down into three distinct phases:
• Basic Tokenization
The first phase is the basic tokenization of the textual input. This includes some
general preprocessing of the text and modifications specific to the used tokenizer
and part-of-speech tagger.
• Semantic Analysis
During this phase, the tokens are analyzed and augmented with semantic informa-
tion. This is the central phase and consists of several steps that process various
aspects of the tokens.
• Tuple Generation
Based on the information from the previous two phases, the system now tries to
identify and weigh the most descriptive tokens and phrases (token tuples).
Each phase can be subdivided in several smaller steps. In the following sections, each
of these steps will be discussed in more detail.
5.1. Preprocessing
This step subsumes all actions necessary to process an input document. This includes
the following steps:
• Reading the input data
Depending on the data source, this means parsing XML files, connecting to a
database or any other mean of acquiring the data. In any case, the end result
must be a single string.
52
Chapter 5. Implementation
• Replacing special characters
GermaNet uses underscores to concatenate multi-word expressions. For instance,
the phrase log out has to be queried as log_out. For this reason, the text should not
contain this character. However, natural language text rarely contains underscores,
so in practice, this poses no restriction on the format of the input.
• Performing tokenizer-specific modifications
While most tokenizers require only little preprocessing (if any), some do. Stan-
ford, for instance, had massive problems processing URLs. Issues like these
should be addressed during this step.
For our test, data preprocessing was minimal. In a real-world application, this can be
more complex and require any number of steps commonly involved in processing textual
data, such as handling different character encodings or removing markup.
5.2. Tokenization and PoS-Tagging
After preprocessing is complete, the tokenizer and part-of-speech tagger parses the text
and generates a sequence of BasicTokens. We will use the Java-like notation of t.pos,
t.lemma and so forth to refer to the annotations of a given token t.
All part-of-speech taggers that we evaluated combine tokenization and pos-tagging in
either one single or two strongly linked steps. As discussed in section 4.1.1, some postpro-
cessing is required for OpenNLP to account for its sometimes faulty sentence detection.
The sentence splitting returned by OpenNLP is reexamined and adjusted as needed.
This step is essentially a conversion from the tagger-specific format to our own represen-
tation. This implies a conversion to the STTS tag set, should the part-of-speech tagger
use a different one.
5.3. Semantic Analysis
During the semantic analysis, the BasicTokens are analyzed and augmented with addi-
tional information. This phase consists of five steps: lemmatization, compound splitting,
integration of GermaNet, word sense disambiguation and semantic chunking.
53
Chapter 5. Implementation
5.3.1. Lemmatization
The value of each token is sent to the lemmatizer and the lemma is assigned to the token.
LemServer returns multiple values if the lemma can not be determined unambiguously.
The part-of-speech tag of the token is then used to infer the lemma most likely to be
correct. If the ambiguity cannot be resolved this way, the first lemma is used as a
fallback. LemServer always returns a lemma; if it does not recognize the input, it
returns it unaltered.
5.3.2. Compound Splitting
As outlined in section 3.2.3, splitting words into their constituents is in practice a rather
difficult task. Besides the problems already discussed, further problems arise when
part-of-speech taggers make mistakes, for instance by labeling a verb or an adjective
as a noun. When that happens, the decompounding algorithm oftentimes produces
nonsensical results that affect the following processing steps negatively. For this reason,
we employ a very conservative approach. A word w is only decompounded if all of the
following conditions are met:
(1) w is tagged as a noun
(2) w is capitalized
(3) w itself is not relevant
(4) all lemmatized constituents of w are relevant
Constraints one and two ensure that only nouns are decompounded. The fraction of
compound verbs or adjectives is for practical purposes negligible. Constraint three keeps
nouns from being decompounded for which a more specific synset is already known. The
final constraint acts as a safeguard against incorrect decomposition. If all constituents
are genuine nouns, the compound was most likely endocentric and decomposition is likely
to yield sensible results. While these constraints naturally only permit comparatively
few decompoundings, the total error is also kept to a minimum. This will be discussed
further in section 6.3.2.
54
Chapter 5. Implementation
5.3.3. Ontology Integration
Linking a token to an entity in an ontology is a key element of establishing semantic
information. In a word net, a token is assigned to one or more synsets, depending on
how ambiguous it is. The large number of interconnections between synsets provide a
rich amount of semantic information that can be harnessed. We will now illustrate how
GermaNet was integrated into our system and what issues arose in doing so.
During this discussion, the terms meaning, sense and synset will be used synonymously,
even though they differ subtly in theory. A meaning (or sense) commonly refers to
something concrete, although possibly abstract. The term addition, for example, has an
intangible but concise meaning. GermaNet, on the other hand, contains synsets whose
only purpose is to complete the subsumption hierarchy. These “artificial synsets” have
no representation in the real language. For instance, neither English nor German has
words that mean to move into a specific direction (as a hypernym of rise and plummet)
or unspecified movement (as a hypernym of arrive and stop).
Nouns, verbs and adjectives are covered extensively by GermaNet. Other word classes
such as prepositions, conjunctions or interjections, are not a part of GermaNet as
they carry no intrinsic meaning. In that sense, they are noise or stop words. It could
be argued that the semantics of negations should not be ignored. To a human, it may
matter whether the kitten was killed or the kitten was not killed. However, for the
purpose of information extraction this distinction is basically irrelevant. Any concept is
innately related to its negation, so the differentiation between the two is of little overall
importance.
GermaNet can be queried directly through a Java API. Given a word’s lemma, a list
of all synsets that the word is part of is returned. It is important to realize that the
number of returned synsets is often much higher than one would intuitively suspect.
This is because GermaNet contains a large number of polysemes (i.e. closely related
but distinctly modeled word senses). For instance, the verb gehen (eng. to go) is a part
of 15 different synsets that sometimes differ only slightly in meaning. It is not always
apparent, even to a human, what meaning is most appropriate in the context from which
the word is originated. In most cases, several meanings can be considered adequate.
The more the synsets of words are scattered throughout an ontology, the less they are
semantically related. Its meaning is then fuzzy. Ideally, a word should be completely
unambiguous (i.e. it only has one synset), where its meaning would then be concise. The
next section discusses how the fuzziness of a word’s meaning can be reduced through
sense disambiguation.
55
Chapter 5. Implementation
5.3.4. Word Sense Disambiguation
As described in section 3.8, algorithms for word sense disambiguation work by comparing
the current context of a word with a “reference context”. This implicitly requires the
existence of such a context, which is commonly given through one of the following:
• Gloss
If a word has a gloss, the words in that gloss can be used as a reference context.
The gloss can be taken from a word net or from an external source, such as a
dictionary. In the latter case, a mapping from the word net synsets to dictionary
entries is required.
• Example Usage
If a list of example uses of a word is available, this can be used as a context. These
examples can either be manually annotated with the precise sense or come from
a corpus whose general domain of meaning is already known, for instance medical
journals.
• Domain Annotation
A domain annotation specifies the general domain of meaning to which a synset
belongs. For instance, branchbusiness and branchbiology. There can still be multiple
synsets per annotation, as in bankbusiness (the financial institution) and bankbusiness
(the building). This allows the quick clustering of words and disambiguation by
finding to which cluster the word in question is closest1.
If no suitable context is available, an alternative is the use the frequency score of a word
sense, either by itself as a heuristic or in combination with other methods. The frequency
of a sense is an indication of its statistical likeliness to occur. For instance, the word eye
is much more commonly used in the sense of organ than it is in the sense of the eye of
a needle.
The problem is that GermaNet, unlike WordNet, contains neither sense annotations,
nor frequency scores and only a minuscule fraction of all entries have a gloss. To overcome
these limitations, we devised an algorithm for word sense disambiguation that operates
exclusively on lexical relations and is thus completely domain-independent. Needless to
1In our case, domain annotations could prove to be particularly useful if they can be mapped onto thegenre information that is available for most entries of the EPG.
56
Chapter 5. Implementation
say, an algorithm with less information at its disposal will necessarily be less accurate
than an algorithm that can utilize more data. This is partially offset by the fact that
for our application, it is not required (or even possible) to identify the sense of a word
uniquely, i.e. reduce the list of its synsets to the size of one. The goal is rather to remove
those synsets that are clearly inadequate. For instance, for the word bank it is important
to distinguish between the “river bank” and the “money bank”. The differentiation
between the abstract financial institution and the concrete building is, in comparison,
much less important.
The algorithm has the following key elements:
• The context of a token t, written as context(t), is the set of words in its immediate
surrounding (excluding the word itself).
• A similarity measure sim for semantic relatedness (see section 3.7.3). sim(s, S)
is the average similarity between a synset s and all synsets in some set S ⊆ S.
• A threshold value τ that determines what the minimum similarity score is for a
synset to be retained relative to sim(s, S).
• A merge function that combines synsets that are closely related.
The algorithm is a fixpoint iteration that successively reduces ambiguities until no further
changes occur. The following steps are executed for each ambiguous token t with a set
of synsets t.senses:
- Invoke the merge function to combine senses that are closely related
- Determine the set fp of all tokens in the context of t that are unambiguous
- Calculate the semantic relatedness between each sense of t and all tokens in fp
- Discard all senses of t whose similarity to fp are below the threshold τ
This is repeated until no more senses can be discarded. Figure 5.2 is a more formal
description of this algorithm. The termination of the algorithm is trivial to prove.
Exact parameters for this algorithm and a performance evaluation will be discussed in
more detail in section 6.3.2.
57
Chapter 5. Implementation
repeat
merge_similar(t.senses);
foreach t in {ti ∈ tokens | card(ti.senses) > 1} begin
fp := {ti ∈ context(t) | card(ti.senses) = 1};τ := calculate_threshold(t.senses, fp);
t.senses := {s ∈ t.senses | sim(s, fp) ≥ τ};end;
until not changed;
Figure 5.2.: Algorithm for word sense disambiguation
5.3.5. Chunking
At this point in the processing chain, we have a sequence of tokens, each annotated with
its lemma, its pos-tag and a list of synsets. Such a token is called an AnnotatedToken.
The next step now is to combine consecutive tokens that are intrinsically related. This
is called chunking. The intent is to infer semantic relatedness, reflect it in the data
representation and prepare the model for the following step in the processing pipeline.
We will first briefly formalize the concept of a chunk and introduce needed terminology.
Next, three different types of chunking will be discussed and what their exact semantics
are. And finally, we will describe the chunking strategies that were implemented in our
system.
Formalization
Given a sequence T = (t1, t2, . . . , tn) of tokens, a chunk cki ⊆ T is defined as the subse-
quence (ti, . . . , ti+k) with k > 0. In this work, all chunks are assumed to be mutually
disjoint, i.e. given a chunking C = (cki , clk, . . .) of a text, c ∩ d = ∅ for all c, d ∈ C with
c 6= d. In general,⋃
c∈C c 6= T , so C is not necessarily a partition of T .
In our implementation, a chunk is itself modeled as a token2. To illustrate, consider the
text represented by the sequence (t1, t2, t3, t4, t5) where each ti is an AnnotatedToken.
Assuming the chunking C = {c23}, the resulting token sequence is then (t1, t2, c23, t5), or
(t1, t2, (t3, t4), t5).
2A direct consequence of this approach is that chunks can be nested as deeply as needed and implicitlyform a hierarchy. This will be discussed in more detail later on.
58
Chapter 5. Implementation
A core issue is how the aggregation of tokens is handled with respect to their synsets.
We will be dealing with synsets, sets of synsets, and sets thereof. Confusion can arise
from the fact that a synset is only a set in so far that it consists of multiple lexemes,
but these lexemes are irrelevant here. In this context, a synset is considered atomic.
To avoid confusion, we will in this section primarily be using the term sense instead of
synset. The symbol S denotes the set of all synsets contained in GermaNet.
When a sequence T of tokens is to be combined into a chunk c, a unification function
φ : P̂(S)→ P(S) is required3. This function maps the multiset of senses of all t ∈ T to
a single set of senses. The synsets of c are then defined by
c.synsets := φ(⊎
t∈T
t.synsets)
It is apparent that the definition of φ needs to reflect the intended semantics of the
chunking and can not be defined globally. For instance, the most intuitive functions
φ∪(S) :=⋃
s∈S s and φ∩(S) :=⋂
s∈S s are only adequate if all the tokens in the chunk
have similar meanings to begin with. In any other case, φ∪ will lead to a significant
increase in fuzziness, and φ∩ will yield the empty set unless the word were synonyms of
each other. Various techniques could be considered to counter this, for instance using
synset expansion (i.e. extending the synsets to include the hypernyms of all its elements).
This would guarantee that the intersection is never empty, but inevitably lead to very
abstract meanings close to the root of the subsumption hierarchy.
Intended Semantics
In the above formalization, we have introduced the concept of chunking, but it was left
open what its implications are. Each token in a chunk has its own value, lemma and
synsets. The question is what the synset of the chunk is and, to a lesser extent, what its
value and lemma is4. We will first discuss the principal methods of how chunking can
be done, and then describe various strategies as to when tokens can be aggregated and
what the exact semantics of the aggregate is.
3P̂(M) is the “power multiset”, i.e. the multiset of all subsets of M . This is needed because themultiplicity of a synset may be relevant for concrete implementation of φ.
4For a semantic analysis the concrete textual representation of a word is only of secondary interest.
59
Chapter 5. Implementation
Consider the token sequence (t1, t2, t3, t4, t5) where the chunk (t2, t3) was identified.
Chunking can then happen in three different ways.
(1) remove the tokens completely: (t1, t4, t5)
(2) merge the tokens into a single new token: (t1, tc, t4, t5)
(3) combine the tokens into a composite token: (t1, (t2, t3), t4, t5)
In the first case, a sequence of tokens is removed entirely from the system. This should
be done when the sequence has little informational value and its removal does not change
the semantics of the text. φ is trivially defined as φ(S) := ∅. Analogous to the term stop
word, this is called a stop sequence.
The second option assumes that the terms in the chunk can sensibly be represented by
a single token. For this reason, φ∪ and φ∩ can be considered adequate. In practice,
however, such a sequence rarely occurs in natural language. The most likely occurrence
is in the slightly concealed form of enumerations like “He jogs, swims and skydives”.
For this reason, it is called a uniform sequence. It is in our system represented by an
AnnotatedToken.
When a uniform sequence is transformed into an AnnotatedToken, all information of
the individual tokens is lost. In many cases, this is not desirable. The key idea of chunk-
ing a sequence of semantically heterogeneous tokens is to identify a head token that
can represent the entire chunk without significantly changing its semantics or gram-
matical function. Such a sequence is called a head sequence. It is converted into a
CompositeToken that is atomic for the purpose of algorithmic processing, but still re-
tains the original constituents5. Given a head token ti, the unification function is then
commonly defined as φ(S) := ti.synsets.
Patterns
We have defined and implemented six different chunking patterns to improve the per-
formance of our system.
• Attributive Adjective/Noun (A/N)
The part-of-speech combination adjective (ADJA) and noun (NN) form a head
sequence that can be merged. This is because in this particular combination the
5The notation of a CompositeToken consisting of the tokens t1, t2 and t3 with the head t3 is (t1, t2, t3).
60
Chapter 5. Implementation
adjective is in all but a few edge cases a qualifier for the noun. The resulting
CompositeToken inherits its primary meaning from the noun and can be treated
as such in subsequent steps.
For instance, the sequence (beautiful/ADJD, morning/NN) can be replaced by the
composite (beautiful∼morning/NN) with the primary meaning morning. The two
can be used interchangeably with only minimal alteration of the overall semantics.
• Adverbial Adjective/Verb (A/V)
Similarly to the adjective/noun chunking, the combination of ADJD and a verb
can be combined. This is true for all verbs, including auxiliary verbs.
• Annotation (ANN)
Descriptions of movies and series often contain annotations such as by whom a
character is portrayed, for example
But now Dirty Harry (Clint Eastwood) is ready to make your day once again.
The annotation is unwanted for two reasons. First, it is in most cases redundant
because the EPG entry usually already contains an explicit list of actors. Second,
and more importantly, these tokens needlessly add to the distance between the
tokens to the left and right of them. Why this matters will be explained in the next
section. In any event, omitting the entire annotation does not impact the semantics
of the text. Another, less common example of an annotation is a reference to a
specific year or age, such as “Harrison Ford (66) is still a great actor”. Annotations
are immediately discarded and not used for further analysis.
• Named Entities (NES)
Consecutive sequences of NE-tagged tokens are combined into one AnnotatedToken.
That composite inherits NE as its part-of-speech tag. The benefit of this merging
lies in the fact that tokens that are part of the same semantic entity are now
also represented as one element. For instance, it is desirable to treat (Neil/NE
Patrick/NE Harris/NE) as a single token because it stands for a single entity.
There are edge cases where this approach causes semantically incorrect chunking.
Consider the sentence “She told Peter Brian was home” where the named entities
Peter and Brian should not be merged. In practice, however, this style is commonly
avoided in written language and occurs only seldom.
61
Chapter 5. Implementation
• Phrase (PHR)
A phrase refers to a sequence of words that is recognized by GermaNet and has its
own synset. This can be either a very common named entity or a figure of speech.
Typical examples of this are New York State, Hot Dog or mentally challenged.
When merged, a phrase is replaced by an AnnotatedToken. No information is
lost because the assigned synset contains the meaning of the entire phrase. The
part-of-speech tag is inferred from the lexical class of the assigned synset6.
• Personified Profession (PPR)
If a named entity is qualified with the name of a profession, that entity is called
a personified profession. A typical example of this is Inspector Columbo or Dr.
Beverly Crusher.
The pattern is similar to that of regular named entities, except that it is preceded
by a noun. That noun needs to be a hyponym of an abstract GermaNet concept
that subsumes all professions. The head of the resulting CompositeToken is the
noun as it carries the primary semantic meaning of the composite (the named
entities commonly do not have any synsets assigned to them anyway).
Table 5.1 shows a brief summary of the different chunking patterns. The column “mean-
ing” describes the value of φ, i.e. which synset(s) are assigned to the replacement to-
ken.
Name Pattern Replaced By PoS Meaning
A/N ADJA, ADJA, . . ., NN CompositeToken NN noun
A/V ADJD, ADJD, . . ., verb CompositeToken verb verb
ANN ’(’, NE, . . ., NE, ’)’ nothing - -
NES NE, NE, . . . AnnotatedToken NE -
PPR NNprof , NE, . . . CompositeToken NN noun
PHR any, . . . AnnotatedToken gn gn
Table 5.1.: Overview of chunking patterns
The pattern are executed sequentially and it is obvious that the order of execution is
important. Phrase must be run first in order to identify idioms before other chunking
patterns destroy them. Personified Profession should be run before Named Entity to
avoid unnecessary nesting, and so forth.
6Each synset in GermaNet is classified as either a noun, a verb or an adjective.
62
Chapter 5. Implementation
One correct order of execution is thus:
1. Phrase and Annotation
2. Personified Profession
3. Named Entities
4. Adverbial Adjective/Verb and Attributive Adjective/Noun
An important thing to realize is that a key characteristic of all above modifications is
that they do not change the semantics of the text except for very rare edge cases as
discussed in section 6.3.3. The grammatical structure is essentially unaltered and the
original text could be, within limits, reconstructed. Figure 5.3 illustrates how chunking
imposes a hierarchical structure on a text.
Chunking bears some resemblance to fully parsing a sentence. In particular, the hierar-
chy imposed on the token structure often looks similar to a parse tree. However, there
are two differences that distinguish both techniques. First, the nesting of chunks rarely
exceeds two to three levels and is as such much more shallow than a parse tree. Second,
as mentioned in section 5.3.5, chunking is not exhaustive and may ignore some tokens.
A consequence of these difference is that chunking is generally much less error prone
because it can avoid ambiguous states.
63
Chapter
5.
Implem
enta
tion
Figure 5.3.: Example of semantic chunking
64
Chapter 5. Implementation
5.4. Extraction Of Descriptive Terms
So far we have not addressed the question how the information that we extracted from
the text will be structured. Chunking, as described above, often leads to nested tokens
in form of CompositeTokens which carry rich semantics but are cumbersome to handle.
An easier and more natural way to represent descriptive terms is to resort to simple
pairs of tokens. Such a pair is also called a concept. It is therefore desirable to flatten
the hierarchical structure of composite tokens and transform it into a set of token pairs.
The concept of a head token makes this task almost trivial. Consider the CompositeToken
(beautiful, French, poetry). It can be expanded by pairing the head with each of its
qualifiers, we thus get the two tuples (beautiful, poetry) and (French, poetry).
5.4.1. Tuple Generation
The expansion of composites alone is not sufficient to cover the better part of a text’s
characteristics. There are other pairs of word classes that are likely to carry semantic
information besides adverb/verb and adjective/noun. For instance, verb/noun pairs are
very descriptive. Oftentimes just a few of these tuples are sufficient to sketch a course of
events. Consider the phrases (break, toy), (slap, sister) and (run, mother). Immediately a
multitude of associations is invoked and it only takes very little fantasy to imagine what
the whole story could be.
Unlike the sequences examined in the previous section, finding these kinds of token pairs
is very difficult. The word order of a German sentence is extremely flexible, especially
in comparison to English. Consider the following three sentences:
(1) Er geht ein Bier trinken.
(2) Bier zu trinken war sein Hobby.
(3) Trinken wollte er eigentlich nur ein paar Bier.
In all three cases, the essential notion is (trinken, Bier), but the positioning of those words
varies greatly, both in absolute terms as well as relative to each other. Actually finding
this concept is challenging. The only reliable way to do so is to parse the sentence using
a complete grammar for natural language in order to identify grammatical relations. For
real-world applications, this is not a feasible option, if it can even be done at all.
65
Chapter 5. Implementation
As an alternative to a complex linguistic analysis, we propose a heuristic based on the
assumption that certain part-of-speech combinations are more likely to carry meaningful
information than others. The basic idea of this approach is outlined in the algorithm
in figure 5.4. The algorithm essentially calculates the Cartesian product of the tokens
P := ∅;foreach s in S, t1 in s begin
foreach t2 in context(t1) begin
if pass_filter(t1, t2) then P := P ∪ (t1, t2);
end
end
result := rank_pairs(P);
Figure 5.4.: Basic algorithm for heuristic tuple generation
of each sentence with itself and then assigns a score to each tuple that reflects the
likelihood of that particular tuple being descriptive for the text. Generating the full
Cartesian product is for practical purposes not desirable and, as will be discussed in the
following section, generally not needed.
5.4.2. Search-Space Reduction
It is not necessary to fully analyze all possible token combinations. A large number of
candidates can be discarded directly if certain conditions are met. We used the following
three exclusion criteria:
pos-combinations The combination of many part-of-speech tags is never relevant, for
instance if one of the tags is $, (a punctuation). Other combinations are only
sensible under very specific and statistically unlikely circumstances and should
be ignored. The differentiation is is done using a weight matrix that reflects the
probability of a pos-combination of being of interest.
element irrelevancy For the resulting token to be semantically meaningful, it is nec-
essary that both of its elements themselves are meaningful, i.e. that they have a
non-empty list of synsets. This automatically reduces the search-space significantly
because GermaNet only covers adjectives, adverbs, nouns and verbs.
66
Chapter 5. Implementation
element distance In order to further narrow down the number of candidates, the tuple
generation is limited to only pair a word with those in its immediate context instead
of all words in the sentence. The context is thus defined as the set of tokens in the
direct neighborhood of the current token. We have used two different measures
when defining a word’s context: the lexicographic distance and the pos-weighted
distance. The former measures the distance in equidistant steps, i.e. each token
has a distance value of one. The latter uses a variable distance value depending
on the pos-tag of the token.
Figure 5.5.: Context of the word Lust with a search radius of 3.0 using lexicographic(top) and pos-weighted distance values (bottom)
Figure 5.5 illustrates how this affects the search radius when computing the context
of the word Lust. If a lexicographic distance is used, the verb hatte is not part of the
context and the combination Lust haben (eng. to feel like doing something) is thus not
included in the search-space. Using the pos-weighted distance extends the radius far
enough. For the sake of clarity, the sentence contains no nested tokens, but the concept
can trivially be applied to include those as well. How the distance values for each tag
were determined and how it improves the tuple generation of our application will be
discussed in section 6.4.
5.4.3. Scoring
Having generated the set of candidate tuples, the next step is to assign a score to each
candidate using a scoring function σ. The tuples with the highest score are assumed to
be most likely to be descriptive for the text.
Given a tuple t, the value of σ(t) should reflect the general probability of t being descrip-
tive, its conciseness and possibly the concrete values of its constituents. Consider the
part-of-speech combination (CARD, NN) that is generally of little interest, except if the
noun is Celsius or Fahrenheit. In that particular case, the tuple denotes a temperature
67
Chapter 5. Implementation
and could be considered important. Likewise, concise tuples should score higher than
tuples with very fuzzy meanings. The tuple (Fall, gehen), for instance, has a total of
4 × 15 = 60 possible meanings. Word sense disambiguation mitigates this problem to
an extent but can not solve it entirely.
Actual implementations of σ can access a wide variety of information, including all se-
mantic annotations and lexical information of the tokens. The following is an incomplete
list of possible parameters:
- the part-of-speech tags
- the lexicographical or pos-weighted distance between the tokens
- the number of synsets
- properties of the synsets (e.g. penalty for very broad terms)
- the positioning of the tokens within the text
When all tuples are scored, a subset of them will be selected and passed on to the
final step. This can be done by either selecting the top n tuples, discarding all tuples
below a threshold τ , or a combination of both. Section 6.4 will discuss in detail what
parameters turned out to be most significant and what results different implementations
of σ yielded.
5.5. Result Set
The final step in the processing pipeline is the compilation of the actual result set. The
result consists of a list of concepts and a set of keywords. The list of concepts is built by
expanding composite tokens (as described in the beginning of this section on page 65)
and selecting the best candidate tuples. Keywords are subdivided into locations, named
entities and people7. While the explicit listing of named entities provides little additional
value compared to a simple full text search, it can can be used to efficiently generate
indexes for cross-referencing texts. An complete example output of such a result set can
be seen in appendix B.
7Technically, locations and people are also named entities. The difference is that locations and peoplehave semantic meaning in form of an assigned synset, while for named entities no information isavailable beyond their textual representation.
68
Chapter 6.
Evaluation
This chapter will discuss the evaluation of our system. We will first go into some partic-
ularities of electronic program guides, how they differ from “regular” text corpora and
what that implies for this work. Section 6.2 introduces our test corpus and discusses the
results of a preliminary evaluation we have done in order to gather empirical data. In
section 6.3, individual system components are evaluated, namely the compound splitter,
the word sense disambiguation module and the semantic chunkers. The scoring strategy
for heuristic token pairs is discussed in section 6.4. Finally, the system as a whole is
evaluated in section 6.5 and its limitations are being discussed.
6.1. Particularities of EPGs
Unlike most text corpora commonly used in Text Mining, the entries of an EPG are
very heterogeneous. In particular the following five aspects are of interest:
• Style
The style in which the program descriptions are written varies greatly, particularly
between genres. For instance, thrillers and mystery stories are commonly written
in a prose-like fashion, history documentaries or political magazines usually follow
a more formal style, and programs aimed at teenagers often make use of lurid
formulations.
As a consequence of this, there are no clearly recognizable patterns for the use of
punctuation, reported speech or other structural aspects. This means that it is
69
Chapter 6. Evaluation
very difficult for the tokenizer and pos-tagger to ensure that these constructs are
handled consistently across the corpus.
• Length
While some descriptions are very comprehensive and exceed 400 and more words,
others are only one or two sentences long. In some cases, the description field only
contained a few catchwords. Figure 6.1 illustrates the distribution of text length
measured in a EPG dump of roughly 2000 entries, excluding those with empty
descriptions. The average length of a program description is approximately 100
words. About 50% of all EPG entries had no description.
word count
400350300250200150100500
do
cu
men
ts
100
80
60
40
20
0
Figure 6.1.: Distribution of description length in number of words
• Quality
Most text corpora to which semantic analysis is applied meet a certain standard,
for instance scientific papers, newspaper articles or books. This means that these
texts are of a high quality, particularly with regard to grammar, spelling and use
of colloquialisms. Parsing texts from these corpora is thus often easier and less
error prone compared to descriptions from EPGs.
• Actuality
Description of newscasts or political discussions often make use of up-to-date vo-
cabulary. This is not limited to geographical locations or names of people currently
in the spotlight but include the latest “buzzwords”. Because public interest shifts
almost on a daily basis, ontologies like GermaNet are hesitant to incorporate
these words and only do so with a delay of several months, if it is done at all. As a
consequence of this, the semantic analysis of these programs can be very difficult.
70
Chapter 6. Evaluation
(a) over all words (b) over relevant words
Figure 6.2.: Distribution of pos-tags in the test corpus
• Extrinsic Context
Many program descriptions are written under the assumption that the reader is,
to some extent, already familiar with the program. These texts are said to have
an extrinsic context. TV series in particular are prone to this. For instance, a
regular viewer of House M.D. will know that “Thirteen” refers to a person and
that the series takes place in Princeton, New Jersey. Information like this is rarely
mentioned explicitly, and if so, only at the beginning of a series. It is as such
commonly not available to an analyzing system.
6.2. The Test Corpus and Preliminary Evaluation
In order to evaluate the performance of the system objectively, it was necessary to build
a test corpus as a point of reference. We selected 150 texts from various genres with a
length between 50 and 175 words. All texts were tokenized and each token was manually
annotated with its lemma, part-of-speech tag and a list of adequate synsets.
We wanted to gather some empirical data in terms of what combinations of pos-tags are
most commonly considered descriptive, and how the distance between words factors in.
In order to do this, we implemented a website where users were shown a list of twenty
texts from our corpus (see figure 6.3). Given a list of predetermined word pairs, the
user was asked to select the most descriptive pairs for each text, drag-and-drop them
into a separate list and order them by descriptiveness. The set of “candidates pairs”
71
Chapter 6. Evaluation
Figure 6.3.: Screenshot of the Sliver web application
was created using a prototypic implementation of our system. Each word in a sentence
was paired with every other word in its immediate surrounding, up to a maximum
lexicographic distance of four. This resulted in approximately 58,000 pairs of words, or
about 380 words per text, which was obviously too much for any user to survey in a
reasonable amount of time.
The difficulty was to reduce the number of candidates to a manageable amount without
anticipating the results. In other words, it was necessary to only filter out pairs where
one could be sufficiently sure that they were not relevant. It was decided to ignore all
tokens that were, among others, tagged as conjunctions, negations, or numbers. Also
excluded were several pos-combinations, for instance pairs of auxiliary verbs. And finally,
all words were rejected that could not be found in GermaNet and have, as such, no
accessible semantic value. The number of candidates was thus reduced from 58,000 to
about 4,300 pairs, averaging a manageable 29 tuples per text.
Figure 6.4 shows the distribution of tuples that were perceived as descriptive with respect
to the possible pos-combinations. Subfigure (a) show the raw data as entered by the
users, subfigure (b) shows the adjusted distribution of pos-combinations after removing
outliers (i.e. combinations that occurred less than five times) and combinations that
we did not consider to be descriptive in our intended sense, for example noun/noun-
combinations. During a period of three weeks, 73 users participated and created a total
72
Chapter 6. Evaluation
t1.pos t2.pos p
ADJA NN 12.9 %NN NN 9.9 %NN VVFIN 9.1 %APPR NN 5.3 %NN VAFIN 5.3 %ART NN 4.1 %ADJD NN 3.3 %ADJA VVFIN 2.8 %NN VVPP 2.6 %NN VVINF 2.4 %NE NN 2.2 %ADJA ART 1.6 %NN PPOSAT 1.6 %ADJA VAFIN 1.4 %ADJA APPR 1.4 %VAFIN VVPP 1.3 %APPR VVFIN 1.3 %
other 31.5 %(a) all
t1.pos t2.pos p
ADJA NN 29.4 %NN VVFIN 20.7 %ADJD NN 7.5 %ADJA VVFIN 6.3 %NN VVPP 6.0 %NN VVINF 5.6 %ADJD VAFIN 2.8 %ADV NN 2.8 %ADJD VVFIN 2.5 %VMFIN VVINF 1.6 %ADJA VVPP 1.5 %ADJA VVINF 1.4 %ADV VVFIN 1.2 %ADJD VVPP 1.2 %ADJD VVINF 1.2 %ADJD ADJD 1.1 %ADJA ADV 1.1 %
other 6.3 %(b) adjusted
Figure 6.4.: Percieved relevance of pos-tag combinations
of 1286 evaluations, or 8.6 evaluations per text in the test corpus. The interface was
a web 2.0 application written using ExtJS[43] and jQuery[44]. It was designed to be
easily accessible and self-explanatory, especially to users that were not familiar with these
technologies. The backend was implemented in PHP[45] using the Zend Framework[46]
and a MySQL[47] database.
Two conclusions in particular could be drawn from analyzing the user evaluations:
• over 75% of all pairs fall into the same six classes of pos-combinations
• the perceived relatedness between words is directly connected to their distance
We have addressed the first observation by implementing the A/N and A/V chunkers
which enabled us to capture a large quantity of descriptive terms without the need for
complex analysis. Other important classes of noun/verb combinations are harder to
recognize because the two words were usually not directly adjacent as was the case with
the A/N and A/V patterns.
73
Chapter 6. Evaluation
lexicographic token distance
654321
se
lec
tio
n p
erc
en
tag
e
50%
40%
30%
20%
10%
0%
Figure 6.5.: Empirical correlation between the distance of two words and their perceivedsemantic connectedness
Figure 6.5 illustrates the empiric correlation between the lexicographical distance of two
words and their perceived relatedness. Almost half of all pairs that were considered
descriptive by the users where direct neighbors. Almost anything beyond a distance of
three was not considered relevant.
We used this data to determine the parameters for the tuple generation and the scoring
function. Using a search radius greater than three words will increase the size of the
search space exponentially while only marginally improving recall.
6.3. Component Performance
6.3.1. Compound Splitting
Our test corpus contained a total of 639 words that were tagged as nouns but could not
be found in GermaNet, averaging 4.3 per text. This includes wrongly tagged words
74
Chapter 6. Evaluation
as well as foreign words and abbreviations. Out of these, 71.4% satisfied the conditions
specified in section 5.3.2 and were marked for decompounding. The evaluation has shown
that our set of criteria was tight and had led to an accuracy of 99.7%. It was only in edge
cases that a word was split wrongly, for instance the word Adventure was lemmatized to
Adventur and split into Advent·ur.
6.3.2. Word Sense Disambiguation
For the evaluation of the word sense disambiguation algorithm (WSD) described in
section 5.3.4 we have chosen the following parameters.
Context Size The context of a word consisted of all nouns in the sentence itself, and
the sentences following and preceding it. Interestingly, increasing the size of the context
window had only very little impact on the performance. This is because the texts in
our test corpus had 6.25 sentences on average, so the seemingly small context window
already covered approximately half of the entire text.
Similarity Measure The Leacock-Chodorow similarity measure was used to calculate
the semantic relatedness of the synsets (see 3.7.3). It was extended to include all relations
modeled in GermaNet.
Threshold Value Let the set fp be defined as the set of unambiguous tokens in the
context of t (see figure 5.2). Let α be defined as the minimum average similarity, and
β as the maximum average similarity between the synsets of t and fp:
α(t, fp) := mins∈t.senses
(
sim(s, fp))
β(t, fp) := maxs∈t.senses
(
sim(s, fp))
The threshold value τ that we chose for disambiguating the token t is then defined as
the mean of α and β:
τ :=α(t, fp) + β(t, fp)
2
In other words, the algorithm discards all senses that are “less than averagely similar to
the context”. Other definitions of τ have in our evaluation only shown little impact on
the overall result.
75
Chapter 6. Evaluation
The size of the interval [α(t, fp); β(t, fp)] is an indicator for the confidence with which
t can be disambiguated. To decrease error, our implementation required a minimum
interval size of δ. Our evaluation has shown that a value of δ = 0.07 yields good results.
Merge Function To preserve as much conciseness as possible, our implementation only
merges synset that are very closely related. More precisely, synsets are only merged if
they stand in a parent-son or sibling relationship in the subsumption hierarchy:
merge(S) =
lso(S) if ∀s ∈ S : dist(s, lso(S)) ≤ 1
S elsewith S ⊆ P(S)
If multiple subsets of S can be merged, it is necessary to combine them “bottom-up” so
that the merging can properly chain upwards.
WSD performance
Table 6.1 shows the results of our evaluation using the parameters described above.
Three values were used to measure the performance. If none of the synsets assigned
to a word is adequate, it is considered a complete miss. Almost 4% of all nouns fall
into that category even before WSD is applied. This is due to faulty pos-tagging or
lemmatization. Average purity describes the percentage of synsets that were marked as
adequate. The average ambiguity denotes the average number of synsets assigned to a
noun.
default(no WSD)
WSD(no merging)
WSD(merging)
Complete Misses 3.96% 9.04% 7.98%
Average Purity 75.06% 91.76% 95.53%
Average Ambiguity 2.93% 1.95% 1.72%
Table 6.1.: Performance of word sense disambiguation
In 6% of the cases synsets could be merged and thus conciseness increased. One case
in particular benefited greatly from this: GermaNet assigns 15 synsets to the word
dollar, each representing a concrete currency (US Dollar, Australian Dollar, and so forth).
Merging these to their lowest super-ordinate naturally decreased ambiguity by a great
margin.
76
Chapter 6. Evaluation
The results show that word sense disambiguation has for the most part significantly
improved the quality of the annotations in terms of conciseness. However, in some cases,
the algorithms has discarded all adequate meanings and lead to awkward results. For
instance, the word Sohn (eng. son) is more often than not disambiguated to the sense son
of god instead of male child. In these cases, the inappropriate meaning was more densely
interconnected within GermaNet, which lead to a higher maximum similarity.
6.3.3. Chunking
An important aspect of the implemented chunking patterns was that they should be
semantically safe (see section 5.3.5). In other words, the process of chunking should
introduce as little error as possible. As the precision values in table 6.2 show, this was
in most instances achieved. The majority of errors made were due to faulty pos-tags
or out-of-vocabulary words, especially in the case of the NES and PPR chunker. The
Name n recall precision f-score
A/N 796 99.4% 98.6% 99.0%
A/V 102 98.0% 97.0% 97.5%
ANN 19 95.0% 100.0% 97.4%
NES 130 92.3% 89.6% 90.8%
PPR 30 79.1% 97.7% 87.4%
PHR 11 - - -
Table 6.2.: Performance of chunkers
distinction between NN and NE tags was sometimes poor, in particular when a named
entity was immediately followed by a noun, or vice versa. Another issue are names
that are also regular nouns, which is not at all uncommon in German (e.g. the word
Bäcker is a common surname as well as a profession). The performance of the A/N and
A/V chunkers was very good. It was only in very few edge cases that they had false
positives, e.g. “Er war immer freundlich Tieren gegenüber”, where the adjective freundlich
(eng. friendly) does not belong to the noun Tier (eng. animal). The phrase chunker
had only little overall impact, but where it could be applied it often combined three to
four words and improved the local performance significantly. Recall and precision were
not calculated because the definition of what a phrase is and whether it is included in
GermaNet is more or less arbitrary.
77
Chapter 6. Evaluation
6.4. Scoring
Our evaluation has shown that the predominant parameters for scoring token pairs are
the weight matrix that assigns a “relevancy weight” to a combination of pos-tags, and
the distance between the elements of the pair.
Weight Matrix The STTS tag set consists of 55 tags, so the weight matrix has a total of
3025 entries. However, as the data in figure 6.2 suggests, little over ten tags are relevant
and the matrix is thus very sparse. This is primarily due to the fact that only verbs,
nouns and adjectives carry any semantically relevant meaning which already reduces the
number of “interesting” pos-tags to 17. Further analysis has shown that only about 30
combinations of tags can be considered reliable in terms of semantic connectedness.
Our initial assumption was that a real-valued weight matrix would be favorable as it
allowed for a more fine-grained ranking of the tuples. In practice, however, the decision
whether a token pair is descriptive or not has for the most part turned out to be a binary
classification problem.
Word Distance and Search Radius Figure 6.5 indicates a strong correlation between
the distance of two words and their likeliness of being semantically connected. This
directly implies that a search radius of 3 to 4 is sufficient to include most relevant pairs.
If a pos-weighted distance is used where most tags are mapped to a distance value of
less than one, a search radius of 2 to 3 is adequate.
Other factors, in particular the ambiguity of a word, its location in the text or properties
of its assigned synsets seem to have no measurable impact on their perceived relevancy.
This is not surprising because for one, users rarely give the ambiguity of a word much
thought unless it is really striking. Put differently, ambiguity is an insignificant factor
unless it can not easily be resolved. Naturally, texts with little contextual information
try to avoid these cases.
Furthermore, while the location of a word in a text can be important in some genres (e.g.
papers where important terms tend to be concentrated in the abstract and conclusion),
we were unable to identify any such correlation for EPGs.
78
Chapter 6. Evaluation
6.5. Overall Performance
Measuring the performance of the system as a whole is not a trivial task. This is for
two reasons. First, there is no concise reference solution to compare against. Word
pairs some users find very descriptive may seem irrelevant to others. Second, the use
of a bounded search radius puts a hard limit on our system’s capacity to detect pairs
whose constituents are far apart. Given these circumstances, we have evaluated the
system’s performance within its technical limits. In a second step we analyzed how
many important concepts the implementation has missed.
(a) simple (b) weighted
Figure 6.6.: Performance of heuristic tuple generation
Figure 6.6 shows the percentage of hits and misses for all pairs with a score up to the
threshold specified on the x-axis. Figure (a) is with the lexicographic word distance
and a binary weight matrix, figure (b) is with pos-weighted distances and a real-valued
matrix. As one can see, using a more fine-grained weight matrix and distance measure
has increased performance by 10%. The empirically best cut-off point is at around 0.85
(the highest score for a token pair is 1.0).
Using that threshold, the heuristic yielded an average of 0.4 concepts per sentence (or 2.2
per text). Given the fact that the heuristic is primarily used to identify pairs containing
a verb, this is not surprising. In addition to the 6.3 concepts per text from expanding
the A/V and A/N, this totals approximately 8.4 concepts per description.
On average, 1.2 important concepts per text were missed because their elements were
too far apart and thus beyond the search radius. The primary reason for this are verbal
brackets (see next section).
79
Chapter 6. Evaluation
6.6. Unresolved Issues
During our evaluation we encountered several issues that could not be resolved with the
current implementation. The reason for this is that some grammatical and linguistic
constructs lead to a word order that is structurally identical to some of the patterns we
have assumed to be safe for chunking. It is unlikely that the following problems can be
solved without incorporating a deep grammatical analysis.
• Sentence Adverbials
Some adverbs do not refer to a particular word of a sentence, but to the sentence as
a whole. These are called sentence adverbials. Consider the sentence “Surprisingly,
he left early”. In this instance, surprisingly is a comment on the entire sentence.
Sentence adverbials are less problematic in English since they are in most cases
separated by a comma. This is not the case in German. As a result of this, an
adverb is sometimes erroneously paired with a verb in its proximity:
(1) Langsam hatte er genug.
(2) Langsam fuhr er um die Ecke.
In the first case, the adverb is a qualifier for the entire sentence, whereas in the
second case it belongs to the verb. The pair (langsam, haben) makes little sense,
while (langsam, fahren) is very descriptive.
• Verbal Brackets
The splitting of verbs into two parts is a feature of German grammar that is very
problematic. It is not at all uncommon that two two words that are syntactically
and semantically related are on opposite ends of a sentence:
Peter riet Lois dann doch von der Reise ab.
Here, riet and ab are two elements of the word abraten (eng. to advise against).
This is even worsened by the fact that both riet and ab can occur independently:
“Ab dann riet er nur noch”.
• Idioms
A more obvious problem is idiomatic phrases such as “to bite the bullet”. Some of
these expressions are covered by GermaNet, but the majority is not.
80
Chapter 7.
Conclusion And Future Work
With several thousand TV shows being broadcasted every single day, keeping track
of interesting programs is a tedious and time-consuming chore. This has led to the
emergence of recommender systems that analyze a user’s behavior and suggest programs
similar to what he or she has enjoyed watching in the past. To work effectively, it is
necessary to process and compare program descriptions on a semantic level. Gaining a
complete understanding of natural language text is a difficult and by no means solved
problem, even more so for loosely structured and colloquially written texts as commonly
found in EPGs. In view of this, attaining partial text comprehension is a sensible
compromise.
In this thesis we have described and implemented a system for augmenting natural
language text with semantic annotations. Using both semantic chunking and heuristic
tuple generation, we were able to create a list of descriptive word pairs that can be used
to assess semantic relatedness between documents on a per-concept basis.
Our evaluation has shown that the combination of “safe” chunking and heuristic tuple
generation can provide useful information about the contents of a program description
and the people and locations involved. Its performance, however, is limited by the
accuracy of the heuristic and more directly, by the comprehensiveness of the underlying
ontology. Furthermore, errors made by the pos-tagger and other components as well as
certain grammatical constructs can cause it to miss important descriptive elements. For
this reason, the annotations should be understood as an auxiliary technique that, used
in conjunction with other methods, can improve the performance of applications like
recommender systems.
81
Chapter 7. Conclusion And Future Work
The performance improvement achieved by using pos-weighted word distances suggests
that implementing more advanced techniques could prove worthwhile, such as adaptive
word distances that assign weights based on the local context. This would require the
incorporation of linguistic experts or extensive statistical analysis.
One area in which our application could be improved is the use of a common data source
for all dictionary-based components. In the current implementation, LemServer,
JWordSplitter and GermaNet all use their own dictionaries, and consolidating
them into a central, authorative data pool could lead to more coherent and compre-
hensible results, as well as an easier identification why a certain term does not show
up in the list of extracted concepts. A promising approach to this issue could be the
integration of the hyphenation data maintained by trennmuster.org[48], either as an
extension to or a replacement for JWordSplitter.
Another way to integrate the different components more tightly could be the incorpo-
ration of the Wiktionary project[49] which provides grammatical annotations, sense
annotations, hyphenation points, a list of synonyms and a gloss, among other things.
Research needs to be done in order to estimate the applicability of the data to our sys-
tem, in particular with respect to vocabulary coverage, completeness of the annotations
and how it could be interconnected with GermaNet.
The question whether our implementation could be applied to languages other than
German arises naturally. While specific aspects of our system are catered to German
(e.g. decompounding), most methodologies described in this thesis should be applicable
to languages that are grammatically and syntactically similar to German. In practice,
however, our system relies on the STTS tag set which is used throughout the entire
application. An interesting approach to resolve this limitation and achieve language
independency is the employment of tagset drivers as proposed in [Zem08].
In conclusion, it can be said that our system performs well within its intended area
of application and yields valuable semantic annotations. While it naturally does not
achieve the comprehensiveness of approaches that employ deep semantic parsing or other
complex linguistic models, it is fast, extensible and robust to the heterogeneity of de-
scriptions from EPGs and leaves various extension points for future projects to enhance
and improve the performance of this work.
82
Bibliography
[BH06] Alexander Budanitsky and Graeme Hirst. Evaluating WordNet-based Mea-sures of Lexical Semantic Relatedness. Computational Linguistics, 32(1):13–47, 2006.
[BN85] Wilhelm Barth and Heinrich Nirschl. Sichere sinnentsprechende Silbentren-nung für die deutsche Sprache. Angewandte Informatik, 27(4):152–159, 1985.
[BP02] Satanjeev Banerjee and Ted Pedersen. An adapted lesk algorithm for wordsense disambiguation using WordNet. In CICLing, pages 136–145, 2002.
[BPP96] Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A Max-imum Entropy Approach to Natural Language Processing. ComputationalLinguistics, 22(1):39–71, 1996.
[Bri92] Eric Brill. A simple rule-based part of speech tagger, 1992.
[Cha97] Eugene Charniak. Statistical Techniques for Natural Language Parsing. AIMagazine, 18(4):33–44, 1997.
[CN03] Hai Leong Chieu and Hwee Tou Ng. Named entity recognition with a maxi-mum entropy approach. In Proceedings of the seventh conference on Naturallanguage learning at HLT-NAACL 2003, pages 160–163, Morristown, NJ,USA, 2003. Association for Computational Linguistics.
[Der89] S. J. Derose. Stochastic methods for resolution of grammatical category am-biguity in inflected and uninflected languages. PhD thesis, Brown University,Providence, RI, USA, 1989.
[EP06a] Katrin Erk and Sebastian Pado. Shalmaneser - A flexible toolbox for semanticrole assignment. In Proceedings of LREC 2006, Genoa, Italy, 2006.
[EP06b] Katrin Erk and Sebastian Pado. Shalmaneser - A Toolchain For ShallowSemantic Parsing. In Proceedings of LREC-06, Genoa, Italy, 2006.
[Fil68] Charles Fillmore. The case for case. In Emmon Bach and R. Harms, editors,
83
Bibliography
Universals in Linguistic Theory. Holt, Rinehart, and Winston, New York,1968.
[Fil82] Charles J. Fillmore. Frame semantics. In Linguistics in the Morning Calm,pages 111–137, 1982.
[FMS03] Kostas Fragos, Yannis Maistros, and Christos Skourlas. Word sense disam-biguation using WordNet relations. In In Proc. of the 1st Balkan Conferencein Informatics, Thessaloniki, 2003.
[FZ98] M. Fuller and J. Zobel. Conflation-based comparison of stemming algorithms.In Proceedings of the Australian Document Computing Symposium, pages 8–13, Sydney, Australia, August 1998.
[GG93] Thomas R. Gruber and Thomas R. Gruber. A translation approach toportable ontology specifications. Knowledge Acquisition, 5:199–220, 1993.
[GHC98] Niyu Ge, John Hale, and Eugene Charniak. A statistical approach toanaphora resolution. In In Proceedings of the Sixth Workshop on Very LargeCorpora, pages 161–170, 1998.
[Gie08] Eugenie Giesbrecht. Evaluation of pos tagging for web as corpus. Master’sthesis, University of Osnabrück, 2008.
[GJ02] Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic roles.Computational Linguistics, 28(3):245–288, 2002.
[GR71] Barbara B. Greene and Gerald M. Rubin. Automatic Grammatical Taggingof English. Technical report, Department of Linguistics, Brown University,Providence, Rhode Island, 1971.
[JZ07] Jing Jiang and ChengXiang Zhai. An empirical study of tokenization strate-gies for biomedical information retrieval. Inf. Retr., 10(4-5):341–363, 2007.
[KP96] Wessel Kraaij and Ren Ee Pohlmann. Viewing stemming as recall enhance-ment. In In Proceedings of the 19th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval, pages 40–48,1996.
[KS63] Sheldon Klein and Robert F. Simmons. A Computational Approach to Gram-matical Coding of English Words. J. ACM, 10(3):334–347, 1963.
[KSNM03] Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning.Named entity recognition with character-level models. In Walter Daelemans
84
Bibliography
and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 180–183. Ed-monton, Canada, 2003.
[KXW03] Chunyu Kit, Zhiming Xu, and Jonathan J. Webster. Integrating ngrammodel and case-based learning for Chinese word segmentation. In Proceedingsof the second SIGHAN workshop on Chinese language processing, pages 160–163, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
[LCM98] Claudia Leacock, Martin Chodorow, and George A. Miller. Using CorpusStatistics and WordNet Relations for Sense Identification. ComputationalLinguistics, 24(1):147–165, 1998.
[Les86] Michael Lesk. Automatic sense disambiguation using machine readable dic-tionaries: how to tell a pine cone from an ice cream cone. In SIGDOC ’86:Proceedings of the 5th annual international conference on Systems documen-tation, pages 24–26, New York, NY, USA, 1986. ACM.
[Lez98] Wolfgang Lezius. A freely available morphological analyzer, disambigua-tor and context sensitive lemmatizer for German. In In Proceedings of theCOLING-ACL, pages 743–747, 1998.
[Lin98] Dekang Lin. An information-theoretic definition of similarity. In Jude W.Shavlik, editor, ICML, pages 296–304. Morgan Kaufmann, 1998.
[LL94] Shalom Lappin and Herbert J. Leass. An algorithm for pronominal anaphoraresolution. Computational Linguistics, 20(4):535–561, 1994.
[LMBC05] Yaoyong Li, Chuanjiang Miao, Kalina Bontcheva, and Hamish Cunningham.Perceptron learning for Chinese word segmentation. In In Proceedings of theFourth SIGHAN Workshop, Jeju, Korea, pages 154–157, 2005.
[Lov68] Julie B. Lovins. Development of a stemming algorithm. Mechanical Trans-lation and Computational Linguistics, June 1968.
[LSM95] Xiaobin Li, Stan Szpakowicz, and Stan Matwin. A WordNet-based algorithmfor word sense disambiguation. In IJCAI, pages 1368–1374, 1995.
[MBF+90] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, andKatherine Miller. WordNet: An on-line lexical database. International Jour-nal of Lexicography, 3:235–244, 1990.
[Mih07] Rada Mihalcea. Using Wikipedia for automatic word sense disambiguation.In Candace L. Sidner, Tanja Schultz, Matthew Stone, and ChengXiang Zhai,
85
Bibliography
editors, HLT-NAACL, pages 196–203. The Association for ComputationalLinguistics, 2007.
[MJ07] N. Lavrac M. Jursic, I. Mozetic. Learning Ripple Down Rules for EfficientLemmatization. In Proc. 10th Intl. Multiconference Information Society,pages 206–209, 2007.
[ML95] Joseph F. McCarthy and Wendy G. Lehnert. Using decision trees for coref-erence resolution. In IJCAI, pages 1050–1055, 1995.
[MM01] Rada Mihalcea and Dan Moldovan. Automatic generation of a coarse grainedWordNet. In Proceedings of the NAACL worshop on WordNet and OtherLexical Resources, Pittsburg, USA, 2001.
[MS05] Tomasz Marciniak and Michael Strube. Beyond the pipeline: Discrete op-timization in NLP. In Proceedings of the Ninth Conference on Computa-tional Natural Language Learning (CoNLL-2005), pages 136–143, Ann Ar-bor, Michigan, June 2005. Association for Computational Linguistics.
[Pal94] David D. Palmer. Satz - an adaptive sentence segmentation system. Techni-cal Report UCB/CSD-94-846, EECS Department, University of California,Berkeley, Dec 1994.
[Pav05] Sladjana Pavlovic. Unsupervised Coreference Resolution for German. Mas-ter’s thesis, Department of Linguistics, Everhard-Karls-Universität Tübin-gen, 2005.
[PK00] Thierry Poibeau and Leila Kosseim. Proper name extraction from non-journalistic texts. In Walter Daelemans, Khalil Sima’an, Jorn Veenstra,and Jakub Zavrel, editors, CLIN, volume 37 of Language and Computers- Studies in Practical Linguistics, pages 144–157. Rodopi, 2000.
[PLME08] Joël Plisson, Nada Lavrac, Dunja Mladenic, and Tomaz Erjavec. Rip-ple Down Rule learning for automated word lemmatisation. AI Commun.,21(1):15–26, 2008.
[Por80] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137,1980.
[PW95] Martha Stone Palmer and Zhibiao Wu. Verb semantics for English-Chinesetranslation. Machine Translation, 10(1-2):59–92, 1995.
[Rat98] Adwait Ratnaparkhi. Maximum entropy models for natural language ambi-
86
Bibliography
guity resolution. PhD thesis, University of Pennsylvania, Philadelphia, PA,USA, 1998. Supervisor-Marcus„ Mitchell P.
[RR97] Jeffrey C. Reynar and Adwait Ratnaparkhi. A Maximum Entropy Approachto Identifying Sentence Boundaries. In ANLP, pages 16–19, 1997.
[RS95] Emmanuel Roche and Yves Schabes. Deterministic part-of-speech taggingwith finite state transducers. Computational Linguistics, 21:227–253, 1995.
[RSW+98] Bob Rehder, M. E. Schreiner, Michael B. W. Wolfe, Darrell Laham,Thomas K. Landauer, and Walter Kintsch. Using latent semantic analysisto assess knowledge: Some technical considerations. Discourse Processes,25:337–354, 1998.
[RV95] Atro Voutilainen Research and Atro Voutilainen. A syntax-based part-of-speech analyser. In In EACL-95, pages 157–164, 1995.
[San90] Beatrice Santorini. Part-of-speech tagging guidelines for the Penn TreebankProject. Technical report, Department of Computer and Information Science,University of Pennsylvania, 1990.
[SB88] Gerard Salton and Christopher Buckley. Term-weighting approaches in au-tomatic text retrieval. In Information Processing and Management, pages513–523, 1988.
[Sch94] Helmut Schmid. Probabilistic part-of-speech tagging using decision trees,1994.
[Sch95] Helmut Schmid. Improvements in part-of-speech tagging with an applicationto German. In In Proceedings of the ACL SIGDAT-Workshop, pages 47–50,1995.
[Süd09] Mediendaten Südwest. Aktuelle Basisdatenzu TV, Hörfunk, Print, Film und Internet.http://www.mediendaten.de/fernsehen-empfang-sender.html, 2009.
[SFK99] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic extraction ofrules for sentence boundary disambiguation, 1999.
[SM04] Lei Shi and Rada Mihalcea. An algorithm for open text semantic parsing.In In Proceedings of the ROMAND 2004 workshop on Robust Methods inAnalysis of Natural language Data, 2004.
[SNL01] Wee Meng Soon, Hwee Tou Ng, and Chung Yong Lim. A machine learning
87
Bibliography
approach to coreference resolution of noun phrases. Computational Linguis-tics, 27(4):521–544, 2001.
[SRM02] Michael Strube, Stefan Rapp, and Christoph Müller. The influence of mini-mum edit distance on reference resolution. In Proceedings of the Conferenceon Empirical Methods in Natural Language Processing (EMNLP), pages 312–319, Philadelphia, July 2002. Association for Computational Linguistics.
[TKMS03] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer.Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. InHuman Language Technology Conference (HLT-NAACL 2003), 2003.
[TM00] Kristina Toutanova and Christopher Manning. Enriching the knowledgesources used in a maximum entropy part-of-speech tagger. In Joint SIG-DAT Conference on Empirical Methods in NLP and Very Large Corpora(EMNLP/VLC-2000), October 2000.
[Vol00] Johannes Volmert. Grundkurs Sprachwissenschaft. UTB, Stuttgart, 2000.
[WK92] Jonathan J. Webster and Chunyu Kit. Tokenization as the initial phase inNLP. In Proceedings of the 14th conference on Computational linguistics,pages 1106–1110, Morristown, NJ, USA, 1992. Association for Computa-tional Linguistics.
[WM06] René Witte and Jutta Mülle, editors. Text Mining: Wissensgewinnungaus natürlichsprachigen Dokumenten, Interner Bericht 2006-5. UniversitätKarlsruhe, Fakultät für Informatik, Institut für Programmstrukturen undDatenorganisation (IPD), 2006.
[WQL07] Xin-Jing Wang, Yong Qin, and Wen Liu. A search-based Chinese wordsegmentation method. In WWW ’07: Proceedings of the 16th internationalconference on World Wide Web, pages 1129–1130, New York, NY, USA,2007. ACM.
[Zdz05] Jonathan A. Zdziarski. Ending Spam: Bayesian Content Filtering and theArt of Statistical Language Classification. No Starch Press, San Francisco,CA, USA, 2005.
[Zem08] Daniel Zeman. Reusable tagset conversion using tagset drivers. In EuropeanLanguage Resources Association (ELRA), editor, Proceedings of the SixthInternational Language Resources and Evaluation (LREC’08), Marrakech,Morocco, may 2008.
[ZG06] Torsten Zesch and Iryna Gurevych. Automatically creating datasets for mea-
88
Bibliography
sures of semantic relatedness. In Proceedings of the Workshop on LinguisticDistances, pages 16–24, Sydney, Australia, July 2006. Association for Com-putational Linguistics.
89
LIST OF URLS
List of URLs
[1] http://www.sighan.org/
[2] http://www.metacritic.com/
[3] http://snowball.tartarus.org/
[4] http://khnt.aksis.uib.no/icame/manuals/brown/
[5] http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html
[6] http://www.natcorp.ox.ac.uk/
[7] http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html
[8] http://www.ldc.upenn.edu/Catalog/docs/LDC2005T33/BBN-Types-Subtypes.html
[9] http://nlp.cs.nyu.edu/ene/
[10] http://rdfs.org/sioc/spec/
[11] http://www.obofoundry.org/
[12] http://www.geneontology.org/
[13] http://wordnet.princeton.edu/
[14] http://www.ldc.upenn.edu/Catalog/LDC97T12.html
[15] http://www.cse.unt.edu/~rada/downloads.html
[16] http://www.wikipedia.org
[17] http://framenet.icsi.berkeley.edu/
[18] http://framenet.icsi.berkeley.edu/FrameGrapher/grapher.php
[19] http://gemini.uab.es:9080/SFNsite
[20] http://jfn.st.hc.keio.ac.jp/
[21] http://www.laits.utexas.edu/gframenet/
[22] http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/stts.asc
[23] http://opennlp.sourceforge.net/
[24] http://nlp.stanford.edu
90
LIST OF URLS
[25] http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
[26] http://www.wolfganglezius.de/doku.php?id=public:cl:morphy
[27] http://kt.ijs.si/software/LemmaGen/
[28] http://rupostagger.sourceforge.net/
[29] http://www.cygwin.com/
[30] http://www.aot.ru/download.php
[31] http://www.genios.de
[32] http://www.sfs.uni-tuebingen.de/
[33] http://www.openthesaurus.de/
[34] http://www.openoffice.org
[35] http://koffice.org/
[36] http://gate.ac.uk/
[37] http://incubator.apache.org/uima/
[38] http://www.oasis-open.org/news/oasis-news-2009-03-19.php
[39] http://www.swig.org/
[40] http://coli.uni-saarland.de/projects/salsa/shal/
[41] http://coli.uni-saarland.de/projects/salsa/page.php?id=index-salsa1
[42] http://sourceforge.net/projects/jwordsplitter/
[43] http://extjs.com/
[44] http://jquery.com/
[45] http://www.php.net
[46] http://framework.zend.com
[47] http://www.mysql.com/
[48] http://groups.google.de/group/trennmuster-opensource/
[49] http://www.wiktionary.de/
91
Index
ambiguityconjunctive, 11disjunctive, 11
antecedent, see referentantonym, see antonymyantonymy, 29
case frame, see frame semanticscase grammar, see frame semanticscausation, 29chunking, 7, 58chunking pattern
adverbial adjective/verb, 61annotation, 61attributive adjective/noun, 60named entities, 61personified profession, 62phrase, 62
collocations, 11compound, 21
appositional, 22copulative, primary, 22endocentric, 22exocentric, possessive, 22
compound splitting, see decompoundingconcept, 3, 65conciseness, 55context
extrinsic, 71coreference chain, 27coreference resolution, 26corpus, 3CR, see coreference resolution
decompounding, 21
distancelexicographic, 67pos-weighted, 67
entailment, 29epenthesis, 50
f-measure, 4f-score, see f-measureframe semantics, 34fuzziness, 55
gloss, 30
head sequence, 60head token, 60holonym, see holonymyholonymy, 29homograph, 4homonym, 4hypernym, see hypernymyhypernymy, 29hyponym, see hyponymyhyponymy, 29
inflection, 18
languageagglutinative, 20fusional, 20morphemic, 8
lemma, 18lemmatization, 20lexeme, 18lexical category, see part-of-speechlexical relation, 29lowest super-ordinate, 32
92
Index
lso, see lowest super-ordinate
meaningconcise, see concisenessfuzzy, see fuzziness
meronym, see meronymymeronymy, 29morpheme, 4morphology, 18most specific common subsumer, see low-
est super-ordinate
n-gram, 13named entity recognition, 26NER, see named entity recognitionnoise word, see stop word
ontology, 28oronym, 13
part-of-speech tagging, 24polymorpheme, see compoundpolyseme, 4pos, see part-of-speech taggingprecision, 4
qualifier, 61
recall, 4reference
anaphoric, 27cataphoric, 27endophoric, 27exophoric, 27
referent, 26relevant word, 3, 38
scoring function, 67segment frequency, 14semantic role labeling, 36semantic similarity, 32sentence adverbials, 80sentence boundary disambiguation, 9sentence detection, 9shallow semantic parsing, 34stem, 18
stemmer, see stemmingheavy, strong, 19light, weak, 19
stemming, 18stop sequence, 60stop word, 23sublanguage, 10subsumption hierarchy, 32suppletive, 18synonym ring, see synsetsynonym set, see synsetsynset, 4, 29synset expansion, 59
text corpus, see corpustf.idf, 3token sequence
head, 60stop, 60uniform, 60
tokenizationdictionary-based, 12n-gram, 13perceptron learning, 13search-based, 14
troponym, see troponymytroponymy, 29
unification function, 59uniform sequence, 60
valency, 34expansion, 35reduction, 35
weight matrix, 66word net, 29word segmentation, 8word sense disambiguation, 33WSD, see word sense disambiguation
93
Appendix A.
The STTS tag set
Name Meaning
ADJA attributives Adjektiv
ADJD adverbiales oder prädikatives Adjektiv
ADV Adverb
APPR Präposition; Zirkumposition links
APPRART Präposition mit Artikel
APPO Postposition
APZR Zirkumposition rechts
ART bestimmter oder unbestimmter Artikel
CARD Kardinalzahl (Ordinalzahlen sind als ADJA getaggt)
FM Fremdsprachliches Material
ITJ Interjektion
KOUI unterordnende Konjunktion mit “zu” und Infinitiv
KOUS unterordnende Konjunktion mit Satz
KON nebenordnende Konjunktion
KOKOM Vergleichskonjunktion
NN normales Nomen
NE Eigennamen
PDS substituierendes Demonstrativpronomen
PDAT attribuierendes Demonstrativpronomen
PIS substituierendes Indefinitpronomen
PIAT attribuierendes Indefinitpronomen ohne Determiner
PIDAT attribuierendes Indefinitpronomen mit Determiner
PPER irreflexives Personalpronomen
Continued on next page
94
Appendix A. The STTS tag set
Name Meaning
PPOSS substituierendes Possessivpronomen
PPOSAT attribuierendes Possessivpronomen
PRELS substituierendes Relativpronomen
PRELAT attribuierendes Relativpronomen
PRF reflexives Personalpronomen
PWS substituierendes Interrogativpronomen
PWAT attribuierendes Interrogativpronomen
PWAV adverbiales Interrogativ- oder Relativpronomen
PAV Pronominaladverb
PTKZU “zu” vor Infinitiv
PTKNEG Negationspartikel
PTKVZ abgetrennter Verbzusatz
PTKANT Antwortpartikel
PTKA Partikel bei Adjektiv oder Adverb
TRUNC Kompositions-Erstglied
VVFIN finites Verb, voll
VVIMP Imperativ, voll
VVINF Infinitiv, voll
VVIZU Infinitiv mit “zu”, voll
VVPP Partizip Perfekt, voll
VAFIN finites Verb, aux
VAIMP Imperativ, aux
VAINF Infinitiv, aux
VAPP Partizip Perfekt, aux
VMFIN finites Verb, modal
VMINF Infinitiv, modal
VMPP Partizip Perfekt, modal
XY Nichtwort, Sonderzeichen enthaltend
$, Komma
$. Satzbeendende Interpunktion
$( sonstige Satzzeichen; satzintern
95
Appendix B.
Example output of Sliver
Program description:
“Shaggy und seine liebenswerte Deutsche Dogge, der riesengroße, aber nichtallzu heldenhafte Scooby-Doo, müssen neue Abenteuer bestehen! Alles fängtdamit an, dass Shaggy eine große Summe von seinem spurlos verschwunde-nen Onkel Albert erbt. Zusammen mit Scooby-Doo zieht er in dessen alteVilla. Hier erfahren die beiden den Grund für das plötzliche Verschwindendes Onkels - und damit beginnt für sie eine weitere Reise in aufregende Aben-teuer...”
96
App
endix
B.
Exam
ple
outp
ut
of
Sliv
er
<?xml version="1.0" encoding="ISO-8859-1"?>
<AnalysisResult>
<Locations/>
<NamedEntities>
<NamedEntity name="Albert"/>
<NamedEntity name="Shaggy"/>
<NamedEntity name="Scooby-Doo"/>
</NamedEntities>
<Pairs>
<Pair>
<AnnotatedToken value="Dogge" pos="NN" lemma="Dogge" synsets="nTier.2108"/>
<AnnotatedToken value="liebenswerte" pos="ADJA" lemma="liebenswert" synsets="aVerhalten.11"/>
</Pair>
<Pair>
<AnnotatedToken value="Dogge" pos="NN" lemma="Dogge" synsets="nTier.2108"/>
<AnnotatedToken value="deutsche" pos="ADJA" lemma="deutsch" synsets="aGesellschaft.396"/>
</Pair>
<Pair>
<AnnotatedToken value="Scooby-Doo" pos="NN" lemma="Scooby-Doo" synsets=""/>
<AnnotatedToken value="heldenhafte" pos="ADJA" lemma="heldenhaft" synsets="aVerhalten.282"/>
</Pair>
<Pair>
<AnnotatedToken value="Abenteuer" pos="NN" lemma="Abenteuer" synsets="nGeschehen.676"/>
<AnnotatedToken value="neue" pos="ADJA" lemma="neu" synsets="aZeit.182,aZeit.335"/>
</Pair>
<Pair>
<AnnotatedToken value="Summe" pos="NN" lemma="Summe" synsets="nBesitz.127,nMenge.678"/>
<AnnotatedToken value="große" pos="ADJA" lemma="groß" synsets="aAllgemein.3,aMenge.134"/>
</Pair>
<Pair>
<AnnotatedToken value="Onkel" pos="NN" lemma="Onkel" synsets="nMensch.89"/>
<AnnotatedToken value="verschwundenen" pos="ADJA" lemma="verschwinden"
synsets="vBesitz.456,vVeraenderung.377,vVeraenderung.1482"/>
</Pair>
<Pair>
<AnnotatedToken value="Villa" pos="NN" lemma="Villa" synsets="nArtefakt.3687"/>
97
App
endix
B.
Exam
ple
outp
ut
of
Sliv
er
<AnnotatedToken value="alte" pos="ADJA" lemma="alt" synsets="aZeit.235,aZeit.329,aZeit.337"/>
</Pair>
<Pair>
<AnnotatedToken value="Verschwinden" pos="NN" lemma="Verschwinden" synsets="nGeschehen.4142"/>
<AnnotatedToken value="plötzliche" pos="ADJA" lemma="plötzlich" synsets="aZeit.29"/>
</Pair>
<Pair>
<AnnotatedToken value="Reise" pos="NN" lemma="Reise" synsets="nGeschehen.690"/>
<AnnotatedToken value="weitere" pos="ADJA" lemma="weiter" synsets="aZeit.43"/>
</Pair>
<Pair>
<AnnotatedToken value="Abenteuer" pos="NN" lemma="Abenteuer" synsets="nGeschehen.676"/>
<AnnotatedToken value="aufregende" pos="ADJA" lemma="aufregend" synsets="aGefuehl.302"/>
</Pair>
</Pairs>
<HeuristicPairs>
<HeuristicPair score="1,000">
<AnnotatedToken value="Abenteuer" pos="NN" lemma="Abenteuer" synsets="nGeschehen.676"/>
<AnnotatedToken value="bestehen" pos="VVINF" lemma="bestehen"
synsets="vAllgemein.11,vAllgemein.310,vGesellschaft.970,vGesellschaft.998,..."/>
</HeuristicPair>
<HeuristicPair score="0,714">
<AnnotatedToken value="erfahren" pos="VVFIN" lemma="erfahren"
synsets="aGeist.88,vBesitz.290,vKognition.257,vKognition.264"/>
<AnnotatedToken value="Grund" pos="NN" lemma="Grund" synsets="nArtefakt.6315,nMotiv.2,nnatGegenstand.3"/>
</HeuristicPair>
<HeuristicPair score="0,650">
<AnnotatedToken value="spurlos" pos="ADJD" lemma="spurlos" synsets="aprivativ.91"/>
<AnnotatedToken value="Onkel" pos="NN" lemma="Onkel" synsets="nMensch.89"/>
</HeuristicPair>
<HeuristicPair score="0,588">
<AnnotatedToken value="Onkels" pos="NN" lemma="Onkel" synsets="nMensch.89"/>
<AnnotatedToken value="beginnt" pos="VVFIN" lemma="beginnen" synsets="vVeraenderung.6,vVeraenderung.95"/>
</HeuristicPair>
<HeuristicPair score="0,500">
<AnnotatedToken value="Summe" pos="NN" lemma="Summe" synsets="nBesitz.127,nMenge.678"/>
<AnnotatedToken value="spurlos" pos="ADJD" lemma="spurlos" synsets="aprivativ.91"/>
98
App
endix
B.
Exam
ple
outp
ut
of
Sliv
er
</HeuristicPair>
<HeuristicPair score="0,500">
<AnnotatedToken value="Onkel" pos="NN" lemma="Onkel" synsets="nMensch.89"/>
<AnnotatedToken value="erbt" pos="VVFIN" lemma="erben" synsets="vBesitz.374"/>
</HeuristicPair>
<HeuristicPair score="0,500">
<AnnotatedToken value="zieht" pos="VVFIN" lemma="ziehen"
synsets="vKoerperfunktion.65,vKoerperfunktion.298,vKontakt.18,vLokation.273,..."/>
<AnnotatedToken value="Villa" pos="NN" lemma="Villa" synsets="nArtefakt.3687"/>
</HeuristicPair>
<HeuristicPair score="0,500">
<AnnotatedToken value="beginnt" pos="VVFIN" lemma="beginnen" synsets="vVeraenderung.6,vVeraenderung.95"/>
<AnnotatedToken value="Reise" pos="NN" lemma="Reise" synsets="nGeschehen.690"/>
</HeuristicPair>
</HeuristicPairs>
</AnalysisResult>
99
Recommended