41
natural language processing Willem Van Onsem june 2013 Contents 1 Introduction 2 2 Finite state automata 6 3 Morphology 7 4 Language modeling and Sequential tagging 10 5 Part-of-Speech Tagging and Shallow Parsing 15 6 Formal Grammars and Parsers 18 7 Statical parsing 22 8 Computational Lexical Semantics 25 9 Named Entity Recognition and Semantic Role Labeling 29 10 Discourse Analysis 31 11 Temporal Analysis 32 12 Generic tables 33 12.1 Miniature grammar ............................................... 33 13 Symbols 36 List of Figures 1 Nominal inflection................................................ 8 2 Verbal inflection. ................................................ 8 3 Adjectival derivations. ............................................. 8 4 Orthographic transducer. ........................................... 9 List of Tables 1 List of clitics in English............................................. 33 2 List of regular nouns in English. ....................................... 33 3 List of irregular plural nouns in English.................................... 33 4 List of irregular singular nouns in English................................... 34 5 List of regular verb stems in English...................................... 34 6 List of irregular verb stems in English..................................... 34 7 List of irregular past verb stems in English. ................................. 34 8 List of English Part-of-Speech tags used in the Penn Treebank........................ 35 9 List of symbols and their meaning....................................... 36 1

Definitions in Natural Language Processing [Work in Progress]

Embed Size (px)

DESCRIPTION

A document listing several definitions related to the field of Natural Language Processing.

Citation preview

natural language processing

Willem Van Onsem

june 2013

Contents

1 Introduction 2

2 Finite state automata 6

3 Morphology 7

4 Language modeling and Sequential tagging 10

5 Part-of-Speech Tagging and Shallow Parsing 15

6 Formal Grammars and Parsers 18

7 Statical parsing 22

8 Computational Lexical Semantics 25

9 Named Entity Recognition and Semantic Role Labeling 29

10 Discourse Analysis 31

11 Temporal Analysis 32

12 Generic tables 3312.1 Miniature grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

13 Symbols 36

List of Figures

1 Nominal inflection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Verbal inflection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Adjectival derivations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Orthographic transducer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

List of Tables

1 List of clitics in English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 List of regular nouns in English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 List of irregular plural nouns in English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 List of irregular singular nouns in English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 List of regular verb stems in English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 List of irregular verb stems in English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 List of irregular past verb stems in English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 List of English Part-of-Speech tags used in the Penn Treebank.. . . . . . . . . . . . . . . . . . . . . . . 359 List of symbols and their meaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1

1 Introduction

Definition 1 (Natural language). A natural language is a language spoken and/or written by people.

Definition 2 (Artificial language). An artificial language is a language used for programming languages or constructedlanguages like Esperanto or Klingon.

Definition 3 (natural language processing). natural language processing is the ability of a machine tho process naturallanguage and thus understand, generate or communicate with such language.

Definition 4 (Human language technology). Human language technology is a research branch interested in the tech-nological aspects of natural language processing.

Definition 5 (Computational linguistics). Computational language is a research domain where one focuses on thetheory behind language (linguistics).

Definition 6 (Ambiguity). Ambiguity is the ability of a string to be interpreted in several ways. In natural languageprocessing, ambiguity comes in two flavors: part-of-speech and word sense.

Definition 7 (Part-of-Speech tagging). Part of speech tagging is a form of disambiguation where one given a wordclassifies this word as a non, verb,...

Definition 8 (Word sense disambiguation). Word sense disambiguation is a form of disambiguation between differentmeanings for the same word. Their are three dimensions in word sense disambiguation: pronunciation, spelling andmeaning.

Definition 9 (Heterographs). Two words are heterographs if the have a different spelling and meaning.

Definition 10 (Heteronyms). Two words are heteronyms if the have a different pronunciation and meaning.

Definition 11 (Homonyms). Two words are homonyms if they have a different meaning but share the spelling andpronunciation.

Definition 12 (Synonyms). Two words are synonyms if they have a different spelling but share the meaning.

Definition 13 (Phonetics). Phonetics is a analysis level in natural language processing handling articulatory andacoustic properties of sounds.

Definition 14 (Phonology). Phonology is a analysis level in natural language processing handling the systematic useof sound to encode meaning.

Definition 15 (International Phonetic Alphabet (IPA)). The International Phonetic Alphabet (IPA) is an alphabetcontaining the various symbols to express pronunciation.

Definition 16 (Morphology). Morphology is a analysis level in natural language processing handling about the struc-ture of a word. Words are split up in different basic components and one can analyze the appropriate part-of-speech.

Definition 17 (Compounding). A Compounding is a morphological rule where a lexeme that consists of more thanone stem. Compounding are very common in Dutch and German. Compounding can be applied recursively.

Definition 18 (Derivation). A Derivation is a morphological rule where an affix forms a new word from an existingword. For instance “happi-ness”. Derivation can be applied recursively.

Definition 19 (Inflection). An inflection is a modification of a word to express different grammatical properties. Forinstance “work-s”.

Definition 20 (Syntax). Syntax is a analysis level in natural language processing where structural relations betweenwords are analyzed. Syntax is also known as grammar.

Definition 21 (Semantics). Semantics is a analysis level in natural language processing where the meaning of thedifferent words are analyzed.

Definition 22 (Pragmatics). Pragmatics is a analysis level in natural language processing where the relationship ofthe meaning to the goals of the speaker are analyzed. For instance verbatim or ironic, question or statement,...

Definition 23 (Discourse). Discourse is a analysis level in natural language processing where the relations betweensentences are analyzed. For instance to who refers “ her” in the sentence “ I made her duck.”.

2

Definition 24 (Information retrieval). Information retrieval is an application of natural language processing whereone tries to retrieve information by searching for documents, information within document, metadata, relationaldatabases,...

Definition 25 (Machine translation (MT)). Machine translation (MT) is an application of natural language processingwhere a machine tries to translate a fragment from one language into another.

Definition 26 (Dialogue System). A Dialogue System is a system where the user performs a dialogue with a machinethrough a natural language. One of the earliest systems was ELIZA.

Definition 27 (Question answering). Question answering is an application of natural language processing where auser asks a question who is resolved by the machine. An example of such systems is Watson (IBM).

Definition 28 (Automatic summarization). Automatic summarization is an application of natural language processingwhere a machine summarizes a fragment into a smaller fragment with approximately the same meaning. Automaticsummarization comes in different flavors: single text summarization, multi-document summarization and multi lingualmulti-document summarization.

Definition 29 (Automatic paraphrasing). Automatic paraphrasing is an application of natural language processing.

Definition 30 (Topic detection). Topic detection is an application of natural language processing.

Definition 31 (Authorship attribution). Authorship attribution is an application of natural language processing wherea machine tries to detect plagiarism.

Definition 32 (Sentiment analysis). Sentiment analysis is an application of natural language processing where amachine tries to guess what people think and houw they feel about certain concepts or products.

Definition 33 (Automatic paraphrasing). Automatic paraphrasing is an application of natural language processing.

Definition 34 (Protocol). a Protocol is a set of rules describing the behavior of the system under several circumstances.

Definition 35 (Machine learning technique). A Machine learning technique is an technique where a machine triesto learn a certain task empirically. Machine learning techniques come in two flavors: unsupervised techniques andsupervised techniques.

Definition 36 (Unsupervised techniques). Unsupervised techniques are a set of machine learning techniques wherea machine learns from unlabeled data. Since the output is not specified, this method requires lots of data. A hugeadvantage is that the given data can be generated easily. Unsupervised techniques however are only effective for alimited set of problems.

Definition 37 (Supervised techniques). Supervised techniques are a set of machine learning techniques where amachine learns from labeled data. Most machine learning techniques succeed in learning by using these techniques.Generating an annotated dataset however, can be quite cumbersome.

Definition 38 (Rule-based techniques). Rule-based techniques are a set of techniques where a programmer specifieshow the data will be processed based on a set of rules. This approach requires no data at all. Rule-based techniques canbe quite cumbersome since most problems require a vast amount of rules. There rules are not always easy to deriveand require a lot lot work.

Definition 39 (Hybrid approach). A Hybrid approach is a technique to tackle problems where both machine learningand rule-based techniques are combined. For instance, one can use machine learning techniques to learn human readablerules.

Definition 40 (Annotated data). Annotated data is a form of data where both the input and the expected outputare combined as tuples in a single dataset. There exist different flavors of annotated data: Monolingual annotateddata, Multilingual annotated data, Parallel annotated data and Comparable annotated data. Furthermore annotateddata can introduce a divers set of annotation layers: Phonetic transcription annotation layer, Prosodic annotationlayer, Morphological annotation layer, Part-of-speech annotation layer, Syntactic parsing annotation layer, Semanticannotation layer and Alignment annotation layer.

Definition 41 (Monolingual annotated data). Monolingual annotated data is a form of annotated data where all datais in one and the same language.

Definition 42 (Multilingual annotated data). Multilingual annotated data is a form of annotated data where the datacan be in any of a determined set of languages.

3

Definition 43 (Parallel annotated data). Parallel annotated data is a form of annotated data that is at least bilingual.Parts of the document in one language can be considered translations of parts of documents in another language.

Definition 44 (Comparable annotated data). Comparable annotated data is a form of annotated data where al datais about the same event (monolingual or multilingual).

Definition 45 (Phonetic transcription annotation layer). Phonetic transcription annotation layer is an annotationlayer describing the text in phonetic characters.

Definition 46 (Prosodic annotation layer). Prosodic annotation layer is an annotation layer describing the stresspatterns in the fragment.

Definition 47 (Morphological annotation layer). Morphological annotation layer is an annotation layer describingthe lemmatization, stemming and morphological aspects of the fragments.

Definition 48 (Part-of-speech annotation layer). Part-of-speech annotation layer is an annotation layer describingthe role of the different words in the fragments.

Definition 49 (Syntactic annotation layer). Syntactic annotation layer is an annotation layer describing how the textshould syntactically be parsed.

Definition 50 (Semantic annotation layer). Semantic annotation layer is an annotation layer describing the seman-tical meaning of each data element.

Definition 51 (Alignment annotation layer). Alignment annotation layer is an annotation layer describing wheredifferent parts of a certain fragment begin. For instance: paragraph, sentence, word, subtree,...

Definition 52 (Bootstrapping). Bootstrapping is a method to speedup the development of an annotated database. Ina first stage data is manually annotated. Based on this set of data one can use a rule-based system and annotatedfuture fragments automatically and manually correct errors.

Definition 53 (Training set (Development set)). A Training set is a set of annotated data used to train a systemwith machine learning techniques. The Training set should never be used to test the system. Otherwise the system istested with seen data.

Definition 54 (Test set (Evaluation set)). A Test set is a set of annotated data used to test a system. The Test setshould never be used to train the system.

Definition 55 (Gold Standard (Ground truth)). A Gold Standard is a set of perfect solutions (or oracle answers).One can test a system by comparing the output with the Gold Standard. This set is never used to develop a system.Only the final evaluation is conducted with this dataset.

Definition 56 (Cross-validation). Cross-validation is a method in order to test and train a system with the same setof (limited data). Each time n− 1 parts out of n of the dataset are used to train the system and 1 part is used to testthe system. This procedure is repeated n times and the final score is the average over the n runs.

Definition 57 (In vitro). In vitro is a term used to describe we test the system in lab situations. In such situationswe are interested in the accuracy, F-score, BLEU,...

Definition 58 (In vivo). In vivo is a term used to describe we test the system in a real life setting. In such settingwe are interested in the return of investment, process speed-up, quality improvement,...

Definition 59 (Accuracy). The Accuracy is a metric to measure the performance of a system. The Accuracy is thepercentage of agreement between the output and the gold standard. The Accuracy is sensitive for the granularity ofthe evaluation together with human ceiling. In part-of-speech tagging, the Accuracy is the number of correctly taggedwords divided by the total number of words.

Definition 60 (Token accuracy). Token accuracy is a way to measure the performance of a system by measuring theaccuracy of each lemmatized tokens. Therefore the word frequency is taken into account.

Definition 61 (Type accuracy). Type accuracy is a way to measure the performance of a system by measuring theaccuracy of each word. Therefore the word frequency is not taken into account.

Definition 62 (Precision). The Precision is an evaluation metric that measures the ratio between the true positivesand the number of items who were classified as positive. Therefore it is the proportion of the selected items the systemgot right. In chunking the precision is the number of correct chunks given divided by the total number of chunks givenby the system.

4

Definition 63 (Recall). The Recall is an evaluation metric that measures the ratio between the true positives and thenumber of positive items. Therefore it is the proportion of target items the system selected. In chunking, the recall isdefined as the number of correct chunks given by the system divided by the total number of actual chunks in the text.

Definition 64 (F-score). The F-score is an evaluation metric that combines the precision and recall. Therefore theharmonic mean between these metrics is used.

Definition 65 (Error rate). The Error rate is an evaluation metric equal to 1− accuracy.

Definition 66 (Baseline). A Baseline is the most frequent class in part-of-speech-tagging.

Definition 67 (Correct chunk). A chunk in a fragment is correct if the chunk label is correct and both boundaries arecorrect as well.

5

2 Finite state automata

Definition 68 (Finite State Automaton (FSA)). A Finite State Automaton is a 5-tuple 〈Q,Σ, q0, F, δ〉 where Q ={q0, q1, . . . , qN−1} is a finite set of N states. Σ a finite input alphabet of symbols, q0 ∈ Q the start state, F ⊆ Q theset of final states and δ : Q×Σ→ Q the transition function or transition matrix between states. Given a state q ∈ Qand the next input symbol γ ∈ Σ, δ (q, γ) returns the new state q′ ∈ Σ. A Finite State Automaton can be describedby a Regular Expression and recognizes a Regular Language. Finite State Automaton are used to break words intosyllables, morphology, dictionary building, and parts of machine translation.

Definition 69 (Recognizer). A Recognizer is an abstract machine accepting or rejecting strings.

Definition 70 (Non-deterministic Finite State Automaton (NFSA)). A Non-deterministic Finite State Automaton isa variant of a Finite State Automaton where the signature of the transition function is modified to δ : Q×Σ∗ → P (Q).An NFSA accepts a string if there exists at least one path who ends in a state qf ∈ F . Since we might follow the wrongarc, there are a couple of methods to avoid this:

• Backup: a method where one puts a marker at choice points. Therefore one can return to that point when thefirst try turned out to be wrong.

• Look-ahead: a method where the program looks to future characters to decide which state to choose.

• Parallelism: a method where a program evaluates every alternative path in parallel.

Definition 71 (Regular Language). A Regular Language over an alphabet Σ is a set of strings. We define regularlanguages inductively:

• ∅ is a Regular Language

• ∀a ∈ Σ ∪ {ε}, {a} is a regular language.

• For two regular languages L1 and L2:

– L1 · L2 = {xy|x ∈ L1, y ∈ L2}, the concatenation of L1 and L2.

– L1 ∪ L2, the union or disjunction of L1 and L2.

– L?1, the Kleene closure of L1

Definition 72 (Regular Grammar). Regular Grammar over an alphabet Σ is a form of grammar where the followingtransition rules are allowed:

• B → a: where B is a non-terminal and a ∈ Σ

• B → aC: where B and C are non-terminals and a ∈ Σ

• B → ε: where B is a non-terminal and ε denotes the empty string.

Definition 73 (Regular Expression (Regexp)). A regular expression is a mean for identifying strings or text ofinterest, such as particular characters, words or patterns of characters.

Definition 74 (Finite State Transducer (FST)). A Finite State Transducer is a 6-tuple 〈Q,Σ, q0, F, δ, σ〉 where Q ={q0, q1, . . . , qN−1} is a finite set of N states. Σ a finite input alphabet of symbols, q0 ∈ Q the start state, F ⊆ Q the setof final states, δ : Q×Σ? → P (Q) the transition function or transition matrix between states and σ : Q×Σ? → P (Σ)the output function. Given a state q ∈ Q and the next input symbol γ ∈ Σ, δ (q, γ) returns a set of new states. σ (q, γ)returns a set of possible output strings. Examples are the two transducers for the two-level morphology of Koskenniemi.

6

3 Morphology

Definition 75 (Morphology). Morphology is the way words are built up from smaller meaning-bearing words calledMorphemes.

Definition 76 (Morpheme). A Morpheme is a part of a word with a specific meaning. Morphemes are categorized inthe following way:

• Stems: the main part of a word

• Affixes:

– Prefixes

– Suffixes

– Infixes

– Circumfixes

Definition 77 (Stem).

Definition 78 (Affix). An Affix is a part of a word that is not part of theflvStem. Affixs are used to give a Stem a different meaning or for the declension of the Stem. Sometimes a word hasmore than one affix. Especially agglutinative languages tend to string Affixes together.

Definition 79 (Prefix). A Prefix is a part of a word that proceeds the Stem. In English, prefixes play an importantrole in the meaning of a word. For instance “ un-buckle”, “ un-necessary”, “ ir-regular”, “ il-legal”.

Definition 80 (Suffix). A Suffix is a part of a word that follows the Stem. In English, the suffix plays an importantrole in the declension of nouns and verbs. For instance “ cat-s”, “ bush-es”, “ eat-s”, “ try-ing”.

Definition 81 (Infix). A Infix is a part of a word in the middle of the Stem. These are rare in English. For instance“ asbo-bloody-lutely”.

Definition 82 (Circumfix). A Circumfix is a part of a word surrounding the Stem. Circumfixes don’t exist in Englishbut are used for past participles in Dutch and German. For instance “ ge-zeg-d” and “ ge-sag-t”.

Definition 83 (Morphotactics). Morphotactics are a set of rules that describe how to combine different affixes.

Definition 84 (Inflection). Inflection is a method where one combines a Stem and a grammatical morpheme. Thismorpheme has the same part-of-speech class as the Stem and fills some syntactical function.

Definition 85 (English plural nouns declension). English plural nouns declension are a form of inflection. A pluralnoun uses a suffix “ s”. Sometimes Orthographic rules should be applied, for instances “ thrush-es”. Furthermoresome irregular nouns exist like “ mouse/mice” and “ ox/oxen”.

Definition 86 (English verb inflection). English verb inflection are a form of inflection. Regular verbs can be processedeasily. If the third person is used, one uses a suffix “ s”, when the past tense is used, the suffix “ ed” is used. Finallythe gerund of a verb is the Stem together with the “ ing” suffix. Irregular verbs can be processed by using a set ofidiosyncratic rules of inflection.

Definition 87 (Derivation). Derivation is a method where one combines a Stem and a grammatical morpheme. Thismorpheme has a different part-of-speech class than the Stem. Therefore the meaning is hard to predict. Examples arethe suffixes “ -ation”, “ -ee”, “ -er” and “ -ness” for nouns and “ -al”, “ -able” and “ -less” for adjectives.

Definition 88 (Compounding). Compounding is a method where one combines multiple word Stems to form a newword. Usually the result has the same class as that of the head. For instance “ doghouse=dog (modifier)+house(head)”. Compounding can be implemented recursively.

Definition 89 (Clitic). A clitic is a morpheme that acts syntactically as a word but is reduced in form. It is attachedphonologically or orthographically (for instance with an accent). A table of Clitics and their corresponding full formis listed in Table 1.

Definition 90 (Cliticization). Cliticization is a method where one combines a Stem with a clitic. If the clitic is placedbefore the word, we call this a proclitic. For instance “ l’opera”. If the clitic is placed behind the word, we call this anenclitic. For instance “ I’ve”.

Definition 91 (Agreement). Agreement is a concept where for instance a subject and noun must agree in number(person). In French nouns and adjectives must also agree in gender. One can parse such things my considering thisin the design of the Formal grammar or by introducing a Subcategory.

7

Definition 92 (Sentence Detection). Sentence Detection is a task where a syStem must detect the begin and end ofa sentence in a text. Most syStems use punctuations to process this. However headlines for instance often don’t havepunctuations. Furthermore abbreviations use punctuations. Therefore it’s not that easy to disambiguate.

Definition 93 (Lexicon). A Lexicon is a repository of words.

Definition 94 (Morphotactics). Morphotactics are a set of rules that tell us how Stems and affixes fit together.

Definition 95 (Computational lexicon (Finite-State Lexicon)). Computational lexicon is a finite state model thatdescribes the different rules of morphotactics using a Finite State Automaton. Examples of such syStems can be foundon Figure ??.

q0start q1 q2regular noun plural (-s)

irregular plural noun

irregular singular noun

Figure 1: Nominal inflection.

q0start q1

q2

q3regular verb stem

irregular past verb form

past (-ed)

past participle (-ed)

regularverb

stem

irregularverb

stem

pres

entpa

rticip

le(-in

g)

3sg

(-s)

Figure 2: Verbal inflection.

q0start q1 q2 q3un-

ε

adjective root -er -est -ly

Figure 3: Adjectival derivations.

Definition 96 (Porter Stemmer). The Porter Stemmer is a transducer who transforms a word to its Stem. Thetransducer is lexicon free and is based on three rules: ational→ate, ing→ε and sses→ss. The Stemmer can improveperformance but is error-prone.

8

q0start q1 q2 q3 q4

q5

ˆ:ε,#,other

s,x,z

#,other

s,x,z

ˆ:ε

ˆ:ε,#,other

x,z

ε : e

s

s

#

other

s,x,z

ˆ:ε

Figure 4: Orthographic transducer.

9

4 Language modeling and Sequential tagging

Definition 97 (Probabilistic model). A Probabilistic model is a model which predicts the next item based on probabilitytheory.

Definition 98 (markov chain). A markov chain is a model where the next variable depends only on a limited amountof previous variables. For a k-order Markov model this means that:

Pr [X1, X2, . . . , Xt] =

t∏i=1

Pr[Xi|Xi−1, Xi−2, . . . , Xmax(i−k,1)

](1)

Definition 99 (N -gram model). A N -gram model is a model which predicts the next item based on the previous N−1items. Such models come in two flavors: Smoothed N -grams and Unsmoothed N -grams.

Definition 100 (Unsmoothed N -gram model). In an Unsmoothed N -gram model, the probability is estimated by:

Pr [Xt|Xt−1, Xt−2, . . . , Xt−N+1] ≈ # [Xt−N+1, Xt−N+2, . . . , Xt]∑Xt

# [Xt−N+1, Xt−N+2, . . . , Xt](2)

Definition 101 (Smoothed N -gram model). In order to eliminate zero-probabilities, a Smoothed N -gram model, willmodify the probabilities using some smoothing model. Popular smoothing models are for instance laplace smoothing,Kneser-Ney smoothing, ...

Definition 102 (Laplace smoothing). Laplace smoothing adds a small value to all counts before normalizing. Thereforethe probabilities of an N -gram model look like:

Pr [Xt|Xt−1, Xt−2, . . . , Xt−N+1] ≈ 1 + # [Xt−N+1, Xt−N+2, . . . , Xt]

V +∑Xt

# [Xt−N+1, Xt−N+2, . . . , Xt](3)

With V the number of words in the vocabulary.

Definition 103 (Good discounting). Good discounting is a belief that the count of things you have seen helps estimatingthe count of things you have never seen.

Definition 104 (Hapax legomenon (Singleton)). A Hapax legomenon is an N gram that occurs only once.

Definition 105 (Good Turing discounting). Good Turing discounting is a method to count unseen things by usingthe frequency of Hapax legomenons as a re-estimate of the frequency of zero-count N -grams. Therefore it uses thefollowing formula:

c? =(c+ 1) ·Nc+1

Nc(4)

With Nc the number of N -grams that occur c times and c∗ the revised counts. Zero counts are estimated as N (or thetotal number of events). If Nc is zero (or smaller than a threshold k), one has to smooth this value a priori. In suchcase we use:

c? =

(c+ 1) ·Nc+1

Nc− c · (k + 1) ·Nk+1

N1

1− (k + 1) ·Nk+1

N1

(5)

Definition 106 (Interpolation with lower order N -grams). Interpolation with lower order N -grams is a method wherethe probability of a N -gram is estimated based on the probabilities of N -grams with a lower N . If one uses linearinterpolation the general formula is:

Pr [XT |XT−N+1, XT−N+1, . . . , XT−1] =

N∑i=1

µi Pr [XT |XT−N+i, XT−N+i+1, . . . , XT−1] where

N∑i=1

µi = 1 (6)

The weights µi can be learned based on a corpus (with an Expectation-Maximization algorithm), empirically or withprior knowledge.

Definition 107 (Backoff). Backoff is a method where the probability of a N -gram is only estimated when we havezero evidence for a higher order N -gram. If data is available, we use that data.

Definition 108 (Katz backoff). Katz backoff is a Backoff method where one uses the original value if data is avail-able. If the required data is missing, one uses interpolation with lower-order N -grams (for instance Good Turingdiscounting).

10

Definition 109 (Log probabilities). Log probabilities are a system where probabilities are represented by their log-arithm. This is done when the probabilities are exceptionally low and could result in numerical underflows. Bycalculating the logarithm the probability fits in a floating point representation.

Definition 110 (Perplexity). Perplexity is thew probability of the test corpus W , normalized by the number of wordsin the test corpus:

Perplexity (W ) = Pr [W ] = L

√1

Pr [w1w2 . . . wL](7)

Definition 111 (Chain rule). The Chain rule is a law in probability theory stating that one can calculate the probabilityof a sequence by the product of a set of conditional probabilities:

Pr [X1, X2, . . . , Xn] =

n∏i=1

Pr [Xi|X1, X2, . . . Xi−1] (8)

Definition 112 (Hidden Markov Model). A Hidden Markov Model is a 5-tuple 〈Q,A,B, q0, qF 〉. Where Q ={q1, q1, . . . , qN} is a set of N states. A is a Transition probability matrix where each ai j represents the probabil-ity of moving from state qi to qj such that for all i,

∑nj=1 ai j = 1. o = {o1, o2, . . . , oT } a set of T Observations, each

one drawn from a vocabulary V = {v1, v2, . . . , vV }. Bi t = bi (ot) a sequence of Observation likelihoods (also calledEmission probabilities) each expressing the probability of an observation ot being generated from state qi. q0 is a specialStart state not associated with any observations but with transition probabilities a0 1, a0 2, . . . , a0n. Finally qF is calledthe End state of Final state. This state is not associated with observations either. The probabilities a1F , a2F , . . . , anFare associated with qF . Popular problems in Hidden Markov Model are Hidden Markov Model Likelihood, HiddenMarkov Model Training and Hidden Markov Model Encoding. In a Hidden Markov Model one does not know the statesequence that the model past through when generating the training examples. A Hidden Markov Model is basically amodel of a 2-gram model. One can extend this model by using a N -dimensional matrix for A and B.

Definition 113 (Class Based N -gram (Cluster N -gram)). A Class Based N -gram uses information about the Wordclasses or Clusters to predict the next word. This is useful when for instance the “ to Shanghai” never occurs in thetraining set, but the bigrams “ to London” and “ to Beijing” occur. One can approximate the probability with thefollowing formula:

Pr [wi|wi−1] ≈ Pr [ci|ci−1] · Pr [wi|ci] where Pr [w|c] =# [w]

# [c]and Pr [ci|ci−1] =

# [ci−1, ci]∑k # [ci−1, ck]

(9)

Class Based N -gram are often mixed with regular Word-based N -grams.

Definition 114 (Hidden Markov Model Likelihood). Hidden Markov Model Likelihood is a problem where given aHidden Markov Model M = 〈Q,A,O,B, q0, qF 〉, and a observation sequence o = (o1, o2, . . . , oT ), one tries to find theprobability of this observation Pr [O|M ]. One can calculate the likelihood by using the Transition probability matrixand Emission probabilities with the following formula:

Pr [O|M ] =∑κ∈QT

Pr [O|M,κ] =∑κ∈QT

(T∏i=1

Pr [oi|M,κi]

(T∏i=2

Pr [κi|M,κi−1] ·

)(10)

One can efficiently calculate this using the Forward algorithm.

Definition 115 (Forward algorithm). The Forward algorithm an algorithm that calculates the Hidden Markov ModelLikelihood using dynamic programming. Therefore a table at j is calculated where αt j is the probability of being instate qj after seeing the first t observations of O = 〈o1, o2, . . . , oT 〉, or more formally:

αt j = Pr [o1, o2, . . . , ot, κt = qj |M ] (11)

Where κt is the t-th state in a sequence. Once this table is computed, we only have to sum over the probabilities ofevery path that could lead up to this observation. One can calculate the table by exploiting the following relation:

αt j =

N∑i=0

αt−1 i · ai j · bj (ot) (12)

The entire algorithm is listed in Algorithm 1.

Definition 116 (Visible Markov Model). A Visible Markov Model is a variant of a Hidden Markov Model. In aVisible Markov Model we can identify the path that was taken inside the model to produce the training sequence.

11

Algorithm 1: The Forward-algorithm.

beginα← Matrix(N+2,T);foreach qi ∈ Q do

α1 i ← a0 i · bi (o1);

for t from 2 to T doforeach qi ∈ Q do

αt i ←∑qj∈Q

αt−1,j · aj i · bi (ot);

αF i ←∑qj∈Q

αT,j · aj F ;

return αF i

Definition 117 (Hidden Markov Model Training). Hidden Markov Model Training is a problem where given a set ofobservation sequences O = {O1, O2, . . . , OT } and a set of states Q = {q1, q2, . . . , qN}, one tries to learn the parametersof the Transition probability matrix A and the Emission probabilities B. The Forward-Backward algorithm listed inAlgorithm 3 can learn these parameters.

Definition 118 (Backward algorithm). The Backward algorithm an algorithm that calculates the probability βt i ofseeing the observations from time t + 1 to the end, given that we are in state i at time t in a Hidden Markov ModelM = 〈Q,A,B〉. This by applying the reverse effects of the Forward algorithm. The entire algorithm is listed inAlgorithm 2.

Algorithm 2: The Backward-algorithm.

beginβ ← Matrix(T,N+2);foreach qi ∈ Q do

βT i ← ai F ;

for t from T − 1 downto 1 doforeach qi ∈ Q do

βt i ←∑qj∈Q

βt+1,j · ai j · bj (ot+1);

β1 0 ←∑qj∈Q

β1,j · a0 j · bj (o1);

return β

Definition 119 (Forward-Backward algorithm). The Forward-Backward algorithm is an algorithm that calculatesthe weights in a Hidden Markov Model. This task is sometimes called Hidden Markov Model Training. It is anExpectation-Maximization algorithm where in the Expectation phase two tables ξ and γ are calculated: ξt i j is theprobability of being in state i at time t and state j at time t+ 1 given the observation sequence and the model. γt j isthe probability of being in state j at time t. Or more formally:

ξt i j = Pr [κt = qi, κt+1 = j|O,M ] =αt i · ai j · bj (ot+1) · βt+1 j

αT N(13)

γt j =αt j · βt j

Pr [O|A,B](14)

In the Maximization phase one uses these parameters to update the model by calculating the most probable parametersfor A and B. Algorithm 3 presents a full listing of this method.

Definition 120 (Hidden Markov Model Encoding). Hidden Markov Model Encoding is a problem where given a HiddenMarkov Model M and a sequence of Observations, one wants to find the most likely sequence of states the HiddenMarkov Model used to generate this sequence. This problem can be solved recursively using the Viterbi algorithm.

12

Algorithm 3: The Forward-Backward-algorithm.

beginInput: O: observations of length T , V : output vocabulary, Q hidden states with |Q| = NA← Matrix(N+2,N+2);

B ← Matrix(N+2,V );repeat

A← A;

B ← B;/* Expectation phase */

for t from 1 to T doforeach qj ∈ Q do

γt j ←αt j · βt j

Pr [O|A,B];

foreach qi ∈ Q do

ξt i j ←αt i · ai j · bj (ot+1) · βt+1 j

αT N;

/* Maximization phase */

foreach qj ∈ Q doforeach qi ∈ Q do

ai j ←

T−1∑t=1

ξt i j

T−1∑t=1

N∑j=1

ξt i j

;

foreach vk ∈ V do

bt j ←

T∑t=1,ot=vk

γt j

T∑t=1

γt j

;

until A ≈ A ∧ B ≈ B;return 〈A,B〉

Definition 121 (Viterbi algorithm). The Viterbi algorithm is an algorithm to calculate the most likely sequence ofstates given a sequence of observations. The algorithm first calculates this probabilities based of being in a state at timet− 1. Probabilities are propagated back when the next time step t is taken into consideration. Therefore the algorithmis a variant of the Max-Product algorithm.

13

Algorithm 4: The Viterbi-algorithm.

beginV ← Matrix(N+2,T);P ← Matrix(N+2,T);foreach qi ∈ Q do

vi 1 ← a0 s · bi (o1);pi 1 ← 0

for t from 2 to T doforeach qi ∈ Q do

vi t ← maxqj∈Q vj t−1 · aj i · bi (ot);pi t ← argmaxqj∈Q vj t−1 · aj i;

foreach qj ∈ Q dovF t ← maxqj∈Q vj T · aj F ;pF t ← argmaxqj∈Q vj T · aj F ;

~r ← Vector(T+2);rT+1 ← F ;for i from T downto 1 do

rt ← prt+1 t;

return ~r

14

5 Part-of-Speech Tagging and Shallow Parsing

Definition 122 (Part-of-Speech tagging). Part-of-Speech tagging is a process where one assigns a part-of-speech orother syntactic class marker to each word in a corpus. This tagging requires tokenization. Punctuations are tagged aswell.

Definition 123 (Part-of-Speech tag). A Part-of-Speech tag specifies the meaning or function of a word in the corpus.The tag is based on both the syntactical and morphological function. The tag can be extracted by the words that occurnearby or the affixes they use. Part-of-Speech tag are grouped in Classes.

Definition 124 (Part-of-Speech class). A Part-of-Speech class is a group of different Part-of-Speech tags. Part-of-Speech classes come in two flavors: Closed classes and Open classes. Closed classes have fixed membership (no newwords), they mainly consist out of articles, conjugations and prepositions and function words (“ of”, “ it”, “ and”,“ you”,...). Open classes can contain new (introduced) words. They contain Nouns, Verbs, Adjectives and Adverbs.

Definition 125 (Noun). A Noun is a Syntactic class in which the word occurs. Semantically it points to people,places and things. A Noun can has a plural declension, can occur with

lvDeterminers and takes possessives.

Definition 126 (Proper noun). Proper noun are names of specific persons or entities. For instance “ Regina”,“ Leuven” and “ IBM”. Usually Proper noun are no articles and are capitalized.

Definition 127 (Common noun (Common name)). A Common noun is a noun who is not a specific person or entity.Common noun are subdivided into Count nouns and Mass nouns.

Definition 128 (Count noun). A Count noun is a Noum which allows enumeration. Count nouns require an articleand have a singular and plural form. Examples are “ goat/goats” and “ relationship”.

Definition 129 (Mass noun). A Mass noun is a Noun which is conceptualized as a homogeneous group. Examplesare “ snow” and “ communism”.

Definition 130 (Verb). A Verb is a Syntactic class. Verbs describe Actions and Processes. They have differentMorphological forms. Subclasses of Verb are Auxiliaries and Modal verbs.

Definition 131 (Auxiliaries (Auxiliary verbs)). Auxiliaries is a closed subclass of Verbs. They mark semantic featuresof the main verb (tense, aspect, polarity, mood). In general it connects a Subject with Predicates. Auxiliariess include“ to have” and “ to be”, “ to do” and Modal verbs. One further divides Auxiliaries in Perfect auxiliaries (like “ have”),Progressive auxiliaries (like “ be”) and Passive auxiliaries (like “ be”). Particular syntactic constraints the determinethe subtype.

Definition 132 (Modal verb (Modal,Modal auxiliary verb, Modal auxiliary)). A Modal verb is a type of Verb that isused to indicate modality: likelihood, ability, permission, and obligation. Examples are “ shall/shoud”, “ will/would”.Table ?? lists a set of Modal verbs.

Definition 133 (Adjective). An Adjective is a Part-of-Speech class which describes properties or qualities of anobject. For instance its color, age, value. Other adjectives describe some degrees of comparison. For instance “ good”,“ better”, “ nicest”. Not all languages have Adjectives.

Definition 134 (Adverb). An Adverb is a diverse Part-of-Speech class which modifies something, usually a verb.Adverbs come in different flavors: Directional adverb, Degree adverb, Manner adverb, Temporal adverb.

Definition 135 (Directional adverb (Locative adverb)). A Directional adverb is a type of Adverb describing thelocation where something takes place. For instance “ home”, “ here” and “ downhill”.

Definition 136 (Degree adverb). A Degree adverb is a type of Adverb describing the severity of the action. Forinstance “ extremely”, “ very” and “ somewhat”.

Definition 137 (Manner adverb). A Manner adverb is a type of Adverb describing in which manner the action takesplace. For instance “ slowly”, “ delicately”.

Definition 138 (Temporal adverb). A Temporal adverb is a type of Adverb describing when an action takes place.For instance “ yesterday”, “ Monday”. Some of these adverbs are Nouns as well.

Definition 139 (Preposition). A Preposition is a Part-of-Speech class that comes before Noun phrases. They expressSpatial relations or Temporal relations. Since Prepositions are a closed class, Table ?? lists all Prepositions.

Definition 140 (Particle). A Particle is a Part-of-Speech class. They resemble Prepositions and are combined witha verb. Often this verb has a different meaning. Examples of such combinations are “ turn down” and “ rule out”.Since Particles are a closed group, Table ?? lists all the single word Particles.

15

Definition 141 (Determiner (Det)). A Determiner is a Part-of-Speech class that occurs with nouns. It is mostlyannotated with DT. Determiners mark the beginning of a Noun phrase. A subtype of a Determiner is a Article. OtherDeterminers are “ this” and “ that”. Some Determiners are complex and are used in Possessive expressions by addinga “ ’s” at the end of a Noun phrase. In some cases they are optional: for instance with Mass nouns and Pluralindefinites.

Theorem 1 (Marker hypotheses). There is a set of words in every language that marks boundaries of phrases in asentence.

Definition 142 (Article). An Article is a subclass of a Determiner. Articles come in two flavors: Definite articleslike “ the” and Indefinite articles like “ a” and “ an”.

Definition 143 (Conjugation). A Conjugation is a Part-of-Speech class who joins two Phrases, Clauses or Sentences.Conjugations come in different flavors: Coordinating conjugations and Subordinate conjugations. Table ?? lists a setof Conjugations.

Definition 144 (Coordinating conjugation). A Coordinating conjugation is a subclass of Conjugations. They joinelements of equal status. Examples are “ and”, “ or” and “ but”.

Definition 145 (Subordinate conjugation). A Subordinate conjugation is a subclass of Conjugations. They joinelements where one element is of embedded status. An example is “ I thought that you might like that”. Subordinateconjugations that link a Verb to its Arguments are called Complementizers.

Definition 146 (Pronoun). A Pronoun is a Part-of-Speech class which is a shorthand for referring to a Noun phrase,Entity or Event. Pronouns come in different flavors: Personal pronouns, Possessive pronouns and Wh-pronouns.Table ?? lists a set of Pronouns.

Definition 147 (Personal pronoun). Personal pronouns are Pronouns who refers to Persons, Entities.

Definition 148 (Possessive pronoun). Possessive pronouns are Pronouns who express actual Possession or an abstractrelation between a Person and an Object.

Definition 149 (Wh-pronoun). Wh-pronouns are Pronouns used in Questions. They can act as Complementizers.

Definition 150 (Interjection). An Interjection is a Part-of-Speech class which contains for instance “ oh”, “ ah”,“ yes” and “ uh”.

Definition 151 (Negative). A Negative is a Part-of-Speech class which contains for instance “ no” and “ not”.

Definition 152 (Politeness marker). A Politeness marker is a Part-of-Speech class which contains for instance “ thankyou” and “ please”.

Definition 153 (Greeting). A Greeting is a Part-of-Speech class which contains for instance “ hello” and “ goodbye”.

Definition 154 (Existential there). The Existential there is a Part-of-Speech class which contains markers for thefact that some object exists. For instance “ there are two apples on the table”.

Definition 155 (Rule-based Part-of-Speech tagging). Rule-based Part-of-Speech tagging is a method where Part-of-Speech tagging is handled by using a Dictionary to assing a list of potential tags to each word, and use a large list ofhand-written disambiguation rules. The result is a fragment where each word has one Part-of-Speech tag per word.For different tag sets ore languages, one has to use different rules.

Definition 156 (EngCGTagger (Voutilainen)). EngCGTagger is a Rule-based Part-of-Speech tagger that uses a setof 56′000 word stems and performs a Morphological analysis. The analysis is done by running the text through a twolevel Lexicon transducer. Next the tagger rules out incorrect Part of Speech tags using Constraints.

Theorem 2 (Bayes rule). Bayes rule is a rule in Probability theory stating the following:

Pr [x|y] =Pr [y|x] · Pr [x]

Pr [y](15)

If one is only interested in the maximum likelihood of of x, one may drop the denominator since it’s the same of eachitem:

x? = argmaxx∈X Pr [y|x] · Pr [x] (16)

Theorem 3 (Part-of-Speech simplification assumption). In order to perform Part-of-Speech tagging, most systemsuse the following two assumptions:

16

• The probability of a word only depends on its own Part-of-Speech tag:

Pr[~w|~t]≈

n∏i=1

Pr [wi|ti] (17)

• The probability of a tag only depends on the previous tag (this is sometimes called the Bigram assumption):

Pr[~t]≈

n∏i=1

Pr [ti|ti−1] (18)

Definition 157 (Hidden Markov Model Part-of-Speech tagging). Hidden Markov Model Part-of-Speech tagging is aform of Part-of-Speech tagging where one uses a Hidden Markov Model. This system is based on the Part-of-Speechsimplification assumptions (otherwise one could model this with a Hidden Markov Model). One can simply train thenetwork by counting tuples in a Training set:

Pr [ti|ti−1] =# [ti−1, ti]

# [ti−1](19)

Pr [wi|ti] =# [ti, wi]

# [ti](20)

In this context, the emission probabilities are sometimes called the Lexical likelihood. Unknown words are in mostcases handled by using a “ unknown word” tag. In other cases the probabilities are equally distributed over the differentPart-of-Speech tags who represent Open classes. In this case, the result mainly depends on the previous tag. A bettersystem is using Morphology: analyzing the word in order to determine its type. Finally one may allso consider allfinal letter sequences of all words and for each suffix compute the probability of a tag t given the suffix1.

Definition 158 (Hybrid Part-of-Speech tagging (Brill tagging)). Hybrid Part-of-Speech tagging is a method to performPart-of-Speech tagging using Transformation-based Part-of-Speech tagging. By mixing Rule-based Part-of-Speechtagging and Stochastic Part-of-Speech tagging one hopes to generate better results. Rules specify which tags shouldbe assigned to what words and these rules are automatically induced from data (by Supervised learning). When thesystem tags a fragment, the rules are ordered so the most broadest rule is applied first. The algorithm then choosesmore specific rules modifying a smaller number of tags. The rules are learned automatically but limited by a template.

Definition 159 (Combination Part-of-Speech tagging). Combination Part-of-Speech tagging is a Part-of-Speech tag-ging mechanism where a couple of taggers are used concurrently. The results are compared and an algorithm decidesabout the final Part-of-Speech tags for each word.

Definition 160 (Shallow parsing (Chunking)). Shallow parsing is a process where one groups Words in linguisticallymeaningful Chunks. By applying this process recursively we get a Shallow parse tree. This process of courses requiresPart-of-Speech tags. Shallow parsing is based on the Marker hypotheses.

Definition 161 (Machine learning chunker). A Machine learning chunker is a system that performs Shallow parsing.Therefore each word is either: the Beginning of a chunk, End of a chunk or nor the beginning or end of a chunk. NoPart-of-Speech tagging is required, but it can help. The program learns the rules by using a traing set. For a chunkX, the beginning is annotated as BX and the end as EX. Machine learnings don’t consider Crossing chunks: if onemarks the end of a chunk X, all embedded chunks end as well.

1This method works fine for most European languages, but is not considered to be universally applicable.

17

6 Formal Grammars and Parsers

Definition 162 (Formal grammar). A Formal grammar is a system where a group of items can behave as an itemthemselves. Therefore a sequence of items can be seen as a tree of items with the original items in the leafs. In thecontext of natural language processing the introduced items are called Phrases. For instance a Noun phrase. Formalgrammars are used to define grammatical relations and thus the formalization of traditional grammar. They introducerelations and dependencies between words and phrases. A popular category of Formal grammars are Context-freegrammars.

Definition 163 (Context-free grammar (CFG, Phrase-Structure grammar)). Context-free grammars are a type ofFormal grammars. They consist out of a Lexicon of Words and Symbols and a set of Production rules expressinghow these symbols can be grouped and ordered. Context-free grammars can be used both for generating sentences andassigning a structure to a given sentence. More formally a Context-free grammar is a 4-tuple G = 〈N,Σ, R, S, n0〉where N is a set of Non-terminal symbols (sometimes called Variables), Σ a set of Terminal symbols (disjoint fromN), R a set of Production rules each of the form n→ β where n ∈ N and β a string of symbols from the infinite set ofstrings (Σ ∪N)

?. n0 ∈ N is called the Start symbol. Sentences that can be generated from n0 are called Grammatical

sentences. Sentences who fail on this condition are called Ungrammatical sentences.

Definition 164 (Noun phrase (NP)). A Noun phrase (tagged as NP) is a sequence of words surrounding at least oneNoun. What holds up for a Noun phrase however, is not true for the individual words making up the Noun phrase.For instance “ Harry the Horse” or “ the Broadway coppers”. A Noun phrase can divided into components with aspecific meaning: Head, Determiner, Nominal and Predeterminer. Formally a Noun phrase is defined by the followinggrammatical rules:

〈Nominal〉→〈Noun〉|〈Nominal〉〈Noun〉〈NP〉→〈Pronoun〉|〈DT〉〈Nominal〉|〈ProperNoun〉〈NP〉→〈DT〉〈Card〉〈Ord〉〈Quant〉〈AP〉〈Nominal〉〈NP〉→〈NP〉〈CC〉〈NP〉

Definition 165 (Prepositional phrase (PP)). A Prepositional phrase (tagged as PP) is a Phrase where one puts aPreposition before a Noun phrase. Prepositional phrases are used to express relations with time, dates and other nouns.They can be rather complex. Examples include “ to Seattle”, “ in Minneapolis” and “ on the ninth of July”. Formallya Prepositional phrase is defined by the following grammatical rules:

〈PP〉→〈Proposition〉〈NP〉

Definition 166 (Verb phrase (VP)). A Verb phrase (tagged as VP) is a Phrase combining a Verb with a Leadingobject. Formally a Verb phrase is defined by the following grammatical rules:

〈VP〉→〈Verb〉|〈Verb〉〈NP〉|〈Verb〉〈VP〉〈PP〉|〈Verb〉〈PP〉|〈Verb〉〈S〉〈VP〉→〈VP〉〈CC〉〈VP〉

Definition 167 (Sentence (S)). A Sentence (tagged as S) is a Phrase where one combines a Noun phrase andVerb phrase. In English, one distinguishes between different forms of sentences: Declarative sentences, Imperativesentences, Yes-no questions and Wh-phrases. Formally a Sentence is defined by the following grammatical rules:

〈S〉→〈NP〉〈VP〉|〈VP〉|〈Aux〉〈NP〉〈VP〉〈Wh-NP〉〈VP〉|〈Wh-NP〉〈Aux〉〈NP〉〈VP〉〈S〉→〈S〉〈CC〉〈S〉

Definition 168 (Clause). A Clause is a part of Sentence that stands on its own as a fundamental unit in discourse.A Clause is thus a Sentence embedded within a larger Sentence. It describes a complete thought since the verb has allits arguments. An example is “ I told him that he should see a doctor.”.

Definition 169 (Nominal (Nom)). A Nominal is a part of Noun phrase that follows a Determiner or Pre-headModifiers.

Definition 170 (Pre-head modifier). A Pre-head modifier is a part of Noun phrase. It expresses quantity by Cardi-nal numbers (“ one”, “ two”), Ordinal numbers (“ first” and “ second”) or quantifiers (“ some”, “ many”). Anothersubclass of Pre-head modifiers are Adjectival phrases.

Definition 171 (Post-head modifier). A Post-head modifier is a part of Noun phrase. Post-head modifiers fall intothree categories: Propositional phrases, Non-finite clauses and Relative clauses.

18

Definition 172 (Relative clause). A Relative clause is a special type of a Clause which begins with a Relative pronoun(“ that”, “ who”) and the Relative pronoun is subject to the embedded Verb. For instance “ a flight that servesbreakfast”.

Definition 173 (Subject-verb agreement). The Subject-verb agreement is a form of Agreement where the Subjectand the Verb agree in number (person). This means that for 3sg verbs in the Third person, one adds an “ -s”. Suchagreements can be resolved by considering them in the grammar by paying a penalty for performance (the grammarsize doubles).

Definition 174 (Determiner-noun agreement). The Determiner-noun agreement is a form of Agreement where theDeterminer and the Noun agree on the number. For instance “ this flight” and “ these flights”.

Definition 175 (Subcategorization). Subcategorization is a concept where one adds additional information to aChunk. Fur instance in a Propositional phrase one can add the Proposition as data.

Definition 176 (Coordination). Coordination is a concept where one cojoins two or more phrase types by using aConjunction (CC) (“ and”, “ or”, “ but”). For instance:

〈VP〉→〈VP〉〈CC〉〈VP〉〈NP〉→〈NP〉〈CC〉〈NP〉〈S〉→〈S〉〈CC〉〈S〉

Definition 177 (Treebank). A Treebank is a collection of syntactically annotated texts. This means the databasecontains a tree for each sentence in the text. This is done automatically, however Manually corrected treebanks arequite popular. A Treebank is important for empirical parsers and empirical investigations of syntactic phenomena.Treebanks are used to extract Context-free grammar rules from the sentences. After the treebank is analyzed and turnedinto a Context-free grammar, the Context-free grammar accepts the strings in the Treebank and much more. Treebankare stored using brackets to notate the groups or in XML. One can use for instance XPath queries to search for data.

Definition 178 (Manually corrected treebank). A Manually corrected treebank is a Treebank which is tagged semi-automatically and corrected by a human.

Definition 179 (Fully automated treebank). A Fully automated treebank is a Treebank which is generated based onsome text and a robust parser. The parser analyzes the text and the result is added to the Treebank.

Definition 180 (Parallel treebank). A Parallel treebank is a Treebank used for Syntax-based machine translation.For a text, the Treebank contains the source text (in the source language), the target text (in the target language)together with Parse trees and Word alignments.

Definition 181 (Dependency grammar). A Dependency grammar is a syntactical structure described in terms ofwords an binary syntactic relations between these words. The label on the links depend on the type of the words. Theadvantages of Dependency grammars are strong predictive parsing powers of words for their dependents since knowingthe identity of a verb can help deciding what the subject or object is. Furthermore the ability to handle free word orderlanguages (like Russian).

Definition 182 (Syntactic Parsing). Syntactic Parsing is a process where given a string a system recognizes thesentence and assigns a syntactic structure to it. Algorithms take as input the Formal grammar and the string andproduces a Syntax tree consistent with the grammar.

Definition 183 (Top-down parsing). Top-down parsing is a strategy where ones builds a Syntax tree from the roodnote (and Start symbol). Algorithms who use this strategy don’t spent processing power to trees who can’t be derivedfrom the Start symbol. On the other hand will this strategy invest processing power in trees who are inconsistent withthe input.

Definition 184 (Bottom-up parsing). Bottom-up parsing is a strategy where ones builds a Syntax tree where asequence of symbols are folded into new symbols until the root is reached. Algorithms who use such strategy use alexicon lookup in order to find the proper Part-of-Speech tags. Algorithms who use this strategy spent processing powerto trees who can’t be derived from the Start symbol. On the other hand will this strategy never invest processing powerin trees who are inconsistent with the input.

Definition 185 (Structural ambiguity). Structural ambiguity is a condition of a Formal Grammar where given a cer-tain string, this can result in multiple consistent syntax trees. Structural ambiguity comes in two flavors: Attachmentambiguity and Coordinate ambiguity. In most real life sentences we find a lot of syntactical ambiguities. These ambi-guities are resolved because most of them are semantically unreasonable. The process of finding the correct syntacticaltree is called Syntactic disambiguation.

19

Definition 186 (Attachment ambiguity). Attachment ambiguity is a form of Structural ambiguity where the ambiguityis caused because a subtree can be attached to different parents.

Definition 187 (Coordinate ambiguity (Scope)). Coordinate ambiguity is a form of Structural ambiguity where theambiguity is because different sets of phrases can be cojoined by the same conjunction. For instance “ old men andwomen” versus “ old men and women”.

Definition 188 (Syntactic disambiguation). Syntactic disambiguation is a process where one chooses the correctsyntactic tree out of a set of valid trees. This process requires Statistical information, Semantical knowledge andPragmatic knowledge. Since this information is not available while the system performs Syntactic parsing, this methodhas to return all possible syntactic trees. This can be exponentially large. Therefore the problem is solved dynamically:the algorithm generates tables containing subtrees. The result is still a set of syntactical trees. But the common partsare stored only once.2. Popular implementations are the Cocke-Kasami-Younger algorithm, the Earley algorithm andChart parsing.

Definition 189 (Chomsky Normal Form (CNF)). The Chomsky Normal Form is a normal form for Context-freegrammars. Context-free grammars in Chomsky Normal Form contain only two types of rules:

a → b c (21)

a → γ (22)

With a, b, c ∈ N (the non-terminals) and γ ∈ Σ. All Context-free grammars can be converted into Chomsky NormalForm. With the following process:

1. Convert terminals within rules to dummy non-terminals;

2. Convert unit-productions (single non-terminals);

3. Make all rules binary.

Definition 190 (Cocke-Kasami-Younger algorithm (CKY)). The Cocke-Kasami-Younger algorithm is an algorithmthat uses Bottum-up parsing and dynamic programming. It requires that the grammar is entirely in Chomsky NormalForm. Since each non-terminal above the Part-of-speech tag-level has two daughters, one can generate a 2D matrix Athat encodes the tree structure. Therefore the algorithm uses the uses the upper triangle where each cell Ai j containsa set of Non-terminal symbols representing all constituents spanning i through j. Each of these entries is paired so itpoints to the entries from which it was derived. The algorithm has to permit multiple versions of the same non-terminalto be entered in the table. The entire algorithm is listed in Algorithm 5.

Algorithm 5: The Cocke-Kasami-Younger algorithm.

begint← Matrix(n+1,n+1);for j from 1 to n do

tj−1 j ← {a|a→ wj ∈ G};for i from j − 2 downto 0 do

for k from i+ 1 to j − 1 doti j ← ti j ∪ {a|a→ b c ∈ G, b ∈ ti k, c ∈ tk j};

Definition 191 (Earley algorithm). The Earley algorithm is an algorithm that uses Top-down parsing and dynamicprogramming. It processes a string with a single left-to-right pass that fills a Chart (array). For each word position, thechart contains a list of states representing the partial parse trees so far. At the end of the sentence, the chart encodesall possible syntactic trees. Individual states in each chart entry contain: a subtree corresponding to a single grammarrule, information about the progress made in completing this subtree and the position of the subtree with respect to theinput. For instance one such entry can be represented by the following expression:

V P → Verb � NP PP [0,1] (23)

Where V P → Verb NP PP is the deepest derivation rule at the moment, we already parsed the verb, we parsed 1 wordand we are looking for incomplete states at position 0. Each entry can be in three state: Predicator, Scanner andCompleter. We are in a Predicator state when we create new states representing top-down expectations. These are

2One can compare this approach by using a flightweight.

20

applied to any state that has a non-terminal immediately right of the marker. The new states generated are markedas a Generating sate. An entry is in a Scanner state when the state has a Part-of-Speech tag to the right of themarker. The algorithm examines the input and adds a state into the chart when the input corresponds to the expectedtag. Otherwise the entry is discarded. Entries are in the Completer state when its dot has reached the end. This staterepresents the fact that the tree has discovered a particular grammatical category over some span of the input. Thealgorithm reassembles such trees and copies the older state to new states. These states are then installed in the currentchart entry.

Definition 192 (Chart parsing). Chart parsing is a variant of the Cocke-Kasami-Younger algorithm and Earleyalgorithm where the order of parsing (Bottom-up parsing versus Top-down parsing) is determined dynamically insteadof statically. This is done by using an explicit Agenda. New states are added and the ordering of the agenda isseparated from the parsing algorithm.

21

7 Statical parsing

Definition 193 (Probabilistic context-free grammar (PCFG,Stochastic context-free grammar)). A Probabilistic context-free grammar is 4-tuple 〈N,Σ, R, n0〉 where N is a set of Non-terminal symbols (of Variables). Σ a set of Terminalsymbols (disjoint from N), R a set of Production rules each of the form a→ β [p] where a is a Non-terminal symbol,β a string of symbols from the infinite set of strings (Σ ∪N)

?and p ∈ [0, 1] expressing Pr [β|a] = Pr [a→ β]. n0 ∈ N

is the Start symbol. Since each non-terminal eventually derives the following condition must hold:

∀a :∑

β∈(Σ∪N)?

Pr [a→ β] = 1 (24)

The joint probability of a Syntax tree T and a sentence ~w is defined as the product of the probabilities of all the k rulesused to expand each of the n non-terminal nodes in the parse tree, where each rule i is expressed as ai → βi:

Pr [T, ~w] =

k∏i=1

Pr [ai → βi] = Pr [T ] · Pr [~w|T ] (25)

Since the parse tree includes all the words of the sentence, one can state that:

Pr [~w|T ] = 1 (26)

We can use a Probabilistic context-free grammar for Syntactic disambiguation by accepting the most probable tree:

T (~w) = argmaxT Pr [T |~w] = argmaxTPr [T |~w]

Pr [S]= argmaxT Pr [T, ~w] = argmaxT Pr [T ] (27)

One can learn the probabilities by counting the rules from a Treebank:

Pr [a→ β] =# [a→ β]∑γ # [a→ γ]

=# [a→ β]

# [a](28)

One can also use the Inside-Outside algorithm. Problems with Probabilistic context-free grammars arise by two as-sumptions: the Independence assumption and the Lack of lexical conditioning. Augmented versions of Probabilisticcontext-free grammars exist to resolve this problems. One can for instance split the Non-terminal symbols in orderto make them less ambiguous. This can be done by the so-called Parent annotation. By enlarging the amount ofNon-terminal symbols however, the amount of training data to train a certain probability is reduced. Therefore oneneeds to learn optimal splitting.

Definition 194 (Probabilistic Cocke-Kasami-Younger algorithm (PCKY)). TheProbabilistic Cocke-Kasami-Youngeralgorithm is a Bottum-up parsing algorithm that produces the most probable tree. Instead of the Cocke-Kasami-Youngeralgorithm, it uses a n+ 1× n+ 1× V matrix. With V the number of Non-terminal symbols. The entries contain theprobability for the non-terminal contituent that spans positions i through j of the output. The entire algorithm is listedin Algorithm 6.

Algorithm 6: The Probabilistic Cocke-Kasami-Younger algorithm.

begint← Matrix(n+1,n+1,V );b← Matrix(n+1,n+1,V );for j from 1 to n do

foreach x ∈ {a|a→ wj ∈ G} dotj−1 j x ← Pr [A→ wj ];

for i from j − 2 downto 0 dofor k from i+ 1 to j − 1 do

foreach x ∈ {a|a→ b c ∈ G ∧ ti k b > 0 ∧ tk j c > 0} doif ti j a < Pr [a→ b c] · ti k b · tk j c then

ti j a ← Pr [a→ b c] · ti k b · tk j c;bi j a ← 〈k, b, c〉;

return b, t;

22

Definition 195 (Inside-Outside algorithm). The Inside-Outside algorithm is a generalization of the Forward-Backwardalgorithm and thus a Expectation-Maximization algorithm. It uses the following technique:

1. Start with a equal probabilities

2. Parse the sentences with this parser

3. Produce probability for each syntax tree

4. Use these probabilities to weight counts

5. Re-estimate the rule probabilities

6. Repeat until the probabilities converge

One can also learn grammar (known as Grammar induction) with this algorithm, but this is quite difficult.

Theorem 4 (Independence assumption). The Independence assumption states that the structure of one part of thetree has nothing to do with the structure of the other part of the tree. Since the probability of a tree is only estimatedby the product of the different rule applications.

Theorem 5 (Lack of lexical conditioning). The Lack of lexical conditioning theorem states that when the number oflexical items in the model is limited, this leads to Subcategorization ambiguity, Attachment ambiguity and Coordinateambiguity. Furthermore the exact word plays an important role in the probability of a certain tree. For instance“ dumped into” is more likely than “ sacks into”. Une can use Lexical dependency statistics to enrich the grammar.

Definition 196 (Parent annotation). Parent annotation is a concept in Probabilistic context-free grammars whereeach Non-terminal symbol is annotated with the symbol of its parent. For instance a Noun phrase who is embedded ina Sentence is annotated as NP ∧ S.

Definition 197 (Lexicalized grammar). A Lexicalized grammar is a Formal grammar where each Non-terminal symbolin the tree is annotated with its lexical head, possibly augmented with the Part-of-Speech tag of the head words. Forinstance:

VP (dumped, VBD)→ VBD (dumped, VBD) NP (sacks, NNS) VP (into, IN) (29)

This method offers a solution for Coordinate ambiguity.

Definition 198 (Probabilistic lexicalized context-free grammar (PLCFG)). A Probabilistic lexicalized context-freegrammar is a probabilistic model where the parser is modified to allow for lexicalized rules, therefore it’s a Lexicalizedgrammar. For instance the Collins parser and Charniak parser can handle such rules. A Probabilistic lexicalizedcontext-free grammar contains two rules: Internal rules and Lexical rules. Internal rules handle transitions from oneNon-terminal symbol to antoher. Lexical rules deterministically expand a Pre-terminal to a word. A problem withProbabilistic lexicalized context-free grammars is that the data is often too spare to estimate the probabilities.

Definition 199 (Collins parser). The Collins parser is an algorithm that can parse Probabilistic lexicalized context-free grammars. The idea is to divide the right side of each rule into: a Head non-terminal, Non-terminals to the leftof the head and Non-terminals to the right of the head. Therefore each rule becomes:

a→ Ln, Ln−1, . . . , L1, H,R1, R2, . . . , Rm (30)

Since each Non-terminal symbol represents the category, head and head tag. The Collins parser considers threeprobabilities: PH for generating heads, PL for generating dependents on the left and PR for generating dependents onthe right. Therefore the probability of the following rule is estimated by:

Pr [VP (dumped, VBD)→ VBD (dumped, VBD) NP (sacks, NNS) VP (into, IN)] ≈PH (VBD|VP, dumped)

·PL (STOP|VP, VBD, dumped)·PR (NP (sacks, NNS) |VP, VBD, dumped)·PR (PP (into, P) |VP, VBD, dumped)

·PR (STOP|VP, VBD, dumped)

(31)

More formally, the following method is used:

1. Generate the head of the phrase H (hw, ht) with probability

PH (H (hw, ht) |Pa, hw, ht) (32)

23

2. Generate modifiers to the left of the head with total probability:

PL (STOP|Pa, hw, ht) ·n∏i=1

PL (Li (lwi, lti) |Pa, hw, ht) (33)

3. Generate modifiers to the right of the head with total probability:

PH (STOP|Pa, hw, ht) ·m∏i=1

PR (Ri (rwi , rti) |Pa, hw, ht) (34)

Since zero probabilities are no exceptions, one needs to perform smoothing, for instance using Backoff models. Forunseen words the UNKNOWN word token is used. Each of these probabilities can be estimated from a much smaller dataset than the the full probability. Some parsers use more complex models and use for instance distance functions.

Definition 200 (Discriminative reranking). Discriminative reranking is a concept where a generative parser (likethe Collins parser) does not generate the most probable parse tree, but a set of N -best Syntax trees. By using aDiscriminative classifier one reranks the trees and picks the best one. This reranking can use more sophisticatedmetrics.

24

8 Computational Lexical Semantics

Definition 201 (Word sense). The Word sense of a word is the Lexical meaning: the meaning defined by the originand the usage of the word. This might vary according to the context. Such knowledge is stored in dictionaries.

Definition 202 (Homonymy). Homonymy means a word has two rather different meanings. For instance “ bark”.

Definition 203 (Plysemy). Plysemy means a word has two somehow rather meanings. For instance “ opening”.

Definition 204 (Metonymy). Metonymy is the use of one aspect of a concept or entity to refer to other aspects ofthe entity or to the entity itself. For instance “ White House” refers to the administration whose office is in the WhiteHouse.

Definition 205 (Synonymy). Synonymy means two words have the same propositional meaning: they are substitutableby one another without changing the truth conditions of the sentence. For instance “ vermin” and “ pests”. Manywords are near synonyms: they slightly differ in meaning.

Definition 206 (Antonymy). Antonymy means two words have the opposite meaning. For instance “ cold” and “ hot”.

Definition 207 (Hypernymy (Superordinate)). Hypernymy means a word is more general than another word. Forinstance “ fruit” is a Hypernym of “ apple”. This property can be defined in terms of entailment: sense A is a hypernymof sense B if ∀x : A (x)⇐ B (x). These relations are usually transitive.

Definition 208 (Hyponymy (Subordinate)). Hyponymy means a word is more specific than another word. For instance“ apple” is a Hyponym of “ fruit”. This property can be defined in terms of entailment: sense A is a hyponym of senseB if ∀x : A (x) ⇒ B (x). These relations are usually transitive. Hyponymy can be acquired based on handcraftedlexico-semantic patterns (like “ such as”) or based on frequency. This task is quite difficult since we expect a lot offalse positives and false negatives.

Definition 209 (Meronymy). Meronymy means a word is a part of another word. For instance “ wheel” is a Meronymof a “ car”.

Definition 210 (Holonymy). Holonymy means a word is the whole of another word. For instance “ wheel” is aHolonym of a “ car”.

Definition 211 (WordNet). WordNet is a lexical database containing Nouns, Verbs, Adjectives and Adverbs inEnglish together with their senses. Each of these words has different Senses and each of these senses contains a Gloss:a dictionary style definition and a Synset: a set of (near) synonyms for the sense. Sometimes usage examples areprovided. Besides meanings, WordNet also defines relations between words.

Definition 212 (Word sense disambiguation). Word sense disambiguation is a process where one tries to select thecorrect sense of each word in a discourse. In order to do this, the correct context is needed. Formally one has aword w with senses s1, s2, . . . sn. Furthermore a corpus has a number of contexts c1, c2, . . . , cm and one can uses wordv1, v2, . . . , vp as contextual features for disambiguation. The processes usually relies on machine learning techniques.Usually these systems rely on two assumptions: One sense per discourse and One sense per collocation. The contextis encoded using Bag-of-word features and Collocational features. The process can be carried out using resources likeWordNet, or by using a large corpus. The baseline for each Word sense disambiguation algorithm is the same. Foreach word, the given dictionary contains Semantic categories (also known as Subject codes) for each word sense. Thescore for a certain sense is the number of words in context ci that are compatible with the subject code of sense sk.The algorithm will select the sense with the maximum score. Formally the score is calculated by the following formula:

score (sk) =∑vj∈ci

δ (t (sk) , vj) (35)

A typical algorithm to do this is the Simplified Lesk algorithm. In other cases supervised learning is used, like forinstance the Yarowski algorithm.

Theorem 6 (One sense per discourse). The One sense per discourse-assumption assumes that the sense of the targetword is highly consistent within a document.

Theorem 7 (One sense per collocation). The One sense per collocation-assumption assumes that nearby words providestrong and consistent clues to the sense of a target word, conditional on relative distance, order and syntacticalrelationship.

Definition 213 (Bag-of-word feature). A Bag-of-word feature means one assigns a binary value to each word of thevocabulary encoding whether or not a word is present in the context.

25

Definition 214 (Collocational feature). A Collocational feature means one encodes information about specific positionsfor a certain scope around the target word, this information for instance consists out of the word and its Part-of-Speechtag.

Definition 215 (Simplified Lesk algorithm). The Simplified Lesk algorithm is an algorithm used for Word sensedisambiguation. It counts the number of words in a certain context and returns the context with the most votes. Afull listing is presented in Algorithm 7.

Algorithm 7: The Simplified Lesk algorithm.

Definition 216 (Yarowski algorithm). The Yarowski algorithm is an algorithm used for Word sense disambiguation.For each ambiguous word w it learns a dictionary of collocations. It uses two sets for each sense sk of the ambiguousword w:

Fk The Set of collocations: this set contains rules who describe a context. Each collections has a score that correspondsto the likelihood that the collection is specific to the sense sk.

Ek The Set of contexts: The different contexts of the ambiguous word w with sense sk.

The algorithm is initialized with the corpus and collects all the contexts of the ambiguous word in the corpus. Next itfilters out the senses below a certain threshold α. A full listing is presented in Algorithm 8.

Algorithm 8: The Yarowski algorithm.

beginC ← contexts (corpus);foreach sk ∈ senses (w) do

F ′k ← {fm|fm ∈ initial set of collocations};E′k ← {ci ∈ C|∃fm ∈ F ′k : applies (fm, ci)};

repeatFk ← F ′k;Ek ← E′k;

F ′k ←{fm ∈ Fk|∀n 6= k : log

(Pr [sk|fm]

Pr [sn|fm]

)> α

};

E′k ← {ci ∈ C|∃fm ∈ F ′k : applies (fm, ci)};until Ek = E′k;return Fk

Definition 217 (Unsupervised word sense disambiguation). The Unsupervised word sense disambiguation is an al-gorithm used for Word sense disambiguation. This method is quite popular since labeling is expensive. One doesnot labels based on human-readable senses but clusters texts according to some “meaning” with a Cluster algorithm.Therefore a word is represented by a Feature vector w = ~f . Such system is trained by computing a context vector ~c.Therefore each cluster defines a sense for a word. In order to cluster new words, Vector centroids are computed foreach cluster. The context vector is calculated using the following formula:

~c =

n∑i=1

~xi

n(36)

With xi the feature vector of the context word wi. Words in a new text are clustered by the sense which Vector centroidis the closest.

Definition 218 (Selectional preferences). Selectional preferences are a set of rules who determine which words arecombinable for a given semantical role. For instance the sentence “ I eat gold for lunch.” doesn’t make any sense.These restrictions are not rigid but express some preferences (and are thus executed by a probabilistic framework).A popular implementation is Resnik’s model of selection association: Selectional preference strength. The systemcomputes the difference in information a verb v expresses about a class and the information the class itself expresses.The difference is calculated using the Kullback-Leibler metric:

SR (v) =∑c

Pr [c|v] · log

(Pr [c|v]

Pr [c]

)(37)

26

The specific preference of a verb in a class is calculated by the Selectional association. The probabilites can be estimatedby couning them from the corpus.

Definition 219 (Kullback-Leibler (Relative entropy)). The Kullback-Leibler difference is a measurement of the dif-ference between two distributions P and Q. The Kullback-Leibler difference is defined as:

∆ (P,Q) =∑x

P (x) · log

(P (x)

P (y)

)(38)

Definition 220 (Selectional association). The Selectional association is a matric who measures the relative contribu-tion of that class to the general selectional preference of the verb.

AR (v, c) =Pr [c|v]

SR (v)· log

(Pr [c|v]

Pr [c]

)(39)

Definition 221 (Word similarity). Word similarity is a metric defined as the inverse Semantic distance. One calcu-lates the Word similarity by looking at the number of features two words share. Algorithm who calculate this metricare the Thesaurus based word similarity or the Corpus based word similarity systems.

Definition 222 (Thesaurus based word similarity). The Thesaurus based word similarity is a way to calculate theword similarity. Therefore the algorithm stores a semantic network with for instance a Hypernym hierarchy. Thesimilarity is calculated based on the Path-length based similarity:

similaritypath (c1, c2) = − log (pathlength (c1, c2)) (40)

Where pathlength (c1, c2) is the number of edges in the shortest path in the Thesaurus graph between the senses nodesc1 and c2. The Word similarity is then defined as:

wordsim (w1, w2) = maxc1∈senses(w1),c2∈senses(w2)

(sim (c1, c2)) (41)

Another metric is the Information-conent word similarity. This method relies both on the Thesaurus graph andprobabilistic information. One calculates this similarity by calculating the probability of the Lowest common subsumer:

similarityresnik (c1, c2) = − log (Pr [LCS (c1, c2)]) (42)

The Similarity theorem of Lin and Jing-Conrath uses the Lowest common subsumer as well:

similaritylin (c1, c2) =2 · log (Pr [LCS (c1, c2)])

log (Pr [c1]) + log (Pr [c2])(43)

similarityJC (c1, c2) =1

2 · log (Pr [LCS (c1, c2)])− log (Pr [c1]) log (Pr [c2])(44)

(45)

Finally the Extended gloss overlap or Extended Lesk algorithm uses a set of relations R (like Hypernym,...):

similarityeLesk (c1, c2) =∑r,q∈R

overlap (gloss (r (c1)) , gloss (q (c2))) (46)

The probabilities for a class can be calculating by counting them in a corpus:

Pr [ci] =# [ci]n∑l=1

# [cl]

(47)

Note that since a word can have multiple classes, the sum in the denominator can be larger than the number of wordsin the corpus.

Definition 223 (Lowest common subsumer (LCS)). The Lowest common subsumer of two elements in a graph is thelowest node in the hierarchy that subsumes both elements.

Definition 224 (Overlap). The overlap between two sets is measured by the number of elements both sets contains:

overlap (A,B) = # [{x|x ∈ A ∧ x ∈ B}] (48)

27

Definition 225 (Corpus based word similarity). Corpus based word similarity are a group of Word similarity methodswho perform this task by using a corpus. They are based on word co-occurrences. By using this metric one can generatea Co-occurrence vector ~f for each word w where an element in the vector expresses a frequency or a binary value ofthe occurrence of word i in the context of w. By doing this for each word, one can generate a Term co-occurrencematrix. One then calculates the similarity by using statistical tests (Point wise mutual information, Lin-associationmeasure, T-test, ...) or the similarity between the two vectors (Inner product, Cosine, ...).

Definition 226 (Point wise mutual information (MI)). The Point wise mutual information is a metric that measuresthe similarity between two elements wi, wj with the following formula:

MI (wi, wj) = log2

(Pr [wi, wj ]

Pr [wi] · Pr [wj ]

)(49)

Where Pr [wi] is the probability of the occurrence in the corpus and Pr [wi, wj ] is the probability of co-occurrence.

Definition 227 (Lin-association measure). The Lin-association measure calculates the correlation between a wordw and a feature f . A feature is defined as tuple containing another word w′ and a relation r. The Lin-associationmeasure extends the Point wise mutual information by including this relation:

assocLin (w, f) = log2

(Pr [w, f ]

Pr [w] · Pr [r|w] · Pr [w′|w]

)(50)

Definition 228 (T-test). The T-test computes the difference between the observed and expected means normalized byvariance:

t =(x− µ) ·

√N

s(51)

One tests the Null hypothesis that the words occur independently: Pr [wi, wj ] = Pr [wi] · Pr [wj ]. The associativity isbased on the difference between these two:

assocT−test (wi, wj) = log2

(Pr [wi, wj ]− Pr [wi] · Pr [wj ]√

Pr [wi] · Pr [wj ]

)(52)

Definition 229 (Manhattan distance). The Manhattan distance for two vectors ~v and ~w is defined as:

∆ (~v, ~w) =

n∑i=1

|vi − wi| (53)

Definition 230 (Euclidean distance). The Euclidean distance for two vectors ~v and ~w is defined as:

∆ (~v, ~w) =

√√√√ n∑i=1

(vi − wi)2(54)

Definition 231 (Inner product similarity). The Inner product similarity for two vectors ~v and ~w is defined as:

similarity (~v, ~w) = ~v> · ~w =

n∑i=1

vi · wi (55)

Definition 232 (Cosine similarity). The Cosine similarity for two vectors ~v and ~w is defined as:

similarity (~v, ~w) =~v> · ~w‖~v‖ · ‖~w‖

(56)

Definition 233 (Dice similarity). The Dice similarity for two vectors ~v and ~w is defined as:

similarity (~v, ~w) =2 · ~v> · ~w‖~v‖+ ‖~w‖

(57)

Definition 234 (Jenson-Shannon divergence). The Jenson-Shannon divergence for two vectors ~v and ~w is defined as:

divJS (~v, ~w) = ∆ (~v, ~w) + ∆ (~w,~v) (58)

With ∆ the Kullback-Leibler distance.

Definition 235 (Collocation). Collocation is an expression consisting of two or more words (usually phrases) thatcorrespond to some convectional way of saying things. Usually an element of meaning is added to the collocation thatcan not be predicted from the meanings of the composing parts. Corpus based association techniques can be used forthe detection of Collocations.

28

9 Named Entity Recognition and Semantic Role Labeling

Definition 236 (Named entity recognition). Named entity recognition is a process where one recognizes and classifiesnamed expressions in text (such as persons, companies, locations, protein names,...). For instance “ John Smith worksfor IBM.”. Two problems in this context are Segmentation and Classification.

Definition 237 (Segmentation). sb is a process where one tries to find segments in a sequence which satisfy someconstraint. In other words one determines which constituents in a sentence are semantic arguments. Since theseconstraints can be rather complex and fuzzy, Segmentation is sometimes a problem.

Definition 238 (Classification (Recognition)). sb is a process where tries to determine the role for each of the segmentsfound by the Segmentation in the sentence.

Definition 239 (Semantic role labeling (Thematic role labeling, Case role assignment, Shallow semantic parsing)).Semantic role labeling is a process where one recognizes the basic event structure of a sentence (like “ who?”, “ doeswhat?”, “ to whom/what?”, “ when?”, “ where?”). Each algorithm uses the same high level structure as Algorithm 9.For each node a vector is built. A classification model generated by a training set will then return semantical contentabout these features. Most algorithm prune nodes according to some set of rules before classification. Furthermore someclassification models use Chunking instead of parsing. Usually one evaluates the performance of such algorithms bythe Precision, Recall and F-measure. The classification algorithm will always generate a Frame. Therefore Semanticrole labeling is a form of Relational learning and Context dependent classification.

Algorithm 9: A high level description of a Semantic role labeling algorithm..

Definition 240 (Semantic frame labeling). Semantic frame labeling is an extension of Semantic role labeling whereone does not only label the individual roles but the entire frame structure. This might involve the analysis of severalsentences and disambiguation of the meaning of the predicate.

Definition 241 (Proposition bank (ProBank)). A Proposition bank is a resource of sentences annotated with semanticroles. These roles differ per language. The semantic roles are encoded in the for of numbers specific to the verb sense.For instance the “ agree” relation: agree (agreer, object of the agreement, other entity agreeing).

Definition 242 (Frame). A Frame is a script-like structure that instantiates a set of frame-specific semantic rulescalled Frame elements. It assigns the Core-roles and Non-core roles across the different verbs. Also relations likecausality are expressed.

Definition 243 (FrameNet). A FrameNet is a resource containing a large number of Frame elements as roles.

Definition 244 (Classifier). A Classifier is an algorithm that based on a vector ~x returns a label y. Classifiers comein two flavors: Generative classifiers and Discriminative classifiers.

Definition 245 (Generative classifier). A Generative classifier is an algorithm that learns a model of the joint proba-bility Pr [~x, y] and makes its predictions by using Bayes rule to calculate Pr [y|~x] and then selects the most likely label.Examples of such classifiers are Naive Bayes and Hidden Markov Models.

Definition 246 (Discriminative classifier). A Discriminative classifier is an algorithm trained to model the conditionalprobability Pr [y|~x] directly and selects the most likely label y or learns a direct map from inputs ~x to the class labels y.Examples include Maximum entropy models, Support vector machines, ...

Definition 247 (Linear regression). Linear regression is a model where we assume that the classes that we want tofind are linearly separable. Therefore the result is a linear recombination of the elements in the input vector. Thus:

y ≈ w0 +

n∑i=1

wi · xi (59)

In most cases w0 is assumed to be zero. The weights are learned by minimizing the cost of a Training set with:

cost(~w, ~~x, ~y

)=

m∑i=1

w0 − yi +

n∑j=1

wi · xj i

2

(60)

If one performs classification, the expected answer is calculated and the system answers with right or wrong. Onecould train this model to output Boolean values by using 0 and 1 in the test-output but the calculated results are notguaranteed to lie in the interval [0, 1]. Therefore one uses Logistic regression.

29

Definition 248 (Logistic regression). Logistic regression is a model where one assumes that the Log odd function ofprobability of a certain vector ~x being classified as true, can be modeled with Linear regression. Therefore it guaranteesthat the probabilities are elements of the [0, 1] interval. More formally one states:

ln

(Pr [y = true|~x]

1− Pr [y = true|~x]

)≈ w0 +

n∑i=1

wi · xi (61)

One than classifies a vector ~x as true if:

w0 +

n∑i=1

wi · xi > 0 (62)

Learning the weights is done through convex optimization. Therefore several method are invented like Gradient ascent,Conjugate gradient, ...

Definition 249 (Log odds function (Logit)). The Log odds function is defined as follows:

logit (P (x)) = ln

(P (x)

1− P (x)

)(63)

Definition 250 (Maximum entropy model (Maxent, Multinominial logistic regression)). The Maximum entropy modelis a model that is a generalization of Logistic regression. Instead of two classes, different classes y1, y2, . . . , yC areconsidered.

Definition 251 (Maximum entropy Markov model (MEMM)).

Definition 252 (Conditional random field (CRF)).

30

10 Discourse Analysis

31

11 Temporal Analysis

32

12 Generic tables

Full form Clitic Full form Clitic

am ’m have ’veare ’re has ’sis ’s had ’dwill ’ll would ’d

Table 1: List of clitics in English.

Regular noun

foxcataardvark

Table 2: List of regular nouns in English.

Irregular plural noun

geesesheepmice

Table 3: List of irregular plural nouns in English.

12.1 Miniature grammar

〈S〉→〈NP〉〈VP〉〈S〉→〈Aux〉〈NP〉〈VP〉〈S〉→〈VP〉〈NP〉→〈Pronoun〉〈NP〉→〈Proper-noun〉〈NP〉→〈Det〉〈Nominal〉〈Nominal〉→〈Noun〉〈Nominal〉→〈Nominal〉〈Noun〉〈Nominal〉→〈Nominal〉〈PP〉〈VP〉→〈Verb〉〈VP〉→〈Verb〉〈NP〉〈VP〉→〈Verb〉〈NP〉〈PP〉〈VP〉→〈Verb〉〈PP〉〈VP〉→〈VP〉〈NP〉〈PP〉→〈Preposition〉〈NP〉〈Det〉→that|this|a〈Noun〉→book|flight|meal|money〈Verb〉→book|include|prefer〈Pronoun〉→I|she|me〈Proper-Noun〉→Houston|NWA〈Aux〉→does〈Preposition〉→from|on|near|through

33

Irregular singular noun

goosesheepmouse

Table 4: List of irregular singular nouns in English.

Regular verb stem

walkfrytalkimpeach

Table 5: List of regular verb stems in English.

Regular verb stem

cutspeaksing

Table 6: List of irregular verb stems in English.

Regular verb stem

caughtateeatensang

Table 7: List of irregular past verb stems in English.

34

Tag Description

CC Coordinated conjunctionCD Cardinal numberDT DeterminerEX Existential thereFW Foreign wordIN PrepositionJJ AdjectiveJJR Comparative adjectiveJJS Superlative adjectiveLS List item markerMD ModalNN Singular noun or Mass nounNNS Plural nounNNP Proper singular nounNNPS Proper plural nounPDT PredeterminerPOS Possessive endingPRP Personal pronounPRPS Possessive pronounRB AdverbRBR Comparative adverbRBS Superlative adverbRP ParticleSYM SymbolTO ToUH InterjectionVB Base form verbVBD Past tense verbVBG Gerund verbVBN Past participle verbVBP Non 3sg present verbVBZ 3sg present verbWDT Wh-determinerWP Wh-pronounWPS Possessive whWRB Wh-adverb

Table 8: List of English Part-of-Speech tags used in the Penn Treebank..

35

13 Symbols

Symbol Meaning

Pr [e] The probability of an event e.Pr [e|c] The probability of an event e given conditions c.# [e] The number of times a certain event e is counted in a reference set.P (S) The power set of a set S.~w A vector containing the Words in a certain fragment~t A vector containing the Part-of-Speech tags in a certain fragmentδ (x, y) The Kronecker-delta of x and y

Table 9: List of symbols and their meaning.

36

Index

N -gram model, 10k-order Markov model, 102-gram model, 113sg present verb, 353sg verb, 19

Accuracy, 4Action, 15Adjectival phrase, 18Adjective, 15, 15, 25, 35Adverb, 15, 15, 25, 35Affix, 7, 7Agenda, 21agglutinative languages, 7Agreement, 7, 19Alignment annotation layer, 3, 4Ambiguity, 2Annotated data, 3Antonymy, 25Argument, 16Article, 16, 16Artificial language, 2Attachment ambiguity, 19, 20, 23Authorship attribution, 3Automatic paraphrasing, 3Automatic summarization, 3Auxiliaries, 15, 15Auxiliary verbs, see Auxiliaries

Backoff, 10, 10, 24Backup [Non-deterministic Finite State Automaton], 6Backward algorithm, 12Bag-of-word feature, 25, 25Base form verb, 35Baseline, 5Bayes rule, 16, 29Beginning of a chunk, 17Bigram assumption, 17Bootstrapping, 4Bottom-up parsing, 19, 21Bottum-up parsing, 20, 22Brill tagging, see Hybrid Part-of-Speech tagging

Cardinal number, 18, 35Case role assignment, see Semantic role labelingCFG, see Context-free grammarChain rule, 11Charniak parser, 23Chart, 20Chart parsing, 20, 21Chomsky Normal Form, 20, 20Chunk, 17, 19Chunking, see Shallow parsing, 29Circumfix, 7, 7CKY, see Cocke-Kasami-Younger algorithmClass, 15Class Based N -gram, 11Classification, 29, 29

Classifier, 29Clause, 16, 18, 19Clitic, 7clitic, 7Cliticization, 7Closed class, 15Cluster, 11Cluster N -gram, see Class Based N -gramCluster algorithm, 26CNF, see Chomsky Normal FormCo-occurrence vector, 28Cocke-Kasami-Younger algorithm, 20, 20–22Collins parser, 23, 23, 24Collocation, 28Collocational feature, 25, 26Combination Part-of-Speech tagging, 17Common name, see Common nounCommon noun, 15Comparable annotated data, 3, 4Comparative adjective, 35Comparative adverb, 35Complementizer, 16Completer, 20, 21Compounding, 2, 7Computational lexicon, 8Computational linguistics, 2concatenation [Regular Language], 6Conditional random field, 30Conjugate gradient, 30Conjugation, 16, 16Conjunction, 19Constraint, 16Context dependent classification, 29Context-free grammar, 18, 18–20Coordinate ambiguity, 19, 20, 23Coordinated conjunction, 35Coordinating conjugation, 16, 16Coordination, 19Core-role, 29Corpus based word similarity, 27, 28Correct chunk, 5Cosine, 28Cosine similarity, 28Count noun, 15, 15CRF, see Conditional random fieldCross-validation, 4Crossing chunk, 17

Declarative sentence, 18Definite article, 16Degree adverb, 15, 15Dependency grammar, 19Derivation, 2, 7Det, see DeterminerDeterminer, 16, 16, 18, 19, 35Determiner-noun agreement, 19Development set, see Training setDialogue System, 3

37

Dice similarity, 28Dictionary, 16dictionary building, 6Directional adverb, 15, 15Discourse, 2Discriminative classifier, 24, 29, 29Discriminative reranking, 24disjunction [Regular Language], 6

Earley algorithm, 20, 20, 21Emission probabilities, 11, 12emission probabilities, 17End of a chunk, 17End state, 11EngCGTagger, 16English plural nouns declension, 7English verb inflection, 7Entitie, 16Entity, 16Error rate, 5Euclidean distance, 28Evaluation set, see Test setEvent, 16Existential there, 16, 35Expectation phase, 12Expectation-Maximization algorithm, 10, 12, 23Extended gloss overlap, 27Extended Lesk algorithm, 27

F-measure, 29F-score, 5Feature vector, 26Final state, 11final states, 6Finite State Automaton, 6, 8Finite State Transducer, 6Finite-State Lexicon, see Computational lexiconForeign word, 35Formal Grammar, 19Formal grammar, 7, 18, 18, 19, 23Forward algorithm, 11, 11, 12Forward-Backward algorithm, 12, 12, 23Frame, 29, 29Frame elements, 29FrameNet, 29FSA, see Finite State AutomatonFST, see Finite State TransducerFully automated treebank, 19

gender, 7Generating, 21Generative classifier, 29, 29Gerund verb, 35Gloss, 25Gold Standard, 4Good discounting, 10Good Turing discounting, 10, 10Gradient ascent, 30Grammar induction, 23Grammatical sentence, 18Greeting, 16

Ground truth, see Gold Standard

Hapax legomenon, 10, 10Head, 18Head non-terminal, 23Heterographs, 2Heteronyms, 2Hidden Markov Model, 11, 11, 12, 17, 29Hidden Markov Model Encoding, 11, 12Hidden Markov Model Likelihood, 11, 11Hidden Markov Model Part-of-Speech tagging, 17Hidden Markov Model Training, 11, 12, 12Holonym, 25Holonymy, 25Homonyms, 2Homonymy, 25human ceiling, 4Human language technology, 2Hybrid approach, 3Hybrid Part-of-Speech tagging, 17Hypernym, 25, 27Hypernymy, 25Hyponym, 25Hyponymy, 25

Imperative sentence, 18In vitro, 4In vivo, 4Indefinite article, 16Independence assumption, 22, 23Infix, 7, 7Inflection, 2, 7inflection, 7Information retrieval, 3Information-conent word similarity, 27Inner product, 28Inner product similarity, 28input alphabet of symbols, 6Inside-Outside algorithm, 22, 23Interjection, 16, 35Internal rule, 23International Phonetic Alphabet (IPA), 2Interpolation with lower order N -grams, 10

Jenson-Shannon divergence, 28Jing-Conrath, 27

Katz backoff, 10Kleene closure [Regular Language], 6Kneser-Ney smoothing, 10Kullback-Leibler, 26, 27, 28

Lack of lexical conditioning, 22, 23Laplace smoothing, 10laplace smoothing, 10LCS, see Lowest common subsumerLeading object, 18Lexical dependency statistics, 23Lexical likelihood, 17Lexical meaning, 25Lexical rule, 23Lexicalized grammar, 23, 23

38

Lexicon, 8, 18Lexicon transducer, 16Lin-association measure, 28, 28Linear regression, 29, 30List item marker, 35Locative adverb, see Directional adverbLog odd function, 30Log odds function, 30Log probabilities, 11Logistic regression, 29, 30Logit, see Log odds functionLook-ahead [Non-deterministic Finite State Automaton],

6Lowest common subsumer, 27, 27

Machine learning chunker, 17Machine learning technique, 3machine translation, 6Machine translation (MT), 3Manhattan distance, 28Manner adverb, 15, 15Manually corrected treebank, 19, 19Marker hypotheses, 16, 17markov chain, 10Mass noun, 15, 15, 16, 35Max-Product algorithm, 13Maxent, see Maximum entropy modelMaximization phase, 12Maximum entropy Markov model, 30Maximum entropy model, 29, 30MEMM, see Maximum entropy Markov modelMeronym, 25Meronymy, 25Metonymy, 25MI, see Point wise mutual informationModal, see Modal verb, 35Modal auxiliary, see Modal verbModal auxiliary verb, see Modal verbModal verb, 15, 15Monolingual annotated data, 3, 3Morpheme, 7, 7Morphological analysis, 16Morphological annotation layer, 3, 4Morphological form, 15Morphology, 2, 7, 17morphology, 6Morphotactics, 7, 8multi lingual multi-document summarization, 3multi-document summarization, 3Multilingual annotated data, 3, 3Multinominial logistic regression, see Maximum entropy

model

Naive Bayes, 29Named entity recognition, 29Natural language, 2natural language processing, 2Negative, 16NFSA, see Non-deterministic Finite State AutomatonNom, see NominalNominal, 18, 18

Non 3sg present verb, 35Non-core role, 29Non-deterministic Finite State Automaton, 6Non-finite clause, 18non-terminal, 6Non-terminal symbol, 18, 20, 22, 23Non-terminals to the left of the head, 23Non-terminals to the right of the head, 23Noum, 15Noun, 15, 15, 18, 19, 25Noun phrase, 16, 18, 18, 23NP, see Noun phraseNull hypothesis, 28number, 7

Object, 16Observation, 11, 12Observation likelihoods, 11One sense per collocation, 25, 25One sense per discourse, 25, 25Open class, 15, 17Ordinal number, 18Orthographic rules, 7output function, 6Overlap, 27

Parallel annotated data, 3, 4Parallel treebank, 19Parallelism [Non-deterministic Finite State Automaton],

6Parent annotation, 22, 23Parse tree, 19Part of Speech tags, 16Part-of-speech annotation layer, 3, 4Part-of-Speech class, 15, 15, 16Part-of-Speech simplification assumption, 16, 17Part-of-Speech tag, 15, 15–17, 19, 21, 23, 26, 36Part-of-speech tag, 20Part-of-Speech tagging, 2, 15, 16, 17Particle, 15, 35Passive auxiliaries, 15Past participle verb, 35Past tense verb, 35Path-length based similarity, 27PCFG, see Probabilistic context-free grammarPCKY, see Probabilistic Cocke-Kasami-Younger algorithmPerfect auxiliaries, 15Perplexity, 11Person, 16Personal pronoun, 16, 35Personal pronouns, 16Phonetic transcription annotation layer, 3, 4Phonetics, 2Phonology, 2Phrase, 16, 18Phrase-Structure grammar, see Context-free grammarPLCFG, see Probabilistic lexicalized context-free gram-

marPlural indefinite, 16Plural noun, 35Plysemy, 25

39

Point wise mutual information, 28, 28Politeness marker, 16Porter Stemmer, 8Possession, 16Possessive ending, 35Possessive expression, 16Possessive pronoun, 16, 35Possessive pronouns, 16Possessive wh, 35Post-head modifier, 18PP, see Prepositional phrasePragmatic knowledge, 20Pragmatics, 2Pre-head modifier, 18Pre-head Modifiers, 18Pre-terminal, 23Precision, 4, 29precision, 5Predeterminer, 18, 35Predicate, 15Predicator, 20Prefix, 7, 7Preposition, 15, 15, 18, 35Prepositional phrase, 18Probabilistic Cocke-Kasami-Younger algorithm, 22Probabilistic context-free grammar, 22, 23Probabilistic lexicalized context-free grammar, 23, 23Probabilistic model, 10Probability theory, 16ProBank, see Proposition bankProcess, 15proclitic, 7Production rule, 18, 22Progressive auxiliaries, 15Pronoun, 16, 16Proper noun, 15Proper plural noun, 35Proper singular noun, 35Proposition, 19Proposition bank, 29Propositional phrase, 18, 19Prosodic annotation layer, 3, 4Protocol, 3

Question, 16Question answering, 3

Recall, 5, 29recall, 5Recognition, see ClassificationRecognizer, 6Regexp, see Regular ExpressionRegular Expression, 6, 6Regular Grammar, 6Regular Language, 6, 6Relational learning, 29Relative clause, 18, 19Relative entropy, see Kullback-LeiblerRelative pronoun, 19Resnik’s model, 26Rule-based Part-of-Speech tagger, 16

Rule-based Part-of-Speech tagging, 16, 17Rule-based techniques, 3

S, see SentenceScanner, 20, 21Scope, see Coordinate ambiguitySegmentation, 29, 29Selectional association, 27, 27Selectional preference strength, 26Selectional preferences, 26Semantic annotation layer, 3, 4Semantic categorie, 25Semantic distance, 27Semantic frame labeling, 29Semantic role labeling, 29, 29Semantical knowledge, 20Semantics, 2Sense, 25Sentence, 16, 18, 18, 23Sentence Detection, 8Sentiment analysis, 3Set of collocations, 26Set of contexts, 26Shallow parse tree, 17Shallow parsing, 17, 17Shallow semantic parsing, see Semantic role labelingSimilarity theorem of Lin, 27Simplified Lesk algorithm, 25, 26single text summarization, 3Singleton, see Hapax legomenonSingular noun, 35Smoothed N -gram model, 10Smoothed N -grams, 10Spatial relation, 15Start state, 11start state, 6Start symbol, 18, 19, 22state, 6, 11Statistical information, 20Stem, 7, 7, 8Stochastic context-free grammar, see Probabilistic context-

free grammarStochastic Part-of-Speech tagging, 17Structural ambiguity, 19, 20Subcategorization, 19Subcategorization ambiguity, 23Subcategory, 7Subject, 15, 19Subject code, 25Subject-verb agreement, 19Subordinate, see HyponymySubordinate conjugation, 16, 16Suffix, 7, 7Superlative adjective, 35Superlative adverb, 35Superordinate, see HypernymySupervised learning, 17Supervised techniques, 3supervised techniques, 3Support vector machine, 29Symbol, 18, 35

40

Synonyms, 2Synonymy, 25Synset, 25Syntactic annotation layer, 4Syntactic class, 15Syntactic disambiguation, 19, 20, 22Syntactic Parsing, 19Syntactic parsing, 20Syntactic parsing annotation layer, 3syntactical function, 7Syntax, 2Syntax tree, 19, 22, 24Syntax-based machine translation, 19

T-test, 28, 28Tags

CC, 35CD, 35DT, 16, 35EX, 35FW, 35IN, 35JJ, 35JJR, 35JJS, 35LS, 35MD, 35NN, 35NNP, 35NNPS, 35NNS, 35NP, 18PDT, 35POS, 35PP, 18PRP, 35PRPS, 35RB, 35RBR, 35RBS, 35RP, 35S, 18SYM, 35TO, 35UH, 35VB, 35VBD, 35VBG, 35VBN, 35VBP, 35VBZ, 35VP, 18WDT, 35WP, 35WPS, 35WRB, 35

Temporal adverb, 15, 15Temporal relation, 15Term co-occurrence matrix, 28Terminal symbol, 18, 22Test set, 4

Thematic role labeling, see Semantic role labelingThesaurus based word similarity, 27, 27Third person, 19To, 35Token accuracy, 4Top-down parsing, 19, 20, 21Topic detection, 3Training set, 4, 17, 29Transformation-based Part-of-Speech tagging, 17transition function, 6transition matrix, 6Transition probability matrix, 11, 12Treebank, 19, 19, 22Type accuracy, 4

Ungrammatical sentence, 18union [Regular Language], 6Unsmoothed N -gram model, 10Unsmoothed N -grams, 10Unsupervised techniques, 3unsupervised techniques, 3Unsupervised word sense disambiguation, 26

Variable, 18, 22Vector centroid, 26Verb, 15, 15, 16, 18, 19, 25Verb phrase, 18, 18Visible Markov Model, 11Viterbi algorithm, 12, 13vocabulary, 11Voutilainen, see EngCGTaggerVP, see Verb phrase

Wh-adverb, 35Wh-determiner, 35Wh-phrase, 18Wh-pronoun, 16, 35Wh-pronouns, 16Word, 17, 18, 36Word alignment, 19Word class, 11Word sense, 25Word sense disambiguation, 2, 25, 26Word similarity, 27, 27, 28Word-based N -grams, 11WordNet, 25, 25

XML, 19XPath, 19

Yarowski algorithm, 25, 26Yes-no question, 18

41