33
1 Unsupervised Knowledge-Free Morpheme Unsupervised Knowledge-Free Morpheme Boundary Detection Boundary Detection Stefan Bordag Stefan Bordag University of Leipzig University of Leipzig Example Example Related work Related work Part One: Generating training data Part One: Generating training data Part Two: Training and Applying a Classificator Part Two: Training and Applying a Classificator Preliminary results Preliminary results Further research Further research

1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

  • View
    223

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

11

Unsupervised Knowledge-Free Morpheme Unsupervised Knowledge-Free Morpheme Boundary DetectionBoundary Detection

Stefan BordagStefan Bordag

University of LeipzigUniversity of Leipzig

ExampleExample

Related workRelated work

Part One: Generating training dataPart One: Generating training data

Part Two: Training and Applying a ClassificatorPart Two: Training and Applying a Classificator

Preliminary resultsPreliminary results

Further researchFurther research

Page 2: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

22

Example: clearly earlyExample: clearly early

The examples used throughout this presentation are The examples used throughout this presentation are clearly clearly and and earlyearly

In one case, the stem is In one case, the stem is clearclear and in the other and in the other earlyearly Other word forms of same lemmas:Other word forms of same lemmas:

clearclearlyly: clear: clearestest, clear,, clear, clear clearerer, clear, clearinging early: earlearly: earlierier, erl, erliestiest

Semantically related words:Semantically related words: clearly: logically, really, totally, weakly, …clearly: logically, really, totally, weakly, … early: morning, noon, day, month, time, …early: morning, noon, day, month, time, …

Correct morpheme boundaries analysis:Correct morpheme boundaries analysis: clearly clearly →→ clear clear--ly but not *clearlly but not *clearl--y or *cleay or *clea--rlyrly early early → early or earl→ early or earl--y but not *eary but not *ear--lyly

Page 3: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

33

1. Three approaches to morpheme 1. Three approaches to morpheme boundary detectionboundary detection

Three kinds of approaches: Three kinds of approaches: 1.1. Genetic AlgorithmsGenetic Algorithms and the and the Minimum Description LengthMinimum Description Length

modelmodel (Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05)(Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05) This approach utilizes only word list, not the This approach utilizes only word list, not the context context

informationinformation for each word from corpus. for each word from corpus. This possibly results in an upper limit on achievable This possibly results in an upper limit on achievable

performance (especially with regards to irregularities).performance (especially with regards to irregularities). One advantage is that smaller corpora sufficientOne advantage is that smaller corpora sufficient

2.2. Semantics basedSemantics based (Schone & Jurafsky 01), (Baroni 03)(Schone & Jurafsky 01), (Baroni 03) General problem of this approach with examples like General problem of this approach with examples like deeplydeeply

and and deepnessdeepness where semantic similarity is unlikely where semantic similarity is unlikely

3.3. Letter Successor VarietyLetter Successor Variety (LSV) based (LSV) based (Harris 55), (Hafer & Weiss 74) first application, but low performance(Harris 55), (Hafer & Weiss 74) first application, but low performance Also applied only to a word listAlso applied only to a word list Further hampered by noise in the dataFurther hampered by noise in the data

Page 4: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

44

2. New solution in two parts2. New solution in two parts

sentencescooccurrences

The talk wasvery informative

The talk 1Talk was 1…

similar words

Talk speech 20Was is 15…

clear-lylatelyearly

clearly late

ear ¤

root

cl

late ¤

¤

¤

¤

train classifier

clear-lylate-lyearly

… apply classifier

compute LSV ss = = LSVLSV * * freqfreq * * multilettermultiletter * * bigrambigram

Page 5: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

55

2.1. First part: Generating training data with 2.1. First part: Generating training data with LSV and distributed SemanticsLSV and distributed Semantics

Overview:Overview: Use context information to gather common Use context information to gather common direct direct

neighborsneighbors of the input word of the input word →→ they are most they are most probably marked by the same grammatical probably marked by the same grammatical informationinformation

Frequency of word A and B is Frequency of word A and B is nnAA and and nnBB

Frequency of cooccurrence of A with B is Frequency of cooccurrence of A with B is nnABAB

Corpus size is Corpus size is nn Significance computation is Poisson approximation of Significance computation is Poisson approximation of

log-likelihood (Dunning 93) (Quasthoff & Wolff 02)log-likelihood (Dunning 93) (Quasthoff & Wolff 02) 1 , ln ln A B A Bpoiss AB AB

n n n nsig A B n n

n n

Page 6: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

66

Neighbors of “Neighbors of “clearly“clearly“

Most Most significant left neighborssignificant left neighbors

veryvery

quitequite

soso

It‘sIt‘s

mostmost

it‘sit‘s

showsshows

resultsresults

that‘sthat‘s

statedstated

QuiteQuite

Most significant right Most significant right neighborsneighbors

defineddefined

writtenwritten

labeledlabeled

markedmarked

visiblevisible

demonstrateddemonstrated

superiorsuperior

statedstated

showsshows

demonstratesdemonstrates

understoodunderstood

clearly

It’s clearly labeled

very clearly shows

Page 7: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

77

2.22.2.. New solution as combination of two existing New solution as combination of two existing approachesapproaches

Overview:Overview: Use context information to gather common Use context information to gather common

direct neighborsdirect neighbors of the input word of the input word →→ they are they are most probably marked by the same grammatical most probably marked by the same grammatical informationinformation

Use these neighbor cooccurrences to find words Use these neighbor cooccurrences to find words that have that have similar cooccurrence profilessimilar cooccurrence profiles →→ those that are surrounded by the same those that are surrounded by the same cooccurrences bear mostly the same grammatical cooccurrences bear mostly the same grammatical markermarker

Page 8: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

88

Similar words to “Similar words to “clearly“clearly“

Most significant right Most significant right neighborsneighbors

defineddefined

writtenwritten

labeledlabeled

markedmarked

visiblevisible

demonstrateddemonstrated

superiorsuperior

statedstated

showsshows

demonstratesdemonstrates

understoodunderstood

…weaklylegallycloselyclearlygreatlylinearlyreally…

Most Most significant left neighborssignificant left neighbors

veryvery

quitequite

soso

It‘sIt‘s

mostmost

it‘sit‘s

showsshows

resultsresults

that‘sthat‘s

statedstated

QuiteQuite

Page 9: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

99

2.3. New solution as combination of two 2.3. New solution as combination of two existing approachesexisting approaches

Overview:Overview: Use context information to gather common Use context information to gather common direct direct

neighborsneighbors of the input word of the input word →→ they are most they are most probably marked by the same grammatical probably marked by the same grammatical informationinformation

Use these neighbor cooccurrences to find words Use these neighbor cooccurrences to find words that have that have similar cooccurrence profilessimilar cooccurrence profiles →→ those those that are surrounded by the same cooccurrences that are surrounded by the same cooccurrences bear mostly the same grammatical markerbear mostly the same grammatical marker

Sort those words by edit distance and keep 150 Sort those words by edit distance and keep 150 most similar most similar →→ since further words only add since further words only add random noiserandom noise

Page 10: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1010

Similar words to “Similar words to “clearlyclearly“ sorted by edit “ sorted by edit distancedistance

Most Most significant significant left neighborsleft neighbors

veryveryquitequitesosoIt‘sIt‘smostmostit‘sit‘sshowsshowsresultsresultsthat‘sthat‘sstatedstatedQuiteQuite

Most significant Most significant right neighborsright neighbors

defineddefinedwrittenwrittenlabeledlabeledmarkedmarked

visiblevisibledemonstrateddemonstrated

superiorsuperiorstatedstatedshowsshows

demonstratesdemonstratesunderstoodunderstood

Sorted List

clearlycloselygreatlylegallylinearlyreally

weakly…

Page 11: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1111

2.4. New solution as combination of two 2.4. New solution as combination of two existing approachesexisting approaches

Overview:Overview: Use context information to gather common Use context information to gather common direct direct

neighborsneighbors of the input word of the input word →→ they are most they are most probably marked by the same grammatical probably marked by the same grammatical informationinformation

Use these neighbor cooccurrences to find words that Use these neighbor cooccurrences to find words that have have similar cooccurrence profilessimilar cooccurrence profiles →→ those that those that are surrounded by the same cooccurrences bear are surrounded by the same cooccurrences bear mostly the same grammatical markermostly the same grammatical marker

Sort those words by edit distance and keep 150 most Sort those words by edit distance and keep 150 most similar similar →→ since further words only add random noise since further words only add random noise

Compute letter successor variety for each transition Compute letter successor variety for each transition between two characters of the input wordbetween two characters of the input wordReport boundaries where the LSV is above thresholdReport boundaries where the LSV is above threshold

Page 12: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1212

2.5. Letter successor variety2.5. Letter successor variety

Letter successor variety: Letter successor variety: Harris (55)Harris (55)where word-splitting occurs if the number of distinct where word-splitting occurs if the number of distinct letters that follows a given sequence of characters letters that follows a given sequence of characters surpasses the threshold.surpasses the threshold.

Input are the 150 most similar wordsInput are the 150 most similar words Observing how many different letters occur after Observing how many different letters occur after

a part of the string:a part of the string: #c- In the given list after #c- 5 letters#c- In the given list after #c- 5 letters #cl- only 3 letters #cl- only 3 letters #cle- only 1 letter#cle- only 1 letter …… -ly# but reversed before –ly# -ly# but reversed before –ly# 1616 different letters (16 different different letters (16 different

stems preceding the suffix –ly#)stems preceding the suffix –ly#)

# c l e a r l y # # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters)28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 1 1 2 1 3 1616 10 14 f. right (thus before -y# 10 var. letters) 10 14 f. right (thus before -y# 10 var. letters)

Page 13: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1313

2.5.1. Balancing factors2.5.1. Balancing factors

LSVLSV score for each possible boundary is not score for each possible boundary is not normalized and needs to be weighted against normalized and needs to be weighted against several factors that otherwise add noise:several factors that otherwise add noise: freqfreq: Frequency differences between beginning and : Frequency differences between beginning and

middle of wordmiddle of word multilettermultiletter: Representation of single phonemes with : Representation of single phonemes with

several lettersseveral letters bigrambigram: Certain fixed combinations of letters: Certain fixed combinations of letters

Final score Final score ss for each possible boundary is then: for each possible boundary is then:

ss = = LSVLSV * * freqfreq * * multilettermultiletter * * bigrambigram

Page 14: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1414

2.5.2. Balancing factors: Frequency2.5.2. Balancing factors: Frequency

LSV is not normalized against frequencyLSV is not normalized against frequency 28 different 28 different firstfirst letters within 150 words letters within 150 words 5 different 5 different secondsecond letters within 11 words, beginning with letters within 11 words, beginning with cc 3 different 3 different thirdthird letters within 4 words, beginning with letters within 4 words, beginning with clcl

Computing frequency weight Computing frequency weight freqfreq:: 44 out of out of 1111 begin with # begin with #clcl- then weight is - then weight is 4/114/11

# # cc ll e a r l y # e a r l y # 150 150 1111 44 1 1 1 1 1 of 11 4 begin with cl 1 1 1 1 1 of 11 4 begin with cl 0.1 0.1 0.40.4 0.3 1 1 1 1 1 from left 0.3 1 1 1 1 1 from left

Page 15: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1515

2.5.3. Balancing factors: Multiletter 2.5.3. Balancing factors: Multiletter Phonemes Phonemes

Problem: Two or more letters which together represent Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the one phoneme “carry away” the nominator for the overlap factor quotient:overlap factor quotient:

Letter split variety:Letter split variety: # s c h l i m m e# s c h l i m m e 7 1 7 2 1 1 27 1 7 2 1 1 2 2 1 1 1 2 4 15 2 1 1 1 2 4 15 Computing overlap factor: Computing overlap factor: 150150 27 27 1818 1818 6 5 5 5 6 5 5 5 2 2 2 2 3 7 105 1502 2 2 2 3 7 105 150 ^ thus at this point the LSV 7 is weighted 1 (^ thus at this point the LSV 7 is weighted 1 (18/1818/18), but since sch is ), but since sch is

one phoneme, it should have been one phoneme, it should have been 1818//150150 ! !

Solution: Ranking of Solution: Ranking of bi-bi- and and trigramstrigrams, highest receives , highest receives weight of 1.0weight of 1.0

Overlap factor is recomputed as weighted average:Overlap factor is recomputed as weighted average: In this case that means 1.0 * 27/150, since ‘sch’ is the In this case that means 1.0 * 27/150, since ‘sch’ is the

highest trigram and has a weight of 1.0.highest trigram and has a weight of 1.0.

Page 16: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1616

2.5.4. Balancing factors: Bigrams2.5.4. Balancing factors: Bigrams

It is obvious that –It is obvious that –thth– in English is almost never to – in English is almost never to be dividedbe divided

Computation of bigram ranking over all words in Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked.and 1.0 to lowest ranked.

LSV score then multiplied with resulting weight.LSV score then multiplied with resulting weight. Thus, the German –Thus, the German –chch- which is the highest - which is the highest

ranked bigram receives a penalty of 0.1 and thus ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a it is nearly impossible that it becomes a morpheme boundarymorpheme boundary

Page 17: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1717

2.5.5. Sample computation2.5.5. Sample computationCompute letter successor variety:Compute letter successor variety: # c l e a r # c l e a r -- l y # # e a r l y # l y # # e a r l y # 28 5 3 1 1 1 1 1 40 5 1 1 2 128 5 3 1 1 1 1 1 40 5 1 1 2 1 1 1 2 1 3 1 1 2 1 3 1616 10 10 1 2 1 4 6 19 10 10 1 2 1 4 6 19Balancing: Frequencies: Balancing: Frequencies: 150 11 4 1 1 1 1 1 150 9 2 2 2 1150 11 4 1 1 1 1 1 150 9 2 2 2 1 1 1 2 2 5 1 1 2 2 5 7676 9090 150150 1 2 2 6 19 150 1 2 2 6 19 150Balancing: Multiletter weights:Balancing: Multiletter weights:Bi lBi l 0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0 0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0Tri rTri r 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0Bi lBi l 0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3 0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3Tri rTri r 0.1 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.2 0.2 0.0 0.0 0.20.0 0.0 0.2Balancing: Bigram weight: Balancing: Bigram weight: 0.1 0.5 0.2 0.5 0.1 0.5 0.2 0.5 0.00.0 0.1 0.2 0.5 0.0 0.1 0.1 0.2 0.5 0.0 0.1Left and Right LSV scores:Left and Right LSV scores: 0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.70.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7 0.3 0.9 0.1 0.3 0.9 0.1 0.00.0 12.412.4 3.7 1.0 0.0 0.7 0.2 3.7 1.0 0.0 0.7 0.2Computing right score for clear-ly: Computing right score for clear-ly: 1616*(*(7676//9090++0.10.1**7676//150150)/(1.0+)/(1.0+0.10.1)*(1-)*(1-0.00.0)=)=12.412.4 Sum scores for left and rightSum scores for left and right:: 0.4 1.2 0.1 0.4 1.2 0.1 0.4 0.4 13.413.4 4.6 1.0 0.1 1.2 2.0 4.6 1.0 0.1 1.2 2.0

thresholdthreshold: 5: 5 clearclear--ly earlyly early

Page 18: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1818

3. Second Part: Training and Applying 3. Second Part: Training and Applying classifierclassifier

Any word list can be stored Any word list can be stored in a in a trietrie (Fredkin:60) or in a (Fredkin:60) or in a more efficient version of a more efficient version of a trie, a PATRICIA compact trie, a PATRICIA compact tree (PCT) (Morrison:68)tree (PCT) (Morrison:68)

Example: Example:

clearlyclearly

earlyearly

lately lately

clearclear

latelate

ry e

al

t

er a

la l

ce

¤

root

l

c

l

a

e

¤

¤

¤

¤

a

¤ = End or beginning of word

Page 19: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

1919

3.1. PCT as a Classificator3.1. PCT as a Classificator

clearly late

ear ¤

root

cl

late ¤

¤

¤

¤

clear-ly, late-ly, early,Clear, late

clearly late

ear ¤

root

cl

late ¤

¤

¤

¤

ly=2

ly=1

ly=1

ly=1

Amazing?ly

amazing-ly

ly=1

ly=1

add knowninformation

retrieveknowninformation

Apply deepest found node

dear?ly

dearly

¤=1

¤=1 ¤=1

¤=1¤=1

Page 20: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2020

4. Evaluation4. Evaluation

Boundary measuring: Boundary measuring: each boundary detected can be each boundary detected can be correct or wrong (precision) or boundaries can be not correct or wrong (precision) or boundaries can be not detected (recall)detected (recall)

First evaluation is global LSV with the proposed First evaluation is global LSV with the proposed improvementsimprovements

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F(lsv)

F(lsv*fw)

F(lsv*fw*ib)

Page 21: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2121

Evaluating LSV Precision vs. RecallEvaluating LSV Precision vs. Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P(lsv)

R(lsv)

P(lsv*fw)

R(lsv*fw)

P(lsv*fw*ib)

R(lsv*fw*ib)

Page 22: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2222

Evaluating LSV F-measureEvaluating LSV F-measure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

10 2 4 6 8 10

12

14

16

18

20

22

24

F(lsv)

F(lsv*fw)

F(lsv*fw*ib)

Page 23: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2323

Evaluating combination Precision vs. Evaluating combination Precision vs. RecallRecall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

10 2 4 6 8 10 12 14 16 18 20 22 24

P(lsv+trie)

R(lsv+trie)

P(lsv*fw+trie)

R(lsv*fw+trie)

P(lsv*fw*ib+trie)

R(lsv*fw*ib+trie)

Page 24: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2424

Evaluating combination F-measureEvaluating combination F-measure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.80 2 4 6 8 10 12 14 16 18 20 22 24

F(lsv+trie)

F(lsv*fw*ib+trie)

F(lsv*fw+trie)

Page 25: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2525

Comparing combination with global LSVComparing combination with global LSV

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 26: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2626

4.1. Results4.1. Results

German newspaper corpus with 35 million sentences German newspaper corpus with 35 million sentences

English newspaper corpus with 13 million sentencesEnglish newspaper corpus with 13 million sentences

t=5t=5 GermanGerman EnglishEnglish

lsv Precisionlsv Precision 80,2080,20 70,3570,35lsv Recalllsv Recall 34,5234,52 10,8610,86lsv F-measurelsv F-measure 48,2748,27 18,8218,82combined Precisioncombined Precision 68,7768,77 52,8752,87combined Recallcombined Recall 72,1172,11 52,5652,56combined F-combined F-measuremeasure 70,4070,40 55,0955,09

Page 27: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2727

4.2. Statistics4.2. Statistics

en lsven lsv en comben comb tr lsvtr lsv tr combtr comb fi lsvfi lsv fi combfi combCorpus sizeCorpus size 13 million13 million 1 million1 million 4 million4 millionnunmber of nunmber of word( form)sword( form)s 167.377167.377 582.923582.923 1.636.3361.636.336

analysed wordsanalysed words 49.15949.159 94.23794.237 26.30726.307 460.791460.791 68.84068.840 1.380.841.380.8411

boundariesboundaries 70.10670.106 131.465131.465 31.56931.569 812.454812.454 84.19384.193 3.138.033.138.0399

morph. lengthmorph. length 2,602,60 2,562,56 2,292,29 3,033,03 2,322,32 3,733,73length of analysed length of analysed wordswords 8,978,97 8,918,91 9,759,75 10,6210,62 11,9411,94 13,3413,34

length of unanalysed length of unanalysed wordswords 7,567,56 6,776,77 10,1210,12 8,158,15 12,9112,91 10,4710,47

morphemes per wordmorphemes per word 2,432,43 2,402,40 2,202,20 2,762,76 2,222,22 3,273,27

Page 28: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2828

4.3. Assessing true error rate4.3. Assessing true error rate Typical sample list of words considered as wrong Typical sample list of words considered as wrong

due to CELEX:due to CELEX: TauTau--sendesende Tausend-eTausend-e senegales-isch-esenegales-isch-e senegalesisch-esenegalesisch-e sensibelst-ensensibelst-en sens-ibel-stensens-ibel-sten separat-ist-isch-eseparat-ist-isch-e separ-at-istisch-esepar-at-istisch-e tristris--tt tristtrist triumptriump--halhal triumph-altriumph-al trock-entrock-en trockentrocken unueber-troff-enunueber-troff-en un-uebertroffenun-uebertroffen troptrop--f-enf-en tropf-entropf-en trotz-t-entrotz-t-en trotz-tentrotz-ten ver-traeum-t-ever-traeum-t-e vertraeumt-evertraeumt-e

Reasons:Reasons: Gender –e (in (Creutz & Lagus 05) for example counted as correct)Gender –e (in (Creutz & Lagus 05) for example counted as correct) compounds (sometimes separated, sometimes not)compounds (sometimes separated, sometimes not) -t-en Error-t-en Error With proper names –isch often not analyzedWith proper names –isch often not analyzed Connecting elementsConnecting elements

Page 29: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

2929

4.4. Real example4.4. Real example

Orien-talOrien-talOrien-tal-ischeOrien-tal-ischeOrien-tal-istOrien-tal-istOrien-tal-ist-enOrien-tal-ist-enOrien-tal-ist-ikOrien-tal-ist-ikOrien-tal-ist-inOrien-tal-ist-inOrient-ier-ungOrient-ier-ungOrient-ier-ungenOrient-ier-ungenOrient-ier-ungs-hilf-eOrient-ier-ungs-hilf-eOrient-ier-ungs-hilf-enOrient-ier-ungs-hilf-enOrient-ier-ungs-los-igkeitOrient-ier-ungs-los-igkeitOrient-ier-ungs-punktOrient-ier-ungs-punktOrient-ier-ungs-punkt-eOrient-ier-ungs-punkt-eOrient-ier-ungs-stuf-eOrient-ier-ungs-stuf-e

Ver-trau-enskriseVer-trau-enskrise

Ver-trau-ensleuteVer-trau-ensleute

Ver-trau-ens-mannVer-trau-ens-mann

Ver-trau-ens-sacheVer-trau-ens-sache

Ver-trau-ensvorschußVer-trau-ensvorschuß

Ver-trau-ensvo-tumVer-trau-ensvo-tum

Ver-trau-ens-würd-igkeitVer-trau-ens-würd-igkeit

Ver-traut-esVer-traut-es

Ver-trieb-enVer-trieb-en

Ver-trieb-spartn-erVer-trieb-spartn-er

Ver-triebeneVer-triebene

Ver-triebenenverbändeVer-triebenenverbände

Ver-triebs-beleg-eVer-triebs-beleg-e

Page 30: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

3030

5. Further research5. Further research

Examine quality on various language typesExamine quality on various language types Improve trie-based classificatorImprove trie-based classificator Possibly combine with other existing algorithmsPossibly combine with other existing algorithms Find out how to acquire morphology of non-Find out how to acquire morphology of non-

concatenative languagesconcatenative languages Deeper analysis:Deeper analysis:

find deletionsfind deletions alternationsalternations insertionsinsertions morpheme classes etc.morpheme classes etc.

Page 31: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

3131

6. References6. References

(Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag.March 1998. Deutscher Universitätsverlag.

(Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A (Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages computational/experimental study. Yearbook of Morphology, pages 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001.July 2001.

(Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised (Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. Science, Report A81. Helsinki University of Technology, March 2005.

(Déjean 98) Hervé Déjean. Morphemes as necessary concept for (Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998.Language Learning, ACL, pages 295-299, Adelaide, January 1998.

(Dunning 93) T. E. Dunning. Accurate methods for the statistics of (Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.1993.

Page 32: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

3232

6. References II6. References II

(Goldsmith 01) John Goldsmith. Unsupervised learning of the (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001.27(2):153-198, 2001.

(Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word (Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974.Retrieval, 10:371-385, 1974.

(Harris 55) Zellig S. Harris. From phonemes to morphemes. (Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955.Language, 31(2):190-222, 1955.

(Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive (Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive morphology with genetic algorithms. In A. van den Bosch, W. morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague, Czech Republic, April Processing Tasks, pages 105-112, Prague, Czech Republic, April 1997.1997.

(Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The (Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. International Workshop on Computational Approaches to Collocations. 2002.2002.

(Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language-(Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language-independent induction of part of speech class labels using only independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision.2001. Machine Learning: Beyond Supervision.

Page 33: 1 Unsupervised Knowledge-Free Morpheme Boundary Detection Stefan Bordag University of Leipzig Example Related work Related work Part One: Generating training

3333

E. Gender-e vs. Frequency-eE. Gender-e vs. Frequency-e

vs. Gender-e

Schule 8.4Devise 7.8Sonne 4.5Abendsonne 5.3Abende 5.5Liste 6.5

vs. other-e

andere 8.4keine 6.8rote 11.6stolze 8.0drehte 10.8winzige 9.7lustige 13.2rufe 4.4Dumme 12.6

Frequency-eFrequency-e

Affe Affe 2.72.7Junge Junge 5.35.3Knabe Knabe 4.64.6Bursche 2.4Bursche 2.4Backstage 3.0Backstage 3.0