View
223
Download
1
Embed Size (px)
Citation preview
11
Unsupervised Knowledge-Free Morpheme Unsupervised Knowledge-Free Morpheme Boundary DetectionBoundary Detection
Stefan BordagStefan Bordag
University of LeipzigUniversity of Leipzig
ExampleExample
Related workRelated work
Part One: Generating training dataPart One: Generating training data
Part Two: Training and Applying a ClassificatorPart Two: Training and Applying a Classificator
Preliminary resultsPreliminary results
Further researchFurther research
22
Example: clearly earlyExample: clearly early
The examples used throughout this presentation are The examples used throughout this presentation are clearly clearly and and earlyearly
In one case, the stem is In one case, the stem is clearclear and in the other and in the other earlyearly Other word forms of same lemmas:Other word forms of same lemmas:
clearclearlyly: clear: clearestest, clear,, clear, clear clearerer, clear, clearinging early: earlearly: earlierier, erl, erliestiest
Semantically related words:Semantically related words: clearly: logically, really, totally, weakly, …clearly: logically, really, totally, weakly, … early: morning, noon, day, month, time, …early: morning, noon, day, month, time, …
Correct morpheme boundaries analysis:Correct morpheme boundaries analysis: clearly clearly →→ clear clear--ly but not *clearlly but not *clearl--y or *cleay or *clea--rlyrly early early → early or earl→ early or earl--y but not *eary but not *ear--lyly
33
1. Three approaches to morpheme 1. Three approaches to morpheme boundary detectionboundary detection
Three kinds of approaches: Three kinds of approaches: 1.1. Genetic AlgorithmsGenetic Algorithms and the and the Minimum Description LengthMinimum Description Length
modelmodel (Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05)(Kazakov 97 & 01), (Goldsmith 01), (Creutz 03 & 05) This approach utilizes only word list, not the This approach utilizes only word list, not the context context
informationinformation for each word from corpus. for each word from corpus. This possibly results in an upper limit on achievable This possibly results in an upper limit on achievable
performance (especially with regards to irregularities).performance (especially with regards to irregularities). One advantage is that smaller corpora sufficientOne advantage is that smaller corpora sufficient
2.2. Semantics basedSemantics based (Schone & Jurafsky 01), (Baroni 03)(Schone & Jurafsky 01), (Baroni 03) General problem of this approach with examples like General problem of this approach with examples like deeplydeeply
and and deepnessdeepness where semantic similarity is unlikely where semantic similarity is unlikely
3.3. Letter Successor VarietyLetter Successor Variety (LSV) based (LSV) based (Harris 55), (Hafer & Weiss 74) first application, but low performance(Harris 55), (Hafer & Weiss 74) first application, but low performance Also applied only to a word listAlso applied only to a word list Further hampered by noise in the dataFurther hampered by noise in the data
44
2. New solution in two parts2. New solution in two parts
sentencescooccurrences
The talk wasvery informative
The talk 1Talk was 1…
similar words
Talk speech 20Was is 15…
clear-lylatelyearly
…
clearly late
ear ¤
root
cl
late ¤
¤
¤
¤
train classifier
clear-lylate-lyearly
… apply classifier
compute LSV ss = = LSVLSV * * freqfreq * * multilettermultiletter * * bigrambigram
55
2.1. First part: Generating training data with 2.1. First part: Generating training data with LSV and distributed SemanticsLSV and distributed Semantics
Overview:Overview: Use context information to gather common Use context information to gather common direct direct
neighborsneighbors of the input word of the input word →→ they are most they are most probably marked by the same grammatical probably marked by the same grammatical informationinformation
Frequency of word A and B is Frequency of word A and B is nnAA and and nnBB
Frequency of cooccurrence of A with B is Frequency of cooccurrence of A with B is nnABAB
Corpus size is Corpus size is nn Significance computation is Poisson approximation of Significance computation is Poisson approximation of
log-likelihood (Dunning 93) (Quasthoff & Wolff 02)log-likelihood (Dunning 93) (Quasthoff & Wolff 02) 1 , ln ln A B A Bpoiss AB AB
n n n nsig A B n n
n n
66
Neighbors of “Neighbors of “clearly“clearly“
Most Most significant left neighborssignificant left neighbors
veryvery
quitequite
soso
It‘sIt‘s
mostmost
it‘sit‘s
showsshows
resultsresults
that‘sthat‘s
statedstated
QuiteQuite
Most significant right Most significant right neighborsneighbors
defineddefined
writtenwritten
labeledlabeled
markedmarked
visiblevisible
demonstrateddemonstrated
superiorsuperior
statedstated
showsshows
demonstratesdemonstrates
understoodunderstood
clearly
It’s clearly labeled
very clearly shows
77
2.22.2.. New solution as combination of two existing New solution as combination of two existing approachesapproaches
Overview:Overview: Use context information to gather common Use context information to gather common
direct neighborsdirect neighbors of the input word of the input word →→ they are they are most probably marked by the same grammatical most probably marked by the same grammatical informationinformation
Use these neighbor cooccurrences to find words Use these neighbor cooccurrences to find words that have that have similar cooccurrence profilessimilar cooccurrence profiles →→ those that are surrounded by the same those that are surrounded by the same cooccurrences bear mostly the same grammatical cooccurrences bear mostly the same grammatical markermarker
88
Similar words to “Similar words to “clearly“clearly“
Most significant right Most significant right neighborsneighbors
defineddefined
writtenwritten
labeledlabeled
markedmarked
visiblevisible
demonstrateddemonstrated
superiorsuperior
statedstated
showsshows
demonstratesdemonstrates
understoodunderstood
…weaklylegallycloselyclearlygreatlylinearlyreally…
Most Most significant left neighborssignificant left neighbors
veryvery
quitequite
soso
It‘sIt‘s
mostmost
it‘sit‘s
showsshows
resultsresults
that‘sthat‘s
statedstated
QuiteQuite
99
2.3. New solution as combination of two 2.3. New solution as combination of two existing approachesexisting approaches
Overview:Overview: Use context information to gather common Use context information to gather common direct direct
neighborsneighbors of the input word of the input word →→ they are most they are most probably marked by the same grammatical probably marked by the same grammatical informationinformation
Use these neighbor cooccurrences to find words Use these neighbor cooccurrences to find words that have that have similar cooccurrence profilessimilar cooccurrence profiles →→ those those that are surrounded by the same cooccurrences that are surrounded by the same cooccurrences bear mostly the same grammatical markerbear mostly the same grammatical marker
Sort those words by edit distance and keep 150 Sort those words by edit distance and keep 150 most similar most similar →→ since further words only add since further words only add random noiserandom noise
1010
Similar words to “Similar words to “clearlyclearly“ sorted by edit “ sorted by edit distancedistance
Most Most significant significant left neighborsleft neighbors
veryveryquitequitesosoIt‘sIt‘smostmostit‘sit‘sshowsshowsresultsresultsthat‘sthat‘sstatedstatedQuiteQuite
Most significant Most significant right neighborsright neighbors
defineddefinedwrittenwrittenlabeledlabeledmarkedmarked
visiblevisibledemonstrateddemonstrated
superiorsuperiorstatedstatedshowsshows
demonstratesdemonstratesunderstoodunderstood
Sorted List
clearlycloselygreatlylegallylinearlyreally
weakly…
1111
2.4. New solution as combination of two 2.4. New solution as combination of two existing approachesexisting approaches
Overview:Overview: Use context information to gather common Use context information to gather common direct direct
neighborsneighbors of the input word of the input word →→ they are most they are most probably marked by the same grammatical probably marked by the same grammatical informationinformation
Use these neighbor cooccurrences to find words that Use these neighbor cooccurrences to find words that have have similar cooccurrence profilessimilar cooccurrence profiles →→ those that those that are surrounded by the same cooccurrences bear are surrounded by the same cooccurrences bear mostly the same grammatical markermostly the same grammatical marker
Sort those words by edit distance and keep 150 most Sort those words by edit distance and keep 150 most similar similar →→ since further words only add random noise since further words only add random noise
Compute letter successor variety for each transition Compute letter successor variety for each transition between two characters of the input wordbetween two characters of the input wordReport boundaries where the LSV is above thresholdReport boundaries where the LSV is above threshold
1212
2.5. Letter successor variety2.5. Letter successor variety
Letter successor variety: Letter successor variety: Harris (55)Harris (55)where word-splitting occurs if the number of distinct where word-splitting occurs if the number of distinct letters that follows a given sequence of characters letters that follows a given sequence of characters surpasses the threshold.surpasses the threshold.
Input are the 150 most similar wordsInput are the 150 most similar words Observing how many different letters occur after Observing how many different letters occur after
a part of the string:a part of the string: #c- In the given list after #c- 5 letters#c- In the given list after #c- 5 letters #cl- only 3 letters #cl- only 3 letters #cle- only 1 letter#cle- only 1 letter …… -ly# but reversed before –ly# -ly# but reversed before –ly# 1616 different letters (16 different different letters (16 different
stems preceding the suffix –ly#)stems preceding the suffix –ly#)
# c l e a r l y # # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters)28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 1 1 2 1 3 1616 10 14 f. right (thus before -y# 10 var. letters) 10 14 f. right (thus before -y# 10 var. letters)
1313
2.5.1. Balancing factors2.5.1. Balancing factors
LSVLSV score for each possible boundary is not score for each possible boundary is not normalized and needs to be weighted against normalized and needs to be weighted against several factors that otherwise add noise:several factors that otherwise add noise: freqfreq: Frequency differences between beginning and : Frequency differences between beginning and
middle of wordmiddle of word multilettermultiletter: Representation of single phonemes with : Representation of single phonemes with
several lettersseveral letters bigrambigram: Certain fixed combinations of letters: Certain fixed combinations of letters
Final score Final score ss for each possible boundary is then: for each possible boundary is then:
ss = = LSVLSV * * freqfreq * * multilettermultiletter * * bigrambigram
1414
2.5.2. Balancing factors: Frequency2.5.2. Balancing factors: Frequency
LSV is not normalized against frequencyLSV is not normalized against frequency 28 different 28 different firstfirst letters within 150 words letters within 150 words 5 different 5 different secondsecond letters within 11 words, beginning with letters within 11 words, beginning with cc 3 different 3 different thirdthird letters within 4 words, beginning with letters within 4 words, beginning with clcl
Computing frequency weight Computing frequency weight freqfreq:: 44 out of out of 1111 begin with # begin with #clcl- then weight is - then weight is 4/114/11
# # cc ll e a r l y # e a r l y # 150 150 1111 44 1 1 1 1 1 of 11 4 begin with cl 1 1 1 1 1 of 11 4 begin with cl 0.1 0.1 0.40.4 0.3 1 1 1 1 1 from left 0.3 1 1 1 1 1 from left
1515
2.5.3. Balancing factors: Multiletter 2.5.3. Balancing factors: Multiletter Phonemes Phonemes
Problem: Two or more letters which together represent Problem: Two or more letters which together represent one phoneme “carry away” the nominator for the one phoneme “carry away” the nominator for the overlap factor quotient:overlap factor quotient:
Letter split variety:Letter split variety: # s c h l i m m e# s c h l i m m e 7 1 7 2 1 1 27 1 7 2 1 1 2 2 1 1 1 2 4 15 2 1 1 1 2 4 15 Computing overlap factor: Computing overlap factor: 150150 27 27 1818 1818 6 5 5 5 6 5 5 5 2 2 2 2 3 7 105 1502 2 2 2 3 7 105 150 ^ thus at this point the LSV 7 is weighted 1 (^ thus at this point the LSV 7 is weighted 1 (18/1818/18), but since sch is ), but since sch is
one phoneme, it should have been one phoneme, it should have been 1818//150150 ! !
Solution: Ranking of Solution: Ranking of bi-bi- and and trigramstrigrams, highest receives , highest receives weight of 1.0weight of 1.0
Overlap factor is recomputed as weighted average:Overlap factor is recomputed as weighted average: In this case that means 1.0 * 27/150, since ‘sch’ is the In this case that means 1.0 * 27/150, since ‘sch’ is the
highest trigram and has a weight of 1.0.highest trigram and has a weight of 1.0.
1616
2.5.4. Balancing factors: Bigrams2.5.4. Balancing factors: Bigrams
It is obvious that –It is obvious that –thth– in English is almost never to – in English is almost never to be dividedbe divided
Computation of bigram ranking over all words in Computation of bigram ranking over all words in word list and give 0.1 weight to highest ranked word list and give 0.1 weight to highest ranked and 1.0 to lowest ranked.and 1.0 to lowest ranked.
LSV score then multiplied with resulting weight.LSV score then multiplied with resulting weight. Thus, the German –Thus, the German –chch- which is the highest - which is the highest
ranked bigram receives a penalty of 0.1 and thus ranked bigram receives a penalty of 0.1 and thus it is nearly impossible that it becomes a it is nearly impossible that it becomes a morpheme boundarymorpheme boundary
1717
2.5.5. Sample computation2.5.5. Sample computationCompute letter successor variety:Compute letter successor variety: # c l e a r # c l e a r -- l y # # e a r l y # l y # # e a r l y # 28 5 3 1 1 1 1 1 40 5 1 1 2 128 5 3 1 1 1 1 1 40 5 1 1 2 1 1 1 2 1 3 1 1 2 1 3 1616 10 10 1 2 1 4 6 19 10 10 1 2 1 4 6 19Balancing: Frequencies: Balancing: Frequencies: 150 11 4 1 1 1 1 1 150 9 2 2 2 1150 11 4 1 1 1 1 1 150 9 2 2 2 1 1 1 2 2 5 1 1 2 2 5 7676 9090 150150 1 2 2 6 19 150 1 2 2 6 19 150Balancing: Multiletter weights:Balancing: Multiletter weights:Bi lBi l 0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0 0.4 0.1 0.5 0.2 0.5 0.0 0.2 0.2 0.5 0.0Tri rTri r 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0 0.1 0.1 0.1 0.1 0.0 0.0 0.1 0.0Bi lBi l 0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3 0.5 0.2 0.5 0.0 0.1 0.3 0.5 0.0 0.1 0.3Tri rTri r 0.1 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.2 0.2 0.0 0.0 0.20.0 0.0 0.2Balancing: Bigram weight: Balancing: Bigram weight: 0.1 0.5 0.2 0.5 0.1 0.5 0.2 0.5 0.00.0 0.1 0.2 0.5 0.0 0.1 0.1 0.2 0.5 0.0 0.1Left and Right LSV scores:Left and Right LSV scores: 0.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.70.1 0.3 0.0 0.4 1.0 0.9 0.0 0.0 0.5 1.7 0.3 0.9 0.1 0.3 0.9 0.1 0.00.0 12.412.4 3.7 1.0 0.0 0.7 0.2 3.7 1.0 0.0 0.7 0.2Computing right score for clear-ly: Computing right score for clear-ly: 1616*(*(7676//9090++0.10.1**7676//150150)/(1.0+)/(1.0+0.10.1)*(1-)*(1-0.00.0)=)=12.412.4 Sum scores for left and rightSum scores for left and right:: 0.4 1.2 0.1 0.4 1.2 0.1 0.4 0.4 13.413.4 4.6 1.0 0.1 1.2 2.0 4.6 1.0 0.1 1.2 2.0
thresholdthreshold: 5: 5 clearclear--ly earlyly early
1818
3. Second Part: Training and Applying 3. Second Part: Training and Applying classifierclassifier
Any word list can be stored Any word list can be stored in a in a trietrie (Fredkin:60) or in a (Fredkin:60) or in a more efficient version of a more efficient version of a trie, a PATRICIA compact trie, a PATRICIA compact tree (PCT) (Morrison:68)tree (PCT) (Morrison:68)
Example: Example:
clearlyclearly
earlyearly
lately lately
clearclear
latelate
ry e
al
t
er a
la l
ce
¤
root
l
c
l
a
e
¤
¤
¤
¤
a
¤ = End or beginning of word
1919
3.1. PCT as a Classificator3.1. PCT as a Classificator
clearly late
ear ¤
root
cl
late ¤
¤
¤
¤
clear-ly, late-ly, early,Clear, late
clearly late
ear ¤
root
cl
late ¤
¤
¤
¤
ly=2
ly=1
ly=1
ly=1
Amazing?ly
amazing-ly
ly=1
ly=1
add knowninformation
retrieveknowninformation
Apply deepest found node
dear?ly
dearly
¤=1
¤=1 ¤=1
¤=1¤=1
2020
4. Evaluation4. Evaluation
Boundary measuring: Boundary measuring: each boundary detected can be each boundary detected can be correct or wrong (precision) or boundaries can be not correct or wrong (precision) or boundaries can be not detected (recall)detected (recall)
First evaluation is global LSV with the proposed First evaluation is global LSV with the proposed improvementsimprovements
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F(lsv)
F(lsv*fw)
F(lsv*fw*ib)
2121
Evaluating LSV Precision vs. RecallEvaluating LSV Precision vs. Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P(lsv)
R(lsv)
P(lsv*fw)
R(lsv*fw)
P(lsv*fw*ib)
R(lsv*fw*ib)
2222
Evaluating LSV F-measureEvaluating LSV F-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10 2 4 6 8 10
12
14
16
18
20
22
24
F(lsv)
F(lsv*fw)
F(lsv*fw*ib)
2323
Evaluating combination Precision vs. Evaluating combination Precision vs. RecallRecall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10 2 4 6 8 10 12 14 16 18 20 22 24
P(lsv+trie)
R(lsv+trie)
P(lsv*fw+trie)
R(lsv*fw+trie)
P(lsv*fw*ib+trie)
R(lsv*fw*ib+trie)
2424
Evaluating combination F-measureEvaluating combination F-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.80 2 4 6 8 10 12 14 16 18 20 22 24
F(lsv+trie)
F(lsv*fw*ib+trie)
F(lsv*fw+trie)
2525
Comparing combination with global LSVComparing combination with global LSV
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2626
4.1. Results4.1. Results
German newspaper corpus with 35 million sentences German newspaper corpus with 35 million sentences
English newspaper corpus with 13 million sentencesEnglish newspaper corpus with 13 million sentences
t=5t=5 GermanGerman EnglishEnglish
lsv Precisionlsv Precision 80,2080,20 70,3570,35lsv Recalllsv Recall 34,5234,52 10,8610,86lsv F-measurelsv F-measure 48,2748,27 18,8218,82combined Precisioncombined Precision 68,7768,77 52,8752,87combined Recallcombined Recall 72,1172,11 52,5652,56combined F-combined F-measuremeasure 70,4070,40 55,0955,09
2727
4.2. Statistics4.2. Statistics
en lsven lsv en comben comb tr lsvtr lsv tr combtr comb fi lsvfi lsv fi combfi combCorpus sizeCorpus size 13 million13 million 1 million1 million 4 million4 millionnunmber of nunmber of word( form)sword( form)s 167.377167.377 582.923582.923 1.636.3361.636.336
analysed wordsanalysed words 49.15949.159 94.23794.237 26.30726.307 460.791460.791 68.84068.840 1.380.841.380.8411
boundariesboundaries 70.10670.106 131.465131.465 31.56931.569 812.454812.454 84.19384.193 3.138.033.138.0399
morph. lengthmorph. length 2,602,60 2,562,56 2,292,29 3,033,03 2,322,32 3,733,73length of analysed length of analysed wordswords 8,978,97 8,918,91 9,759,75 10,6210,62 11,9411,94 13,3413,34
length of unanalysed length of unanalysed wordswords 7,567,56 6,776,77 10,1210,12 8,158,15 12,9112,91 10,4710,47
morphemes per wordmorphemes per word 2,432,43 2,402,40 2,202,20 2,762,76 2,222,22 3,273,27
2828
4.3. Assessing true error rate4.3. Assessing true error rate Typical sample list of words considered as wrong Typical sample list of words considered as wrong
due to CELEX:due to CELEX: TauTau--sendesende Tausend-eTausend-e senegales-isch-esenegales-isch-e senegalesisch-esenegalesisch-e sensibelst-ensensibelst-en sens-ibel-stensens-ibel-sten separat-ist-isch-eseparat-ist-isch-e separ-at-istisch-esepar-at-istisch-e tristris--tt tristtrist triumptriump--halhal triumph-altriumph-al trock-entrock-en trockentrocken unueber-troff-enunueber-troff-en un-uebertroffenun-uebertroffen troptrop--f-enf-en tropf-entropf-en trotz-t-entrotz-t-en trotz-tentrotz-ten ver-traeum-t-ever-traeum-t-e vertraeumt-evertraeumt-e
Reasons:Reasons: Gender –e (in (Creutz & Lagus 05) for example counted as correct)Gender –e (in (Creutz & Lagus 05) for example counted as correct) compounds (sometimes separated, sometimes not)compounds (sometimes separated, sometimes not) -t-en Error-t-en Error With proper names –isch often not analyzedWith proper names –isch often not analyzed Connecting elementsConnecting elements
2929
4.4. Real example4.4. Real example
Orien-talOrien-talOrien-tal-ischeOrien-tal-ischeOrien-tal-istOrien-tal-istOrien-tal-ist-enOrien-tal-ist-enOrien-tal-ist-ikOrien-tal-ist-ikOrien-tal-ist-inOrien-tal-ist-inOrient-ier-ungOrient-ier-ungOrient-ier-ungenOrient-ier-ungenOrient-ier-ungs-hilf-eOrient-ier-ungs-hilf-eOrient-ier-ungs-hilf-enOrient-ier-ungs-hilf-enOrient-ier-ungs-los-igkeitOrient-ier-ungs-los-igkeitOrient-ier-ungs-punktOrient-ier-ungs-punktOrient-ier-ungs-punkt-eOrient-ier-ungs-punkt-eOrient-ier-ungs-stuf-eOrient-ier-ungs-stuf-e
Ver-trau-enskriseVer-trau-enskrise
Ver-trau-ensleuteVer-trau-ensleute
Ver-trau-ens-mannVer-trau-ens-mann
Ver-trau-ens-sacheVer-trau-ens-sache
Ver-trau-ensvorschußVer-trau-ensvorschuß
Ver-trau-ensvo-tumVer-trau-ensvo-tum
Ver-trau-ens-würd-igkeitVer-trau-ens-würd-igkeit
Ver-traut-esVer-traut-es
Ver-trieb-enVer-trieb-en
Ver-trieb-spartn-erVer-trieb-spartn-er
Ver-triebeneVer-triebene
Ver-triebenenverbändeVer-triebenenverbände
Ver-triebs-beleg-eVer-triebs-beleg-e
3030
5. Further research5. Further research
Examine quality on various language typesExamine quality on various language types Improve trie-based classificatorImprove trie-based classificator Possibly combine with other existing algorithmsPossibly combine with other existing algorithms Find out how to acquire morphology of non-Find out how to acquire morphology of non-
concatenative languagesconcatenative languages Deeper analysis:Deeper analysis:
find deletionsfind deletions alternationsalternations insertionsinsertions morpheme classes etc.morpheme classes etc.
3131
6. References6. References
(Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, (Argamon et al. 04) Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Effcient unsupervized recursive word segmentation and Oren Kapah. Effcient unsupervized recursive word segmentation using minimun desctiption length. In Proceedings of Coling 2004, using minimun desctiption length. In Proceedings of Coling 2004, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, Geneva, Switzerland, 2004. GLDV-Tagung, pages 93-99, Leipzig, March 1998. Deutscher Universitätsverlag.March 1998. Deutscher Universitätsverlag.
(Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A (Baroni 03) Marco Baroni. Distribution-driven morpheme discovery: A computational/experimental study. Yearbook of Morphology, pages computational/experimental study. Yearbook of Morphology, pages 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 213-248, 2003. France, http://www.sle.sharp.co.uk/senseval2/, 5-6 July 2001.July 2001.
(Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised (Creutz & Lagus 05) Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information using morfessor 1.0. In Publications in Computer and Information Science, Report A81. Helsinki University of Technology, March 2005. Science, Report A81. Helsinki University of Technology, March 2005.
(Déjean 98) Hervé Déjean. Morphemes as necessary concept for (Déjean 98) Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In D.M.W. Powers, editor, structures discovery from untagged corpora. In D.M.W. Powers, editor, NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Natural Language Learning, ACL, pages 295-299, Adelaide, January 1998.Language Learning, ACL, pages 295-299, Adelaide, January 1998.
(Dunning 93) T. E. Dunning. Accurate methods for the statistics of (Dunning 93) T. E. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74, surprise and coincidence. Computational Linguistics, 19(1):61-74, 1993.1993.
3232
6. References II6. References II
(Goldsmith 01) John Goldsmith. Unsupervised learning of the (Goldsmith 01) John Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, morphology of a natural language. Computational Linguistics, 27(2):153-198, 2001.27(2):153-198, 2001.
(Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word (Hafer & Weiss 74) Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and segmentation by letter successor varieties. Information Storage and Retrieval, 10:371-385, 1974.Retrieval, 10:371-385, 1974.
(Harris 55) Zellig S. Harris. From phonemes to morphemes. (Harris 55) Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190-222, 1955.Language, 31(2):190-222, 1955.
(Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive (Kazakov 97) Dimitar Kazakov. Unsupervised learning of na¨ive morphology with genetic algorithms. In A. van den Bosch, W. morphology with genetic algorithms. In A. van den Bosch, W. Daelemans, and A. Weijters, editors, Workshop Notes of the Daelemans, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague, Czech Republic, April Processing Tasks, pages 105-112, Prague, Czech Republic, April 1997.1997.
(Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The (Quasthoff & Wolff 02) Uwe Quasthoff and Christian Wolff. The poisson collocation measure and its applications. In Second poisson collocation measure and its applications. In Second International Workshop on Computational Approaches to Collocations. International Workshop on Computational Approaches to Collocations. 2002.2002.
(Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language-(Schone & Jurafsky 01) Patrick Schone and Daniel Jurafsky. Language-independent induction of part of speech class labels using only independent induction of part of speech class labels using only language universals. In Workshop at IJCAI-2001, Seattle, WA., August language universals. In Workshop at IJCAI-2001, Seattle, WA., August 2001. Machine Learning: Beyond Supervision.2001. Machine Learning: Beyond Supervision.
3333
E. Gender-e vs. Frequency-eE. Gender-e vs. Frequency-e
vs. Gender-e
Schule 8.4Devise 7.8Sonne 4.5Abendsonne 5.3Abende 5.5Liste 6.5
vs. other-e
andere 8.4keine 6.8rote 11.6stolze 8.0drehte 10.8winzige 9.7lustige 13.2rufe 4.4Dumme 12.6
Frequency-eFrequency-e
Affe Affe 2.72.7Junge Junge 5.35.3Knabe Knabe 4.64.6Bursche 2.4Bursche 2.4Backstage 3.0Backstage 3.0