View
216
Download
0
Category
Tags:
Preview:
Citation preview
A Corpus Based A Corpus Based Computational LinguisticsComputational Linguistics
Alon Itai Alon Itai Department of Computer ScienceDepartment of Computer Science
TechnionmdashIsrael Institute of TechnionmdashIsrael Institute of Technology Haifa IsraelTechnology Haifa Israel
Language is the communication media between humans If we want to allow computers to communicate with us we have to teach them human language (aka ldquoNatural Languagerdquo)
Investigating natural language is a complex and interesting challange
It also has many immediate technological benefits such asInternet search enginesvoice interface with computersndash for children handicapped persons
IntroductionIntroduction
Search Google for ldquodogrdquoSearch Google for ldquodogrdquo DogpileDogpile Web Search Home Page Web Search Home Page
dogdogcomcom - a - a dogsdogs best friend best friend
Sausage Software - HotDog Web Editors - HotDog Sausage Software - HotDog Web Editors - HotDog Professional Professional Special Make certain you check out our Special Make certain you check out our DogDog Packs before making yourPacks before making yourpurchase purchase דפים דומיםדפים דומים
explodingdog explodingdog DogDog-Play Great activities you can do with your -Play Great activities you can do with your
dogdogActivities for Dogs Having Fun with Your Activities for Dogs Having Fun with Your DogDog
Search Google for ldquodogsrdquoSearch Google for ldquodogsrdquo I-Love-I-Love-DogsDogscomcom - Free Resources For Dog Lovers - Free Resources For Dog Lovers
including Free Dog Loversincluding Free Dog LoversI-Love-I-Love-DogsDogscom - Free Dog Resources For Dog com - Free Dog Resources For Dog Lovers includinghellipLovers includinghellip
DogsDogs in Canada in CanadaCanadas Top Obedience Canadas Top Obedience DogsDogs 2002 2002
Guide Guide DogsDogs for the Blind for the BlindGuide Guide DogsDogs for the Blind A nonprofit charitable for the Blind A nonprofit charitable organization (800) 295-4050organization (800) 295-4050
DogsDogs of the Dow - High dividend yield Dow stocks - www of the Dow - High dividend yield Dow stocks - www In addition to the high dividend yield stocks of the In addition to the high dividend yield stocks of the DogsDogs of the Dow you will also find a helpful set of of the Dow you will also find a helpful set of long-term stock market charts investment research long-term stock market charts investment research
Ambiguitymdashthe Main ProblemAmbiguitymdashthe Main Problem
Morphological Morphological בחורבחור LexicalLexical bankbank SyntacticalSyntactical
I saw the man with the telescopeI saw the man with the telescope SemanticSemantic hot dogs for the blindhot dogs for the blind Pragmatic Pragmatic Can you pass the saltCan you pass the salt
Methods for understanding Methods for understanding LanguageLanguage
The traditional methodThe traditional method
Develop a theory build toolshellipDevelop a theory build toolshellip
Corpus methodsCorpus methods
Hebrew Computational LinguisticsHebrew Computational Linguistics
Most research is in EnglishMost research is in English
Many other languages have developed toolsMany other languages have developed tools
Why not in HebrewWhy not in Hebrew
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Language is the communication media between humans If we want to allow computers to communicate with us we have to teach them human language (aka ldquoNatural Languagerdquo)
Investigating natural language is a complex and interesting challange
It also has many immediate technological benefits such asInternet search enginesvoice interface with computersndash for children handicapped persons
IntroductionIntroduction
Search Google for ldquodogrdquoSearch Google for ldquodogrdquo DogpileDogpile Web Search Home Page Web Search Home Page
dogdogcomcom - a - a dogsdogs best friend best friend
Sausage Software - HotDog Web Editors - HotDog Sausage Software - HotDog Web Editors - HotDog Professional Professional Special Make certain you check out our Special Make certain you check out our DogDog Packs before making yourPacks before making yourpurchase purchase דפים דומיםדפים דומים
explodingdog explodingdog DogDog-Play Great activities you can do with your -Play Great activities you can do with your
dogdogActivities for Dogs Having Fun with Your Activities for Dogs Having Fun with Your DogDog
Search Google for ldquodogsrdquoSearch Google for ldquodogsrdquo I-Love-I-Love-DogsDogscomcom - Free Resources For Dog Lovers - Free Resources For Dog Lovers
including Free Dog Loversincluding Free Dog LoversI-Love-I-Love-DogsDogscom - Free Dog Resources For Dog com - Free Dog Resources For Dog Lovers includinghellipLovers includinghellip
DogsDogs in Canada in CanadaCanadas Top Obedience Canadas Top Obedience DogsDogs 2002 2002
Guide Guide DogsDogs for the Blind for the BlindGuide Guide DogsDogs for the Blind A nonprofit charitable for the Blind A nonprofit charitable organization (800) 295-4050organization (800) 295-4050
DogsDogs of the Dow - High dividend yield Dow stocks - www of the Dow - High dividend yield Dow stocks - www In addition to the high dividend yield stocks of the In addition to the high dividend yield stocks of the DogsDogs of the Dow you will also find a helpful set of of the Dow you will also find a helpful set of long-term stock market charts investment research long-term stock market charts investment research
Ambiguitymdashthe Main ProblemAmbiguitymdashthe Main Problem
Morphological Morphological בחורבחור LexicalLexical bankbank SyntacticalSyntactical
I saw the man with the telescopeI saw the man with the telescope SemanticSemantic hot dogs for the blindhot dogs for the blind Pragmatic Pragmatic Can you pass the saltCan you pass the salt
Methods for understanding Methods for understanding LanguageLanguage
The traditional methodThe traditional method
Develop a theory build toolshellipDevelop a theory build toolshellip
Corpus methodsCorpus methods
Hebrew Computational LinguisticsHebrew Computational Linguistics
Most research is in EnglishMost research is in English
Many other languages have developed toolsMany other languages have developed tools
Why not in HebrewWhy not in Hebrew
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Search Google for ldquodogrdquoSearch Google for ldquodogrdquo DogpileDogpile Web Search Home Page Web Search Home Page
dogdogcomcom - a - a dogsdogs best friend best friend
Sausage Software - HotDog Web Editors - HotDog Sausage Software - HotDog Web Editors - HotDog Professional Professional Special Make certain you check out our Special Make certain you check out our DogDog Packs before making yourPacks before making yourpurchase purchase דפים דומיםדפים דומים
explodingdog explodingdog DogDog-Play Great activities you can do with your -Play Great activities you can do with your
dogdogActivities for Dogs Having Fun with Your Activities for Dogs Having Fun with Your DogDog
Search Google for ldquodogsrdquoSearch Google for ldquodogsrdquo I-Love-I-Love-DogsDogscomcom - Free Resources For Dog Lovers - Free Resources For Dog Lovers
including Free Dog Loversincluding Free Dog LoversI-Love-I-Love-DogsDogscom - Free Dog Resources For Dog com - Free Dog Resources For Dog Lovers includinghellipLovers includinghellip
DogsDogs in Canada in CanadaCanadas Top Obedience Canadas Top Obedience DogsDogs 2002 2002
Guide Guide DogsDogs for the Blind for the BlindGuide Guide DogsDogs for the Blind A nonprofit charitable for the Blind A nonprofit charitable organization (800) 295-4050organization (800) 295-4050
DogsDogs of the Dow - High dividend yield Dow stocks - www of the Dow - High dividend yield Dow stocks - www In addition to the high dividend yield stocks of the In addition to the high dividend yield stocks of the DogsDogs of the Dow you will also find a helpful set of of the Dow you will also find a helpful set of long-term stock market charts investment research long-term stock market charts investment research
Ambiguitymdashthe Main ProblemAmbiguitymdashthe Main Problem
Morphological Morphological בחורבחור LexicalLexical bankbank SyntacticalSyntactical
I saw the man with the telescopeI saw the man with the telescope SemanticSemantic hot dogs for the blindhot dogs for the blind Pragmatic Pragmatic Can you pass the saltCan you pass the salt
Methods for understanding Methods for understanding LanguageLanguage
The traditional methodThe traditional method
Develop a theory build toolshellipDevelop a theory build toolshellip
Corpus methodsCorpus methods
Hebrew Computational LinguisticsHebrew Computational Linguistics
Most research is in EnglishMost research is in English
Many other languages have developed toolsMany other languages have developed tools
Why not in HebrewWhy not in Hebrew
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Search Google for ldquodogsrdquoSearch Google for ldquodogsrdquo I-Love-I-Love-DogsDogscomcom - Free Resources For Dog Lovers - Free Resources For Dog Lovers
including Free Dog Loversincluding Free Dog LoversI-Love-I-Love-DogsDogscom - Free Dog Resources For Dog com - Free Dog Resources For Dog Lovers includinghellipLovers includinghellip
DogsDogs in Canada in CanadaCanadas Top Obedience Canadas Top Obedience DogsDogs 2002 2002
Guide Guide DogsDogs for the Blind for the BlindGuide Guide DogsDogs for the Blind A nonprofit charitable for the Blind A nonprofit charitable organization (800) 295-4050organization (800) 295-4050
DogsDogs of the Dow - High dividend yield Dow stocks - www of the Dow - High dividend yield Dow stocks - www In addition to the high dividend yield stocks of the In addition to the high dividend yield stocks of the DogsDogs of the Dow you will also find a helpful set of of the Dow you will also find a helpful set of long-term stock market charts investment research long-term stock market charts investment research
Ambiguitymdashthe Main ProblemAmbiguitymdashthe Main Problem
Morphological Morphological בחורבחור LexicalLexical bankbank SyntacticalSyntactical
I saw the man with the telescopeI saw the man with the telescope SemanticSemantic hot dogs for the blindhot dogs for the blind Pragmatic Pragmatic Can you pass the saltCan you pass the salt
Methods for understanding Methods for understanding LanguageLanguage
The traditional methodThe traditional method
Develop a theory build toolshellipDevelop a theory build toolshellip
Corpus methodsCorpus methods
Hebrew Computational LinguisticsHebrew Computational Linguistics
Most research is in EnglishMost research is in English
Many other languages have developed toolsMany other languages have developed tools
Why not in HebrewWhy not in Hebrew
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Ambiguitymdashthe Main ProblemAmbiguitymdashthe Main Problem
Morphological Morphological בחורבחור LexicalLexical bankbank SyntacticalSyntactical
I saw the man with the telescopeI saw the man with the telescope SemanticSemantic hot dogs for the blindhot dogs for the blind Pragmatic Pragmatic Can you pass the saltCan you pass the salt
Methods for understanding Methods for understanding LanguageLanguage
The traditional methodThe traditional method
Develop a theory build toolshellipDevelop a theory build toolshellip
Corpus methodsCorpus methods
Hebrew Computational LinguisticsHebrew Computational Linguistics
Most research is in EnglishMost research is in English
Many other languages have developed toolsMany other languages have developed tools
Why not in HebrewWhy not in Hebrew
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Methods for understanding Methods for understanding LanguageLanguage
The traditional methodThe traditional method
Develop a theory build toolshellipDevelop a theory build toolshellip
Corpus methodsCorpus methods
Hebrew Computational LinguisticsHebrew Computational Linguistics
Most research is in EnglishMost research is in English
Many other languages have developed toolsMany other languages have developed tools
Why not in HebrewWhy not in Hebrew
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Hebrew Computational LinguisticsHebrew Computational Linguistics
Most research is in EnglishMost research is in English
Many other languages have developed toolsMany other languages have developed tools
Why not in HebrewWhy not in Hebrew
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Special problems of HebrewSpecial problems of Hebrew
The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard
Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word
AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה
בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different
readings (readings (שמונהשמונה))
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
AchievementsAchievements
Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct
Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew
Search enginesSearch engines
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
סיכום וחזוןסיכום וחזון
נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות
נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons
The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ
small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =
Hebrew morphology is complexHebrew morphology is complex
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
The structure of a Hebrew wordThe structure of a Hebrew word
the lexical lemma the lexical lemma short words such as determiners short words such as determiners
prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word
suffixes for possessives and object suffixes for possessives and object cliticsclitics
The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc
morphemesmorphemes
linguistic featureslinguistic features
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
ExampleExample
$QMTI $QMTI שקמתישקמתי
$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore
$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up
$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey
noun sg possessive-1sgnoun sg possessive-1sg
connective+verb 1sg pastconnective+verb 1sg past
connective + noun sg possessive-1sgconnective + noun sg possessive-1sg
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Previous workPrevious work
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
POS and Morphological POS and Morphological disambiguationdisambiguation
jhjh
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Three stagesThree stages
11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context
22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors
33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses
Combining all three stages yielded the best results
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis
How to estimate the probability of each How to estimate the probability of each analysisanalysis
Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus
A large enough corpus does not existA large enough corpus does not exist
Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo
(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus
ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine
possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Our variation of the SW methodOur variation of the SW method
To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)
Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Evaluation and ComplexityEvaluation and Complexity
Errors 36 Errors 36 145 145
Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus
Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
The pair stage The pair stage
Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus
The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage
Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Example of a correction ruleExample of a correction rule
If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun
andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun
and and ww22 has an analysis as a verb that has an analysis as a verb that
matches wmatches w11 by gender and number by gender and number
thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
ExampleExample
YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר
YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)
ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07
ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303
08
0467
0533
normalization
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Learning the Rules from a training Learning the Rules from a training corpuscorpus
Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed
Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the
minimum value that does more good than minimum value that does more good than damagedamage
Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall
analyses of the training corpusanalyses of the training corpus
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Evaluation and ComplexityEvaluation and Complexity
Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62
Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus
Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
The sentence stageThe sentence stage
Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses
The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
ExampleExample
מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH
moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip
mascfem verb-mascmascfem verb-masc
more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Score of a syntax treeScore of a syntax tree
PREP NN J
S
NP VP
V COMPN COMP
more ha-kitta ha-nmuka niknas la-kitta
score(s) = score(more)score(ha-kitta) hellip score(la-kitta)
The challenge calculate the score of all syntax trees without enumerating all trees
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Dynamic ProgrammingDynamic Programming
TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses
Fill table by incrasing values of Fill table by incrasing values of
i jA w w
max and im im im iTable i i A s A t G t T
[ ] max [ ] [ 1 ]A BC Gi k j
Table i j A Table i k B Table k j C
0
Time complexity 3O G n
0j i
EvaluationEvaluation
53
147
38
362120
14
Word Stage
Pair Stage
Sentence Stage
error rate
Recommended