Upload
lee-powers
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
Automatic Automatic Identification of Identification of Cognates, False Cognates, False
Friends, and Partial Friends, and Partial CognatesCognates
University of Ottawa, University of Ottawa, CanadaCanada
OutlineOutline Overview of the ThesisOverview of the Thesis Research ContributionResearch Contribution Cognate and False Friend Cognate and False Friend
IdentificationIdentification Partial Cognate DisambiguationPartial Cognate Disambiguation CLPA- Cognate and False Friend CLPA- Cognate and False Friend
Annotator Annotator Conclusions and Future WorkConclusions and Future Work
Overview of the ThesisOverview of the Thesis
TasksTasks– Automatic Identification of Cognates and Automatic Identification of Cognates and
False FriendsFalse Friends– Automatic Disambiguation of Partial CognatesAutomatic Disambiguation of Partial Cognates
Areas of ApplicationsAreas of Applications– CALL,CALL, MT, Word Alignment, Cross-Language MT, Word Alignment, Cross-Language
Information RetrievalInformation Retrieval
CALL Tool - CLPACALL Tool - CLPA
CognatesCognates or or True FriendsTrue Friends ( (Vrais AmisVrais Amis), are pairs of ), are pairs of words that are perceived as similar and are mutual words that are perceived as similar and are mutual translations. translations.
nature - naturenature - nature, , reconnaissance - recognition reconnaissance - recognition
False FriendsFalse Friends ( (Faux AmisFaux Amis) are pairs of words in two ) are pairs of words in two languages that are perceived as similar but have languages that are perceived as similar but have different meanings. different meanings.
main main (=hand) (=hand) -- main main (principal, essential), (principal, essential), blesserblesser (=to injure) (=to injure) -- blessbless (b (béénir in French)nir in French)
Partial CognatesPartial Cognates words that share the same words that share the same meaning in two languages in some but not all meaning in two languages in some but not all contexts contexts note – notenote – note,, facteurfacteur - factor - factor or mailman, maker or mailman, maker
DefinitionsDefinitions
Research ContributionResearch Contribution Novel method based on ML algorithms to Novel method based on ML algorithms to
identify Cognates and False Friendsidentify Cognates and False Friends
A method to create complete lists of Cognates A method to create complete lists of Cognates and False Friendsand False Friends
Define a novel task: Define a novel task: Partial Cognate Partial Cognate Disambiguation, Disambiguation, and solve it using a and solve it using a supervised and a semi-supervised methodsupervised and a semi-supervised method– Combine and use corpora from different domainsCombine and use corpora from different domains
Implement a CALL Tool – Implement a CALL Tool – CLPA CLPA to annotate to annotate Cognates and False FriendsCognates and False Friends
Cognates and False Friends Cognates and False Friends IdentificationIdentification Our methodOur method
– Machine Learning techniques with different Machine Learning techniques with different algorithmsalgorithms
– InstancesInstances: French-English pairs of words : French-English pairs of words – Feature SpaceFeature Space: 13 orthographic similarity : 13 orthographic similarity
measuresmeasures– ClassesClasses: Cog_FF and Unrelated: Cog_FF and Unrelated
Experiments done for:Experiments done for: Each measure separately Each measure separately Average of all measuresAverage of all measures All 13 measuresAll 13 measures
Cognates and False Friends Cognates and False Friends IdentificationIdentification
DataData
Training setTraining set Test setTest set
CognatesCognates 613 (73)613 (73) 603 (178)603 (178)
False-False-FriendsFriends
314 (135)314 (135) 94 (46) 94 (46)
UnrelatedUnrelated 527 (0)527 (0) 343 (0)343 (0)
TotalTotal 14541454 10401040
Results for classification Results for classification (COG_FF/UNREL)(COG_FF/UNREL)
Orthographic Orthographic similarity measuresimilarity measure
ThresholdThreshold Accuracy onAccuracy onTraining setTraining set
Accuracy onAccuracy onTest setTest set
IDENTIDENT 11 43.90 %43.90 % 55.00 %55.00 %
PREFIXPREFIX 0.038450.03845 92.70 %92.70 % 90.97 %90.97 %
DICEDICE 0.296690.29669 89.40 %89.40 % 93.37 %93.37 %
LCSRLCSR 0.458000.45800 92.91 %92.91 % 94.24 %94.24 %
NEDNED 0.348450.34845 93.39 %93.39 % 93.57 %93.57 %
SOUNDEXSOUNDEX 0.625000.62500 85.28 %85.28 % 84.54 %84.54 %
TRITRI 0.04760.0476 88.30 %88.30 % 92.13 %92.13 %
XDICEXDICE 0.218250.21825 92.84 %92.84 % 94.52 %94.52 %
XXDICEXXDICE 0.129150.12915 91.74 %91.74 % 95.39 %95.39 %
TRI-SIMTRI-SIM 0.348450.34845 95.66 %95.66 % 93.28 %93.28 %
TRI-DISTTRI-DIST 0.348450.34845 95.11 %95.11 % 93.85 %93.85 %
Average measureAverage measure 0.147700.14770 93.83 %93.83 % 94.14 %94.14 %
Results for classification Results for classification (COG_FF/UNREL)(COG_FF/UNREL)
ClassifierClassifier Accuracy cross-Accuracy cross-val. on training setval. on training set
Accuracy on Accuracy on test settest set
BaselineBaseline 63.75 %63.75 % 66.98 %66.98 %
OneRuleOneRule 95.66 %95.66 % 92.89 %92.89 %
Naive BayesNaive Bayes 94.84 %94.84 % 94.62 %94.62 %
Decision TreeDecision Tree 95.66 %95.66 % 92.08 %92.08 %
Decision Tree (pruned)Decision Tree (pruned) 95.66%95.66% 93.18 %93.18 %
IBKIBK 93.81 %93.81 % 92.80 %92.80 %
Ada BoostAda Boost 95.66 %95.66 % 93.47 %93.47 %
PerceptronPerceptron 95.11 %95.11 % 91.55 %91.55 %
SVM (SMO)SVM (SMO) 95.46 %95.46 % 93.76 %93.76 %
Complete Lists of Cognates and False Complete Lists of Cognates and False FriendsFriends
MethodMethod– Use the XXDICE orthographic similarity Use the XXDICE orthographic similarity
measuremeasure– Use list of pairs of words in two Use list of pairs of words in two
languages (the words that are translation languages (the words that are translation of each other, or not, or monolingual lists of each other, or not, or monolingual lists of words)of words)
– Use a bilingual dictionary to determine if Use a bilingual dictionary to determine if the words contained in a pair are the words contained in a pair are translation of each other translation of each other
Complete Lists of Cognates and False Complete Lists of Cognates and False FriendsFriends EvaluationEvaluation
– On the entry list of a French-English On the entry list of a French-English bilingual dictionarybilingual dictionary
55% - Cognates55% - Cognates 2% - False Friends (5,619,270 pairs) 2% - False Friends (5,619,270 pairs)
– We created pair of words from two large We created pair of words from two large monolingual list of words in French and monolingual list of words in French and EnglishEnglish
11,469,662 – Orthographical Similar (0.8%)11,469,662 – Orthographical Similar (0.8%)– 3,496 Cognates (0.03%)3,496 Cognates (0.03%)– 3,767,435 False Friends (32%)3,767,435 False Friends (32%)
Cognates and False Friends Cognates and False Friends IdentificationIdentification
Conclusion Conclusion
We tested a number of orthographic similarity We tested a number of orthographic similarity measures individually, and also combined using measures individually, and also combined using different Machine Learning algorithmsdifferent Machine Learning algorithms
We evaluated the methods on a training set using 10-We evaluated the methods on a training set using 10-fold cross validation, on a test setfold cross validation, on a test set
We proposed an extension of the method to create We proposed an extension of the method to create complete lists of Cognates and False Friendscomplete lists of Cognates and False Friends
The results show that, for French and English, it is The results show that, for French and English, it is possible to achieve very good accuracy based on the possible to achieve very good accuracy based on the orthographic measures of word similarityorthographic measures of word similarity
Partial Cognate DisambiguationPartial Cognate Disambiguation
TaskTask – To determine the sense/meaning (Cognate To determine the sense/meaning (Cognate
or False Friend with the equivalent English or False Friend with the equivalent English word) of an Partial Cognate in a French word) of an Partial Cognate in a French contextcontext
NoteNote CogCog
Le comité prend Le comité prend notenote de cette information. de cette information. The Committee takes The Committee takes notenote of this reply. of this reply. FFFF
Mais qui a dû payer la Mais qui a dû payer la notenote?? So who got left holding the So who got left holding the billbill??
DataData
Use a set of 10 Partial CognatesUse a set of 10 Partial Cognates– Parallel sentences that have on the Parallel sentences that have on the
French side the French side the French Partial CognateFrench Partial Cognate and on the English side the and on the English side the English English CognateCognate ( (English False FriendEnglish False Friend) - labeled ) - labeled as as COG (FF)COG (FF)
Collected from EuroPar, HansardCollected from EuroPar, Hansard– ~ ~ 115 sentences each class for Training115 sentences each class for Training– ~ 60 sentences each class for Testing~ 60 sentences each class for Testing
Supervised MethodSupervised Method
Traditional ML algorithmsTraditional ML algorithms
FeaturesFeatures
- used the bag-of-words (BOW) approach of - used the bag-of-words (BOW) approach of modeling context, with the binary feature modeling context, with the binary feature valuesvalues
- context words from the training corpus that - context words from the training corpus that appeared at least 3 times in the training appeared at least 3 times in the training sentences sentences
ClassesClasses COG and FF COG and FF
Monolingual BootstrappingMonolingual BootstrappingForFor each pair of partial cognates (PC) each pair of partial cognates (PC) 1. Train a classifier on the training seeds – using the BOW 1. Train a classifier on the training seeds – using the BOW
approach and a NB-K classifier with attribute selection on the approach and a NB-K classifier with attribute selection on the featuresfeatures
2. Apply the classifier on unlabeled data – sentences that contain 2. Apply the classifier on unlabeled data – sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E) E)
3. Take the first k newly classified sentences, both from the COG 3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most and FF class and add them to the training seeds (the most confident ones – the prediction accuracy greater or equal than confident ones – the prediction accuracy greater or equal than a threshold =0.85)a threshold =0.85)
4. Rerun the experiments training on the new training set4. Rerun the experiments training on the new training set
5. Repeat steps 2 and 3 for t times 5. Repeat steps 2 and 3 for t times endForendFor
Bilingual BootstrappingBilingual Bootstrapping
1. Translate the English sentences that 1. Translate the English sentences that were collected in the MB-E step into were collected in the MB-E step into French using an online MT tool and add French using an online MT tool and add them to the French seed training data. them to the French seed training data.
2. Repeat the MB-F and MB-E steps for 2. Repeat the MB-F and MB-E steps for TT times. times.
Additional DataAdditional Data
LeMondeLeMonde– An average of 250 sentences for each classAn average of 250 sentences for each class
BNCBNC– An average of 200 sentences for each classAn average of 200 sentences for each class
Multi-Domain corpusMulti-Domain corpus– An average of 80 sentences for each classAn average of 80 sentences for each class
Partial Cognate DisambiguationPartial Cognate Disambiguation
ConclusionsConclusions
– Simple methods and available tools are Simple methods and available tools are used with success for a task hard to solve used with success for a task hard to solve even foreven for humanshumans
– Additional use of unlabeled data improves Additional use of unlabeled data improves the learning process for the Partial the learning process for the Partial Cognates Disambiguation taskCognates Disambiguation task
– Semi-Supervised Learning proves to be “as Semi-Supervised Learning proves to be “as good as” Supervised Learninggood as” Supervised Learning
Future WorkFuture Work
Apply the Cognate and False Friend Apply the Cognate and False Friend Identification method, and create Identification method, and create complete list for other pair of complete list for other pair of languageslanguages
Increase the accuracy results for the Increase the accuracy results for the Partial Cognate Disambiguation taskPartial Cognate Disambiguation task
Use lemmatization for French texts Use lemmatization for French texts and human evaluation for CLPA and human evaluation for CLPA