Automatic Identification of Cognates, False Friends, and Partial Cognates

Automatic Automatic Identification of Identification of Cognates, False Cognates, False

Friends, and Partial Friends, and Partial CognatesCognates

University of Ottawa, University of Ottawa, CanadaCanada

OutlineOutline Overview of the ThesisOverview of the Thesis Research ContributionResearch Contribution Cognate and False Friend Cognate and False Friend

IdentificationIdentification Partial Cognate DisambiguationPartial Cognate Disambiguation CLPA- Cognate and False Friend CLPA- Cognate and False Friend

Annotator Annotator Conclusions and Future WorkConclusions and Future Work

Overview of the ThesisOverview of the Thesis

TasksTasks– Automatic Identification of Cognates and Automatic Identification of Cognates and

False FriendsFalse Friends– Automatic Disambiguation of Partial CognatesAutomatic Disambiguation of Partial Cognates

Areas of ApplicationsAreas of Applications– CALL,CALL, MT, Word Alignment, Cross-Language MT, Word Alignment, Cross-Language

Information RetrievalInformation Retrieval

CALL Tool - CLPACALL Tool - CLPA

CognatesCognates or or True FriendsTrue Friends ( (Vrais AmisVrais Amis), are pairs of ), are pairs of words that are perceived as similar and are mutual words that are perceived as similar and are mutual translations. translations.

nature - naturenature - nature, , reconnaissance - recognition reconnaissance - recognition

False FriendsFalse Friends ( (Faux AmisFaux Amis) are pairs of words in two ) are pairs of words in two languages that are perceived as similar but have languages that are perceived as similar but have different meanings. different meanings.

main main (=hand) (=hand) -- main main (principal, essential), (principal, essential), blesserblesser (=to injure) (=to injure) -- blessbless (b (béénir in French)nir in French)

Partial CognatesPartial Cognates words that share the same words that share the same meaning in two languages in some but not all meaning in two languages in some but not all contexts contexts note – notenote – note,, facteurfacteur - factor - factor or mailman, maker or mailman, maker

DefinitionsDefinitions

Research ContributionResearch Contribution Novel method based on ML algorithms to Novel method based on ML algorithms to

identify Cognates and False Friendsidentify Cognates and False Friends

A method to create complete lists of Cognates A method to create complete lists of Cognates and False Friendsand False Friends

Define a novel task: Define a novel task: Partial Cognate Partial Cognate Disambiguation, Disambiguation, and solve it using a and solve it using a supervised and a semi-supervised methodsupervised and a semi-supervised method– Combine and use corpora from different domainsCombine and use corpora from different domains

Implement a CALL Tool – Implement a CALL Tool – CLPA CLPA to annotate to annotate Cognates and False FriendsCognates and False Friends

Cognates and False Friends Cognates and False Friends IdentificationIdentification Our methodOur method

– Machine Learning techniques with different Machine Learning techniques with different algorithmsalgorithms

– InstancesInstances: French-English pairs of words : French-English pairs of words – Feature SpaceFeature Space: 13 orthographic similarity : 13 orthographic similarity

measuresmeasures– ClassesClasses: Cog_FF and Unrelated: Cog_FF and Unrelated

Experiments done for:Experiments done for: Each measure separately Each measure separately Average of all measuresAverage of all measures All 13 measuresAll 13 measures

Cognates and False Friends Cognates and False Friends IdentificationIdentification

DataData

Training setTraining set Test setTest set

CognatesCognates 613 (73)613 (73) 603 (178)603 (178)

False-False-FriendsFriends

314 (135)314 (135) 94 (46) 94 (46)

UnrelatedUnrelated 527 (0)527 (0) 343 (0)343 (0)

TotalTotal 14541454 10401040

Results for classification Results for classification (COG_FF/UNREL)(COG_FF/UNREL)

Orthographic Orthographic similarity measuresimilarity measure

ThresholdThreshold Accuracy onAccuracy onTraining setTraining set

Accuracy onAccuracy onTest setTest set

IDENTIDENT 11 43.90 %43.90 % 55.00 %55.00 %

PREFIXPREFIX 0.038450.03845 92.70 %92.70 % 90.97 %90.97 %

DICEDICE 0.296690.29669 89.40 %89.40 % 93.37 %93.37 %

LCSRLCSR 0.458000.45800 92.91 %92.91 % 94.24 %94.24 %

NEDNED 0.348450.34845 93.39 %93.39 % 93.57 %93.57 %

SOUNDEXSOUNDEX 0.625000.62500 85.28 %85.28 % 84.54 %84.54 %

TRITRI 0.04760.0476 88.30 %88.30 % 92.13 %92.13 %

XDICEXDICE 0.218250.21825 92.84 %92.84 % 94.52 %94.52 %

XXDICEXXDICE 0.129150.12915 91.74 %91.74 % 95.39 %95.39 %

TRI-SIMTRI-SIM 0.348450.34845 95.66 %95.66 % 93.28 %93.28 %

TRI-DISTTRI-DIST 0.348450.34845 95.11 %95.11 % 93.85 %93.85 %

Average measureAverage measure 0.147700.14770 93.83 %93.83 % 94.14 %94.14 %

Results for classification Results for classification (COG_FF/UNREL)(COG_FF/UNREL)

ClassifierClassifier Accuracy cross-Accuracy cross-val. on training setval. on training set

Accuracy on Accuracy on test settest set

BaselineBaseline 63.75 %63.75 % 66.98 %66.98 %

OneRuleOneRule 95.66 %95.66 % 92.89 %92.89 %

Naive BayesNaive Bayes 94.84 %94.84 % 94.62 %94.62 %

Decision TreeDecision Tree 95.66 %95.66 % 92.08 %92.08 %

Decision Tree (pruned)Decision Tree (pruned) 95.66%95.66% 93.18 %93.18 %

IBKIBK 93.81 %93.81 % 92.80 %92.80 %

Ada BoostAda Boost 95.66 %95.66 % 93.47 %93.47 %

PerceptronPerceptron 95.11 %95.11 % 91.55 %91.55 %

SVM (SMO)SVM (SMO) 95.46 %95.46 % 93.76 %93.76 %

Complete Lists of Cognates and False Complete Lists of Cognates and False FriendsFriends

MethodMethod– Use the XXDICE orthographic similarity Use the XXDICE orthographic similarity

measuremeasure– Use list of pairs of words in two Use list of pairs of words in two

languages (the words that are translation languages (the words that are translation of each other, or not, or monolingual lists of each other, or not, or monolingual lists of words)of words)

– Use a bilingual dictionary to determine if Use a bilingual dictionary to determine if the words contained in a pair are the words contained in a pair are translation of each other translation of each other

Complete Lists of Cognates and False Complete Lists of Cognates and False FriendsFriends EvaluationEvaluation

– On the entry list of a French-English On the entry list of a French-English bilingual dictionarybilingual dictionary

55% - Cognates55% - Cognates 2% - False Friends (5,619,270 pairs) 2% - False Friends (5,619,270 pairs)

– We created pair of words from two large We created pair of words from two large monolingual list of words in French and monolingual list of words in French and EnglishEnglish

11,469,662 – Orthographical Similar (0.8%)11,469,662 – Orthographical Similar (0.8%)– 3,496 Cognates (0.03%)3,496 Cognates (0.03%)– 3,767,435 False Friends (32%)3,767,435 False Friends (32%)

Cognates and False Friends Cognates and False Friends IdentificationIdentification

Conclusion Conclusion

We tested a number of orthographic similarity We tested a number of orthographic similarity measures individually, and also combined using measures individually, and also combined using different Machine Learning algorithmsdifferent Machine Learning algorithms

We evaluated the methods on a training set using 10-We evaluated the methods on a training set using 10-fold cross validation, on a test setfold cross validation, on a test set

We proposed an extension of the method to create We proposed an extension of the method to create complete lists of Cognates and False Friendscomplete lists of Cognates and False Friends

The results show that, for French and English, it is The results show that, for French and English, it is possible to achieve very good accuracy based on the possible to achieve very good accuracy based on the orthographic measures of word similarityorthographic measures of word similarity

Partial Cognate DisambiguationPartial Cognate Disambiguation

TaskTask – To determine the sense/meaning (Cognate To determine the sense/meaning (Cognate

or False Friend with the equivalent English or False Friend with the equivalent English word) of an Partial Cognate in a French word) of an Partial Cognate in a French contextcontext

NoteNote CogCog

Le comité prend Le comité prend notenote de cette information. de cette information. The Committee takes The Committee takes notenote of this reply. of this reply. FFFF

Mais qui a dû payer la Mais qui a dû payer la notenote?? So who got left holding the So who got left holding the billbill??

DataData

Use a set of 10 Partial CognatesUse a set of 10 Partial Cognates– Parallel sentences that have on the Parallel sentences that have on the

French side the French side the French Partial CognateFrench Partial Cognate and on the English side the and on the English side the English English CognateCognate ( (English False FriendEnglish False Friend) - labeled ) - labeled as as COG (FF)COG (FF)

Collected from EuroPar, HansardCollected from EuroPar, Hansard– ~ ~ 115 sentences each class for Training115 sentences each class for Training– ~ 60 sentences each class for Testing~ 60 sentences each class for Testing

Supervised MethodSupervised Method

Traditional ML algorithmsTraditional ML algorithms

FeaturesFeatures

- used the bag-of-words (BOW) approach of - used the bag-of-words (BOW) approach of modeling context, with the binary feature modeling context, with the binary feature valuesvalues

- context words from the training corpus that - context words from the training corpus that appeared at least 3 times in the training appeared at least 3 times in the training sentences sentences

ClassesClasses COG and FF COG and FF

Monolingual BootstrappingMonolingual BootstrappingForFor each pair of partial cognates (PC) each pair of partial cognates (PC) 1. Train a classifier on the training seeds – using the BOW 1. Train a classifier on the training seeds – using the BOW

approach and a NB-K classifier with attribute selection on the approach and a NB-K classifier with attribute selection on the featuresfeatures

2. Apply the classifier on unlabeled data – sentences that contain 2. Apply the classifier on unlabeled data – sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E) E)

3. Take the first k newly classified sentences, both from the COG 3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most and FF class and add them to the training seeds (the most confident ones – the prediction accuracy greater or equal than confident ones – the prediction accuracy greater or equal than a threshold =0.85)a threshold =0.85)

4. Rerun the experiments training on the new training set4. Rerun the experiments training on the new training set

5. Repeat steps 2 and 3 for t times 5. Repeat steps 2 and 3 for t times endForendFor

Bilingual BootstrappingBilingual Bootstrapping

1. Translate the English sentences that 1. Translate the English sentences that were collected in the MB-E step into were collected in the MB-E step into French using an online MT tool and add French using an online MT tool and add them to the French seed training data. them to the French seed training data.

2. Repeat the MB-F and MB-E steps for 2. Repeat the MB-F and MB-E steps for TT times. times.

Additional DataAdditional Data

LeMondeLeMonde– An average of 250 sentences for each classAn average of 250 sentences for each class

BNCBNC– An average of 200 sentences for each classAn average of 200 sentences for each class

Multi-Domain corpusMulti-Domain corpus– An average of 80 sentences for each classAn average of 80 sentences for each class

ResultsResults

505254565860626466687072747678808284

S_NC S+NC_TS S_TS+NC S_TS

NO_BSTMBBBMB+BB

Partial Cognate DisambiguationPartial Cognate Disambiguation

ConclusionsConclusions

– Simple methods and available tools are Simple methods and available tools are used with success for a task hard to solve used with success for a task hard to solve even foreven for humanshumans

– Additional use of unlabeled data improves Additional use of unlabeled data improves the learning process for the Partial the learning process for the Partial Cognates Disambiguation taskCognates Disambiguation task

– Semi-Supervised Learning proves to be “as Semi-Supervised Learning proves to be “as good as” Supervised Learninggood as” Supervised Learning

CLPACLPA--Cross Language Pair AnnotatorCross Language Pair Annotator

Future WorkFuture Work

Apply the Cognate and False Friend Apply the Cognate and False Friend Identification method, and create Identification method, and create complete list for other pair of complete list for other pair of languageslanguages

Increase the accuracy results for the Increase the accuracy results for the Partial Cognate Disambiguation taskPartial Cognate Disambiguation task

Use lemmatization for French texts Use lemmatization for French texts and human evaluation for CLPA and human evaluation for CLPA

Thank you!Thank you!

Documents

Automatic Identification of Cognates, False Friends, and Partial Cognates