Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Usingcharactern‐gramstoclassifyna3velanguageinanon‐na3ve
Englishcorpusoftranscribedspeech
Charlo;eVaughnJanetPierrehumbert
HannahRohde
NorthwesternUniversity
AACL2009|UniversityofAlberta|October10
Authorshipa;ribu3on
▸ Usevariouscomponentsofwri3ng(e.g.syntac3c,stylis3c,discourse‐level)todetermineaspectsofauthor’siden3ty– e.g.gender,emo3onalstate,na3velanguage,actualiden3ty
(MostellerandWallace,1964;Koppel,Schler,andZigdon,2005)
Na3velanguageclassifica3on
▸ ExaminedEnglishwri3ngfromtheInterna3onalCorpusofLearnerEnglish(ICLE)– Usedsubcorporafrom5differentna3velanguagebackgrounds:
Bulgarian,Czech,French,Russian,Spanish
▸ Dividedeachdocumentintocharactern‐grams– e.g.‘bigrams’=‘_b’,‘bi’,‘ig’,‘gr’,‘ra’,‘am’,‘ms’,and‘s_’
▸ Usedmul3‐classsupportvectormachine(SVM)toclassifyeachdocumentbyna3velanguageofwriter
(TsurandRappoport,2007)
Findings
– Comparedwith20%randombaselineaccuracy,46.78%accuracyforcharacterunigrams,and59.67%forcharactertrigrams
(TsurandRappoport,2007)
▸ Obtained65.6%accuracyiniden3fyingna3velanguageoftheauthorbasedoncharacterbigramsalone
Interpreta3on
▸ Speculatedthat“useofL2wordsisstronglyinfluencedbyL1soundsandsoundpa;erns”(p.16)bigrams≈diphones
▸ Languagetransferevidentonmanylevels– EffectofL1onL2pronuncia3oniswidelya;ested
(Flege,1987,1995;Mack,2003)
▸ But,whatifyourL1backgrounddoesn’tjustaffecthowyousaywordsinyourL2,butwhatwordsyouuseinthefirstplace?
(TsurandRappoport,2007)
Drawbacksandopenques3onsfromTsurandRappoport(2007)
▸ Howgeneralizablearetheseresultstospeech?
– Wri3ngisamoreconscious,deliberateprocessthanspeech– Ifthisreallyisaphonologicalprocess,wemightexpectstronger
effectsinspeech
▸ Usedcorpusuncontrolledfortopiccontent– Diduse/‐idfmeasuretoaddresspossiblecontentbias,but
nonethelessahighlyvariablecorpus
▸ Whatisdrivingthiseffect?– Li;leevidenceofferedfortheL1‐drivenphonologicalhypothesis
Goalsofpresentstudy
▸ Extendmethodologytonaturalis3cspeechdata
▸ Useseman3callycontrolledcorpustominimizevariabilityintopicorregister
▸ Exploreclassifierinputinordertopinpointthesource(s)oftheeffect
Thecorpus
▸ TheWildcatCorpusofNa3ve‐andForeign‐AccentedEnglish(fromNorthwesternUniversity)– Bothscriptedandspontaneousspeechrecordings– Orthographicallytranscribed
– 24na3veEnglishspeakers&52non‐na3veEnglishspeakersEnglish(n=24),Korean(n=20),MandarinChinese(n=20),Indian(n=2),Spanish(n=2),Turkish(n=2),Italian(n=1),Iranian(n=1),Japanese(n=1),Macedonian(n=1),Russian(n=1),Thai(n=1)
– Designedinparttoexaminecommunica3onbetweentalkersofdifferentlanguagebackgrounds
(VanEngen,Baese‐Berk,Baker,Choi,Kim,andBradlow,inpress)
Diapixtask(VanEngen,Baese‐Berk,Baker,Choi,Kim,andBradlow,inpress)
SubcorpusdetailsEnglish(n=24)
Korean(n=20)
Mandarin(n=20)
Total
Wordtokens
15,617 17,253 19,168 52,038
Wordtypes
981 927 915 1,461
Wordtype/tokenra>o
0.063 0.054 0.048
Uniquecharacterbigrams
402 382 378
Uniquecharactertrigrams
2,141 2,006 1,982
Space=_ Apostrophe=‘
Test
▸ kNearestNeighbors(kNN)– k=numberofneighbors
– 1speaker=1document=1vector• Mul3dimensionalvectorsoffrequenciesrepresenteither:allwords,allbigrams,oralltrigrams
– Random80%documentstraining,20%tes3ng
Classifier
Na3veEnglish
Na3veKorean
Na3veMandarinθ
/ab/
/bc/
/cd/
(5,3,0)
Results
k Words
1 69.2
4 53.8
8 69.2
(inpercentcorrect)
Bigrams
69.5
61.5
61.5
Trigrams
69.2
76.9
69.2
Li;ledecreaseinaccuracyaverremovingmostfrequentwords
Whatisdoingtheclassifying?
▸ Pickoutn‐gramsthatare:– maximallyvariantinfrequencybetweenlanguagebackgrounds– fairlyfrequent
Whatisdoingtheclassifying?
▸ Lookforpossiblephonologicaleffects– MaybeEnglishspeakersusewordswithdifficultconsonant
clustersthatnon‐na3vespeakersavoid?
st_
just
just just
first first
first
Sowhatisdoingtheclassifying?
▸ Anumberofthings…
Case1:Singlefunc3onword
to_
N‐gramsignificantbecauseofonesinglefunc3onword
Otherexamples:ut_ =‘but’and‘about’_wi and ll_ =‘will’
to
to
to
Case2:Singleinterjec3on
oh
ohoh
oh_
N‐gramsignificantbecauseofonesingleinterjec3onordiscoursemarker
Otherexamples:hm_ =‘mhm’yes =‘yes’no_=‘no’
Case3:Singlemorpheme
n’t
N‐gramsignificantbecauseofonesinglemorpheme
don’t
don’t
don’t
doesn’t
didn’t
can’t
doesn’t
didn’t didn’t
Combina3onofcases
_ho
Func3onandcontentwords
Vocabularyitems
to
how
how
how
holding
house
househoney
Combina3onofcases
_ca
Contentandfunc3onwords
to
cat
cat
cat
can
can
can
case
carrying
BacktoTsurandRappoport
▸ Howgeneralizablearetheirresultstospeech?– Classifierperformswellonorthographicallytranscribedspeech
▸ Havewedeterminedwhatisdrivingthiseffect?– Appearstobemorelexicalthanphonological
Conclusions
▸ Canobtainsuccessfulclassifica3onusingsimpleorthographictranscrip3on– Nophone3callyormorphologicallytaggedcorpusappearstobe
necessary
▸ Mainac3onareasaremorphosyntaxandlexicalseman3cs
▸ Classifier’ssta3s3calpowerderivedfromcollapsingacrossrelatedcases– Trigramsdothisbest
Thankyou:
TylerKendall
BeiYu
AnnBradlow
LanguageDynamicsLabatNorthwesternUniversity
SpeechCommunica3onResearchGroup atNorthwesternUniversity
ReferencesFlege,J.E.,1987.Theproduc3onof‘new’and‘similar’phonesinaforeignlanguage:
evidencefortheeffectofequivalenceclassifica3on.J.Phone6cs15,47–65.Flege,J.E.,1995.Second‐languagespeechlearning:theory,findings,andproblems.In:
Strange,W.(Ed.),SpeechPercep6onandLinguis6cExperience,IssuesinCrosslinguis6cresearch.YorkPress,Timonium,MD,233–277.
KoppelM.,J.Schler,andK.ZigdonK.2005.Automa6callyDetermininganAnonymousAuthor’sNa6veLanguage.InIntelligenceandSecurityInforma6cs,209–217.Berlin/Heidelberg:Springer.
Mack,M.,2003.Thephone6csystemsofbilinguals.In:Banich,M.T.,Mack,M.(Eds.),Mind,Brain,andLanguage:Mul3disciplinaryPerspec3ves.LawrenceErlbaumPress,Mahwah,NJ.
Mosteller,F.andWallace,D.1964.InferenceandDisputedAuthorship,Addison–Wesley,Reading.
Tsur,O.andA.Rappoport.2007.Usingclassifierfeaturesforstudyingtheeffectofna3velanguageonthechoiceofwri;ensecondlanguagewords.ProceedingsoftheWorkshoponCogni6veAspectsofComputa6onalLanguageAcquisi6on,pages6‐16,Prague,CzechRepublic,June2007.
VanEngen,K.,M.Baese‐Berk,R.Baker,A.Choi,M.Kim,andA.Bradlow.Inpress.TheWildcatCorpusofNa3ve‐andForeign‐AccentedEnglish:Communica3veefficiencyacrossconversa3onaldyadswithvaryinglanguagealignmentprofiles.LanguageandSpeech.