Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Usingcharactern‐gramstoclassifyna3velanguageinanon‐na3ve

Englishcorpusoftranscribedspeech

Charlo;eVaughnJanetPierrehumbert

HannahRohde

NorthwesternUniversity

AACL2009|UniversityofAlberta|October10

Authorshipa;ribu3on

▸  Usevariouscomponentsofwri3ng(e.g.syntac3c,stylis3c,discourse‐level)todetermineaspectsofauthor’siden3ty–  e.g.gender,emo3onalstate,na3velanguage,actualiden3ty

(MostellerandWallace,1964;Koppel,Schler,andZigdon,2005)

Na3velanguageclassifica3on

▸  ExaminedEnglishwri3ngfromtheInterna3onalCorpusofLearnerEnglish(ICLE)–  Usedsubcorporafrom5differentna3velanguagebackgrounds:

Bulgarian,Czech,French,Russian,Spanish

▸  Dividedeachdocumentintocharactern‐grams–  e.g.‘bigrams’=‘_b’,‘bi’,‘ig’,‘gr’,‘ra’,‘am’,‘ms’,and‘s_’

▸  Usedmul3‐classsupportvectormachine(SVM)toclassifyeachdocumentbyna3velanguageofwriter

(TsurandRappoport,2007)

Findings

–  Comparedwith20%randombaselineaccuracy,46.78%accuracyforcharacterunigrams,and59.67%forcharactertrigrams


▸ Obtained65.6%accuracyiniden3fyingna3velanguageoftheauthorbasedoncharacterbigramsalone

Interpreta3on

▸  Speculatedthat“useofL2wordsisstronglyinfluencedbyL1soundsandsoundpa;erns”(p.16)bigrams≈diphones

▸  Languagetransferevidentonmanylevels–  EffectofL1onL2pronuncia3oniswidelya;ested

(Flege,1987,1995;Mack,2003)

▸  But,whatifyourL1backgrounddoesn’tjustaffecthowyousaywordsinyourL2,butwhatwordsyouuseinthefirstplace?


Drawbacksandopenques3onsfromTsurandRappoport(2007)

▸  Howgeneralizablearetheseresultstospeech?

–  Wri3ngisamoreconscious,deliberateprocessthanspeech–  Ifthisreallyisaphonologicalprocess,wemightexpectstronger

effectsinspeech

▸  Usedcorpusuncontrolledfortopiccontent–  Diduse/‐idfmeasuretoaddresspossiblecontentbias,but

nonethelessahighlyvariablecorpus

▸  Whatisdrivingthiseffect?–  Li;leevidenceofferedfortheL1‐drivenphonologicalhypothesis

Goalsofpresentstudy

▸  Extendmethodologytonaturalis3cspeechdata

▸  Useseman3callycontrolledcorpustominimizevariabilityintopicorregister

▸  Exploreclassifierinputinordertopinpointthesource(s)oftheeffect

Thecorpus

▸  TheWildcatCorpusofNa3ve‐andForeign‐AccentedEnglish(fromNorthwesternUniversity)–  Bothscriptedandspontaneousspeechrecordings–  Orthographicallytranscribed

–  24na3veEnglishspeakers&52non‐na3veEnglishspeakersEnglish(n=24),Korean(n=20),MandarinChinese(n=20),Indian(n=2),Spanish(n=2),Turkish(n=2),Italian(n=1),Iranian(n=1),Japanese(n=1),Macedonian(n=1),Russian(n=1),Thai(n=1)

–  Designedinparttoexaminecommunica3onbetweentalkersofdifferentlanguagebackgrounds

(VanEngen,Baese‐Berk,Baker,Choi,Kim,andBradlow,inpress)

Diapixtask(VanEngen,Baese‐Berk,Baker,Choi,Kim,andBradlow,inpress)

SubcorpusdetailsEnglish(n=24)

Korean(n=20)

Mandarin(n=20)

Total

Wordtokens

15,617 17,253 19,168 52,038

Wordtypes

981 927 915 1,461

Wordtype/tokenra>o

0.063 0.054 0.048

Uniquecharacterbigrams

402 382 378

Uniquecharactertrigrams

2,141 2,006 1,982

Space=_ Apostrophe=‘

Test

▸  kNearestNeighbors(kNN)–  k=numberofneighbors

–  1speaker=1document=1vector•  Mul3dimensionalvectorsoffrequenciesrepresenteither:allwords,allbigrams,oralltrigrams

–  Random80%documentstraining,20%tes3ng

Classifier

Na3veEnglish

Na3veKorean

Na3veMandarinθ

/ab/

/bc/

/cd/

(5,3,0)

Results

k Words

1 69.2

4 53.8

8 69.2

(inpercentcorrect)

Bigrams

69.5

61.5

61.5

Trigrams

69.2

76.9

69.2

Li;ledecreaseinaccuracyaverremovingmostfrequentwords

Whatisdoingtheclassifying?

▸  Pickoutn‐gramsthatare:–  maximallyvariantinfrequencybetweenlanguagebackgrounds–  fairlyfrequent

Whatisdoingtheclassifying?

▸  Lookforpossiblephonologicaleffects–  MaybeEnglishspeakersusewordswithdifficultconsonant

clustersthatnon‐na3vespeakersavoid?

st_

just

just just

first first

first

Sowhatisdoingtheclassifying?

▸  Anumberofthings…

Case1:Singlefunc3onword

to_

N‐gramsignificantbecauseofonesinglefunc3onword

Otherexamples:ut_ =‘but’and‘about’_wi and ll_ =‘will’

to

to

to

Case2:Singleinterjec3on

oh

ohoh

oh_

N‐gramsignificantbecauseofonesingleinterjec3onordiscoursemarker

Otherexamples:hm_ =‘mhm’yes =‘yes’no_=‘no’

Case3:Singlemorpheme

n’t

N‐gramsignificantbecauseofonesinglemorpheme

don’t

don’t

don’t

doesn’t

didn’t

can’t

doesn’t

didn’t didn’t

Combina3onofcases

_ho

Func3onandcontentwords

Vocabularyitems

to

how

how

how

holding

house

househoney

Combina3onofcases

_ca

Contentandfunc3onwords

to

cat

cat

cat

can

can

can

case

carrying

BacktoTsurandRappoport

▸ Howgeneralizablearetheirresultstospeech?–  Classifierperformswellonorthographicallytranscribedspeech

▸ Havewedeterminedwhatisdrivingthiseffect?–  Appearstobemorelexicalthanphonological

Conclusions

▸  Canobtainsuccessfulclassifica3onusingsimpleorthographictranscrip3on–  Nophone3callyormorphologicallytaggedcorpusappearstobe

necessary

▸  Mainac3onareasaremorphosyntaxandlexicalseman3cs

▸  Classifier’ssta3s3calpowerderivedfromcollapsingacrossrelatedcases–  Trigramsdothisbest

Thankyou:

TylerKendall

BeiYu

AnnBradlow

LanguageDynamicsLabatNorthwesternUniversity

SpeechCommunica3onResearchGroup atNorthwesternUniversity

ReferencesFlege,J.E.,1987.Theproduc3onof‘new’and‘similar’phonesinaforeignlanguage:

evidencefortheeffectofequivalenceclassifica3on.J.Phone6cs15,47–65.Flege,J.E.,1995.Second‐languagespeechlearning:theory,findings,andproblems.In:

Strange,W.(Ed.),SpeechPercep6onandLinguis6cExperience,IssuesinCrosslinguis6cresearch.YorkPress,Timonium,MD,233–277.

KoppelM.,J.Schler,andK.ZigdonK.2005.Automa6callyDetermininganAnonymousAuthor’sNa6veLanguage.InIntelligenceandSecurityInforma6cs,209–217.Berlin/Heidelberg:Springer.

Mack,M.,2003.Thephone6csystemsofbilinguals.In:Banich,M.T.,Mack,M.(Eds.),Mind,Brain,andLanguage:Mul3disciplinaryPerspec3ves.LawrenceErlbaumPress,Mahwah,NJ.

Mosteller,F.andWallace,D.1964.InferenceandDisputedAuthorship,Addison–Wesley,Reading.

Tsur,O.andA.Rappoport.2007.Usingclassifierfeaturesforstudyingtheeffectofna3velanguageonthechoiceofwri;ensecondlanguagewords.ProceedingsoftheWorkshoponCogni6veAspectsofComputa6onalLanguageAcquisi6on,pages6‐16,Prague,CzechRepublic,June2007.

VanEngen,K.,M.Baese‐Berk,R.Baker,A.Choi,M.Kim,andA.Bradlow.Inpress.TheWildcatCorpusofNa3ve‐andForeign‐AccentedEnglish:Communica3veefficiencyacrossconversa3onaldyadswithvaryinglanguagealignmentprofiles.LanguageandSpeech.

Documents

Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)