25
Using character n‐grams to classify na3ve language in a non‐na3ve English corpus of transcribed speech Charlo;e Vaughn Janet Pierrehumbert Hannah Rohde Northwestern University AACL 2009 | University of Alberta | October 10

Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Usingcharactern‐gramstoclassifyna3velanguageinanon‐na3ve

Englishcorpusoftranscribedspeech

Charlo;eVaughnJanetPierrehumbert

HannahRohde

NorthwesternUniversity

AACL2009|UniversityofAlberta|October10

Page 2: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Authorshipa;ribu3on

▸  Usevariouscomponentsofwri3ng(e.g.syntac3c,stylis3c,discourse‐level)todetermineaspectsofauthor’siden3ty–  e.g.gender,emo3onalstate,na3velanguage,actualiden3ty

(MostellerandWallace,1964;Koppel,Schler,andZigdon,2005)

Page 3: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Na3velanguageclassifica3on

▸  ExaminedEnglishwri3ngfromtheInterna3onalCorpusofLearnerEnglish(ICLE)–  Usedsubcorporafrom5differentna3velanguagebackgrounds:

Bulgarian,Czech,French,Russian,Spanish

▸  Dividedeachdocumentintocharactern‐grams–  e.g.‘bigrams’=‘_b’,‘bi’,‘ig’,‘gr’,‘ra’,‘am’,‘ms’,and‘s_’

▸  Usedmul3‐classsupportvectormachine(SVM)toclassifyeachdocumentbyna3velanguageofwriter

(TsurandRappoport,2007)

Page 4: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Findings

–  Comparedwith20%randombaselineaccuracy,46.78%accuracyforcharacterunigrams,and59.67%forcharactertrigrams

(TsurandRappoport,2007)

▸ Obtained65.6%accuracyiniden3fyingna3velanguageoftheauthorbasedoncharacterbigramsalone

Page 5: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Interpreta3on

▸  Speculatedthat“useofL2wordsisstronglyinfluencedbyL1soundsandsoundpa;erns”(p.16)bigrams≈diphones

▸  Languagetransferevidentonmanylevels–  EffectofL1onL2pronuncia3oniswidelya;ested

(Flege,1987,1995;Mack,2003)

▸  But,whatifyourL1backgrounddoesn’tjustaffecthowyousaywordsinyourL2,butwhatwordsyouuseinthefirstplace?

(TsurandRappoport,2007)

Page 6: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Drawbacksandopenques3onsfromTsurandRappoport(2007)

▸  Howgeneralizablearetheseresultstospeech?

–  Wri3ngisamoreconscious,deliberateprocessthanspeech–  Ifthisreallyisaphonologicalprocess,wemightexpectstronger

effectsinspeech

▸  Usedcorpusuncontrolledfortopiccontent–  Diduse/‐idfmeasuretoaddresspossiblecontentbias,but

nonethelessahighlyvariablecorpus

▸  Whatisdrivingthiseffect?–  Li;leevidenceofferedfortheL1‐drivenphonologicalhypothesis

Page 7: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Goalsofpresentstudy

▸  Extendmethodologytonaturalis3cspeechdata

▸  Useseman3callycontrolledcorpustominimizevariabilityintopicorregister

▸  Exploreclassifierinputinordertopinpointthesource(s)oftheeffect

Page 8: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Thecorpus

▸  TheWildcatCorpusofNa3ve‐andForeign‐AccentedEnglish(fromNorthwesternUniversity)–  Bothscriptedandspontaneousspeechrecordings–  Orthographicallytranscribed

–  24na3veEnglishspeakers&52non‐na3veEnglishspeakersEnglish(n=24),Korean(n=20),MandarinChinese(n=20),Indian(n=2),Spanish(n=2),Turkish(n=2),Italian(n=1),Iranian(n=1),Japanese(n=1),Macedonian(n=1),Russian(n=1),Thai(n=1)

–  Designedinparttoexaminecommunica3onbetweentalkersofdifferentlanguagebackgrounds

(VanEngen,Baese‐Berk,Baker,Choi,Kim,andBradlow,inpress)

Page 9: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Diapixtask(VanEngen,Baese‐Berk,Baker,Choi,Kim,andBradlow,inpress)

Page 10: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

SubcorpusdetailsEnglish(n=24)

Korean(n=20)

Mandarin(n=20)

Total

Wordtokens

15,617 17,253 19,168 52,038

Wordtypes

981 927 915 1,461

Wordtype/tokenra>o

0.063 0.054 0.048

Uniquecharacterbigrams

402 382 378

Uniquecharactertrigrams

2,141 2,006 1,982

Space=_ Apostrophe=‘

Page 11: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Test

▸  kNearestNeighbors(kNN)–  k=numberofneighbors

–  1speaker=1document=1vector•  Mul3dimensionalvectorsoffrequenciesrepresenteither:allwords,allbigrams,oralltrigrams

–  Random80%documentstraining,20%tes3ng

Classifier

Na3veEnglish

Na3veKorean

Na3veMandarinθ

/ab/

/bc/

/cd/

(5,3,0)

Page 12: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Results

k Words

1 69.2

4 53.8

8 69.2

(inpercentcorrect)

Bigrams

69.5

61.5

61.5

Trigrams

69.2

76.9

69.2

Li;ledecreaseinaccuracyaverremovingmostfrequentwords

Page 13: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Whatisdoingtheclassifying?

▸  Pickoutn‐gramsthatare:–  maximallyvariantinfrequencybetweenlanguagebackgrounds–  fairlyfrequent

Page 14: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Whatisdoingtheclassifying?

▸  Lookforpossiblephonologicaleffects–  MaybeEnglishspeakersusewordswithdifficultconsonant

clustersthatnon‐na3vespeakersavoid?

Page 15: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

st_

just

just just

first first

first

Page 16: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Sowhatisdoingtheclassifying?

▸  Anumberofthings…

Page 17: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Case1:Singlefunc3onword

to_

N‐gramsignificantbecauseofonesinglefunc3onword

Otherexamples:ut_ =‘but’and‘about’_wi and ll_ =‘will’

to

to

to

Page 18: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Case2:Singleinterjec3on

oh

ohoh

oh_

N‐gramsignificantbecauseofonesingleinterjec3onordiscoursemarker

Otherexamples:hm_ =‘mhm’yes =‘yes’no_=‘no’

Page 19: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Case3:Singlemorpheme

n’t

N‐gramsignificantbecauseofonesinglemorpheme

don’t

don’t

don’t

doesn’t

didn’t

can’t

doesn’t

didn’t didn’t

Page 20: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Combina3onofcases

_ho

Func3onandcontentwords

Vocabularyitems

to

how

how

how

holding

house

househoney

Page 21: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Combina3onofcases

_ca

Contentandfunc3onwords

to

cat

cat

cat

can

can

can

case

carrying

Page 22: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

BacktoTsurandRappoport

▸ Howgeneralizablearetheirresultstospeech?–  Classifierperformswellonorthographicallytranscribedspeech

▸ Havewedeterminedwhatisdrivingthiseffect?–  Appearstobemorelexicalthanphonological

Page 23: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Conclusions

▸  Canobtainsuccessfulclassifica3onusingsimpleorthographictranscrip3on–  Nophone3callyormorphologicallytaggedcorpusappearstobe

necessary

▸  Mainac3onareasaremorphosyntaxandlexicalseman3cs

▸  Classifier’ssta3s3calpowerderivedfromcollapsingacrossrelatedcases–  Trigramsdothisbest

Page 24: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

Thankyou:

TylerKendall

BeiYu

AnnBradlow

LanguageDynamicsLabatNorthwesternUniversity

SpeechCommunica3onResearchGroup atNorthwesternUniversity

Page 25: Using character n‐grams to classify nave language in a non ...aacl2009/PDFs/Vaughn... · The Wildcat Corpus of Nave‐ and Foreign‐Accented English (from Northwestern University)

ReferencesFlege,J.E.,1987.Theproduc3onof‘new’and‘similar’phonesinaforeignlanguage:

evidencefortheeffectofequivalenceclassifica3on.J.Phone6cs15,47–65.Flege,J.E.,1995.Second‐languagespeechlearning:theory,findings,andproblems.In:

Strange,W.(Ed.),SpeechPercep6onandLinguis6cExperience,IssuesinCrosslinguis6cresearch.YorkPress,Timonium,MD,233–277.

KoppelM.,J.Schler,andK.ZigdonK.2005.Automa6callyDetermininganAnonymousAuthor’sNa6veLanguage.InIntelligenceandSecurityInforma6cs,209–217.Berlin/Heidelberg:Springer.

Mack,M.,2003.Thephone6csystemsofbilinguals.In:Banich,M.T.,Mack,M.(Eds.),Mind,Brain,andLanguage:Mul3disciplinaryPerspec3ves.LawrenceErlbaumPress,Mahwah,NJ.

Mosteller,F.andWallace,D.1964.InferenceandDisputedAuthorship,Addison–Wesley,Reading.

Tsur,O.andA.Rappoport.2007.Usingclassifierfeaturesforstudyingtheeffectofna3velanguageonthechoiceofwri;ensecondlanguagewords.ProceedingsoftheWorkshoponCogni6veAspectsofComputa6onalLanguageAcquisi6on,pages6‐16,Prague,CzechRepublic,June2007.

VanEngen,K.,M.Baese‐Berk,R.Baker,A.Choi,M.Kim,andA.Bradlow.Inpress.TheWildcatCorpusofNa3ve‐andForeign‐AccentedEnglish:Communica3veefficiencyacrossconversa3onaldyadswithvaryinglanguagealignmentprofiles.LanguageandSpeech.