Algorithms for NLP - Carnegie Mellon Universitytbergkir/11711fa17/FA17 11-711 lecture 6... · 2017. 9. 14. · Figure thanks to Jennifer Venditti Dental: th/dh Alveolar: t/d/s/z/l/n

AcousticModelsTaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

SpeechSignals

n Frequencygivespitch;amplitudegivesvolume

n Frequenciesateachtimesliceprocessedintoobservationvectors

s p ee ch l a b

ampl

itude

SpeechinaSlide

……………………………………………..x12x13x12x14x14………..

Articulation

TextfromOhala,Sept2001,fromSharonRoseslide

Sagittal sectionofthevocaltract(Techmer 1880)

Nasalcavity

Pharynx

Vocalfolds(inthelarynx)

Trachea

Lungs

ArticulatorySystem

Oralcavity

SpaceofPhonemes

§ Standardinternationalphoneticalphabet(IPA)chartofconsonants

Place

PlacesofArticulation

labial

dentalalveolar post-alveolar/palatal

velaruvular

pharyngeal

laryngeal/glottal

FigurethankstoJenniferVenditti

Labialplace

bilabial

labiodental


Bilabial:p,b,m

Labiodental:f,v

Coronalplace

dentalalveolar post-alveolar/palatal


Dental:th/dh

Alveolar:t/d/s/z/l/n

Post:sh/zh/y

DorsalPlace

velaruvular

pharyngeal


Velar:k/g/ng

SpaceofPhonemes


Manner

MannerofArticulation§ Inadditiontovaryingbyplace,soundsvaryby

manner

§ Stop:completeclosureofarticulators,noairescapesviamouth§ Oralstop:palateisraised(p,t,k,b,d,g)§ Nasalstop:oralclosure,butpalateislowered(m,

n,ng)

§ Fricatives:substantialclosure,turbulent:(f,v,s,z)

§ Approximants:slightclosure,sonorant:(l,r,w)

§ Vowels:noclosure,sonorant:(i,e,a)

SpaceofPhonemes


Vowels

VowelSpace

Acoustics

“Shejusthadababy”

§ Whatcanwelearnfromawavefile?§ Nogapsbetweenwords(!)§ Vowelsarevoiced,long,loud§ Lengthintime=lengthinspaceinwaveformpicture§ Voicing:regularpeaksinamplitude§ Whenstopsclosed:nopeaks,silence§ Peaks=voicing:.46to.58(vowel[iy],fromsecond.65to.74(vowel[ax])andsoon

§ Silenceofstopclosure(1.06to1.08forfirst[b],or1.26to1.28forsecond[b])

§ Fricativeslike[sh]:intenseirregularpattern;see.33to.46

Time-DomainInformation

bad

pad

spat

pat

ExamplefromLadefoged

SimplePeriodicWavesofSound

Time (s)0 0.02

œ0.99

0.99

0

• Y axis: Amplitude = amount of air pressure at that point in time• Zero is normal air pressure, negative is rarefaction

• X axis: Time.• Frequency = number of cycles per second.• 20 cycles in .02 seconds = 1000 cycles/second = 1000 Hz

ComplexWaves:100Hz+1000Hz

Time (s)0 0.05

œ0.9654

0.99

0

Ampl

itude

Spectrum

100 1000Frequency in Hz

Coe

ffici

ent

Frequency components (100 and 1000 Hz) on x-axis

Partof[ae]waveformfrom“had”

§ Notecomplexwaverepeatingninetimesinfigure§ Plussmallerwaveswhichrepeats4timesforeverylarge

pattern§ Largewavehasfrequencyof250Hz(9timesin.036seconds)§ Smallwaveroughly4timesthis,orroughly1000Hz§ Twolittletinywavesontopofpeakof1000Hzwaves

Ampl

itude

Time

SpectrumofanActualSpeech

Frequency (Hz)0 5000

0

20

40

Coe

ffici

ent

Spectrogramsam

pl

time

slice


0

20

40

freq

coeff

FFT

time

ampl

Spectrograms

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

Fre

qu

en

cy (H

z)

05

00

0

0

20

40

time

ampl

Spectrogramsfre

q

time

time

ampl

TypesofGraphsfre

q

time

time

ampl

ampl

time


0

20

40

freq

coeff

BacktoSpectra§ Spectrumrepresentsthesefreqcomponents§ ComputedbyFouriertransform,algorithmwhichseparates

outeachfrequencycomponentofwave.

§ x-axisshowsfrequency,y-axisshowsmagnitude(indecibels,alogmeasureofamplitude)

§ Peaksat930Hz,1860Hz,and3020Hz.

Source/Filter

WhythesePeaks?

§ Articulationprocess:§ Thevocalcordvibrations

createharmonics§ Themouthisanamplifier§ Dependingonshapeof

mouth,someharmonicsareamplifiedmorethanothers

Figures from Ratree Wayland

A3

A4

A2

C4 (middle C)

C3

F#3

F#2

Vowel[i]atincreasingpitches

ResonancesoftheVocalTract

§ Thehumanvocaltractasanopentube:

§ Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.

§ Constraint:Pressuredifferentialshouldbemaximalat(closed)glottalendandminimalat(open)lipend.

Closedend Openend

Length17.5cm.

Figure from W. Barry

FromSundberg

Computingthe3FormantsofSchwa

§ LetthelengthofthetubebeL§ F1 =c/l1 =c/(4L)=35,000/4*17.5=500Hz§ F2 =c/l2 =c/(4/3L)=3c/4L=3*35,000/4*17.5=1500Hz§ F3 =c/l3 =c/(4/5L)=5c/4L=5*35,000/4*17.5=2500Hz

§ Soweexpectaneutralvoweltohave3resonancesat500,1500,and2500Hz

§ Thesevowelresonancesarecalledformants

FromMarkLiberman

SeeingFormants:theSpectrogram

VowelSpace

SeeingFormants:theSpectrogram

AmericanEnglishVowelSpace

FRONT BACK

HIGH

LOW

iy

ih

eh

ae aa

ao

uw

uh

ahax

ix ux

Figures from Jennifer Venditti, H. T. Bunnell

Spectrograms

HowtoReadSpectrograms

§ [bab]:closureoflipslowersallformants:sorapidincreaseinallformantsatbeginningof"bab”

§ [dad]:firstformantincreases,butF2andF3slightfall§ [gag]:F2andF3cometogether:thisisacharacteristicof

velars.Formanttransitionstakelongerinvelarsthaninalveolars orlabials

From Ladefoged “A Course in Phonetics”

“Shecamebackandstartedagain”

1.lotsofhigh-freqenergy3.closurefork4.burstofaspirationfork5.ey vowel;faint1100Hzformantisnasalization6.bilabialnasal7.shortbclosure,voicingbarelyvisible.8.ae;noteupwardtransitionsafterbilabialstopatbeginning9.noteF2andF3comingtogetherfor"k"

FromLadefoged “ACourseinPhonetics”

DialectIssues

§ Speechvariesfromdialecttodialect(examplesareAmericanvs.BritishEnglish)§ Syntactic(“Icould”vs.“Icould

do”)§ Lexical(“elevator”vs.“lift”)§ Phonological§ Phonetic

§ Mismatchbetweentrainingandtestingdialectscancausealargeincreaseinerrorrate

American British

all

old

SpeechRecognition

TheNoisyChannelModel

Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions

Language model: Distributions over sequences

of words (sentences)

SpeechModel

w1 w2Words

s1 s2 s3 s4 s5 s6 s7Soundtypes

a1 a2 a3 a4 a5 a6 a7Acousticobservations

Languagemodel

Acousticmodel

AcousticModel

s1 s2 s3 s4 s5 s6 s7Soundtypes

a1 a2 a3 a4 a5 a6 a7Acousticobservations

Acousticmodel

Frame Extraction

§ A frame (25 ms wide) extracted every 10 ms

25 ms

10ms

a1 a2 a3

Figure:SimonArnfield

Previewoffeatureextractionforeachframe:1) DFT(Spectrum)2) Log(Calibrate?)3) anotherDFT(!!??)

FeatureExtraction

DigitizingSpeech

Figure:BryanPellom

Source/Filter

§ Articulationprocess:§ Thevocalcordvibrations

createharmonics§ Themouthisanamplifier§ Dependingonshapeof

mouth,someharmonicsareamplifiedmorethanothers

Figures from Ratree Wayland

ProblemwithRawSpectrum

Deconvolution /Liftering

Deconvolution /Lifterings

e f

�

Deconvolution /Lifterings

e f

log

⇣

log

⇣log

⇣ ⌘

⌘

⌘+

Deconvolution /Liftering

GraphsfromDanEllis

s = e � f

log(s) = log(e) + log(f)

IDFT(log(s))

MelFreq.Cepstral Coefficients

§ DoFFTtogetspectralinformation§ Likethespectrogramwesawearlier

§ ApplyMelscaling(New)§ Modelshumanear;moresensitivity

inlowerfreqs§ Approx linearbelow1kHz,logabove,

equalsamplesaboveandbelow1kHz

§ TakeLog§ Dodiscretecosinetransform

[Graph:Wikipedia]

FinalFeatureVector

§ 39(real)featuresper10msframe:§ 12MFCCfeatures§ 12deltaMFCCfeatures§ 12delta-deltaMFCCfeatures§ 1(log)frameenergy§ 1delta(log)frameenergy§ 1delta-delta(logframeenergy)

§ Soeachframeisrepresentedbya39Dvector

EmissionModel

HMMsforContinuousObservations

§ Before:discretesetofobservations

§ Now:featurevectorsarereal-valued

§ Solution1:discretization§ Solution2:continuousemissions

§ Gaussians§ MultivariateGaussians§ MixturesofmultivariateGaussians

§ Astateisprogressively§ Contextindependentsubphone (~3per

phone)§ Contextdependentphone(triphones)§ StatetyingofCDphone

VectorQuantization

§ Idea:discretization§ MapMFCCvectorsonto

discretesymbols§ Computeprobabilities

justbycounting

§ ThisiscalledvectorquantizationorVQ

§ NotusedforASRanymore

§ But:usefultoconsiderasastartingpoint

GaussianEmissions§ VQisinsufficientfortop-

qualityASR§ Hardtocoverhigh-

dimensionalspacewithcodebook

§ Movesambiguityfromthemodeltothepreprocessing

§ Instead:assumethepossiblevaluesoftheobservationvectorsarenormallydistributed.§ Representtheobservation

likelihoodfunctionasaGaussian?

From bartus.org/akustyk

GaussiansforAcousticModeling

§ P(x):

P(x)

x

P(x) is highest here at mean

P(x) is low here, far from mean

A Gaussian is parameterized by a mean and a variance:

MultivariateGaussians§ Insteadofasinglemeanµ andvariances2:

§ Vectorofmeansµ andcovariancematrixS

§ Usuallyassumediagonalcovariance(!)§ Thisisn’tverytrueforFFTfeatures,butislessbadforMFCCfeatures

Gaussians:SizeofS

§ µ =[00] µ =[00] µ =[00]§ S =I S =0.6I S =2I§ AsS becomeslarger,Gaussianbecomesmorespreadout;asS becomessmaller,Gaussianmorecompressed

TextandfiguresfromAndrewNg

Gaussians:ShapeofS

§ Asweincreasetheoffdiagonalentries,morecorrelationbetweenvalueofxandvalueofy

TextandfiguresfromAndrewNg

Butwe’renotthereyet

§ SingleGaussiansmaydoabadjobofmodelingacomplexdistributioninanydimension

§ Evenworsefordiagonalcovariances

§ Solution:mixturesofGaussians

From openlearn.open.ac.uk

MixturesofGaussians§ MixturesofGaussians:

Fromrobots.ox.ac.uk http://www.itee.uq.edu.au/~comp4702

GMMs§ Summary:eachstatehasanemission

distributionP(x|s)(likelihoodfunction)parameterizedby:§ Mmixtureweights§ MmeanvectorsofdimensionalityD§ EitherM covariancematricesofDxD orM

Dx1diagonalvariancevectors

§ Likesoftvectorquantizationafterall§ Thinkofthemixturemeansasbeing

learnedcodebookentries§ ThinkoftheGaussiandensitiesasa

learnedcodebookdistancefunction§ ThinkofthemixtureofGaussianslikea

multinomialovercodes§ (EvenmoretruegivensharedGaussian

inventories,cf nextweek)

StateModel

StateTransitionDiagrams§ BayesNet:HMMasaGraphicalModel

§ StateTransitionDiagram:MarkovModelasaWeightedFSA

w w w

x x x

the cat chased

doghas

ASRLexicon

Figure:J&M

LexicalStateStructure

Figure:J&M

AddinganLM

FigurefromHuangetalpage618

StateSpace§ Statespacemustinclude

§ Currentword(|V|onorderof20K+)§ Indexwithincurrentword(|L|onorderof5)§ E.g.(lec[t]ure)(thoughnotinorthography!)

§ Acousticprobabilitiesonlydependonphonetype§ E.g.P(x|lec[t]ure)=P(x|t)

§ Fromastatesequence,canreadawordsequence

StateRefinement

PhonesAren’tHomogeneous

NeedtoUseSubphones

Figure:J&M

AWordwithSubphones

Figure:J&M

Modelingphoneticcontext

wiyriymiyniy

“Need”withtriphonemodels

Figure:J&M

LotsofTriphones

§ Possibletriphones:50x50x50=125,000

§ Howmanytriphonetypesactuallyoccur?

§ 20KwordWSJTask(fromBryanPellom)§ Wordinternalmodels:need14,300triphones§ Crosswordmodels:need54,400triphones

§ Needtogeneralizemodels,tietriphones

StateTying/Clustering

§ [Young,Odell,Woodland1994]

§ Howdowedecidewhichtriphonestoclustertogether?

§ Usephoneticfeatures (or‘broadphoneticclasses’)§ Stop§ Nasal§ Fricative§ Sibilant§ Vowel§ lateral

Figure:J&M

StateSpace§ Statespacenowincludes

§ Currentword:|W|isorder20K§ Indexincurrentword:|L|isorder5§ Subphone position:3§ E.g.(lec[t-mid]ure)

§ Acousticmodeldependsonclusteredphonecontext§ Butthisdoesn’tgrowthestatespace

§ But,addingtheLMcontextfortrigram+does§ (afterthe,lec[t-mid]ure)§ Thisisarealproblemfordecoding

Decoding

InferenceTasks

Mostlikelywordsequence:d- ae- d

Mostlikelystatesequence:d1-d6-d6-d4-ae5-ae2-ae3-ae0-d2-d2-d3-d7-d5

ViterbiDecoding

Figure:EnriqueBenimeli

ViterbiDecoding

Figure:EnriqueBenimeli

EmissionCaching§ Problem:scoringalltheP(x|s)valuesistooslow§ Idea:manystatessharetiedemissionmodels,socachethem

PrefixTrie Encodings§ Problem:manypartial-wordstatesareindistinguishable§ Solution:encodewordproductionasaprefixtrie (with

pushedweights)

§ AspecificinstanceofminimizingweightedFSAs[Mohri,94]Figure:Aubert,02

n i d

n i t

n o t

d

ni

t

o t

0.04

0.02

0.01

0.04

0.25

0.5

11

1

BeamSearch§ Problem:trellisistoobigtocomputev(s)vectors§ Idea:moststatesareterrible,keepv(s)onlyfortopstatesat

eachtime

§ Important:stilldynamicprogramming;collapseequiv states

theb.

them.

andthen.

atthen.

theba.thebe.thebi.

thema.theme.themi.

thena.thene.theni.

theba.

thebe.

thema.

thena.

LMFactoring§ Problem:Higher-ordern-gramsexplodethestatespace§ (One)Solution:

§ Factorstatespaceinto(wordindex,lmhistory)§ Scoreunigramprefixcostswhileinsideaword§ Subtractunigramcostandaddtrigramcostoncewordiscomplete

d

ni

t

o t

0.04

0.25

0.5

11

1

the

LMReweighting§ Noisychannelsuggests

§ Inpractice,wanttoboostLM

§ Also,goodtohavea“wordbonus”tooffsetLMcosts

§ Thesearebothconsequencesofbrokenindependenceassumptionsinthemodel

Training

TrainingMixtureModels§ Input:wavfileswithunalignedtranscriptions

§ Forcedalignment§ Computingthe“Viterbipath”overthetrainingdata(wherethe

transcriptionisknown)iscalled“forcedalignment”§ Weknowwhichwordstringtoassigntoeachobservationsequence.§ Wejustdon’tknowthestatesequence.§ Soweconstrainthepathtogothroughthecorrectwords(byusinga

specialexample-specificlanguagemodel)§ AndotherwiseruntheViterbialgorithm

§ Result:alignedstatesequence

StateTying

§ CreatingCDphones:§ Startwithmonophone,doEM

training§ CloneGaussiansintotriphones§ Builddecisiontreeandcluster

Gaussians§ Cloneandtrainmixtures

(GMMs)

§ Generalidea:§ Introducecomplexitygradually§ Interleaveconstraintwith

flexibility

Standardsubphone/mixtureHMM

Temporal Structure

GaussianMixtures

Model Error rateHMM Baseline 25.1%

AnInducedModel

Standard Model

Single Gaussians

Fully Connected

[Petrov, Pauls, and Klein, 07]

HierarchicalSplitTrainingwithEM

32.1%

28.7%

25.6%

HMM Baseline 25.1%5 Split rounds 21.4%

23.9%

Refinementofthe/ih/-phone



0

5

10

15

20

25

30

35

ae

ao

ay

eh

er

ey

ih f r s sil

aa

ah

ix

iy z cl k sh n

vcl

ow l m t v

uw

aw

ax

ch

w

th

el

dh

uh p en

oy

hh

jh

ng y b d dx g zh

epi

HMMstatesperphone

Documents

Algorithms for NLP - Carnegie Mellon Universitytbergkir/11711fa17/FA17 11-711 lecture 6... · 2017. 9. 14. · Figure thanks to Jennifer Venditti Dental: th/dh Alveolar: t/d/s/z/l/n