Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
AcousticModelsTaylorBerg-Kirkpatrick– CMU
Slides:DanKlein– UCBerkeley
AlgorithmsforNLP
SpeechSignals
n Frequencygivespitch;amplitudegivesvolume
n Frequenciesateachtimesliceprocessedintoobservationvectors
s p ee ch l a b
ampl
itude
SpeechinaSlide
……………………………………………..x12x13x12x14x14………..
Articulation
TextfromOhala,Sept2001,fromSharonRoseslide
Sagittal sectionofthevocaltract(Techmer 1880)
Nasalcavity
Pharynx
Vocalfolds(inthelarynx)
Trachea
Lungs
ArticulatorySystem
Oralcavity
SpaceofPhonemes
§ Standardinternationalphoneticalphabet(IPA)chartofconsonants
Place
PlacesofArticulation
labial
dentalalveolar post-alveolar/palatal
velaruvular
pharyngeal
laryngeal/glottal
FigurethankstoJenniferVenditti
Labialplace
bilabial
labiodental
FigurethankstoJenniferVenditti
Bilabial:p,b,m
Labiodental:f,v
Coronalplace
dentalalveolar post-alveolar/palatal
FigurethankstoJenniferVenditti
Dental:th/dh
Alveolar:t/d/s/z/l/n
Post:sh/zh/y
DorsalPlace
velaruvular
pharyngeal
FigurethankstoJenniferVenditti
Velar:k/g/ng
SpaceofPhonemes
§ Standardinternationalphoneticalphabet(IPA)chartofconsonants
Manner
MannerofArticulation§ Inadditiontovaryingbyplace,soundsvaryby
manner
§ Stop:completeclosureofarticulators,noairescapesviamouth§ Oralstop:palateisraised(p,t,k,b,d,g)§ Nasalstop:oralclosure,butpalateislowered(m,
n,ng)
§ Fricatives:substantialclosure,turbulent:(f,v,s,z)
§ Approximants:slightclosure,sonorant:(l,r,w)
§ Vowels:noclosure,sonorant:(i,e,a)
SpaceofPhonemes
§ Standardinternationalphoneticalphabet(IPA)chartofconsonants
Vowels
VowelSpace
Acoustics
“Shejusthadababy”
§ Whatcanwelearnfromawavefile?§ Nogapsbetweenwords(!)§ Vowelsarevoiced,long,loud§ Lengthintime=lengthinspaceinwaveformpicture§ Voicing:regularpeaksinamplitude§ Whenstopsclosed:nopeaks,silence§ Peaks=voicing:.46to.58(vowel[iy],fromsecond.65to.74(vowel[ax])andsoon
§ Silenceofstopclosure(1.06to1.08forfirst[b],or1.26to1.28forsecond[b])
§ Fricativeslike[sh]:intenseirregularpattern;see.33to.46
Time-DomainInformation
bad
pad
spat
pat
ExamplefromLadefoged
SimplePeriodicWavesofSound
Time (s)0 0.02
œ0.99
0.99
0
• Y axis: Amplitude = amount of air pressure at that point in time• Zero is normal air pressure, negative is rarefaction
• X axis: Time.• Frequency = number of cycles per second.• 20 cycles in .02 seconds = 1000 cycles/second = 1000 Hz
ComplexWaves:100Hz+1000Hz
Time (s)0 0.05
œ0.9654
0.99
0
Ampl
itude
Spectrum
100 1000Frequency in Hz
Coe
ffici
ent
Frequency components (100 and 1000 Hz) on x-axis
Partof[ae]waveformfrom“had”
§ Notecomplexwaverepeatingninetimesinfigure§ Plussmallerwaveswhichrepeats4timesforeverylarge
pattern§ Largewavehasfrequencyof250Hz(9timesin.036seconds)§ Smallwaveroughly4timesthis,orroughly1000Hz§ Twolittletinywavesontopofpeakof1000Hzwaves
Ampl
itude
Time
SpectrumofanActualSpeech
Frequency (Hz)0 5000
0
20
40
Coe
ffici
ent
Spectrogramsam
pl
time
slice
Frequency (Hz)0 5000
0
20
40
freq
coeff
FFT
time
ampl
Spectrograms
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
Fre
qu
en
cy (H
z)
05
00
0
0
20
40
time
ampl
Spectrogramsfre
q
time
time
ampl
TypesofGraphsfre
q
time
time
ampl
ampl
time
Frequency (Hz)0 5000
0
20
40
freq
coeff
BacktoSpectra§ Spectrumrepresentsthesefreqcomponents§ ComputedbyFouriertransform,algorithmwhichseparates
outeachfrequencycomponentofwave.
§ x-axisshowsfrequency,y-axisshowsmagnitude(indecibels,alogmeasureofamplitude)
§ Peaksat930Hz,1860Hz,and3020Hz.
Source/Filter
WhythesePeaks?
§ Articulationprocess:§ Thevocalcordvibrations
createharmonics§ Themouthisanamplifier§ Dependingonshapeof
mouth,someharmonicsareamplifiedmorethanothers
Figures from Ratree Wayland
A3
A4
A2
C4 (middle C)
C3
F#3
F#2
Vowel[i]atincreasingpitches
ResonancesoftheVocalTract
§ Thehumanvocaltractasanopentube:
§ Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.
§ Constraint:Pressuredifferentialshouldbemaximalat(closed)glottalendandminimalat(open)lipend.
Closedend Openend
Length17.5cm.
Figure from W. Barry
FromSundberg
Computingthe3FormantsofSchwa
§ LetthelengthofthetubebeL§ F1 =c/l1 =c/(4L)=35,000/4*17.5=500Hz§ F2 =c/l2 =c/(4/3L)=3c/4L=3*35,000/4*17.5=1500Hz§ F3 =c/l3 =c/(4/5L)=5c/4L=5*35,000/4*17.5=2500Hz
§ Soweexpectaneutralvoweltohave3resonancesat500,1500,and2500Hz
§ Thesevowelresonancesarecalledformants
FromMarkLiberman
SeeingFormants:theSpectrogram
VowelSpace
SeeingFormants:theSpectrogram
AmericanEnglishVowelSpace
FRONT BACK
HIGH
LOW
iy
ih
eh
ae aa
ao
uw
uh
ahax
ix ux
Figures from Jennifer Venditti, H. T. Bunnell
Spectrograms
HowtoReadSpectrograms
§ [bab]:closureoflipslowersallformants:sorapidincreaseinallformantsatbeginningof"bab”
§ [dad]:firstformantincreases,butF2andF3slightfall§ [gag]:F2andF3cometogether:thisisacharacteristicof
velars.Formanttransitionstakelongerinvelarsthaninalveolars orlabials
From Ladefoged “A Course in Phonetics”
“Shecamebackandstartedagain”
1.lotsofhigh-freqenergy3.closurefork4.burstofaspirationfork5.ey vowel;faint1100Hzformantisnasalization6.bilabialnasal7.shortbclosure,voicingbarelyvisible.8.ae;noteupwardtransitionsafterbilabialstopatbeginning9.noteF2andF3comingtogetherfor"k"
FromLadefoged “ACourseinPhonetics”
DialectIssues
§ Speechvariesfromdialecttodialect(examplesareAmericanvs.BritishEnglish)§ Syntactic(“Icould”vs.“Icould
do”)§ Lexical(“elevator”vs.“lift”)§ Phonological§ Phonetic
§ Mismatchbetweentrainingandtestingdialectscancausealargeincreaseinerrorrate
American British
all
old
SpeechRecognition
TheNoisyChannelModel
Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions
Language model: Distributions over sequences
of words (sentences)
SpeechModel
w1 w2Words
s1 s2 s3 s4 s5 s6 s7Soundtypes
a1 a2 a3 a4 a5 a6 a7Acousticobservations
Languagemodel
Acousticmodel
AcousticModel
s1 s2 s3 s4 s5 s6 s7Soundtypes
a1 a2 a3 a4 a5 a6 a7Acousticobservations
Acousticmodel
Frame Extraction
§ A frame (25 ms wide) extracted every 10 ms
25 ms
10ms
a1 a2 a3
Figure:SimonArnfield
Previewoffeatureextractionforeachframe:1) DFT(Spectrum)2) Log(Calibrate?)3) anotherDFT(!!??)
FeatureExtraction
DigitizingSpeech
Figure:BryanPellom
Source/Filter
§ Articulationprocess:§ Thevocalcordvibrations
createharmonics§ Themouthisanamplifier§ Dependingonshapeof
mouth,someharmonicsareamplifiedmorethanothers
Figures from Ratree Wayland
ProblemwithRawSpectrum
Deconvolution /Liftering
Deconvolution /Lifterings
e f
�
Deconvolution /Lifterings
e f
log
⇣
log
⇣log
⇣ ⌘
⌘
⌘+
Deconvolution /Liftering
GraphsfromDanEllis
s = e � f
log(s) = log(e) + log(f)
IDFT(log(s))
MelFreq.Cepstral Coefficients
§ DoFFTtogetspectralinformation§ Likethespectrogramwesawearlier
§ ApplyMelscaling(New)§ Modelshumanear;moresensitivity
inlowerfreqs§ Approx linearbelow1kHz,logabove,
equalsamplesaboveandbelow1kHz
§ TakeLog§ Dodiscretecosinetransform
[Graph:Wikipedia]
FinalFeatureVector
§ 39(real)featuresper10msframe:§ 12MFCCfeatures§ 12deltaMFCCfeatures§ 12delta-deltaMFCCfeatures§ 1(log)frameenergy§ 1delta(log)frameenergy§ 1delta-delta(logframeenergy)
§ Soeachframeisrepresentedbya39Dvector
EmissionModel
HMMsforContinuousObservations
§ Before:discretesetofobservations
§ Now:featurevectorsarereal-valued
§ Solution1:discretization§ Solution2:continuousemissions
§ Gaussians§ MultivariateGaussians§ MixturesofmultivariateGaussians
§ Astateisprogressively§ Contextindependentsubphone (~3per
phone)§ Contextdependentphone(triphones)§ StatetyingofCDphone
VectorQuantization
§ Idea:discretization§ MapMFCCvectorsonto
discretesymbols§ Computeprobabilities
justbycounting
§ ThisiscalledvectorquantizationorVQ
§ NotusedforASRanymore
§ But:usefultoconsiderasastartingpoint
GaussianEmissions§ VQisinsufficientfortop-
qualityASR§ Hardtocoverhigh-
dimensionalspacewithcodebook
§ Movesambiguityfromthemodeltothepreprocessing
§ Instead:assumethepossiblevaluesoftheobservationvectorsarenormallydistributed.§ Representtheobservation
likelihoodfunctionasaGaussian?
From bartus.org/akustyk
GaussiansforAcousticModeling
§ P(x):
P(x)
x
P(x) is highest here at mean
P(x) is low here, far from mean
A Gaussian is parameterized by a mean and a variance:
MultivariateGaussians§ Insteadofasinglemeanµ andvariances2:
§ Vectorofmeansµ andcovariancematrixS
§ Usuallyassumediagonalcovariance(!)§ Thisisn’tverytrueforFFTfeatures,butislessbadforMFCCfeatures
Gaussians:SizeofS
§ µ =[00] µ =[00] µ =[00]§ S =I S =0.6I S =2I§ AsS becomeslarger,Gaussianbecomesmorespreadout;asS becomessmaller,Gaussianmorecompressed
TextandfiguresfromAndrewNg
Gaussians:ShapeofS
§ Asweincreasetheoffdiagonalentries,morecorrelationbetweenvalueofxandvalueofy
TextandfiguresfromAndrewNg
Butwe’renotthereyet
§ SingleGaussiansmaydoabadjobofmodelingacomplexdistributioninanydimension
§ Evenworsefordiagonalcovariances
§ Solution:mixturesofGaussians
From openlearn.open.ac.uk
MixturesofGaussians§ MixturesofGaussians:
Fromrobots.ox.ac.uk http://www.itee.uq.edu.au/~comp4702
GMMs§ Summary:eachstatehasanemission
distributionP(x|s)(likelihoodfunction)parameterizedby:§ Mmixtureweights§ MmeanvectorsofdimensionalityD§ EitherM covariancematricesofDxD orM
Dx1diagonalvariancevectors
§ Likesoftvectorquantizationafterall§ Thinkofthemixturemeansasbeing
learnedcodebookentries§ ThinkoftheGaussiandensitiesasa
learnedcodebookdistancefunction§ ThinkofthemixtureofGaussianslikea
multinomialovercodes§ (EvenmoretruegivensharedGaussian
inventories,cf nextweek)
StateModel
StateTransitionDiagrams§ BayesNet:HMMasaGraphicalModel
§ StateTransitionDiagram:MarkovModelasaWeightedFSA
w w w
x x x
the cat chased
doghas
ASRLexicon
Figure:J&M
LexicalStateStructure
Figure:J&M
AddinganLM
FigurefromHuangetalpage618
StateSpace§ Statespacemustinclude
§ Currentword(|V|onorderof20K+)§ Indexwithincurrentword(|L|onorderof5)§ E.g.(lec[t]ure)(thoughnotinorthography!)
§ Acousticprobabilitiesonlydependonphonetype§ E.g.P(x|lec[t]ure)=P(x|t)
§ Fromastatesequence,canreadawordsequence
StateRefinement
PhonesAren’tHomogeneous
NeedtoUseSubphones
Figure:J&M
AWordwithSubphones
Figure:J&M
Modelingphoneticcontext
wiyriymiyniy
“Need”withtriphonemodels
Figure:J&M
LotsofTriphones
§ Possibletriphones:50x50x50=125,000
§ Howmanytriphonetypesactuallyoccur?
§ 20KwordWSJTask(fromBryanPellom)§ Wordinternalmodels:need14,300triphones§ Crosswordmodels:need54,400triphones
§ Needtogeneralizemodels,tietriphones
StateTying/Clustering
§ [Young,Odell,Woodland1994]
§ Howdowedecidewhichtriphonestoclustertogether?
§ Usephoneticfeatures (or‘broadphoneticclasses’)§ Stop§ Nasal§ Fricative§ Sibilant§ Vowel§ lateral
Figure:J&M
StateSpace§ Statespacenowincludes
§ Currentword:|W|isorder20K§ Indexincurrentword:|L|isorder5§ Subphone position:3§ E.g.(lec[t-mid]ure)
§ Acousticmodeldependsonclusteredphonecontext§ Butthisdoesn’tgrowthestatespace
§ But,addingtheLMcontextfortrigram+does§ (afterthe,lec[t-mid]ure)§ Thisisarealproblemfordecoding
Decoding
InferenceTasks
Mostlikelywordsequence:d- ae- d
Mostlikelystatesequence:d1-d6-d6-d4-ae5-ae2-ae3-ae0-d2-d2-d3-d7-d5
ViterbiDecoding
Figure:EnriqueBenimeli
ViterbiDecoding
Figure:EnriqueBenimeli
EmissionCaching§ Problem:scoringalltheP(x|s)valuesistooslow§ Idea:manystatessharetiedemissionmodels,socachethem
PrefixTrie Encodings§ Problem:manypartial-wordstatesareindistinguishable§ Solution:encodewordproductionasaprefixtrie (with
pushedweights)
§ AspecificinstanceofminimizingweightedFSAs[Mohri,94]Figure:Aubert,02
n i d
n i t
n o t
d
ni
t
o t
0.04
0.02
0.01
0.04
0.25
0.5
11
1
BeamSearch§ Problem:trellisistoobigtocomputev(s)vectors§ Idea:moststatesareterrible,keepv(s)onlyfortopstatesat
eachtime
§ Important:stilldynamicprogramming;collapseequiv states
theb.
them.
andthen.
atthen.
theba.thebe.thebi.
thema.theme.themi.
thena.thene.theni.
theba.
thebe.
thema.
thena.
LMFactoring§ Problem:Higher-ordern-gramsexplodethestatespace§ (One)Solution:
§ Factorstatespaceinto(wordindex,lmhistory)§ Scoreunigramprefixcostswhileinsideaword§ Subtractunigramcostandaddtrigramcostoncewordiscomplete
d
ni
t
o t
0.04
0.25
0.5
11
1
the
LMReweighting§ Noisychannelsuggests
§ Inpractice,wanttoboostLM
§ Also,goodtohavea“wordbonus”tooffsetLMcosts
§ Thesearebothconsequencesofbrokenindependenceassumptionsinthemodel
Training
TrainingMixtureModels§ Input:wavfileswithunalignedtranscriptions
§ Forcedalignment§ Computingthe“Viterbipath”overthetrainingdata(wherethe
transcriptionisknown)iscalled“forcedalignment”§ Weknowwhichwordstringtoassigntoeachobservationsequence.§ Wejustdon’tknowthestatesequence.§ Soweconstrainthepathtogothroughthecorrectwords(byusinga
specialexample-specificlanguagemodel)§ AndotherwiseruntheViterbialgorithm
§ Result:alignedstatesequence
StateTying
§ CreatingCDphones:§ Startwithmonophone,doEM
training§ CloneGaussiansintotriphones§ Builddecisiontreeandcluster
Gaussians§ Cloneandtrainmixtures
(GMMs)
§ Generalidea:§ Introducecomplexitygradually§ Interleaveconstraintwith
flexibility
Standardsubphone/mixtureHMM
Temporal Structure
GaussianMixtures
Model Error rateHMM Baseline 25.1%
AnInducedModel
Standard Model
Single Gaussians
Fully Connected
[Petrov, Pauls, and Klein, 07]
HierarchicalSplitTrainingwithEM
32.1%
28.7%
25.6%
HMM Baseline 25.1%5 Split rounds 21.4%
23.9%
Refinementofthe/ih/-phone
Refinementofthe/ih/-phone
Refinementofthe/ih/-phone
0
5
10
15
20
25
30
35
ae
ao
ay
eh
er
ey
ih f r s sil
aa
ah
ix
iy z cl k sh n
vcl
ow l m t v
uw
aw
ax
ch
w
th
el
dh
uh p en
oy
hh
jh
ng y b d dx g zh
epi
HMMstatesperphone