Upload
malory
View
48
Download
0
Embed Size (px)
DESCRIPTION
Automatic speech recognition on the articulation index corpus. Guy J. Brown and Amy Beeston Department of Computer Science University of Sheffield [email protected]. Aims. Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR). - PowerPoint PPT Presentation
Citation preview
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 1
Automatic speech recognition on
the articulation index corpus
Guy J. Brown and Amy BeestonDepartment of Computer Science
University of [email protected]
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 2
Aims
• Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR).
• Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task.– wider vocabulary– range of reverberation conditions– variety of speech contexts– naturalistic speech, rather than interpolated stimuli– consider phonetic confusions in reverberation in general
• Initial ASR studies using articulation index corpus• Aim to compare human performance (Amy
experiment) and machine performance on same task
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 3
Articulation index (AI) corpus
• Recorded by Jonathan Wright (University of Pennsylvania)
• Intended for speech recognition in noise experiments similar to those of Fletcher.
• Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins:– American English– Target syllables are mostly nonsense, but some
correspond to real words (including “sir” and “stir”)– Target syllables are embedded in a context sentence
drawn from a limited vocabulary
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 4
Grammar for Amy’s subset of AI corpus
$cw1 = YOU | I | THEY | NO-ONE | WE | ANYONE | EVERYONE | SOMEONE | PEOPLE;
$cw2 = SPEAK | SAY | USE | THINK | SENSE | ELICIT | WITNESS | DESCRIBE | SPELL | READ | STUDY | REPEAT | RECALL | REPORT | PROPOSE | EVOKE | UTTER | HEAR | PONDER | WATCH | SAW | REMEMBER | DETECT | SAID | REVIEW | PRONOUNCE | RECORD | WRITE | ATTEMPT | ECHO | CHECK | NOTICE | PROMPT | DETERMINE | UNDERSTAND | EXAMINE | DISTINGUISH | PERCEIVE | TRY | VIEW | SEE | UTILIZE | IMAGINE | NOTE | SUGGEST | RECOGNIZE | OBSERVE | SHOW | MONITOR | PRODUCE;
$cw3 = ONLY | STEADILY | EVENLY | ALWAYS | NINTH | FLUENTLY | PROPERLY | EASILY | ANYWAY | NIGHTLY | NOW | SOMETIME | DAILY | CLEARLY | WISELY | SURELY | FIFTH | PRECISELY | USUALLY | TODAY | MONTHLY | WEEKLY | MORE | TYPICALLY | NEATLY | TENTH | EIGHTH | FIRST | AGAIN | SIXTH | THIRD | SEVENTH | OFTEN | SECOND | HAPPILY | TWICE | WELL | GLADLY | YEARLY | NICELY | FOURTH | ENTIRELY | HOURLY;
$test = SIR | STIR | SPUR | SKUR;
( !ENTER $cw1 $cw2 $test $cw3 !EXIT )
Audio demos
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 5
ASR system
• HMM-based phone recogniser – implemented in HTK– monophone models– 20 Gaussian mixtures per state– adapted from scripts by Tony Robinson/Dan Ellis
• Bootstrapped by training on TIMIT then further 10-12 iterations of embedded training on AI corpus
• Word-level transcripts in AI corpus expanded to phones using the CMU pronunciation dictionary
• All of AI corpus used for training, except the 80 utterances in Amy’s experimental stimuli
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 6
MFCC features
• Baseline system trained using mel-frequency cepstral coefficients (MFCCs)– 12 MFCCs + energy + delta+acceleration (total 39
features per frame)– cepstral mean normalization
• Baseline system performance on Amy’s clean subset of AI corpus (80 utterances, no reverberation):– 98.75% context words correct– 96.25% test words correct
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 7
Amy experiment
• Amy’s first experiment used 80 utterances – 20 instances each of “sir”, “skur”, “spur” and “stir”
test words
• Overall confusion rate was controlled by lowpass filtering at 1, 1.5, 2, 3 and 4 kHz
• Same reverberation conditions as in Watkins et al. experiments
• Stimuli presented to the ASR system as in Amy’s human studies
Test 0.32m Test 10m
Context 0.32m
near-near near-far
Context 10m far-near far-far
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 8
Baseline ASR: context words
• Performance falls as the cutoff frequency decreases
• Performance falls as level of reverberation increases
• Near context substantially better than far context at most cutoffs
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 9
Baseline ASR: test words
No particular pattern of confusions in 2kHz near-near case but more frequent skur/spur/stir errors
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 10
Human data(20 subjects)
BaselineASR system
Baseline ASR: human comparison
• Data for 4 kHz cutoff• Even mild
reverberation (near near) causes substantial errors in the baseline ASR system
• Human listeners exhibit compensation in the AIC task, the baseline ASR system doesn’t (as expected) far test word near test word
perc
enta
ge e
rror
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 11
Training on auditory features
• 80 channels between 100 Hz and 8 kHz
• 15 DCT coefficients + delta + acceleration (45 features per frame)
• Efferent attenuation set to zero for initial tests
• Performance of auditory features on Amy’s clean subset of AI corpus (80 utterances, no reverberation):
– 95% context words correct– 97.5% test words correct
DRNLDRNLOMEOME Hair cellHair cell Frame & DCT
Frame & DCT
RecogniserRecogniserATTATT
Auditory periphery
Efferent system
StimulusStimulus
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 12
Auditory features: context words
• Take a big hit in performance using auditory features– saturation in AN is
likely to be an issue– mean normalization
• Performance falls sharply with decreasing cutoff
• As expected, best performance in the least reverberated conditions
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 13
Auditory features: test words
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 14
Effect of efferent suppression
• Not yet used fullclosed-loop modelin ASR experiments
• Indication of likelyperformanceobtained by increasingefferent attenuationin ‘far’ context conditions
DRNLDRNLOMEOME Hair cellHair cell Frame & DCT
Frame & DCT
RecogniserRecogniserATTATT
Auditory periphery
Efferent system
StimulusStimulus
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 15
Auditory features: human comparison
• 4 kHz cutoff• Efferent
suppression effective for mild reverberation
• Detrimental to far test word
• Currently unable to model human data, but:– not closed loop– same efferent
attenuation in all bands
Human data(20 subjects)
No efferent suppressio
n
10 dB efferent suppression
far test word near test word
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 16
Confusion analysis: far-near condition
• Without efferent attenuation “skur”, “spur” and “stir” are frequently confused as “sir”
• These confusions are reduced by more than half when 10 dB of efferent attenuation is applied
SIRSKU
RSPU
RSTIR
SIR 12 3 3 2
SKUR
11 5 2 2
SPUR
10 2 4 4
STIR 7 1 7 5
SIR SKUR SPUR STIR
SIR 9 2 2 7
SKUR 3 10 4 3
SPUR 5 4 4 7
STIR 3 2 3 12
far-near 0 dB attenuation
far-near 10 dB attenuation
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 17
Confusion analysis: far-far condition
• Again “skur”, “spur” and “stir” are commonly reported as “sir”
• These confusions are somewhat reduced by 10dB efferent attenuation, but:
– gain is outweighed by more frequent “skur”, “spur”, “stir” confusions
• Efferent attenuation recovers the dip in the temporal envelope but not cues to /k/, /p/ and /t/
SIRSKU
RSPU
RSTIR
SIR 18 0 1 1
SKUR
14 1 3 2
SPUR
12 1 5 2
STIR 12 0 3 5
SIRSKU
RSPU
RSTIR
SIR 13 2 1 4
SKUR
11 3 1 5
SPUR
6 5 2 7
STIR 10 2 1 7
far-far 0 dB attenuation
far-far 10 dB attenuation
EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 18
Summary
• ASR framework in place for the AI corpus experiments
• We can compare human and machine performance on the AIC task
• Reasonable performance from baseline MFCC system
• Need to address shortfall in performance when using auditory features
• Haven’t yet tried the full within-channel model as a front end