Upload
gwendolyn-barker
View
219
Download
3
Tags:
Embed Size (px)
Citation preview
Flexible, Robust, and EfficientHuman Speech Processing Versus Present-day Speech
Technology
Louis C.W. Pols
Institute of Phonetic Sciences / IFOTT
University of Amsterdam
The Netherlands
IFAHerengracht 338
Amsterdam
My pre-predecessor: Louise KaiserSecretary of First International Congress of Phonetic SciencesAmsterdam, 3-8 July 1932
welcome
Amsterdam ICPhS’32Jac. van Ginneken, president
L. Kaiser, secretary A. Roozendaal, Treasurer
Subjects:- physiology of speech and voice
(experimental phonetics in its strict meaning)- study of the development of speech and voice in the individual; their
evolution in the history of mankind; the influence of heridity- anthropology of speech and voice
- phonology
- linguistic psychology 136 participants
- pathology of speech and voice from 16 countries
- comparative physiology of the sounds of animals 43 plenary papers
- musicology 24 demonstrations
Amsterdam ICPhS’32Some of the participants:prof. Daniel Jones, London: The theory of phonemes, and its
importance in Practical Linguistics
Sir Richard Paget, London: The Evolution of Speech in Men
prof. R.H. Stetson, Oberlin: Breathing Movements in Speech
prof. Prince N. Trubetzkoy, Wien: Charakter und Methode der systematischen phonologischen Darstellung einer gegebenen Sprache
dr. E. Zwirner, Berlin-Buch:- Phonetische Untersuchungen an Aphasischen und Amusischen
- Quantität, Lautdauerschätzung und Lautkurvenmessung (Theorie und Material)
-----------------------------------------------------------------
2nd, London ‘35; 3rd, Ghent’38; 4th, Helsinki ‘61; 5th, Münster ‘64;
Overview
Phonetics and speech technology Do recognizers need ‘intelligent ears’? What is knowledge? How good is human/machine speech recogn.? How good is synthetic speech? Pre-processor characteristics Useful (phonetic) knowledge Computational phonetics Discussion/conclusions
Phonetics Speech Technology
AFFINITY to:from:
phonetics speechtechnology
phoneticssource / filterindividualitycontextprosody
human performancespecific knowledgeregularitiesmultiple features
speechtechnology
more datanew modelsprobabilitiesspeech vs. NLP
EU FPV, DARPAapplicationsuser orientationevaluation
Do recognizers needintelligent ears?
intelligent ears front-end pre-processor only if it improves performance humans are generally better speech
processors than machines, perhaps system developers can learn from human behavior
robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)
What is knowledge?
phonetic knowledge probabilistic knowledge from databases fixed set of features vs. adaptable set trading relations, selectivity knowledge of the world, expectation global vs. detailed
see video
(with permission from Interbrew Nederland NV)
Video is a metaphor for:
from global to detail (world Europe Holland North Sea coast Scheveningen beach
young lady drinking Dommelsch beer) sound speech speaker English utterance ‘recognize speech’ or ‘wreck a nice beach’ zoom in on whatever information is available make intelligent interpretation, given context beware for distracters!
Human auditory sensitivity
stationary vs. dynamic signals simple vs. spectrally complex detection threshold just noticeable differences see Table 3 in paper
phenomenon threshold/jnd
remarks phenomenon threshold/jnd
remarks
thresholdof hearing
0 dB at 1000 Hz frequency dependent formantfrequency
3 - 5 % one formant only< 3 % with moreexperienced subjects
thresholdof duration
constant energyat 10 – 300 ms
Energy =Power x Duration
formantamplitude
3 dB F2 in synthetic vowel
frequencydiscrimination
1.5 Hzat 1000 Hz
more when < 200ms
overallintensity
1.5 dB synthetic vowel,mainly F1
intensitydiscrimination
0.5 – 1 dB up to 80 dB SL formantbandwidth
20 - 40 % one-formant vowel
temporaldiscrimination
5 ms at 50 ms duration dependent F0 (pitch) 0.3 - 0.5 % synthetic vowel
masking psychophysicaltuning curve
pitch ofcomplex tones
low pitch many peculiarities
gap detection 3 ms forwide-band noise
more at low freq. fornarrow-band noise
Detection thresholds and jndmulti-harmonic,
simple, stationary signals single-formant-likeperiodic signals
3 - 5%
1.5 Hz20 - 40%
frequency
F2
BW
DL for short speech-like transitions
20 30 40 50
Transition duration (ms)
0
60
120
180
240
Tone glide
Tone glide
Single-isolated
ComplexSingle Single
Complex
Adopted from van Wieringen & Pols (Acta Acustica ’98)
complex
simple
short longer trans.
How good ishuman / machine speech recognition?
% word errorcorpus description vocabularysize
recognitionperplexity machine human
TI digits read digits 10 10 0.72 0.009alphabet read letters 26 26 5 1.6ResourceManagement
readsentences
1,000 60-1,000 17 2
NAB readsentences
5,000-unlimited
45-160 6.6 0.4
SwitchboardCSR
spontaneoustelephoneconversations
2,000-unlimited
80-150 43 4
Switchboardwordspotting
idem 20keywords
- 31.1 7.4
Adapted from Lippmann (SpeCom, 1997)
How good ishuman / machine speech recognition?
machine SR surprisingly good for certain tasks machine SR could be better for many others
- robustness, outliers what are the limits of human performance?
- in noise- for degraded speech- missing information (trading)
Human word intelligibility vs. noise
Adopted from Steeneken (1992)
recognizers have trouble!
humans start to have some trouble
Robustness to degraded speech speech = time-modulated signal in frequency bands relatively insensitive to (spectral) distortions
- prerequisite for digital hearing aid- modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz
temporal smearing of envelope modulation- ca. 4 Hz max. in modulation spectrum syllable- LP>4 Hz and HP<8 Hz little effect on intelligibility
spectral envelope smearing- for BW>1/3 oct masked SRT starts to degrade
(for references, see paper in Proc. ICPhS’99)
Robustness to degraded speechand missing information
partly reversed speech (Saberi & Perrott, Nature, 4/99)- fixed duration segments time reversed or shifted in time- perfect sentence intelligibility up to 50 ms
(demo: every 50 ms reversed original )- low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum- syllable as information unit? (S. Greenberg)
gap and click restoration (Warren) gating experiments
How good is synthetic speech?
good enough for certain applications could be better in most others evaluation:application-specific or multi-tier required interesting experience: Synthesis workshop
at Jenolan Caves, Australia, Nov. 1998
Workshop evaluation procedure
participants as native listeners DARPA-type procedures in data preparations balanced listening design no detailed results made public 3 text types
- newspaper sentences- semantically unpredictable sentences- telephone directory entries
42 systems in 8 languages tested
Screen for newspaper sentences
Some global results
it worked!, but many practical problems(for demo see http://www.fon.hum.uva.nl)
this seems the way to proceed and to expand global rating (poor to excellent)
- text analysis, prosody & signal processing and/or more detailed scores transcriptions subjectively judged
- major/minor/no problems per entry web site access of several systems
(http://www.ldc.upenn.edu/ltts/)
Phonetic knowledge to improve speech synthesis
(suppose concatenative synthesis) control emotion, style, voice characteristics perceptual implications of
- parameterization (LPC, PSOLA)- discontinuities (spectral, temporal, prosody)
improve naturalness (prosody!) active adaptation to other conditions
- hyper/hypo, noise, comm. channel, listener impairment systematic evaluation
Desired pre-processor characteristicsin Automatic Speech Recognition
basic sensitivity for stationary and dynamic sounds robustness to degraded speech
- rather insensitive to spectral and temporal smearing robustness to noise and reverberation filter characteristics
- is BP, PLP, MFCC, RASTA, TRAPS good enough?- lateral inhibition (spectral sharpening); dynamics
what can be neglected?- non-linearities, limited dynamic range, active elements,
co-modulation, secondary pitch, etc.
Caricature of present-day speech recognizer
trained with a variety of speech input- much global information, no interrelations
monaural, uni-modal input pitch extractor generally not operational performs well on average behavior
- does poorly on any type of outlier (OOV, non-native, fastor whispered speech, other communication channel)
neglects lots of useful (phonetic) information heavily relies on language model
Useful (phonetic) knowledge neglected so far
pitch information (systematic) durational variability spectral reduction/coarticulation (other than multiphone)
intelligent selection from multiple features quick adaptation to speaker, style & channel communicative expectations multi-modality binaural hearing
Useful information: durational variability
R
S
Root /iy/
Lw
Lu
count
mean
s.d.
factorlevel
4626
95
39
1544
83
31
1588
95
36
1494
109
46
796
78
25
711
89
36
37
91
25
816
87
29
735
104
40
37
98
34
719
98
33
729
119
54
46
104
42
91
80
529
91
117
75
79
80
52
94
70
136
180
101
433
101
14
83
22
107
1
99
52
94
50
126
12
186
8
121
134
98
46
111
374
96
37
156
22
90
0 1 2
0 1 2 0 1 2 0 1 2
0 1 2 3 0 1 2 3 0 1 2
0 0 1 2 0 2 0 1 2
26 30 22 25 27 50 25 42 24 36 0
27 46 52 23 25 24 37 58 27
Adopted from Wang (1998)
Useful information: durational variability
R
S
Root /iy/
Lw
Lu
count
mean
s.d.
factorlevel
4626
95
39
1544
83
31
1588
95
36
1494
109
46
796
78
25
711
89
36
37
91
25
816
87
29
735
104
40
37
98
34
719
98
33
729
119
54
46
104
42
91
80
529
91
117
75
79
80
52
94
70
136
180
101
433
101
14
83
22
107
1
99
52
94
50
126
12
186
8
121
134
98
46
111
374
96
37
156
22
90
0 1 2
0 1 2 0 1 2 0 1 2
0 1 2 3 0 1 2 3 0 1 2
0 0 1 2 0 2 0 1 2
26 30 22 25 27 50 25 42 24 36 0
27 46 52 23 25 24 37 58 27
Adopted from Wang (1998)
normal rate=95
primary stress=104
word final=136
utterance final=186
overall average=95 ms
Useful information:V and C reduction, coarticulation
spectral variability is not random but, at least partly, speaker-, style-, and context-specific
read - spontaneous; stressed - unstressed not just for vowels, but also for consonants
- duration- spectral balance- intervocalic sound energy difference- F2 slope difference- locus equation
Stressed Unstressed Total0
5
10
15
20
25
30
35
Read
Spontaneous
Err
or
rate
->
%p 0.001 0.001 0.001
45
50
55
60
65
45
50
55
60
65
Read
Spontaneous
Stressed Unstressed Total
Dur
atio
n ->
ms
p 0.001 0.006 0.001
Mean consonant duration Mean error rate for C identification
Adopted from van Son & Pols (Eurospeech’97)
C-duration C error rate
791 VCV pairs (read & spontan.; stressed & unstr. segments; one male)
C-identification by 22 Dutch subjects
Other useful information:
pronunciation variation (ESCA workshop) acoustic attributes of prominence (B. Streefkerk) speech efficiency (post-doc project R. v. Son) confidence measure units in speech recognition
- rather than PLU, perhaps syllables (S. Greenberg) quick adaptation prosody-driven recognition / understanding multiple features
Speech efficiency
speech is most efficient if it contains only the information needed to understand it:“Speech is the missing information” (Lindblom, JASA ‘96)
less information needed for more predictable things:
- shorter duration and more spectral reduction for high-frequent syllables and words
- C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log2(Prob(x)) in bits
(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))
Correlation between consonant confusion and 4 measures indicated
Read — Read + Spont — Spont + All0
-0.05
-0.10
-0.15
-0.20
-0.25
-0.30
-0.35
-0.40DurationCoGI(syllable)
I(word)
<-
Co
rre
latio
n c
oe
ffic
ien
t
+
+
*
*
*
**
*
++
* *
Adopted from van Son et al. (Proc. ICSLP’98)
Dutch male sp.
20 min. R/S
12 k syll.
8k words
791 VCV R/S
- 308 lex. str.
- 483 unstr.
C ident. 22 Ss
Computational Phonetics(R. Moore, ICPhS’95 Stockholm)
duration modeling optimal unit selection (like in concatenative synthesis)
pronunciation variation modeling vowel reduction models computational prosody information measures for confusion speech efficiency models modulation transfer function for speech
Discussion / Conclusions
speech technology needs further improvement for certain tasks (flexibility, robustness)
phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that
phonetics and speech/language technology should work together more closely, for their mutual benefit
this conference is the ideal platform for that