Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University

Flexible, Robust, and EfficientHuman Speech Processing Versus Present-day Speech

Technology

Louis C.W. Pols

Institute of Phonetic Sciences / IFOTT

University of Amsterdam

The Netherlands

IFAHerengracht 338

Amsterdam

My pre-predecessor: Louise KaiserSecretary of First International Congress of Phonetic SciencesAmsterdam, 3-8 July 1932

welcome

Amsterdam ICPhS’32Jac. van Ginneken, president

L. Kaiser, secretary A. Roozendaal, Treasurer

Subjects:- physiology of speech and voice

(experimental phonetics in its strict meaning)- study of the development of speech and voice in the individual; their

evolution in the history of mankind; the influence of heridity- anthropology of speech and voice

- phonology

- linguistic psychology 136 participants

- pathology of speech and voice from 16 countries

- comparative physiology of the sounds of animals 43 plenary papers

- musicology 24 demonstrations

Amsterdam ICPhS’32Some of the participants:prof. Daniel Jones, London: The theory of phonemes, and its

importance in Practical Linguistics

Sir Richard Paget, London: The Evolution of Speech in Men

prof. R.H. Stetson, Oberlin: Breathing Movements in Speech

prof. Prince N. Trubetzkoy, Wien: Charakter und Methode der systematischen phonologischen Darstellung einer gegebenen Sprache

dr. E. Zwirner, Berlin-Buch:- Phonetische Untersuchungen an Aphasischen und Amusischen

- Quantität, Lautdauerschätzung und Lautkurvenmessung (Theorie und Material)

-----------------------------------------------------------------

2nd, London ‘35; 3rd, Ghent’38; 4th, Helsinki ‘61; 5th, Münster ‘64;

Overview

Phonetics and speech technology Do recognizers need ‘intelligent ears’? What is knowledge? How good is human/machine speech recogn.? How good is synthetic speech? Pre-processor characteristics Useful (phonetic) knowledge Computational phonetics Discussion/conclusions

Phonetics Speech Technology

AFFINITY to:from:

phonetics speechtechnology

phoneticssource / filterindividualitycontextprosody

human performancespecific knowledgeregularitiesmultiple features

speechtechnology

more datanew modelsprobabilitiesspeech vs. NLP

EU FPV, DARPAapplicationsuser orientationevaluation

Do recognizers needintelligent ears?

intelligent ears front-end pre-processor only if it improves performance humans are generally better speech

processors than machines, perhaps system developers can learn from human behavior

robustness at stake (noise, reverberation, incompleteness, restoration, competing speakers, variable speaking rate, context, dialects, non-nativeness, style, emotion)

What is knowledge?

phonetic knowledge probabilistic knowledge from databases fixed set of features vs. adaptable set trading relations, selectivity knowledge of the world, expectation global vs. detailed

see video

(with permission from Interbrew Nederland NV)

Video is a metaphor for:

from global to detail (world Europe Holland North Sea coast Scheveningen beach

young lady drinking Dommelsch beer) sound speech speaker English utterance ‘recognize speech’ or ‘wreck a nice beach’ zoom in on whatever information is available make intelligent interpretation, given context beware for distracters!

Human auditory sensitivity

stationary vs. dynamic signals simple vs. spectrally complex detection threshold just noticeable differences see Table 3 in paper

phenomenon threshold/jnd

remarks phenomenon threshold/jnd

remarks

thresholdof hearing

0 dB at 1000 Hz frequency dependent formantfrequency

3 - 5 % one formant only< 3 % with moreexperienced subjects

thresholdof duration

constant energyat 10 – 300 ms

Energy =Power x Duration

formantamplitude

3 dB F2 in synthetic vowel

frequencydiscrimination

1.5 Hzat 1000 Hz

more when < 200ms

overallintensity

1.5 dB synthetic vowel,mainly F1

intensitydiscrimination

0.5 – 1 dB up to 80 dB SL formantbandwidth

20 - 40 % one-formant vowel

temporaldiscrimination

5 ms at 50 ms duration dependent F0 (pitch) 0.3 - 0.5 % synthetic vowel

masking psychophysicaltuning curve

pitch ofcomplex tones

low pitch many peculiarities

gap detection 3 ms forwide-band noise

more at low freq. fornarrow-band noise

Detection thresholds and jndmulti-harmonic,

simple, stationary signals single-formant-likeperiodic signals

3 - 5%

1.5 Hz20 - 40%

frequency

F2

BW

DL for short speech-like transitions

20 30 40 50

Transition duration (ms)

0

60

120

180

240

Tone glide

Tone glide

Single-isolated

ComplexSingle Single

Complex

Adopted from van Wieringen & Pols (Acta Acustica ’98)

complex

simple

short longer trans.

How good ishuman / machine speech recognition?

% word errorcorpus description vocabularysize

recognitionperplexity machine human

TI digits read digits 10 10 0.72 0.009alphabet read letters 26 26 5 1.6ResourceManagement

readsentences

1,000 60-1,000 17 2

NAB readsentences

5,000-unlimited

45-160 6.6 0.4

SwitchboardCSR

spontaneoustelephoneconversations

2,000-unlimited

80-150 43 4

Switchboardwordspotting

idem 20keywords

- 31.1 7.4

Adapted from Lippmann (SpeCom, 1997)

How good ishuman / machine speech recognition?

machine SR surprisingly good for certain tasks machine SR could be better for many others

- robustness, outliers what are the limits of human performance?

- in noise- for degraded speech- missing information (trading)

Human word intelligibility vs. noise

Adopted from Steeneken (1992)

recognizers have trouble!

humans start to have some trouble

Robustness to degraded speech speech = time-modulated signal in frequency bands relatively insensitive to (spectral) distortions

- prerequisite for digital hearing aid- modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz

temporal smearing of envelope modulation- ca. 4 Hz max. in modulation spectrum syllable- LP>4 Hz and HP<8 Hz little effect on intelligibility

spectral envelope smearing- for BW>1/3 oct masked SRT starts to degrade

(for references, see paper in Proc. ICPhS’99)

Robustness to degraded speechand missing information

partly reversed speech (Saberi & Perrott, Nature, 4/99)- fixed duration segments time reversed or shifted in time- perfect sentence intelligibility up to 50 ms

(demo: every 50 ms reversed original )- low frequency modulation envelope (3-8 Hz) vs.

acoustic spectrum- syllable as information unit? (S. Greenberg)

gap and click restoration (Warren) gating experiments

How good is synthetic speech?

good enough for certain applications could be better in most others evaluation:application-specific or multi-tier required interesting experience: Synthesis workshop

at Jenolan Caves, Australia, Nov. 1998

Workshop evaluation procedure

participants as native listeners DARPA-type procedures in data preparations balanced listening design no detailed results made public 3 text types

- newspaper sentences- semantically unpredictable sentences- telephone directory entries

42 systems in 8 languages tested

Screen for newspaper sentences

Some global results

it worked!, but many practical problems(for demo see http://www.fon.hum.uva.nl)

this seems the way to proceed and to expand global rating (poor to excellent)

- text analysis, prosody & signal processing and/or more detailed scores transcriptions subjectively judged

- major/minor/no problems per entry web site access of several systems

(http://www.ldc.upenn.edu/ltts/)

Phonetic knowledge to improve speech synthesis

(suppose concatenative synthesis) control emotion, style, voice characteristics perceptual implications of

- parameterization (LPC, PSOLA)- discontinuities (spectral, temporal, prosody)

improve naturalness (prosody!) active adaptation to other conditions

- hyper/hypo, noise, comm. channel, listener impairment systematic evaluation

Desired pre-processor characteristicsin Automatic Speech Recognition

basic sensitivity for stationary and dynamic sounds robustness to degraded speech

- rather insensitive to spectral and temporal smearing robustness to noise and reverberation filter characteristics

- is BP, PLP, MFCC, RASTA, TRAPS good enough?- lateral inhibition (spectral sharpening); dynamics

what can be neglected?- non-linearities, limited dynamic range, active elements,

co-modulation, secondary pitch, etc.

Caricature of present-day speech recognizer

trained with a variety of speech input- much global information, no interrelations

monaural, uni-modal input pitch extractor generally not operational performs well on average behavior

- does poorly on any type of outlier (OOV, non-native, fastor whispered speech, other communication channel)

neglects lots of useful (phonetic) information heavily relies on language model

Useful (phonetic) knowledge neglected so far

pitch information (systematic) durational variability spectral reduction/coarticulation (other than multiphone)

intelligent selection from multiple features quick adaptation to speaker, style & channel communicative expectations multi-modality binaural hearing

Useful information: durational variability

R

S

Root /iy/

Lw

Lu

count

mean

s.d.

factorlevel

4626

95

39

1544

83

31

1588

95

36

1494

109

46

796

78

25

711

89

36

37

91

25

816

87

29

735

104

40

37

98

34

719

98

33

729

119

54

46

104

42

91

80

529

91

117

75

79

80

52

94

70

136

180

101

433

101

14

83

22

107

1

99

52

94

50

126

12

186

8

121

134

98

46

111

374

96

37

156

22

90

0 1 2

0 1 2 0 1 2 0 1 2

0 1 2 3 0 1 2 3 0 1 2

0 0 1 2 0 2 0 1 2

26 30 22 25 27 50 25 42 24 36 0

27 46 52 23 25 24 37 58 27

Adopted from Wang (1998)

Useful information: durational variability

R

S

Root /iy/

Lw

Lu

count

mean

s.d.

factorlevel

4626

95

39

1544

83

31

1588

95

36

1494

109

46

796

78

25

711

89

36

37

91

25

816

87

29

735

104

40

37

98

34

719

98

33

729

119

54

46

104

42

91

80

529

91

117

75

79

80

52

94

70

136

180

101

433

101

14

83

22

107

1

99

52

94

50

126

12

186

8

121

134

98

46

111

374

96

37

156

22

90

0 1 2

0 1 2 0 1 2 0 1 2

0 1 2 3 0 1 2 3 0 1 2

0 0 1 2 0 2 0 1 2

26 30 22 25 27 50 25 42 24 36 0

27 46 52 23 25 24 37 58 27

Adopted from Wang (1998)

normal rate=95

primary stress=104

word final=136

utterance final=186

overall average=95 ms

Useful information:V and C reduction, coarticulation

spectral variability is not random but, at least partly, speaker-, style-, and context-specific

read - spontaneous; stressed - unstressed not just for vowels, but also for consonants

- duration- spectral balance- intervocalic sound energy difference- F2 slope difference- locus equation

Stressed Unstressed Total0

5

10

15

20

25

30

35

Read

Spontaneous

Err

or

rate

->

%p 0.001 0.001 0.001

45

50

55

60

65

45

50

55

60

65

Read

Spontaneous

Stressed Unstressed Total

Dur

atio

n ->

ms

p 0.001 0.006 0.001

Mean consonant duration Mean error rate for C identification

Adopted from van Son & Pols (Eurospeech’97)

C-duration C error rate

791 VCV pairs (read & spontan.; stressed & unstr. segments; one male)

C-identification by 22 Dutch subjects

Other useful information:

pronunciation variation (ESCA workshop) acoustic attributes of prominence (B. Streefkerk) speech efficiency (post-doc project R. v. Son) confidence measure units in speech recognition

- rather than PLU, perhaps syllables (S. Greenberg) quick adaptation prosody-driven recognition / understanding multiple features

Speech efficiency

speech is most efficient if it contains only the information needed to understand it:“Speech is the missing information” (Lindblom, JASA ‘96)

less information needed for more predictable things:

- shorter duration and more spectral reduction for high-frequent syllables and words

- C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log2(Prob(x)) in bits

(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

Correlation between consonant confusion and 4 measures indicated

Read — Read + Spont — Spont + All0

-0.05

-0.10

-0.15

-0.20

-0.25

-0.30

-0.35

-0.40DurationCoGI(syllable)

I(word)

<-

Co

rre

latio

n c

oe

ffic

ien

t

+

+

*

*

*

**

*

++

* *

Adopted from van Son et al. (Proc. ICSLP’98)

Dutch male sp.

20 min. R/S

12 k syll.

8k words

791 VCV R/S

- 308 lex. str.

- 483 unstr.

C ident. 22 Ss

Computational Phonetics(R. Moore, ICPhS’95 Stockholm)

duration modeling optimal unit selection (like in concatenative synthesis)

pronunciation variation modeling vowel reduction models computational prosody information measures for confusion speech efficiency models modulation transfer function for speech

Discussion / Conclusions

speech technology needs further improvement for certain tasks (flexibility, robustness)

phonetic knowledge can help if provided in an implementable form; computational phonetics is probably a good way to do that

phonetics and speech/language technology should work together more closely, for their mutual benefit

this conference is the ideal platform for that

Documents

Flexible, Robust, and Efficient Human Speech Processing Versus Present-day Speech Technology Louis C.W. Pols Institute of Phonetic Sciences / IFOTT University