80
SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology 1 Module u1: Speech in the Interface 3: Speech input and output technology Jacques Terken

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

Embed Size (px)

Citation preview

Page 1: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

1

Module u1:Speech in the Interface

3: Speech input and output technology

Jacques Terken

Page 2: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

2

contents

Speech input technology– Speech recognition– Language understanding– Consequences for design

Speech output technology– Language generation– Speech synthesis– Consequences for design

Project

Page 3: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

3

Components of conversational interfaces

Speechrecognition

Natural Language Analysis

DialogueManager

SpeechSynthesis

LanguageGeneration

Application

Noise suppression

Page 4: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

4

Speech recognition

Advances both through progress in speech and language engineering and in computer technology (increases in CPU power)

Page 5: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

5

Developments

1980

1990

2000

Vocabulary size (number of words)

Spea

king

sty

le

Spontaneous speech

Fluent speech

Read speech

Connected speech

Isolated words

2 20 200 2000 20000 Unrestricted

word spotting

digit strings

voice commands

directory assistance

form fill by voice

name dialing

2-way dialogue

natural conversation

transcription

office dictation

system driven dialogue

network agent &

intelligent messaging

Page 6: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

6

State of the art

Page 7: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

7

Why is generic speech recognition so difficult ? Variability of input due to many different sources Understanding requires vast amounts of world

knowledge and common sense reasoning for generation and pruning of hypotheses

Dealing with variability and with storage of/ access to world knowledge outreaches possibilities of current technology

Page 8: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

8

Sources of variation

Task/Context・ Man-machine dialogue・ Dictation・ Free conversation・ InterviewPhonetic/Prosodic context

Speaker・ Voice quality・ Pitch・ Gender・ DialectSpeaking style・ Stress/Emotion・ Speaking rate・ Lombard effect

Noise・ Other speakers・ Background noise・ Reverberations

Microphone・ Distortion・ Electrical noise・ Directional characteristics

DistortionNoiseEchoesDropouts

Speechrecognitionsystem

Channel

Page 9: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

9

No generic speech recognizer Idea of generic speech recognizer has been given up

(for the time being) automatic speech recognition possible by virtue of

self-imposed limitations – vocabulary size– multiple vs single speaker– real-time vs offline– recognition vs understanding

Page 10: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

10

Speech recognition systems Relevant dimensions

– Speaker-dependent vs speaker-independent– Vocabulary size– Grammar: fixed grammar vs probabilistic language

model

Trade-off between different dimensions in terms of performance: choice of technology determined by application requirements

Page 11: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

11

Command and control Examples: controlling functionality of PC or PDA; controlling

consumer appliances (stereo, tv etc.)

Individual words and multi-word expressions– “File”, “Edit”, “Save as webpage”, “Columns to the left”

Speaker-independent: no training needed before use Limited vocabulary gives high recognition performance Fixed format expressions (defined by grammar) Real-time

User needs to know which items are in the vocabulary and what expressions can be used

(Usually) not customizable

Page 12: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

12

Examples: train travel information, integrated trip planning

Continuous speech Speaker-independent: Multiple users Mid size vocabulary, typically less than 5000 words Flexibility of input: extensive grammar that can

handle expected user inputs Requires interpretation Real time

Information services

Page 13: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

13

Continuous speech Speaker-dependent: requires training by user (Almost) unrestricted input:

Large vocabulary > 200.000 words Probabilistic language model instead of fixed

grammar No understanding, just recognition Off-line (but near online performance possible

depending on system properties)

Dictation systems

Page 14: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

14

State of the art ASR: Statistical approach Two phases

– Training: creating an inventory of acoustic models and computing transition probabilities

– Testing (classification): mapping input onto inventory

Page 15: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

15

Writing vs speech

Writing: {see} {eat break} {lake}

Speaking: {si: i:t} {brek lek}

Alphabetic languages: appr. 25 signs

Average language: approximately 40 sounds

Phonetic alphabet (1:1 mapping character-sound)

Speech

Page 16: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

16

Speech and sounds

Waveform and spectrogram of

“How are you”

Speech is made up of nondiscrete events

Page 17: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

17

Sounds coded as successions of states (one state each 10-30 ms)

States represented by acoustic vectors

Representation of the speech signal

time time

Freq Freq

Page 18: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

18

Inventory of elementary probabilistic models of basic linguistic units, e.g. phonemes

Words stored as networks of elementary models

Acoustic models

pdf

pdf pdf

pdf

Page 19: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

19

Training of acoustic models Compute acoustic vectors and transition probabilities

from large corpora each state holds statistics concerning parameter

values and parameter variation The larger the amount of training data, the better the

estimates of parameter values and variation

Page 20: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

20

Language model 1. Defined by grammar

– Grammar: • Rules for combining words into sentences (defining the

admissible strings in that language)• Basic unit of analysis is utterance/sentence

– Sentence composed of words representing word classes, e.g.

determiner: the

noun: boy verb: eat

Page 21: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

21

noun: boy verb: eat determiner: the

rule 1: noun_phrase det n

rule 2: sentence noun_phrase verb

Morphology: base forms vs derived forms

eat stem, 1st person singular

stem + s: 3rd person singular

stem + en: past participle

stem + er: substantive (noun)

the boy eats

*the eats

*boy eats

*eats the boy

Page 22: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

22

2. Statistical language model– Probabilities for words and transition probabilities

for word sequences in corpus:unigram: probability of individual words

bigram: probability of word given preceding word

trigram: probability of word given two preceding words

– Training materials:

language corpora (journal articles; application-specific)

Page 23: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

23

Recognition / classification

x1 xT

over w1... wk

P (x1... xT | w1...wk) ・P(w1...wk )

Language modelP(w1... wk )

Phoneme inventory

Pronunciation lexicon

P(x1...xT | w1...wk)

Acousticanalysis

Global search:Maximize

RecognizedWord sequence

Speech input

...

Page 24: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

24

Compute probability of sequence of states given the probabilities for the states, the probabilities for transitions between states and the language model

Gives best path Usually not best path but n-best list for further

processing

Page 25: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

25

Caveats Properties of acoustic models strongly determined by

recording conditions:

recognition performance dependent on match between recording conditions and run-time conditions

Use of language model induces word bias: for words outside vocabulary the best matching word is selected

Solution: use garbage model

Page 26: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

26

Advances Confidence measures for recognition results

– Based on acoustic similarity– Or based on actual confusions for a database– Or taking into consideration the acoustic properties of the

input signal Dynamic (state-dependent) loading of language model Parallel recognizers

– e.g. In Vehicle Information Systems (IVIS): separate recognizers for navigation system, entertainment systems, mobile phone, general purpose

– choice on the basis of confidence scores Further developments

– Parallel recognizer for hyper-articulate speech

Page 27: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

27

State of the art performance

98 - 99.8 % correct for small vocabulary speaker-independent recognition

92 - 98 % correct for speaker-dependent large vocabulary recognition

50 - 70 % correct for speaker-independent mid size vocabulary

Page 28: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

28

Recognition of prosody

Observable manifestations: pitch, temporal properties, silence

Function: emphasis, phrasing (e.g. through pauses), sentence type (question/statement), emotion &c.

Relevant to understanding/interpretation, e.g.:

Mary knows many languages you know

Mary knows many languages, you know Influence on realisation of phonemes: Used to be

considered as noise, but contains relevant information

Page 29: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

29

contents

Speech input technology– Speech recognition– Language understanding– Consequences for design

Speech output technology Consequences for design project

Page 30: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

30

Natural language processing Full parse or keyword spotting (concept spotting) Keyword spotting:

<any> keyword <any>e.g. <any> $DEPARTURE <any> $DESTINATION <any>

can handle:

Boston New York

I want to go from Boston to New York

I want a flight leaving at Boston and arriving at New York

Semantics (mapping onto functionality) can be specified in the grammar

Page 31: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

31

contents

Speech input technology– Speech recognition– Language understanding– Consequences for design

Speech output technology Consequences for design project

Page 32: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

32

coping with technological shortcomings of ASR Shortcomings

– Reliability/robustness– Architectural complexity of “always open” system– Lack of transparency in case of input limitations

Task for design of speech interfaces:

induce user to modify behaviour to fit requirements (restrictions) of technology

Page 33: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

33

Solutions “Always open” ideal

– push-to-talk button: recognition window

“spoke-too-soon” problem– Barge in (requires echo cancellation which may be

complicated depending on reverberation properties of environment)

Make training conditions (properties of training corpus) similar to test conditions

E.g. special corpora for car environment

Page 34: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

34

Prompt

response

“Will you accept the call”

“Say yes if you accept the call, otherwise say

no” Isolated yes or no

54.5 % 80.8 %

Multiword yes or no

24.2 % 5.7 %

Other affirm-ative or negative

10.7 % 3.4 %

inappropriate 10.4 % 10.2 %

Good prompt design to give clues about required input

Page 35: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

35

contents

Speech input technology consequences for design Speech output technology

– Technology– Human factors in speech understanding– Consequences for design

project

Page 36: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

36

Components of conversational interfaces

Speechrecognition

Natural Language Analysis

DialogueManager

SpeechSynthesis

LanguageGeneration

Application

Page 37: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

37

demos

http://www.ims.uni-stuttgart.de/~moehler/synthspeech/examples.html http://www.research.att.com/~ttsweb/tts/demo.php http://www.acapela-group.com/text-to-speech-interactive-demo.html http://cslu.cse.ogi.edu/tts/

Audiovisual speech synthesis:

http://www.speech.kth.se/multimodal/

http://mambo.ucsc.edu/demos.html

Emotional synthesis (Janet Cahn):

http://xenia.media.mit.edu/%7Ecahn/emot-speech.html

Page 38: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

38

Applications Information Access by phone

– news / weather, timetables (OVR), reverse directory, name dialling,

– spoken e-mail etc. Customer Ordering by phone (call centers)

– IVR: ASR replaces tedious touch-tone actions Car Driver Information by voice

– navigation, car traffic info (RDS/TMC), Command & Control (VODIS)

Page 39: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

39

Interfaces for the Disabled– MIT/DECTalk (Stephen Hawking)

In the office and at home: (near future?) – Command & Control, navigation for home

entertainment

Page 40: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

40

Output technology

DialogueManager

SpeechSynthesis

LangGeneration

Application (e.g. E-mail)

Application (Information service)

Page 41: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

41

Language generation

Eindhoven Amsterdam CS

Vertrektijd 08:32 08:47 09:02 09:17 09:32

Aankomsttijd 09:52 10:10 10:22 10:40 10:52

Overstappen 0 1 0 1 0

Page 42: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

42

If $nr_of_records >1

I have found $n connections:

The first connection leaves at $time_dep from $departure and arrives at $time_arr at $destination

The second connection leaves at $time_dep from $departure and arrives at $time_arr at $destination

Page 43: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

43

If the user also wants information about whether there are transfers, either other templates have to be used, or templates might be composed from template elements

Page 44: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

44

speech output technologies canned (pre-recorded) speech

– Suited for call centers, IVR – fixed messages/announcements

Page 45: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

45

Suited for bank account information, database enquiry systems with structured data and the like

Template-based, e.g. “your account is <$account”>”“the flight from <$departure> to <$destination> leaves at <$date> at <$time”> from <$gate>”“the number of $customer is $telephone_number”

Requirements: database of phrases to be concatenated Some knowledge of speech science required:

– words are pronounced differently depending on • emphasis• position in utterance• type of utterance

– differences concern both pitch and temporal properties (prosody)

Concatenation of pre-recorded phrases

Page 46: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

46

Compare different realisations of Amsterdam in– do you want to go to Amsterdam ? (emphasis, question,

utterance-final)– I want to go to Amsterdam (emphasis, statement,

utterance -final)– Are there two stations in Amsterdam ? (no emphasis,

question, utterance-final)– There are two stations in Amsterdam (no emphasis,

statement, utterance-final)– Do you want to go to Amsterdam Central Station? (no

emphasis, statement, utternace-medial) Solution:

– have words pronounced in context to obtain different tokens– apply clever splicing techniques for smooth concatenation

Page 47: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

47

Suited for unrestricted text input: all kinds of text– reading e-mail, fax (in combination with optical

character recognition)– information retrieval for unstructured data

(preferably in combination with automatic summarisation)

Utterances made up by concatenation of small units and post-processing for prosody, or by concatenation of variable units

text-to-speech conversion (TTS)

Page 48: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

48

TtS technology

Distinction between – linguistic pre-processing and– synthesis

Linguistic pre-processing– Grapheme-phoneme conversion: mapping written

text onto phonemic representation including word stress

– Prosodic structure (emphasis, boundaries including pauses)

Page 49: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

49

TtS: Linguistic pre-processing:

grapheme-phoneme conversion To determine how a word is pronounced:

– consult a lexicon, containing• a phoneme transcription• syllable boundaries • word accent(s)

– and/or develop pronunciation rules– Output:

Enschede . ‘ En-sx@-de .

Kerkrade . ‘ kErk-ra-d@ .

‘s-Hertogenbosch . sEr-to-x@n-‘ bOs .

Page 50: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

50

Pros and con’s of lexicon– phoneme transcriptions are accurate– (high) risk of out-of-vocabulary words because the

lexicon : • often contains only stems, no inflections, nor

compounds• is never up to date / complete

– but usually the application includes a user lexicon

Page 51: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

51

Pros and con’s of pronunciation rules– no out-of-vocabulary words– transcription results are often wrong for

• (longer) combinations of words / morphemes• exceptions and loan-words from other

languages Best solution is a combination of the two methods:

– develop a list of words incorrectly transcribed by the rules and put these words in an exception lexicon

– words not occurring in the exception list are then transcribed by rule

Page 52: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

52

Complications– Words with same written form but different

pronunciations and different meaning: ‘record vs re’cord:

requires parsing or statistical approach– Proper names and other specialized vocabularies,

acronyms/abbreviations (small announcements in journals!)

Need to be included in (user) lexicon– Different kinds of numbers (telephone numbers,

amounts, credit card numbers etc.)

Require number grammars

Page 53: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

53

TtS: Linguistic pre-processing:prosody

Emphasis, boundaries (including pauses), sentence type

Observable manifestations: pitch, temporal properties, silence

Requires analysis of linguistic structure (parsing) and (ideally) discourse level information (cf. the earlier “Amsterdam” example)

Page 54: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

54

TtS: synthesis

Concatenation from words and phrases practically impossible: – database too large (especially if you need several

versions for each word) and– no full coverage (out-of-vocabulary words)

Approaches:– sub-word units– data-oriented approach

Page 55: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

55

Synthesis by subword units

Common approach: diphone synthesis

linking together pre-recorded diphones, i.e. short segments (transitions between two successive phonemes), extracted from natural speech:

‘s Hertogenbosch phonemes: . s E r t o x @ n b O s .

diphones: .s sE Er rt to ox x@ @n nb bO Os s.

In all, 1600 transitions per language (40 * 40)

Page 56: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

56

Synthesis: – concatenate diphones in the correct order – perform some (intensity) smoothing at the

diphone borders – adjust phoneme duration and pitch course,

according to prosody rules

Page 57: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

57

Data-oriented approach

Generalization of diphone approach Store a large database of speech (running text) Run-time:

– generate structure representing phoneme sequence and prosodic properties needed

– Search algorithm:

find the largest possible fragments containing the required properties in the database

Page 58: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

58

frei-burg

nürn-berg

frei-berg

/fr/ also for fr-iedrichshaven:

items in database re-usable

Concatenate the fragments as they are, without post-processing for pitch and duration

in this way, not only phoneme parameters and transitions are taken from data, but also pitch and temporal properties

Page 59: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

59

Advantage: natural speech quality preserved (but may not always be desirable: maybe it should be made clear to people that they are talking to a system)

Disadvantage: no explicit control of voice characteristics and prosodic characteristics such as pitch and speaking rate (which you might want to manipulate for synthesis of emotional speech or conveying a certain personality)

Page 60: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

60

Difficult or impossible to modify speaker characteristics

Other speaker: new database required

Other speaking style: new database required

Research: post-processing of result with preservation of speech quality

Page 61: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

61

hybrid synthesis– Combination of phrase concatenation and TTS– suited for template-based synthesis with fixed

message structure and variable slots

“the flight from <$departure> to <$destination> leaves at <$date> at <$time”> from <$gate>”

in dialogue systems the system has knowledge of message structure and can select the proper tokens from the database on the basis of this knowledge

Page 62: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

62

Future: markup languages

structured text current tts-systems: strip text annotations (plain ascii

standard) draft proposal for xml for synthesis, SALT

Page 63: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

63

Contents

Speech input technology consequences for design Speech output technology

– Technology– Human factors in speech understanding– Consequences for design

project

Page 64: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

64

issues in comprehension Speech quality

reduced quality slows down feature extraction and mapping input onto feature vectors: will increase number of matching vectors.

requires compensation by top-down processing, taking more time and effort and practice

“Text-to-speech”, but written text is often difficult to understand when read aloud due to complex structures, high information density etc.

Page 65: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

65

application to synthetic speech substandard quality of synthetic speech requires

compensation by (resource-limited) top-down processing

potential overload of system due to time constraints

slowing down speaking rate very effective way to give the listener more processing time

Page 66: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

66

Case study: picking up information from speech study on auditory exploration of lists (pitt & edwards,

1997):

recall of list of 48 file names – presented in groups– groups size varied (2, 3 or 4)– presentation of groups listener-paced– recall immediately after each group

Page 67: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

67

Number of filenames in groups

% correct recall

2 76.7

3 49.7

4 20.5

Page 68: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

68

adjustments analysis of “list” speaking style

grouping principles• always try to group• grouping by filenames and extensions• large groups first• mnemonic links between groups

prosodic structuring

Page 69: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

69

evaluation:directory with four subdirectories, each containing files with four different names, corresponding to the modules of a programming project, and with three different extensions

task: find most recent version of the four files containing the source code for the modules and copy them into a new directory

measures: objective (task completion) and subjective

results for task completion: new algorithm: 10.39 min, old version: 24.12 min

Page 70: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

70

Contents

Speech input technology Consequences for design Speech output technology

– Technology– Human factors in speech understanding– Consequences for design

Project

Page 71: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

71

Design implications choice of technology can be made dependent on

needs of application for restricted domains very high quality can be

achieved through canned speech or phrase concatenation with multiple tokens

for concatenation with unit selection there is a relation between quality and size of database; for good systems usually there is no problem with intelligibility even with inexperienced listeners

Page 72: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

72

high quality for diphone speech, needed for uncommon forms such as proper names or company names that are unlikely to be available from a corpus, requires still much effort

importance of learning effects general finding is that acceptance of synthetic speech

depends strongly on voice quality if trade-off between quality and added value is

negative, prospects for acceptance of the speech interface are poor

Page 73: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

73

speech as output modality: speech vs text/graphics

Text/graphics:– an image [ may be ] worth a thousand words– image/written text is persistent– image is (at least) two-dimensional: temporal and

spatial organisation– visual expression of hierarchical structure– receiver paced– but non-adaptive (until recently).

now: adaptive hypertext

Page 74: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

74

speech– one-dimensional: extends only in time

spatial issues better dealt with in other modality

– sender-paced– poor medium, yet popular

• large amount of speech-based communication serves primarily a social function

– no need for supporting aids such as paper and pen

– no special motoric abilities needed– speaking is fast

Page 75: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

75

heuristics:speech output preferred when message is simple message is short message need not be referred to later message deals with events in time message requires an immediate response visual channels are overloaded environment is brightly lit, poorly lit, subject to severe vibration or

otherwise adverse to transmission of visual information user must be free to move around

(from michaelis & wiggins) pronunciation is subject of interaction

Page 76: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

76

but speech output preferably not used when message is complex or uses unfamiliar terms message is long message needs to be referred to later message deals with spatial topics message is not urgent auditory channels are overloaded environment is too noisy user has easy access to screen system output consists of many different kinds of information

which must be available simultaneously and be monitored and acted upon by the user

(from michaelis & wiggins) environmental variation and “mixed” interaction call for multimodal interfaces

Page 77: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

77

contents

Speech input technology consequences for design Speech output technology Main points Project

Page 78: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

78

Main points Database approach, requiring large databases for individual

languages and speaking styles, dominant both for speech input and output– Input: databases for training acoustic models and

language model– Output: concatenation of segments and phrases taken

from database Large differences concerning performance of speech

recognition and quality of output for different languages and target groups (e.g. recognition for children)

Speech input: three major classes of applications: command&control, information services, dictation systems

Major parameters: speaker-dependent/independent, vocabulary size (small, medium, large), rigid vs. free-format input

Page 79: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

79

Dialogue management

– Finite-state or frame-based approach for task-oriented dialogue acts

– Verification strategies and repair mechanisms for dialogue control

Pragmatic approaches to language understanding and language generation:

– Input: directly mapped onto application functionality

– Output: template-based approaches Not covered: speech monitoring, speech data mining

applications and technology

Page 80: SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology1 Module u1: Speech in the Interface 3: Speech input and

SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

80

Exercises with CSLU toolkit and other demonstrators– Try out your name, telephone numbers, dates, e-

mail addresses, abbreviations etc. Project

– Protocol development: • Dialogue structure• Strategies and prompts

– Tomorrow• Wizard of Oz test