Study on Optimal Spoken Dialogue System for Robust Information

Instructions for use

Title Study on Optimal Spoken Dialogue System for Robust Information Search in the Real World

Author(s) 徐, 昕

Issue Date 2016-09-26

DOI 10.14943/doctoral.k12405

Doc URL http://hdl.handle.net/2115/63374

Type theses (doctoral)

File Information Xin_Xu.pdf

Hokkaido University Collection of Scholarly and Academic Papers : HUSCAP

https://eprints.lib.hokudai.ac.jp/dspace/about.en.jsp

DOCTORAL THESIS

Study on Optimal Spoken Dialogue System for Robust

Information Search in the Real World

Author: Supervisor Xin XU Dr. Yoshikazu MIYANAGA

Exαminers: Dr. Kunimasa SAITOH

Dr. Takeo OHGANE Dr. Hiroshi TSUTSUI

A thesis submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy in the

Information and Communication Networks Laboratory Graduate School of Information Science and Technology

August 2016

2016 Doctoral Thesis

Study on Optimal Spoken Dialogue System for RobustInformation Search in the Real World

Division of Media and Network Technologies

Graduate School of Information Science and Technology

Hokkaido University

Xin Xu

August 22, 2016

Abstract

Recently, the spoken dialogue systems those enable users to intuitively and directly

operate services and smartphones with voice commands and information search be-

come popular. However, there is still a remaining challenge that there are not many

users with the habitual and continual use of the spoken dialogue systems for infor-

mation search in the real world, though most of them have devices in which the

spoken dialogue system is implemented. To solve this challenge, three researches in

different aspects have been done in this thesis, to realize an optimal spoken dialogue

system for robust information search in the real world.

The first research practices human-centered design (HCD) to design a dialogue

agent and a dialogue scenario promoting a daily use of the spoken dialogue interface,

which is based on the cognitive science and the gamification theory. The author

proposes a design concept of breeding a character, which is actually a dialogue

agent, through taking care and having a dialogue in order to make users gradually

feel that speaking to the dialogue agent is natural and fun. The real-world data

prove the novelty of the proposed design, in which over 23% users keep speaking

continually. More than 95% conversations from the dialogue agent are responded

by the users.

The second research improves the efficiency and robustness of the dialogue man-

agement for information search based on the information theory. The author pro-

poses two strategies to optimize question selection for information search and to

decrease failures in information search mainly caused by mistaken queries. One

strategy applies optimal question selection in a knowledge-based spontaneous dia-

logue system, which has been verified to be effective to assist the users’ operation

for information search. The other strategy applies a robust and fast search method

based on phoneme strings matching. It decreases the failures caused by the queries

containing incorrect parts. Experimental results show that the proposed search

method increases search accuracy by 4.4% and reduces processing time by at least

86.2%.

The third research practices signal processing technologies to emphasize the us-

ability of spoken dialogue systems. The author proposes a novel pitch detection

method applying an adaptive filtering algorithm to restore the amplitude spectra of

speech corrupted by additive noises. The periodic structures in the amplitude spec-

tra are kept against noise distortion. Experimental results verify that the proposed

pitch detection method achieved the highest robustness in a variety of noise type

and noise level. With the high-accuracy pitch information, emotion recognition is

going to be established in the next step of this research. Understanding speaker’s

emotion helps to generate the appropriate dialogue actions to present superiority

and differentiation to other modalities.

Furthermore, based on the above researches, this thesis proposes a dialogue struc-

ture to build a personalized dialogue system applying emotion recognition and multi-

device interface for further real-world use in the future.

2

Contents

1 Introduction 1

1.1 The History of Spoken Dialogue System Research . . . . . . . . . . . 1

1.2 The Issues Focused in This Thesis . . . . . . . . . . . . . . . . . . . . 3

1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Thesis’s Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Key Technologies of Spoken Dialogue Systems and Related Works 9

2.1 The Structure of Spoken Dialogue Systems . . . . . . . . . . . . . . . 9

2.2 The Technology Development of Spoken Dialogue System . . . . . . . 11

2.2.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . 11

2.2.2 Natural Language Understanding . . . . . . . . . . . . . . . . 12

2.2.3 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . 12

2.2.4 Natural Language Generation . . . . . . . . . . . . . . . . . . 13

2.2.5 Text to Speech . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.6 Emotion Recognition with Prosody Information . . . . . . . . 15

2.3 Remaining Issue and Related Works . . . . . . . . . . . . . . . . . . . 16

1

3 Spoken Dialogue System Design for a Habitual Use 20

3.1 In-House Trial of Spoken Dialogue System . . . . . . . . . . . . . . . 20

3.1.1 Structure of the Trial System . . . . . . . . . . . . . . . . . . 21

3.1.2 Trial Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.3 Discussion of the Spoken Dialogue Agent . . . . . . . . . . . . 22

3.2 Agent Characters’ Impressions against the User’s Tolerance . . . . . . 23

3.2.1 Preliminary Investigation for Selecting Characters . . . . . . . 23

3.2.2 Experiment for Evaluating Users’ Tolerance . . . . . . . . . . 24

3.3 Design of User Interface for the Spoken Dialogue System Applying a

Breeding Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Target Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Scenario Design . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.3 Evaluation of Scenario-based Acceptability . . . . . . . . . . . 30

3.3.4 Game Implementation . . . . . . . . . . . . . . . . . . . . . . 32

3.3.5 Analysis of the Real-World Dialogue Log Data . . . . . . . . . 35

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Dialogue Management for Information Search 39

4.1 Dialogue Strategy of Optimal Question Selection . . . . . . . . . . . . 40

4.1.1 Spoken Dialogue System for Information Search in a Cooper-

ative Manner . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.2 Question Selection for Quick Narrowing Down of Hit Data . . 43

4.1.3 Experimental Evaluations . . . . . . . . . . . . . . . . . . . . 45

4.2 Information Search by Robust and Fast Phonetic String Matching

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Analysis of the Real World Queries . . . . . . . . . . . . . . . 49

2

4.2.2 Investigation of the Relationship between the Query Length

and DP Matching Pre-selection . . . . . . . . . . . . . . . . . 54

4.2.3 Acoustic Distance Derived from a Phoneme Confusion Matrix 56

4.2.4 Fast Two-pass Search Algorithm in Consideration of Acoustic

Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Applying Prosody Information in Spoken Dialogue Systems 76

5.1 Prosody Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 Spectra Analysis and Pitch Detection . . . . . . . . . . . . . . . . . . 78

5.2.1 Fundamentals of Speech Production . . . . . . . . . . . . . . . 78

5.2.2 Introduction of Running Spectrum and Modulation Spectrum 80

5.2.3 Pitch Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Robust Speech Spectra Restoration for Pitch Detection . . . . . . . . 87

5.3.1 Adaptive Running Spectra Filtering designing . . . . . . . . . 87

5.3.2 Pitch Detection with ARSF . . . . . . . . . . . . . . . . . . . 92

5.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Conclusion and Future work 108

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Acknowledgement 111

References 112

3

List of Publications 123

4

List of Tables

3.1 Average score and standard deviation values . . . . . . . . . . . . . . 32

4.1 Results of subjective evaluation . . . . . . . . . . . . . . . . . . . . . 48

4.2 Average operating time for one search task . . . . . . . . . . . . . . . 48

4.3 The distribution of mistaken types within content-word-error cases . . 53

4.4 Number of hits by two web search engines . . . . . . . . . . . . . . . 54

4.5 Search accuracies and p-values of EDPED and EDPAD . . . . . . . . 65

5.1 Comparison results at various noise conditions (GPE %) . . . . . . . 105

5.2 Comparison results at various noise conditions (SDFPE) . . . . . . . 106

5

List of Figures

2.1 Structure of a spoken dialogue system . . . . . . . . . . . . . . . . . . 10

3.1 Structure of the trial spoken dialogue system . . . . . . . . . . . . . . 22

3.2 Character agents used for the experiment . . . . . . . . . . . . . . . . 25

3.3 Demographics of Peratama users. . . . . . . . . . . . . . . . . . . . . 27

3.4 Screenshots of Hey Peratama . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Utterance distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Distribution of major user-initiated dialogues topics . . . . . . . . . . 37

4.1 Interface of the proposed application . . . . . . . . . . . . . . . . . . 42

4.2 The flowchart of the dialogue scenario . . . . . . . . . . . . . . . . . . 43

4.3 The distribution of mistaken queries in the different types and correct

queries within the collected queries . . . . . . . . . . . . . . . . . . . 51

4.4 The distribution of the length of a query in phonemes and the DP

matching distances from the correct lyric for the real world incorrect

queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5 Flowchart of the first pass search . . . . . . . . . . . . . . . . . . . . 61

4.6 Flowchart of the second pass search . . . . . . . . . . . . . . . . . . . 63

4.7 Relationship between hit rates and Ic for various sizes of lyric database 66

6

4.8 Search accuracy and processing time with respect to Fth in the case

of 1-best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.9 Search accuracy and processing time with respect to Fth in the case

of 20-best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.10 Search accuracy of TDPAT and EDPAD . . . . . . . . . . . . . . . . 72

4.11 Average processing times and search accuracy of three search methods

in the case of 1-best . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.12 Average processing times and search accuracy of three search methods

in the case of 20-best . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 Schematic view of the human vocal mechanism (cited from [83]) . . . 79

5.2 Modulation spectra of clean speech . . . . . . . . . . . . . . . . . . . 82

5.3 Modulation spectra of speech corrupted by 0dB white noise . . . . . . 83

5.4 Modulation spectra at 130 Hz in different conditions . . . . . . . . . 84

5.5 AUTOC method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.6 Noise distortion in amplitude spectrum . . . . . . . . . . . . . . . . . 88

5.7 Block diagram of the ARSF approach . . . . . . . . . . . . . . . . . . 89

5.8 The magnitude response of the adaptive high-pass filter with -20 dB

attenuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.9 Flowchart of the proposed pitch detection method . . . . . . . . . . . 94

5.10 Amplitude spectrum of clean speech . . . . . . . . . . . . . . . . . . . 95

5.11 Amplitude spectrum of speech with 0 dB car noise . . . . . . . . . . . 96

5.12 Amplitude spectrum of speech with 0 dB pink noise . . . . . . . . . . 97

5.13 Amplitude spectrum in speech with 0 dB talking babble noise . . . . 98

5.14 Autocorrelation contour in clean speech . . . . . . . . . . . . . . . . . 99

5.15 Autocorrelation contour in speech with 0 dB car noise . . . . . . . . . 100

7

5.16 Autocorrelation contour in speech with 0 dB pink noise . . . . . . . . 101

5.17 Autocorrelation contour in speech with 0 dB talking babble noise . . 102

6.1 Structure of the future system . . . . . . . . . . . . . . . . . . . . . . 110

8

Chapter 1

Introduction

1.1 The History of Spoken Dialogue System Re-

search

Spoken dialogue system is described as “an interactive system which operates in a

constrained domain” (Glass, 1999) [1]. It offers direct, simple, hands-free access to

information with several technical factors contributing to.

The research of spoken dialogue systems can be traced back to “ELIZA” [2].

ELIZA is a system (precisely a computer program) of primitive natural language

processing, which was created in 1960s at MIT Artificial Intelligence Laboratory. It

only returns a response based on a superficial matching of words based on users’

input, without speech interface. So far, ELIZA has been regarded as a source of inspi-

ration for programmers and developers of artificial intelligence (AI). Corresponding

to the development of speech recognition and synthesis technologies around 1990,

the research of spoken dialogue systems became popular. The forerunner is the

MIT’s “VOYAGER” [3]. It functioned as navigation and guidance of the city, close

1

Chapter 1. Introduction 2

to the spoken dialogue service of the current smartphone in terms of the task do-

main. After it, “ATIS” project in the early 1990s supported by DARPA in the

United States was carried out [4]. The researchers engaged in ATIS spinned off to

establish Nuance, Inc. and SpeechWorks, Inc., which brought a boom of spoken di-

alogue systems. The systems were employed in the interactive voice response (IVR)

service with the telephone interface on the basis of a fixed grammar and dialogue

flow described manually. As the spoken dialogue systems are introduced on a large

scale in call centers, it was commercially successful.

For the recent trend, there are great developments in both the theory and practical

use of spoken dialogue systems, especially in mobile devices [5]. Recent spoken dialog

interfaces have moved beyond a mere keyboard replacement, to provide integrated

voice search, speech understanding, and basic device operation. The main players

are Apple’s Siri, Google’s voice search, Nuance’s Dragon Go!, Docomo Shabette

Concier (Japanese) and systems in cars from several manufacturers. In addition,

the spoken dialogue systems are not only on top of the smartphones, also for the

robots. Japanese telecom company Softbank employs robot “Pepper” to do shop

reception, and declares that the future robots that can recognize human emotion

will change the way we live and communicate [6]. The family robot “Jibo” also gives

an idea to bring spoken dialogue technologies to life with personal and emotional

engagement [7].

Also new theories of spoken dialogue, deep learning is becoming a mainstream

technology for speech recognition and has successfully replaced Gaussian mixtures

for speech recognition and feature coding at an increasingly larger scale [8]. The

classification model of support vector machines (SVM) and conditional random fields

(CRF) [9] is introduced into the research of natural language understanding. In the

field of dialogue management, though most systems are limited cross-turn persis-

tence in the state of the dialog, some of them are using machine learning techniques

such as reinforcement learning in partially observable Markov decision processes

(POMDPs) [10], which has been in development in the research community for


about a decade.

1.2 The Issues Focused in This Thesis

Spoken dialogue systems can be presented by a wide range of domains, from simple

weather forecast systems (systems ask the city name and give the weather informa-

tion) to complex problem-solving, reasoning, applications (special dialogue systems

for medical domain). Generally, spoken dialogue systems can be classified into two

types: conversational (non-task) systems and goal-oriented (task) systems. It should

be stressed that these are prototypical categories, and all dialogue systems do not

fit neatly into one of them. Such as Apple’s Siri, it assists users to search infor-

mation, to confirm or edit schedules by users’ commands, while takes the casual

conversations at the same time.

The study in this thesis put particular emphasis on the goal-oriented type spoken

dialogue systems, especially for information search task. The final goal of the study

is to establish a spoken dialogue system as a daily and habitual access to information

spaces through natural spoken language interaction and personal preferences.

Including Siri, many current commercial dialogue systems for information search

are based on a dialogue strategy of One-Turn Q&A type that does not support

complex dialogue scenario logics. On the other hand, some systems are based on a

fixed scenario. Some conventional systems responds with a confirmation of the user’s

request to avoid the misunderstanding by the speech recognition errors. However,

these traditional dialogue strategies and designs degrade the usability of dialogue

system. The focus has not been on the approaches for the habitual use of a spoken

dialog interface in a long term. For example, as is investigated by some speech

specialists [11], 85% of iPhone iOS 7 users have never used Siri. Not only Siri, other

spoken interfaces also face the same issue. In [12], this phenomenon is discussed,

and the necessary conditions for continual use of spoken dialogue systems, which


are also called “requirements for survive”, are summarized as follows：

• Requirement 1: should optimize agent presence and dialogue design to improve

conversation frequency

• Requirement 2: should avoid failures of search and dialogue caused by the

errors of speech recognition

• Requirement 3: should clearly show the superiority and differentiation to other

modalities

• Requirement 4: should enhance the killer apps that are routinely used

• Requirement 5: should find more use cases of immediate real-time require-

ments

As Requirements 4 and 5 are more closely related to the specific applications and

use cases, the study of this thesis concentrates on pursuing Requirements 1, 2 and

3 to realize an optimal spoken dialogue system for robust information search in the

real world.

Firstly, the study takes advantage of the gamification theory to design a dialogue

agent and dialogue scenarios to foster the user’s affection on the character, which

promotes the users to continually use the spoken dialogue interface. The affection is

also supposed to enhance the user’s tolerance against the mistakes caused by speech

recognition (ASR) or natural language understanding (NLU), which is the weak

point of spoken dialogue systems.

Secondly, the study proposes dialogue management strategies to improve the user

experience for searching information. To optimize the spoken dialogue system for

information search, a novel strategy of question selection has been verified to be ef-

fective to assist users’ operation in a knowledge-based spontaneous dialogue system.

Furthermore, to avoid the information search failures in dialogue, especially as the


traditional full-text search method is highly compromised when the queries contain

incorrect parts due to mishearing or misrecognition, a robust and fast matching

method based on phoneme strings decreases the failures, and the most appropriate

answer candidates are presented.

Thirdly, to deeply explore the superiorities of spoken dialogue system to other

modalities such as touch screen or keyboards for information search, the study in

this thesis intends to apply emotion recognition. Compared with other modalities

based on only text or command information, the responses of spoken dialogue sys-

tem are supposed to be more appropriate with the speaker’s emotion information.

However, it is revealed that the extraction of the prosody information that plays

an important role in emotion recognition is a difficult issue in the real-world envi-

ronment. Therefore, my study focuses on the detection of speech pitch period for

practical applications. The proposed method intelligently restores speech modula-

tion spectra according to the estimated noise conditions. The noise distortion that

influences the amplitude spectral structure is much diminished. Therefore pitch de-

tection gains the better accuracy than other conventional methods under different

noise conditions. With this novel pitch extraction method, a method for recognizing

speaker’s emotion is going to be established in the future study with a multimodal

model mixing speech prosody and text positive/negative polarity recognition.

Finally, this thesis also gives discussion on future study of optimization of the

dialogue along to the user’s preferences for information search. One direction is the

further challenge of personalization. The dialogue system will be applied to many

types of devices, such as a set top box (STB) and an in-vehicle machine as well as

smartphones to cover the users’ life scenes to learn more hobbies and preferences of

the user via analysis of individual dialogue log data.


1.3 Research Contributions

The main contributions of this work, which will be discussed in details in Chapters

3, 4 and 5, are summarized as:

First, human-centered design (HCD) and gamification theory are practiced to

design an novel spoken dialogue interface. The research proposes a new concept of

breeding the dialogue agent through taking care and having dialogues. The stepwise

growth of the agent creates the users’ expectations and intimacy, which motivate

users to use the spoken dialogue interface habitually. The designed scenario obtained

high evaluation scores from the subjective evaluation experiments. Many positive

comments are collected: “feel no stress in talking to the character as it is a game”.

The game application was released on Google Play. The daily active rate is 23%

or more based on the analysis of the real world users’ log data. More than 95%

conversations from the dialogue agent are responded by the users. The activity is

much higher than a conventional spoken dialog system with a character agent that

is tested in an in-house trial. Furthermore, my study investigated when and what

topics the real world users are talking to the agent. It is found that users are talking

to the agent most around the time of getting up and going to bed to confirm the

schedule and weather, and the time of arriving home to enjoy chit-chats. These

understandings help the design of the future dialogue system to provide proper

topics automatically.

Second, a dialogue management strategy is proposed to optimally select questions

to ask the users to help their refine search. A decision algorithm is applied to

find the best questions to minimize the number of search refinement steps. In

addition, the questions responding to the candidates that users showed interests in

are preferentially selected. Based on the evaluation results, the application with

the proposed strategy performs better than the conventional applications in terms

of the users’ satisfaction with search results and the effort spent for reconsidering

search keywords. Furthermore, a robust and fast search method based on acoustic


distance and a two-pass search strategy is introduced. The two-pass search uses an

index-based approximate preselection for the first pass and a dynamic programming

(DP) based string matching in the second pass. It is verified in the task of lyric

search with a test set of the incorrect queries those are misheard or mismemorized.

The experiments proved that applying the originally proposed acoustic distance

improved search accuracy by 4.4% compared with the conventional search with edit

distance. Though the search accuracy is expected to be more improved if the acoustic

distances are calculated from the singing voice data, the proposed method offered

a realistic and efficient solution with an easily-obtainable database of more general

ordinary speech. The proposed method achieved real-time operation by reducing

processing time by more than 86.2% with a slight loss in search accuracy compared

with a complete search by DP matching with all lyrics. It is proved to be the most

practical solution for acoustic confusion queries, considering the trade-off between

high search accuracy and low computation complexity.

Third, my research proposes a new algorithm named adaptive running spectrum

filtering (ARSF) to restore the amplitude spectra of speech corrupted by additive

noises. Based on the pre-hand noise estimation, adaptive filtering is used in speech

modulation spectra according to the noise conditions. The periodic structures in the

amplitude spectra are kept against noise distortion. Since the amplitude spectral

structures contain the information of fundamental frequency, which is the inverse

of pitch period, ARSF algorithm is added into robust pitch detection to increase

the accuracy. Compared with the conventional methods, experimental results show

that the proposed method significantly improves the robustness of pitch detection

against noise conditions with several types and SNRs.

1.4 Thesis’s Organization

This thesis is organized as follows:


In Chapter 2, the structure of the dialogue system and the development of ele-

mental technologies are organized, in which the analysis of the issues and problems

of the current state of the dialogue system are listed. In addition, the focused issues

of this thesis and the related works are described.

Chapter 3 presents my first research that practices human-centered design (HCD)

to design a dialogue agent and a dialogue scenario promoting a daily use of the spo-

ken dialogue interface, which is based on the cognitive science and the gamification

theory. It mainly expands the discussion of how to design an interactive system for

continual use. Secondly, it presents HCD process and gamification theory practiced

in the design process of the dialogue system. Then, subjective evaluation and the

results of the dialogue scenario are presented. The end of the chapter describes the

analysis of the real world users’ log data.

Chapter 4 presents my second research that improves the efficiency and robustness

of dialogue management for information search based on the information theory.

Firstly, a dialogue strategy of question selection is introduced. A decision algorithm

is proposed to find the best questions to be asked to narrow down the candidates

as quickly as possible according to the knowledge database. Then, a DP-based two-

pass search strategy using the acoustic distance based on the phonetic confusion

matrix is explained in details.

Chapter 5 presents my third research that practices signal processing technologies

to emphasize the usability of spoken dialogue systems. Firstly, it discusses the

role played in the spoken dialogue system by the prosody information. Then, a

robust pitch detection method using speech signal processing technology ARSF is

introduced.

Chapter 6 gives the conclusion. It summarizes the results obtained in my re-

searches. Furthermore, the future dialogue system will also be described with the

proposed technologies in this thesis and the concept of personalization.

Chapter 2

Key Technologies of Spoken

Dialogue Systems and Related

Works

In this chapter, the key technologies of spoken dialogue systems are described.

First, the classic structure of spoken dialogue systems is presented in Section 2.1.

Second, the technology development of each component are discussed in Section 2.2.

Then, the issue in current representative commercial spoken dialogue systems is

presented in Section 2.3. The related works are introduced as well.

2.1 The Structure of Spoken Dialogue Systems

The spoken dialogue system is composed of automatic speech recognition (ASR),

natural language understanding (NLU), dialogue manager (DM), natural language

generation (NLG) and text-to-speech synthesis (TTS), as shown in Figure 2.1. The

ASR component receives a user’s spoken utterance and transforms it into a textual

9

Chapter 2. Key Technologies of Spoken Dialogue Systems and Related Works 10

automatic speech recognition (ASR) natural language understanding (NLU)dialogue manager dialogue manager (DM)

text-to-speech synthesis (TTS) natural languagegeneration (NLG)Figure2.1 Structure of a spoken dialogue system

hypothesis. Then, NLU parses the hypothesis and generates a semantic representa-

tion of the utterance. This representation is then handled by the DM component,

which looks at the discourse and dialogue context to, for example, resolves anaphora

and interprets elliptical utterances, and generates a response on a semantic level.

Then the NLG component generates a surface representation, often in some textual

form, and passes it to TTS which generates the audio output to the user.　


2.2 The Technology Development of Spoken Dia-

logue System

2.2.1 Automatic Speech Recognition

The task of ASR systems is to interpreter acoustic speech signals into a sequence of

words. ASR systems are separated in different classes by describing what type of

utterances they can recognize.

1. Isolated word: Isolated word recognizers usually require each utterance to

have quiet on both side of sample windows. It accepts single words or single

utterances at a time.

2. Continuous speech: Continuous speech recognizers allows user to speak al-

most naturally, while the computer interpreters the content. Recognizer with

continuous speech should utilize novel methods to determine utterance bound-

aries.

3. Spontaneous speech: Spontaneous speech recognizers try to recognize speech

that is natural sounding and not rehearsed. ASR systems with spontaneous

speech ability should be able to handle a variety of natural speech feature such

as words being run together.

ASR systems need a set of models to compute probabilities for parameterizing the

audio signal into features, and an efficient search algorithm. Gaussian mixture model

(GMM) or hidden Markov models (HMMs) are utilized to represent the sequential

structure of speech signals [13] [14].

Deep learning, referred as representation learning or unsupervised feature learning

is a new area of machine learning (ML) recently. Deep learning is becoming a

mainstream technology for speech recognition [8] [15] and has successfully replaced


Gaussian mixtures for speech recognition and feature coding at an increasingly larger

scale.

2.2.2 Natural Language Understanding

NLU is an important field of natural language processing that deals with machine

reading comprehension, with artificial intelligence and computational linguistics

technologies.

A classic parsing approach of NLU is to use a context-free grammar (CFG) based

grammar which is enhanced with semantic information. The same CFG may then be

used for both ASR and NLU, which is famously implemented in the W3C-standard

VoiceXML [16].

There have also been a lot of efforts in data-driven approaches to NLU to gain

robustness. The opportunity for transfer of ideas between ML and NLP can be

traced by the following algorithms:

• Conditional random fields for part-of-speech tagging [17]

• Latent Dirichlet Allocation for modeling text documents topics [18]

• Sequence-to-sequence models for machine translation [19]

2.2.3 Dialogue Management

As described in [20], the most common tasks of DM can be divided into three groups:

1. Contextual interpretation: Resolve for example ellipses and anaphora.

2. Domain knowledge management: Reason about the domain and access infor-

mation sources.


3. Action selection: Decide what to do next.

The different knowledge sources needed by DM can be separated into dialogue

and discourse knowledge, task knowledge, user knowledge and domain knowledge.

User knowledge can be used to adapt the system’s behavior to the user’s experience

and preferences. Domain knowledge management includes models and mechanisms

for reasoning about the domain and for accessing external information sources, such

as an SQL database for music information or a graphical information system. On

the other hand, for complex and analytical tasks interactive systems like HITIQA

[21], it especially meant for answering explanatory questions like: why, how, list etc.

Furthermore, the introduction of the framework of the reinforcement learning

based on dialogue examples and reward, energetic research has been performed on

the model, such as POMDP [10]. It should be noted that assumes interaction flow

conditions, or information state, such as a slot for performing task. Model based on

this machine learning interesting, it has been reported that high performance can

be obtained in the evaluation by simulations. On the other hand, in the constructed

model is completely black box, or found later trouble, even as you try to add a

vocabulary and concepts, there is a problem that cannot be maintained.

2.2.4 Natural Language Generation

NLG is the process of automatically generating natural language on the basis of

a non-linguistic information representation. This may be, for instance, informa-

tion from a database or an abstract message specification provided by the dialogue

manager of a dialogue system. Most works of NLG are aimed at the production of

written texts. In recent years, a number of new template-based systems have seen

the light, including TG/2 [22], Theune et al.[23], YAG [24].

Based on the development of statistical natural language analysis, data-driven

methods have led to improvements of the performance of capabilities of NLG sys-


tems. The work in [25] described content selection by means of data-driven meth-

ods. Duboue and McKeown [26] developed a two stage approach to content selection

based on statistical techniques. Their method employs clustering to derive content

selection rules for the purpose of automatic generation of biographies. Ballesteros et

al. proposed two approaches based on SVM classifiers to map semantic and syntactic

structures [27] for NLG.

2.2.5 Text to Speech

TTS, which also called Speech synthesis, is the artificial production of human-being

speech. The TTS procedure mainly consists of two phases, usually called natural

language processing (NLP) or high level, and digital signal processing (DSP) or

low-level synthesis. If the output is to be generated by an embodied conversational

agent (ECA), facial animation is also needed [28].

In recent years, researches have been done to provide high quality synthetic speech

with no compromising in terms of naturalness and intelligibility in TTS systems.

To attain the natural speech, different models are available such as text as lan-

guage models, grapheme to phoneme models, full linguistic analysis model and com-

plete prosody generation model.

In complete prosody generation model, the quantities like phrasing, stress are

determined to generate naturalness bearing synthetic voice. Towards generating

such a speech, an explicit prosodic model is required. HMM is one of the best

models currently in use for most of the TTS [29] [30]. Though many researches have

been done in this stream, better solution is required.


2.2.6 Emotion Recognition with Prosody Information

Speech is usually analyzed considering both: the prosodic information and the spo-

ken content. Therefore, to deeply explore the superiorities of speech to other modali-

ties such as touch screen or keyboards, some spoken dialogue systems intend to apply

the recognition of the emotion of the speaker.

By analyzing the emotion from speech, dialogue agent or robot can estimate

the human feelings, and performs a response answer suitable to the situation. For

instance, Japanese telecom company Softbank employs robot “Pepper” to do shop

reception, and declares that the future robots that can recognize human emotion

will change the way we live and communicate [6].

Commonly the speech emotion is classified in categories. The number of categories

is mostly 4 until 7, such as anger, boredom, disgust, fear, joy, neutral and sadness

[31].

In the design of a speech emotion recognition system, an important issue is the

extraction of suitable features that efficiently classifies different emotions. Many

studies have shown that prosody information provide a reliable indication of the

emotion [32] [33] [34]. The most commonly used prosody features in speech emotion

recognition are as follows [32]:

• Fundamental frequency (calculated as the inverse of pitch period): mean, me-

dian, standard deviation, maximum, minimum, range (max-min), jitter, and

ratio of the sample number of the up-slope to that of the down-slope of the

pitch contour.

• Energy: mean, median, standard deviation, maximum, minimum, range (max-

min), linear regression coefficients, shimmer, and 4th order Legendre parame-

ters.

• Duration: speech rate, ratio of duration of voiced and unvoiced regions, and


duration of the longest voiced speech.

However, for emotions those have nearly the same arousal level often it is not

possible to classify them correct. Features are needed that make it possible to dif-

ferentiate emotions like anger and joy better, comparing with only prosody features.

These features depend on the human voice and are called quality features. There-

fore, formants feature set including: first and second formants and their bandwidths

are also used for emotion recognition.

With the extracted features, GMM and support vector machine (SVM) are mainly

used to correctly classify the emotion. Recently, HMM and deep neural network

(DNN) are also applied [34] [35].

Among the prosody features, pitch information is essential. Pitch detection meth-

ods usually use short-term analysis techniques, which means that a score f(T |xm)

is calculated by a function of the candidate pitch periods T for every frame xm. In

speech processing literatures, a wide variety of pitch detection methods has been

proposed. However, accurate and robust pitch detection in noisy environments still

remains as a difficult and important issue in real application.

2.3 Remaining Issue and Related Works

By the analysis of the use situations of the commercial spoken dialogue systems, it is

still difficult to say that spoken dialogue interface has deeply penetrated users’ daily

life yet. In order to realize the habitual use of spoken dialogue systems especially for

information search, the following approaches are proposed to solve this remaining

issue:

• Designing attractive characters and dialogue scenario based on gamification

theory and human centered design (HCD), which makes the users feel like

speaking to.


• Improving dialogue management strategies to optimize question selection for

easy information search and to decrease failures in information search mainly

caused by mismemory and speech recognition errors.

• Estimating the prosody information including pitch to improve the usability

of spoken dialogue systems.

For the related works of designing spoken dialogue systems, Yankelovich et al.

analyzed the issues of designing speech interface in [36]. Besides the speech recog-

nition errors, the natures of speech, such as the lack of visual feedback and the

organization of information, also generate negative impact on the user experience

for smartphones. It leads to the lack of motivation of using spoken dialogue interface.

Several researches tried to use an agent to realize a natural spoken dialogue interface

[37] [38], expecting to improve the motivation. In [37], a human-like dialogue agent

was applied for museum guide. The results indicated that the agent was useful to

engage people in dialogue interactions. The agent’s emotion and personality are

verified to be important for natural dialogues in [38]. Also, some works were carried

out to incorporate spoken dialogue into games in a useful and entertaining way. Bell

et al. introduced an interactive computer game in which spoken and multimodal

dialogue for dialogue data collection [39]. Minami et al. developed a quiz system

using speech, sounds, dialogue, and vision technologies [40]. Both of the researches

proved that the game elements motivate users to speak to the systems, which help

to complete the dialogue tasks. However, the focus has not been on the approaches

for the habitual use of a spoken dialogue interface in a long term.

For the related works of DM for information search, most spoken dialogue sys-

tems, such Apple’s Siri and Google’s voice search, are simply combining the speech

interface with the search systems. The conventional search systems are mostly based

on a domain-specific web search or full-text search [41] with the keywords. However,

they cannot provide proper results unless those results include the keywords those

the user has input. Furthermore, if the user’s information literacy is not high and


the target database is huge, the search effort for narrowing down the candidates

will still be a great issue. In addition, for the robust search strategies against the

queries those contain incorrect parts due to mishearing or misrecognition, the re-

lated works are investigated. Several researches attempted to use phonetic string

matching methods to solve the search problem caused by misheard queries. They

were verified to be more robust than the text retrieval methods. Ring and Uiten-

bogerd [42] tried to find the correct lyric by minimizing the edit distances between

phoneme strings of queries and the lyrics. However, edit distance does not present

the degree of confusability between phonemes. To model the similarity of misheard

lyrics to their correct versions statistically, Hussein [43] introduced a probabilistic

model of mishearing that is trained using examples of actual misheard lyrics from a

user-submitted misheard lyrics website “kissthisguy”[44], and developed a phoneme

similarity scoring matrix based on the model. The performance of this method de-

pends on the size of the training database. As described in [43], a total number of

20788 pairs of the misheard lyrics and the correct lyrics are used. However, such a

big database like “kissthisguy” is not available in other languages. It is impractical

to collect sufficient misheard lyrics to build a practical probabilistic model. On the

other hand, another important requirement for the search strategies of DM is that it

must satisfy a real-time response considering the latency of spoken dialogue systems.

In order to reduce the processing time, conventional high-speed DP matching pro-

cessors use index or tree-structured data to pre-select the hypothetical candidates

[45], [46]. As an example, [45] used a suffix array as the data structure and applied

phoneme-based DP matching to detect keywords quickly from a very large speech

database. In order to avoid an exponential increase in the processing time caused by

increasing keyword length, it divided the original keyword into short sub-keywords.

Then, it searched the sub-keywords on the suffix array by DP matching. If the DP

distance between a sub-keyword and a path of the suffix array is not more than

a predetermined threshold value, these paths remained as the candidates of search

results. By repeating the DP matching process between the original keyword and


the candidates, the final result is detected. As well as other high-speed DP methods,

the predetermined threshold for sub-keywords is proportional to the length of the

queries. However, based on the author’s investigation, for some domains, such as

lyric search, the distance values between the queries and the correct lyrics show no

statistical relationship with the length of queries.

For the related work of pitch extraction in noisy environments, the commonly

used detection methods are considered as the autocorrelation method (AUTOC)

and the cepstrum method (CEP). The autocorrelation function in AUTOC is cal-

culated by inverse Fourier transform of a squared amplitude spectrum of speech,

i.e., power spectrum [47], [48]. CEP uses the logarithm of an amplitude speech

spectrum [48]. They are verified as excellent detecting methods in clean speech. As

it is well known, white noise distributes the energy uniformly along the frequency

axis. It does not form prominent energy peaks. After exponential calculation on

amplitude spectrum, the high energy parts which represent speech components are

enhanced. Accordingly AUTOC has robustness against white noise. The authors in

[49] proposed a new method as an extension of AUTOC. It adjusts the exponential

factor of the amplitude spectrum according to SNR at each speech frame. As SNR

decreases, the exponential factor is increased to distinguish the speech components

from noise components. In a wide range of white noise, i.e., from low SNR to high

SNR, the accuracy on pitch extraction is improved by this extension. On the other

hand, logarithm prefers to extract the envelope of the amplitude spectrum. CEP

is relatively robust against the noise whose energy is distributed in a narrow band,

such as car noise. However, if the noise situation changes, the performances of

the conventional methods are rapidly degraded. Since these conventional methods,

i.e.,AUTOC, provide excellent detection in clean speech, a process which can keep

amplitude spectral structure close to the shape of clean speech against unspecific

noise conditions is requested.

Chapter 3

Spoken Dialogue System Design

for a Habitual Use

This chapter expands the discussion focusing on how to design a spoken dialogue

system to be habitually used. Firstly, an in-house trial of spoken dialogue system

is introduced. The discussion of the trial results evocates a utilization of character

agent which is well designed for building intimacy. Secondly, the hypothesis that

the impressions of the agent character have influences to user’s tolerance of dialogue

system mistakes, is verified by the experimental results. Finally, a breeding game

application with a dialogue agent is designed to realize a habitual use of the spoken

interface for information search.

3.1 In-House Trial of Spoken Dialogue System

In order to investigate the use situation and issues of spoken dialogue systems, an in-

house trial of a personal assistant application has been carried out. The application

is based on a spoken dialogue system in a smartphone. It uses an dialogue agent to

20

Chapter 3. Spoken Dialogue System Design for a Habitual Use 21

assist users in operating the regular smartphone functions, such as contact, calendar,

alarm, and a short message service in daily life. It also supports web search and

weather forecast.

3.1.1 Structure of the Trial System

The spoken dialogue system used in the trial is based on a multimodal dialogue

platform that runs stand alone on Android devices [50]. The platform acts as a

middleware that can easily communicate with other native functions of Android

devices. Figure 3.1 illustrates the architecture of the applied dialogue system. It

comprises of four components. The “Dialogue Management” component analyzes

what users want and responds with appropriate agent actions to the user. The

“Understanding” component includes a lightweight natural language understanding

engine, based on a large database of vocabularies and sentences concerning mobile

operations and information search. The “Character Rendering” component imple-

ments the agent with a fully articulated 3D graphical body, which support PMD

motion format [51]. The “Speech Synthesis” component uses a stand-alone HMM-

based Japanese TTS engine “N2” [52]. Another feature of this dialogue system is

that the users’ profile information is collected through daily dialogues. Base on the

information, the proper responses are selected. To realize the speech input with a

wide range of tasks, Android standard speech input module is used [53].

3.1.2 Trial Evaluation

Eight participants including four females and four males, who had an experience of

using a smartphone, participated in the trial. Though the trial resulted that five par-

ticipants gave good evaluations of operability, all participants used the application

for less than three days. None of the participants continued using the application

after one week. Furthermore, the most voted reason for not using the application


Dialogue PlatformCharacter RenderingSpeech SynthesisDialogue ManagementCalendar Other AppSMSASR* ServerInternet

UnderstandingUser profileUser profile …Favorite foodFavorite music...…Favorite foodFavorite music...Smartphone*ASR: automatic speech recognition

Figure3.1 Structure of the trial spoken dialogue system

was feeling shy about speaking, which was more than the votes for speech recog-

nition errors. Most users reported that the motivation of speaking will increase if

they become very familiar with the dialogue agent.

3.1.3 Discussion of the Spoken Dialogue Agent

The results of the trial bring a great challenge of how to motivate the user to use the

dialogue interface habitually. The author proposes to introduce gamification into the

design of spoken dialogue interface. The dialogue application is designed in the form

of a breeding game using an animated character as the dialogue agent. The breeding


process is expected to make users be very attached to the agent, which increases

motivation of speaking to it. This idea was inspired by the hit phenomenon of a

breeding toy Tamagotchi that has been keeping people absorbed for a long period.

The reason was explained in [54], that people have a high tendency to create and

transfer their affections to the artificial companions regardless of the companions

are devices or animal. Therefore, the affection on the character is expected to help

and enhancing user’s tolerance against dialogue system mistake.

3.2 Agent Characters’ Impressions against the User’s

Tolerance

So far, the applications using agent characters for interactive communications, which

are based on the speech recognition technology, have been increasing [37] [38]. The

reason using character is to engage people in dialogue interactions. Based on the

in-house trial described in Section 3.1, the author sets up a hypothesis that the use

of character agent enhances user’s tolerance against the dialogue system mistakes

caused by ASR, NLU or other reasons. This section introduces an investigation that

the impressions of the characters give influences to the user’s permissiveness to the

mistakes of the spoken dialogue systems.

3.2.1 Preliminary Investigation for Selecting Characters

In order to determine the agent characters to be used in the experiment, a prelim-

inary investigation is carried out with 25 participants, whose ages ranged from 23

to 53 years-old. For the nationality, 20 participants are from Japan, 2 participants

are from the Republic of Korea and 3 participants are from the People’s Republic

of China. Three types of human-type characters and five types of non-human-type

characters are nominated as the targets. The participants were subjected to rank


the cuteness of the eight characters. As it is verified in another previous work of the

author that there is a trend that the cute character makes the user more tolerant of

the dialogue agent mistakes, both of a high level “cute” character and a low level

one are selected. The character that received the highest evaluation is character b in

Figure 3.2, and the character with the lowest evaluation is character a. Since many

spoken dialogue systems trend to use human-type characters, a human-type charac-

ter is added into the experiment in addition, which is character c in Figure 3.2. It’s

evaluation is in the middle of character a and character b and does not have age-

and gender-related changes.

3.2.2 Experiment for Evaluating Users’ Tolerance

With the selected characters, an experiment is carried out to evaluate the influence

to users’ permissiveness. Twelve Japanese participants including 9 males and 3

females, whose ages ranged from 22 to 31 years-old, attended the experiment. They

all have experiences of using spoken interfaces in mobile devices. However, they did

not know the experiment purpose until it started. An operational support task and

a quiz task were prepared for the evaluation experiment. These tasks are executed

over the applications on a tablet device. The operational support task is highly

expected to provide the proper answers according to the user’s purpose. On the

other hand, the expectation of the quiz task for the proper answers is relatively

low. Therefore the evaluation can be studied from the different aspects, in which

the users request the correct answer in different level. A survey is carried out to

the participants about “whether it could be tolerated when each character made a

mistake in the answer”. Based on the participants’ answers, the degree of tolerance

depends on whether the character is the non-human type or the human type.

Specifically, for the human type character, 42% of the participants in the op-

erational support task and 50% of the participants in the quiz task answered “It

couldn’t be tolerated.” On the other hand, more than 80% participants answered


Figure3.2 Character agents used for the experiment

“It could be tolerated.” in all tasks for the non-human type character. The reasons

are summarized as follows:

• For the human type character: Compared with the non-human character, it is

potentially regarded as a human. The expectations for the precision of search

and the speech recognition are greater.

• For the non-person type character: as the appearance is different from human,

it is revealed that the participants trend to tolerate the character’s mistakes

with the impressions of “cute” and “innocent”.


3.3 Design of User Interface for the Spoken Dia-

logue System Applying a Breeding Game

Based on the discussions and experiment results in the last two sections, my work

proposes a design concept of breeding a dialogue agent, which is an animated non-

human type character, through taking care and having dialogues, to establish in-

telligent and useful conversations. The stepwise growth of the agent creates users’

expectation and intimacy, which motivate them to use the spoken dialogue interface

habitually.

In order to design the game efficiently according to the user experience on a

smartphone, human-centered design (HCD) process is applied [55]. The design

process and analysis of the real-world dialogue log data are described in this section.

3.3.1 Target Users

Instead of all smartphone users, the author narrowed down the target users into

group with two specific features at the first stage of the game design. The first

feature is that the users like breeding characters and the second one is that they

accept speech interface. As investigated in [56], more than twice as many female

users as male ones are playing breeding games. Furthermore, the author analyzed

the users of a smartphone game Peratama [57], in which the users enjoy a character

speaking words in a distinctive voice during the play. The character was designed in

the shape of a egg named peratama that means “a talkative egg” in Japanese. Over

100 thousands users of Peratama, who are regarded to accept speech interface, were

grouped by age and sex, which are shown in Figure 3.3. The proportion of female

users in their 20s to 40s was 57.4%, which was the most of any user group. Based

on the analysis, the target users are defined as 20s-40s female smartphone holders

with the experiences of playing mobile games.


Figure3.3 Demographics of Peratama users.

Furthermore, the author interviewed five females in their 20s who like playing

mobile games, about their behaviors in using smartphones and playing mobile games.

By capturing the interactions and activities involved, the author summarized the

understandings of what the users wanted in the daily life.

Understanding 1: They like cute characters.

Understanding 2: If there is an element for collecting items in the game, they

want to complete the collection to feel a sense of achievement.

Understanding 3: They want to play games in a short break or in a train

without sparing much effort.

In addition, in order to observe their behaviors related to speech interface, a

total of 186 comments of Peratama were collected from Google Play, App Store,

and visitor books from exhibitions. The most representative voices were selected: “I

don’t like the uique progress process that a player cannot choose. The play style is so


humdrum.”; “It is not significant to see the character’s growth except the appearance

change”; “The voice and the words of peratama are funny and sometimes heal my

heart”. New understandings were discovered from these comments.

Understanding 4: They desire various types of interactions (play types).

Understanding 5: They expect significant changes in the breeding with an

image of taking care of babies or pets.

Understanding 6: They tend to pursue the feel of healing from the breeding

process and the characters.

3.3.2 Scenario Design

Based on all of the understandings of the target users, the original design concept

is expanded into the subconcepts as below:

• Subconcept 1: Users take care of the dialogue agent to make it grow up.

According to Understanding 1, 2 and 3, the character changes into various

appearances with unique voices. The action of taking care is easy and able to

get game incentives. The incentives are used to collect various items.

• Subconcept 2: According to Understanding 4, more than one type of play

(taking care) will be prepared besides spoken dialogue. Users can freely choose

the play type due to the situations. Meanwhile, the spoken interaction gets

the greatest incentives.

• Subconcept 3: According to Understanding 5, users can obviously see the

agent growing from the dialogue contents. The agent starts speaking baby

talk then gradually becomes able to give intelligent conversation. Finally, it

is able to help users via the dialogues, such as checking schedules or waking

users up.

• Subconcept 4: According to Understanding 6, the dialogue should be fun and

sometimes emotional. After the breeding process, users can still interact with


the agent to make users feel themselves are involved into the agent’s world.

Then, the scenario of a new breeding game with a spoken dialogue interface is

designed. The game is called Hey Peratama, using the character peratama as the

dialogue agent. The game scenario is mainly composed of five play scenes those are

developed by the above subconcepts. In each scene, there are three scenario points,

which are described below:

• Scene 1 developed from Subconcept 1: Users earn coins from breeding opera-

tions and collect the game items.

– 1-1 Users are asked to breed peratama in a room. Users can take care of

peratama by touching.– 1-2 Each touch earns coins. Users can buy different costumes for per-

atama with the accumulated coins. Once users dress up peratama with a

new costume, peratama is endowed with a distinctive voice and appear-

ance.– 1-3 The costumes are displayed in the silhouettes. Users can enjoy the

imagination before they disclose the costumes.

• Scene 2 developed from Subconcept 2: Users can take care of peratama by

accompanying peratama in executing events following assigned schedule.

– 2-1 If there is an event, such as taking a walk or cooking, registered in

peratama’s calendar, users can accompany peratama in executing the

events. It is operated by playing a video of an event story.– 2-2 Coins can be saved by this operation.– 2-3 By collecting several new costumes, the intimacy level with per-

atama increases, which means peratama grows up. Consequently, the

peratama’s talk grows more complicated.

• Scene 3 developed from Subconcept 2 and 3: Users can enjoy a conversation

with peratama, which helps it grow and mature.


– 3-1 Users can talk with peratama.– 3-2 Each conversation with peratama earns a lot of coins.– 3-3 As the intimacy level increases, the conversation ability of the per-

atama gets better.

• Scene 4 developed from Subconcept 3: Via a conversation, peratama can assist

users to register and to confirm the schedules.

– 4-1 Users can also register their schedules in peratama’s calendar.– 4-2 Through conversations with peratama, users can register and confirm

their own schedules.– 4-3 Once users register the schedules, it is possible to earn coins.

• Scene 5 developed from Subconcept 3 and 4: Users can use peratama in the

situations of the daily life.

– 5-1 Users’ favorite peratama can be placed on top of the home screen as

a widget. Doing so lets users check the game status easily.– 5-2 After setting the alarm via conversation with peratama, users can be

awakened by the voice of peratama.– 5-3 The images of the favorite peratama can be sent to friends over SNS

or chatting apps.

3.3.3 Evaluation of Scenario-based Acceptability

In order to take the importance of context into account, scenario-based acceptability

is evaluated. A total of 11 females 20-30 years-old, who are regarded as the target

users, participated in the evaluation. Four were college students and the others were

office workers. All users had smartphones and had experience playing mobile games.

The evaluation was a one-on-one interview. First, the interviewer explained the sce-

nario by a present sheet. The present sheet contained a detailed scenario description

with users’ actions, screen sketch, and the scenario points which are mentioned in


Section 3.3.2. Then the interviewer asked the questions with a rating scale of ac-

ceptability for the scenario points of the five play scenes. The interview lasted one

hour for each interviewee in order to minimize the burden on the interviewee.

All of the points were rated on a four-level rating scale: 1-Strongly disagree; 2-

Disagree; 3-Agree; 4-Strongly agree. The average scores and standard deviation for

each scenario point are presented in Table 3.1. Since most of the scenario points

obtain high scores, our design was verified as proper for the target users.

Furthermore, comments from the interviews were collected, such as “I feel no

stress in talking to the peratama because it is a game,”; “I will talk to the peratama

a lot to earn coins to buy things,”; “I feel a sense of achievement when peratama’s

speech level is raised.”; “I will try use the calendar function when I forget taking

my schedule book”. The positive results and comments encouraged us and gave us

hints to improve the scenario. Four revisions of the game scenario were carried out

on the basis of the evaluation:

1. Scene5-3 got the lowest score of acceptability, which is lower than 2.5. It is

because most users feel troublesome to search the figures while chatting. It

was deleted from the scenario to improve the usability.

2. As shown in Table 3.1, uses are positive to the incentives (coins) in Scene1-2,

2-2, 3-3 and 4-3. The author arranged the number of coins earned through

the game plays to encourage users to use speech input. Conversations with

the peratama and execution of the alarm and calendar assistants will earn 50

coins, while touch operation only results in 1 coin.

3. Many users claimed that they want more game items besides costumes, which

will motivate them to collect the coins and play more. The items including

the breeding room’s interiors were added.

4. Since some comments said that, they want peratama to call their names and

to say warm words in the special days, the author added a function that can

memorize users’ personal information and reflect it into the dialogues.


Table3.1 Average score and standard deviation valuesScene No. Average Standard deviation

Scene1-1 3.55 0.50

Scene1-2 3.91 0.29

Scene1-3 3.00 0.95

Scene2-1 3.27 0.75

Scene2-2 3.73 0.75

Scene2-3 4.00 0.00

Scene3-1 3.82 0.39

Scene3-2 3.82 0.39

Scene3-3 3.91 0.29

Scene4-1 2.86 1.17

Scene4-2 2.82 1.11

Scene4-3 3.64 0.64

Scene5-1 3.55 0.66

Scene5-2 3.09 1.00

Scene5-3 2.45 1.08

3.3.4 Game Implementation

The game is implemented as an Android application. Screenshots of the game are

presented in Figure 3.4. In order to maintain high motivation and prevent users

from becoming bored with the game, the game interfaces and progress followed the

gamification theory and usability standards in [58].

1. The game provides clear goals for users: the goals are divided into the short-

term goal, midterm goal, and long-term goal. The short-term goal is to collect

the coins by touching the peratama in the standby scene (Figure 3.4 (a)),

accompanying the event (Figure 3.4 (b)), and talk training (Figure 3.4 (c)),


which are easy and cost little in terms of time. Once the goal is reached, the

midterm goal is to collect the necessary costumes to increase the intimacy level

with the peratama (Figure 3.4 (d)). Then, the long-term or final goal is to

keep increasing the intimacy level to make the peratama grow and increase

the dialogue maturity and the assistant abilities (Figures 3.4 (e) and (f)).

2. The first experience is encouraging and the play cycle makes users experience

constant stress until the goal was achieved: the Hey Peratama game progressed

rapidly in the early stages to make users get interested. However, the numbers

of events and operations involving talking and touching per day were restricted,

which give users moderate stress in daily play.

3. Consistently visualizing the user’s situation in the game and giving feedback

are important: every time the intimacy level went up, a short tutorial explained

what was new and what to do next. Sound effects were also provided after

every operation.


(a) Standby (b) Event video (c) Talk training

(d) Costume list (e) Weather (f) Schedule

Figure3.4 Screenshots of Hey Peratama


3.3.5 Analysis of the Real-World Dialogue Log Data

The game application was released on Google Play [59]. There were 415 downloads

in one week after the release. Over 23% of the total users played the game per

day. All of them actively talked to peratama in the game. Comparing with the low

motivation in the in-house trial mentioned in Section 3.1, 26% of users have played

the game for more than three days. Many users commented that they hoped to talk

more than the restriction. It indicated that our design was effective. Based on the

analysis of log data, over 95% questions from peratama are answered. Furthermore

user-initiated dialogues including chatting, weather checking, confirming schedule,

also activated. It is proved that continual system-initiated dialogues built intimacy,

which is activating user-initiated dialogues.

Furthermore, the author investigated the distribution of the number of utterance

and the details of the utterance contents to understand how the user was using

speech input. The analysis is based on the utterances from 4910 real world users.

Figure 3.5 shows that there are 3 peaks of the number of utterances 1) from 7

o’clock to 8 o’clock, 2) from 20 o’clock to 21 o’clock, 3) from 0 o’clock to 1 o’clock.

This distribution stays invariant in the period of one year. These time zones are

around the time of getting up, the time of going home and the time of going to bed.

Therefore these dialogues mainly took place at the users’ home. The tendency of the

speech contents between the user and peratama is also analyzed. The classification

of speech contents uses a bag-of-words based speech intention estimation for each

utterance. Figure 3.6 shows the percentage of each utterance content in each time

zone of 7 o’clock, 20 o’clock and 0 o’clock. Around 7 o’clock the topics were about

weather and time, while the confirmation of the schedule from the users also occupied

a relatively high proportion. From 20 o’clock to 21 o’clock, the chit-chats and the

questions related to peratama took place in the dialogues most. Around 0 o’clock,

a larger number of utterances contained the similar contents with those around 7

o’clock, by adding fortune-telling and asking the date.


3.0%4.0%5.0%6.0%7.0%8.0%9.0%10.0%ut

tera

nce

prop

ortio

n

0.0%1.0%2.0%3.0%0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

utte

ranc

e pr

opor

tion

time

Figure3.5 Utterance distribution

Based on the above discoveries, it is supposed that a function that pushes a proper

dialogue topic according to the time, will be effective for prompting users to use the

dialogue system more habitually.


0.0%5.0%10.0%15.0%20.0%25.0%30.0%

prop

ortio

n of

ea

ch to

pic

7 o'clock 20 o'clock 0 o'clock

Figure3.6 Distribution of major user-initiated dialogues topics

3.4 Summary

This chapter presents the studies on how to design a spoken dialogue system to be

habitual use for information search. Based on he investigations of a in-house trial

of spoken dialogue system and the analysis of agent characters’ impressions against

the user’s tolerance, human-centered design and gamification theory are practiced

to design a breeding game. A novel design concept of breeding a character through

taking care and having a dialogue was proposed in order to make the users gradually

feel that speaking to the dialogue agent is natural and fun. The designed scenario

obtained high evaluation scores from the subjective evaluation experiments. After

the release of the game on Google Play, the analysis of the real-world users also


positively supported our design, in which over 23% users are keeping doing spoken

dialogue interactions in the game continually.

Chapter 4

Dialogue Management for

Information Search

This chapter introduces the second point of my research, which is the work for

optimizing the dialogue management for robust information search. As is known,

an important use of spoken dialogue system is to obtain the information by natu-

ral utterances. Especially, it is regarded as a proper interface for the information

illiterate. The general goal of this work is to realize effective interaction and robust

search between people and the information resources such as database or internet.

First, a knowledge-based spontaneous dialogue system with the strategy of op-

timal question selection is proposed to assist the users’ operations. The task is

targeted for searching recipes, as the cooking-related applications and researches

have become more popular recently [60] [61] [62]. The proposed dialogue strategy

actively asks the users questions to help them easily select the satisfactory recipes

from among thousands of potential options. It aims to help the users, especially the

novices who do not have abundant knowledge about recipes. Based on the results of

evaluation experiments, It is concluded that the application with the proposed strat-

39

Chapter 4. Dialogue Management for Information Search 40

egy performs better than the conventional applications in terms of the satisfaction

with search results and the effort spent for reconsidering search keywords.

Another issue of the spoken dialogue systems for information search is the mis-

taken queries. A great amount of queries contain incorrect parts due to speech

recognition errors as well as users’ mishearing or mis-memory. In order to improve

the system performance for information search with these incorrect queries, a robust

and fast phonetic string matching method is proposed in my work. Experimental

results show that the proposed search method reduced processing time by more

than 86.2%, compared with the conventional methods for the same search accuracy

in the task of searching lyrics with the real-world misheard queries. Furthermore,

some researches show that the recognition errors in unlimited-vocabulary speech

recognition include a great amount the misrecognized words with similar morpheme

strings, such as foreign names [63] [64]. It infers that the proposed search method

is effective for recovering the search failures caused by speech recognition errors in

spoken dialogue systems.

4.1 Dialogue Strategy of Optimal Question Selec-

tion

This section proposes a novel dialogue strategy to easily select the satisfactory and

proper information from among thousands of potential options. It is applied in a

recipe search application on mobile devices and helps the users especially the novices

who do not have abundant knowledge about the recipes.

The application with the proposed dialogue strategy asks the users a series of

questions related to various cooking categories including recipe genres and cooking

needs in order to narrow down the potential recipes that meet the users’ wants. A

decision algorithm is applied to find the best question to be asked to narrow down the


candidates as quickly as possible according to the recipe database. Furthermore, the

questions responding to the recipes that the users viewed during the search process

are preferentially selected.

4.1.1 Spoken Dialogue System for Information Search in a

Cooperative Manner

The interface of the proposed recipe search application is shown in Figure 4.1. Area

A is for user operation, and supports typing, speech input and button-press input.

A brief operation guidance is also shown here. Area B is for showing the search

keywords by which the results are refined. Area C is for displaying the search results:

recipes with introduction in text and pictures. The application can “speaks” over

the device’s speaker using a speech synthesis engine. The words are also displayed

in a balloon.

To simplify the search process for the novice users, the proposed recipe search ap-

plication employs a spoken dialogue system. The dialogue management is designed

as a mixture of a user-initiative model and a system-initiative model. The flowchart

of the dialogue scenario is shown in Figure 4.2. First, the application prompts an

open question to ask the user, “What kind of recipes are you looking for?” The

user can answer for example, “A recipe using chicken and cabbage”. The recipe

results including those keywords are then refined. At the same time, the application

prompts, “Please press here if confused by the number of recipe results” and displays

a button. Once the user presses the button, the application starts to ask the user a

series of questions related to the recipe categories, such as “Do you use an oven?”

The user then simply presses a button or speaks into the microphone to choose yes

or no, and the recipe results are consequently narrowed down until the user finds a

satisfactory recipe. This design is intended to assist the user who has not prepared

many keywords but expect an easy search. Even while being asked questions, the


A B CRecipeContent RecipeContent

RecipeContent

RecipeContent

RecipePicture

RecipePicture

RecipePicture

RecipePicture Content

RecipeContent

RecipeContent

RecipeContent

Picture

RecipePicture

RecipePicture

RecipePicture

Figure4.1 Interface of the proposed application

user is allowed to view the details of recipes and freely input keywords and sentences

to narrow down the number of recipe results.


Figure4.2 The flowchart of the dialogue scenario

4.1.2 Question Selection for Quick Narrowing Down of Hit

Data

In order to minimize the number of search refinement steps, the question selection

in the spoken dialogue system is based on the information gain according to the

remaining recipes. The information gain is calculated by Equations 4.1 and 4.2,

using the decision algorithm of iterative dichotomiser 3 [65].

H(S) = −∑x∈X

p(x)log2p(x) (4.1)


IG(A) = H(S)−∑t∈T

p(t)H(t) (4.2)

Entropy H(S) represents the average amount of uncertainty in the recipe set

S. p(x) is the proportion of the number of elements in attribute (or question) x

to the number of elements in S. T represents the subsets created from splitting

set S by attribute (or question) A. p(t) represents the proportion of the number

of elements in t to the number of elements in set S. Information gain IG(A) is

the difference in entropy from before to after the set of recipes S is split on an

attribute (or question) A. So after each refinement, information gain is calculated

for each remaining question. The question related to the category with the largest

information gain is used on this iteration.

Furthermore, as users usually check a number of recipes that they are interested in

before they make the final decision. Preferentially selecting the questions responding

to the recipes that users are viewing is supposed to be helpful to reflect the user’s

potential search purpose. Therefore, P (A) is induced as a weight of IG(A) in

Equation 4.3, which is the proportion of the number of recipes categorized by A

that have been viewed by the user during the search process to the number of all

recipes that have been viewed. r is the smoothing parameter. The question related

to the category with the highest PIG(A) value is regarded as the optimal candidate

that keeps balance between the search efficiency and the users’ personalization. After

each refinement, PIG(A) is calculated for each remaining question. The question

with the largest PIG(A) is used on this iteration.

PIG(A) = (P (A) + r)× IG(A) (4.3)

Based on a questionnaire of the users and an investigation of recipe websites, 81

recipe categories of cooking purposes are classified. They correspond to the cooking

situations, cooking utensils, cooking methods, ingredients genres, cooking needs,

cooking difficulty, dish origins and dish genres. Besides the normal categories, the


categories adapted to spoken queries, such as outdoor, diet and saving electricity

are also included.

The learning algorithm named confidence-weighted linear classification (CW) [66]

is used for classification and estimation. It allows for recipes to belong to multiple

categories. The TF-IDF [67] value of nouns and adjectives in the texts of recipes

are selected to build a bag-of-words model. By an evaluation involving 5-fold cross

validation on 81 categories (on average each category uses 777 recipes for a test set),

the average accuracy of classification is 94%.

4.1.3 Experimental Evaluations

To evaluate the usability of the proposed application, it was compared with a con-

ventional keyword-based search application (Application1). Furthermore, a major

commercial recipe search application (Application2) was also compared as a refer-

ence. The mobile device in the experiment was Google Nexus 7 tablet (OS: Android

4.1). The comparisons of subjective evaluations and operating time were both car-

ried out.

Experiments were conducted with 7 participants (3 females, 4 males) aged from

20 to 40 years-old, including engineers and clerks. All of them were experienced

smartphone users and the novice users of recipe search services. Application1 is

almost the same with the proposed application only except the dialogue interface.

It only allows keyword input with a keyword suggest function. Both the proposed

application and Application1 stored 54,277 recipes in Japanese and they are imple-

mented on the open-source search engine [68]. The Application2 is a widely-used

recipes search application with a commercial database over 1 million recipes. All of

the applications support speech input, in which the same Android standard speech

input module is used [53]. The experiments were carried out in a quiet room. The

participants were asked to imagine that they were in their own living room, and


trying to search for a satisfactory recipe with the mobile devices for cooking a din-

ner one hour later. No time limitation was set. For each application, a list of two

designated tasks and one free task was given in which the participants were asked

to finally decide on one recipe for each of these requirements. For examples, one de-

signed task is searching a recipe of western food with spinach as an ingredient. They

were allowed to add or change keywords that were related to the cases given. Before

participants started the experiment, a brief guidance video was shown, and a few

minutes of preliminary practice were allowed. The order in which each application

was used was randomly determined for each participant.

Once the participants finished the operations of an application, they were asked to

rate six aspects of their experience by using a 5-point Likert scale, on which 5 points

were awarded for the most satisfactory experience, and 1 point for an unsatisfactory

experience. The list of aspects is as described below:

1. Did you feel easy to operate the application?

2. Did you become clear about the interface after read the guide of the applica-

tion?

3. Did you feel smooth to execute the next operation when you try to find the

recipe?

4. Did you feel easy when you were considering the keywords?

5. Did you feel ok with the number of operations for inputting a keyword?

6. Are you satisfied with the recipe that you decided?

The average scores are shown in Table 4.1. The greater the score is, the more

positive the participant evaluates the application. Besides the score, the partici-

pant’s free comments of each aspect are also recorded. On aspect 1, the proposed


application achieved worse scores than the others. It is because the operation on

spoken dialogue interface is quite new for the participants, getting used to when

and how to answer was not easy. On aspect 2, Application1 achieved the best eval-

uation because of its simplest interface. However, the participant most positively

evaluates the proposed application on aspect 3. The reason is supposed to be the

speech guidance and the simple operation to choose yes or no to refine the search re-

sults. On the other hand, the proposed application received higher scores than other

conventional applications on aspects 4 and 5. It is verified that the strategy using

dialogue interface to simplify and to refine the search is effective, since it saved the

participants’ efforts to reconsider the keywords. Finally, the proposed application

achieved the best score on aspect 6. It is ascribed to the proper question selec-

tion that provides useful keywords and information for the participants to search

recipes. In addition, especially for the proposed application, an aspect “Did you feel

the question asked by application helpful?” was asked. The average score is 4, which

is relatively positive. The operating times including both real times (the real-world

elapsed time) and sensory times (the elapsed time estimated by the participants)

are also recorded. As shown in Table 4.2, the proposed application costs more time

than others, as it speaks to the participants to ask the questions. However, different

from Application2, the sensory time of the proposed application is shorter than the

real time. It reflects that the participants felt less stress when using the proposed

application.

Based on the results of the evaluation experiments, it is concluded that the pro-

posed application performs better than the conventional applications in terms of

the satisfaction with search results and the effort spent for reconsidering search

keywords.


Table4.1 Results of subjective evaluationScore Proposed Application Application1 Application 2

Aspect 1 4.29 4.43 4.43

Aspect 2 4.29 4.71 3.57

Aspect 3 4.29 3.86 4.14

Aspect 4 3.29 3.00 3.00

Aspect 5 3.71 3.57 3.29

Aspect 6 4.14 3.86 3.86

Table4.2 Average operating time for one search taskTime (m) Proposed Application Application1 Application 2

Real Time 3.80 3.74 3.47

Sensory Time 3.14 2.95 3.55

4.2 Information Search by Robust and Fast Pho-

netic String Matching Method

As one of my work above has proved that applying spoken dialogue system makes

information search easier, another research work focuses on improving the robustness

of the spoken dialogue system for information search. Information search by voice

is the technology underlying many spoken dialog systems those enable the users to

access information by spoken queries. The information normally exists in a large

database, and the query has to be compared with a field in the database to obtain

the relevant information. Different from the spoken dialog technology for ATIS [4]

style systems which emphasizes detailed semantic analysis, a robust search method

is more essential for the dialogue management of information search. It needs to be

robust to the unique conditions in the real world:


• Incorrect queries due to users’ mishearing or mis-memory

• Incorrect queries with ASR errors

My work proposes a robust and fast search method to improve the performance

of spoken dialogue system with these incorrect queries. To achieve the robustness,

“acoustic distance”, which is computed based on a confusion matrix of an automatic

speech recognition experiment, is originally introduced. Then acoustic distance is

applied into Dynamic-Programming (DP)-based phonetic string matching to identify

the target contents that the incorrect queries refer to. Furthermore, as the latency

of the spoken dialogue system is important, a two-pass search algorithm is proposed

to realize real-time execution. The algorithm pre-selects the probable candidates

using a rapid index-based search in the first pass and executes a DP-based search

process with an adaptive termination strategy in the second pass.

The proposed search method is implemented for music information retrieval (MIR).

Many commercial systems accept diverse queries by text, humming, singing, and

acoustic music signals. Among these types of queries, text queries of lyric phrases

are commonly used (lyric search) [69]. The effectiveness of lyric search systems

based on full-text retrieval engines or web search engines is highly compromised

when the queries of lyric phrases contain incorrect parts due to mishearing. Ex-

perimental proves the proposed search method to be the most practical solution for

these incorrect queries, considering the trade-off between high search accuracy and

low computation complexity.

4.2.1 Analysis of the Real World Queries

Several investigations were carried out on the collected real world queries. The sta-

tistical features of the queries and some issues peculiar to lyric search are presented

in this section.


To analyze the queries of lyric phrases for MIR in the real world, a preliminary

analysis investigated major Japanese question & answer community websites, “ok-

wave” [70] and “oshiete goo” [71]. It was found that many questions used lyric

phrases to request the names of songs and singers. As 1140 queries of lyric phrases

asked by various questioners were collected, each query is compared with its corre-

sponding lyric to categorize whether lyric phrases in the query are correct or not

(correct query or incorrect query) and how they were mistaken. The lyrics and

queries are written in Japanese or English, or a mixture of both.

Figure 4.3 shows the distribution of incorrect queries in the different types and

correct queries within the collected data. The incorrect queries, which make up

around 79%, are classified into the following types:

• Confusion of notations: Chinese characters in the queries are substituted for

reading symbols (kana), and vice versa.

• Function-word-error: Only the function words (such as prepositions, pronouns,

auxiliary verbs), which have little lexical meaning, are mistaken in the queries.

• Content-word-error: The content words (such as nouns, verbs, or adjectives),

which have stable lexical meanings, are mistaken in the queries.


Figure4.3 The distribution of mistaken queries in the different types and correct

queries within the collected queries

In the current full-text search methods, function-word-error and confusion of no-

tations can be handled using a stop word list for filtering out the function words


[72], and a hybrid index of words and syllables [73].

On the other hand, as the content words play more important roles in determining

the search intention [72], content-word-error queries were further categorized into

three subtypes, namely “acoustic confusion”, “meaning confusion” and “others”.

The percentages and examples are listed in Table 4.3. The mistaken parts are

marked in bold.

Acoustic confusion is defined as a replacement of a word with that of a similar

pronunciation; or a replacement of the words of unknown spelling with reading sym-

bol strings of a similar pronunciation. For the first example of acoustic confusion

queries in Table 4.3, “/kotoganai/” and “/kotobawanani/” have similar pronuncia-

tions while the character strings have no common parts. In the second example, the

Japanese syllable (kana) string is used as a query whose pronunciation is similar to

the English phrase, “You’ve been out riding fences for so long now” in the target

lyric. This was assumed to happen when users were not able to spell the foreign

words that they heard in a song.

Meaning confusion is defined as a replacement of a word with its synonym or near-

synonym. As shown in Table 4.3, in the first example of meaning confusion queries,

“/anata/” is mistaken for “/kimi/”. Both of the terms refer to the same meaning

“you” in Japanese. For the second example, “/tsuki/” and “/hoshi/”, which mean

“moon” and “star”, are confused.

The “others” type contains word insertion, word deletion and other errors in the

queries. From the analysis of collected examples, it is known that mistakes in the

“others” type are caused by a variety of reasons, which include individual experiences

or memories and other reasons. The analysis did not find a relationship between

the mistakes and the lyrics.

As the acoustic confusion queries occupy about 19.3% of the collected queries

(45.0% of content-word-error queries), it remains an important issue for lyric search.


Table4.3 The distribution of mistaken types within content-word-error cases Types of queries Percentage Examples Correct lyric Mistaken queries Acoustic confusion 19.3% (220 queries)

Ex.1 Text (Japanese): 好きな事がない事がない事がない事がない好きな言葉は何言葉は何言葉は何言葉は何 Pronunciation: /sukinakotoganai/ /sukinakotobawanani / Meaning: There is nothing I like. What are your favorite words? Ex.2 Text: You’ve been out riding fences for so long now ユーベーナウプラウドゥンシーンユーベーナウプラウドゥンシーンユーベーナウプラウドゥンシーンユーベーナウプラウドゥンシーンクスソーセングナウクスソーセングナウクスソーセングナウクスソーセングナウ Pronunciation: /yuubiibiiNNautoraidiNNgufeNNshizufoosooroNNgunau / /yuubeenaupurauduNNshiiNNkususooseNNgunau/ Meaning: You’ve been out riding fences for so long now * no actual meaning Meaning confusion 7.3% (83 queries) Ex.1 Text (Japanese): 君君君君には何でも話せるよとあなたあなたあなたあなたには何でも話せるよと Pronunciation: /kiminiwanaNNdemohanaseruyoto/ /anataniwanaNNdemohanaseruyoto/ Meaning: I can say anything to you I can say anything to you Ex.2 Text (Japanese): 月月月月に願いを星星星星に願いを Pronunciation: /tsukininegaio/ /hoshininegaio/ Meaning: pray to the moon pray to the star Others 16.3% (186 queries) Ex.1 Text (Japanese): 星から来たから来たから来たから来た子の見る夢は星の子チョビンチョビンチョビンチョビンの見る夢は Pronunciation: /hoshikarakitakonomiruyumewa / /hoshinokochobiNNnomiruyumewa / Meaning: The dream that the child who came from the star has The dream that child Chobin of the star has My research focuses on the solution to the acoustic confusion issue for lyric search.

The average length of 220 acoustic confusion queries collected is about 6 words.

The word error rate of incorrect words is about 53.1% (the insertion errors were not

included). In the text retrieval field, some fuzzy matching algorithms, such as Latent

Semantic Indexing (LSI) and partial matching, were used by major commercial Web


search engines [74] to improve the robustness against incorrect queries. Thereby, a

search test was carried out to evaluate how robust web search engines are against

220 acoustic confusion queries collected. The test results are shown in Table 4.4.

The number of “hits” is equal to the number of webpages mentioning the target lyric

that are included in the top 20 results returned by a search engine. Correct queries

mean the correct versions of the incorrect queries. Comparing the number of hits

with the correct queries, the performance of both web search engines are severely

degraded in case of incorrect queries.

According to this result, identifying a lyric containing the most similar part in the

acoustic aspect of the query is expected to be a better solution for acoustic confusion

than focusing on the textual or the semantic aspects.

Table4.4 Number of hits by two web search engines

web search engines Web Search Engine 1 Web Search Engine 2

220 correct queries 175 157

220 incorrect queries 27 16

4.2.2 Investigation of the Relationship between the Query

Length and DP Matching Pre-selection

Furthermore, the investigation of the relationship between the lyric query length

and DP matching pre-selection is carried out. As introduced in Section 2.3, the

conventional high-speed DP-based search method usually contains a pre-selection

approach. It prunes out the improbable search paths by comparing the DP match-

ing distance with the predetermined threshold proportional to the length of queries.

To find out whether it is a practicable approach to the lyric search problem, the


DP matching distances between the queries and the correct lyric are also analyzed.

Figure 4.4 shows the distribution of the analysis data. The horizontal axis is the

phoneme number of each query, representing the length of the queries. The ver-

tical axis is the DP matching distance between the queries and the correct lyrics.

Figure 4.4 shows that the phoneme number of the queries is distributed in a broad

range from 5 to 57. In addition, the distance values between the queries and the

correct lyrics show no statistical relationship with the length of queries. Thereby, it

is practically difficult for the conventional method, such as the one in [45], to find

the appropriate threshold based solely on the length of queries.


68101214161820

distance

0246

0 10 20 30 40 50 60 70length of a query in phonemesFigure4.4 The distribution of the length of a query in phonemes and the DP matching

distances from the correct lyric for the real world incorrect queries

4.2.3 Acoustic Distance Derived from a Phoneme Confusion

Matrix

In this section, the proposed acoustic distance is presented. Acoustic distance be-

tween two strings is calculated by DP matching with cost values derived from pho-

netic confusion probabilities instead of a constant cost value used for edit distance.


First, a phonetic confusion matrix is obtained by running a phoneme speech recog-

nizer over a set of speech data and aligning the phoneme strings of recognition results

with reference phoneme strings, which uses the same speech recognition experiment

as in [75].

For the elements of the confusion matrix, g(p, q) means the number of instances

of phoneme q obtained as recognition results by the actual utterances of phoneme

p. As “ϕ” represents a null, g(ϕ, p) means the number of instances of the wrongly

inserted phoneme p (insertion) and g(p, ϕ) means the number of instances of the

deleted phoneme p (deletion). U represents the set of 37 phonemes including null.

For each phoneme p, the phonetic confusion probabilities of an insertion Pins(p),

deletion Pdel(p) and substitution for phoneme q Psub(p, q) are calculated on the basis

of the confusion matrix elements, by Equations 4.4 to 4.6.

Pins(p) =g(ϕ, p)∑

k∈U g(k, p)(4.4)

Pdel(p) =g(p, ϕ)∑

k∈U g(p, k)(4.5)

Psub(p, q) =g(p, q)∑

k∈U g(p, k)(4.6)

As a large value of Pins(p) represents high confusability for an insertion of p, it

corresponds to the low cost of an insertion operation for p in string matching based

on DP. Therefore the value of insertion cost Cins(p) is calculated by Equation 4.7.

In the same way, the value of deletion cost Cdel(p) and substitution cost Csub(p, q)

are calculated from the corresponding phonetic confusion probabilities by Equations

4.8 and 4.9.

Cins(p) = 1− Pins(p) (4.7)


Cdel(p) = 1− Pdel(p) (4.8)

Csub(p, q) = 1− Psub(p, q) (4.9)

Second, with the calculated cost values, edge-free DP matching between the

phoneme strings S1, S2 is carried out by Equations 4.10 to 4.12. Here, S[x] is

xth phoneme of phoneme string S and len(S) means the length of S (S1, S2 ∈ S).

D(i, j) designates the minimum distance from the starting point to the lattice point

(i, j). DS1,S2 is the accumulated cost of DP matching between S1 and S2, which is

defined as the acoustic distance. It reflects acoustic confusion probability for each

phoneme.

1. Initialization:

D(0, j) = 0 (0 ≤ j ≤ len(S2)); (4.10)

2. Transition:

D(i, j) = min

D(i, j − 1) + Cins(S2[j])

D(i− 1, j − 1) + Csub(S1[i], S2[j])

D(i− 1, j − 1), (if S1[i] = S2[j])

D(i− 1, j) + Cdel(S1[i])

(4.11)

3. Determination

DS1,S2 = min{D(len(S1), j)} (0 < j ≤ len(S2)); (4.12)


4.2.4 Fast Two-pass Search Algorithm in Consideration of

Acoustic Similarity

Another important requirement for the spoken dialogue system and information

search is to satisfy a real-time response. As the search algorithm of phonetic string

matching methods is based on exhaustive DP matching, the computational com-

plexity results are in the order of m × n × It per query. Here m is the length of

the query, n is the average length of a lyric and It is the number of lyrics to search.

Since commercial MIR systems usually provide hundreds of thousands of lyrics, the

computational complexity is too high to realize a real-time search.

Therefore, a two-pass search strategy is proposed to be used in the DP-based

phonetic string matching, which is based on acoustic distance, in order to realize a

real-time search. It is realized through the following steps: off-line index construc-

tion, a rapid index-based search in the first pass and a DP-based search process with

an adaptive termination strategy in the second pass.

Preliminary indexing is done as a off-line process. Theoretically, DP matching

computation for the acoustic confusion distance between queries and lyric text

should be done beforehand. However, this is impossible in reality because the num-

ber of query patterns is too large to be predicted.

An inverted index construction is preliminarily incorporated for the first pass

search. The whole lyric set LIt are converted into syllable strings using a morpho-

logical analysis tool such as Mecab [76]. Here It represents the number of lyrics in

the whole set. The syllable strings are converted into phoneme strings by referring

to a syllable-to-phoneme translation table. Consequently, a phoneme string SL(k)

represents a lyric L(k) (L(k) ∈ LIt). Here k is the lyric number. On the other hand,

a list of linguistically existing units of N successive syllables (syllable N -gram) A1

· · · An are collected from the lyric corpus. The units are organized as index units

for fast access, as shown in Figure 4.5. The acoustic distance DSAn ,SL(k)between the


phoneme strings of An and L(k) are pre-computed by Equations 4.10 to 4.12 and

stored in the index matrix. It can be regarded as an index of acoustic confusion.

For the search process, firstly, by accessing the index described above, the first

pass with a fast search is realized using the following steps. The flowchart is shown

in Figure 4.5:

1. The input query Q is converted into a syllable string v by Macab.

2. By Equation 4.13 the syllable string is converted into syllable N -gram sets,

V1, . . . , Vm, . . . ,VM . Here, v[m] is the mth syllable of v.

Vm = {v[m], v[m+ 1], · · · , v[m+N − 1]}; (4.13)

3. V1, . . . , Vm, . . . , VM are matched with the index units A1, . . . , An, . . .. By ac-

cumulating the pre-computed and indexed distance values DSAn ,SL(k), the ap-

proximate acoustic distance R(k) is calculated by Equation 4.14.

R(k) =M∑

m=1

DSAn=Vm ,SL(k)(4.14)

4. To narrow the search space of lyrics, L(k) with higher R(k) is pruned off, and

a lyric set LIc containing Ic (Ic < It) as the best lyric candidates is preserved

for the second pass.

As seen in the four steps, the order of the syllable N -grams is not considered in

the first pass.


Character to Syllable conversion

Query in Characters : 「好きな事がない(Japanese)」Lyric No.

syllable 3-gram

L(1) … L(k) …

A1 0.34 … 0.23 …

… … … … …

A35 1 1

… … … … …

An 0.88

… … … … …

A1000 1 0.45

… … … … …

Syllable 3-gram composition

Query in Syllables : 「su-ki-na-ko-to-ga-na-i」A set of syllable 3-gram

V1 su-ki-na

… …

Vm na-ko-to

… …

V6 ga-na-i

Accumulating pre-computed and indexed distances

INDEX

Lyrics with approximate acoustic distance R(k)

Pre-selecting

A lyric set of Iｃｃｃｃ candidatescIL

)(, kLSnASD

Figure4.5 Flowchart of the first pass search

Secondly, a DP-based search with an adaptive termination strategy in the second

pass is done.

By means of the pre-selection in the first pass, the range of target lyrics is narrowed

down to LIc . DP matching with the lyrics in LIc is then carried out to calculate

the precise distance. The candidates with the minimum acoustic distance D(k) are

indicated as the search results. Since the R(k) is calculated as an approximate value

of DP matching distance D(k), after LIc is sorted by R(k), the correct lyric with


the minimum D(k) rises into the forward ranks in most cases. Thus, instead of

the exhaustive DP matching over the entire set of pre-selected lyrics LIc , a DP-

based search with an adaptive termination criterion is proposed. The termination

is adaptive to a cut off function F . The second pass search is designed as shown in

the flowchart in Figure. 4.6.

Lyrics LIc are first sorted by R(k) and then divided into Z groups, thus each

group has Ic/Z lyrics. A DP matching calculation is executed in one group after

another, while the cut-off function F does not fulfill a terminating condition. Once

the value of F reaches a threshold Fth, the DP matching process is aborted at that

group. Within the lyrics of the calculated groups, the lyrics are ranked in the order

of the D(k), and then the lyrics with lower distance values are provided as search

results.


First Pass

Ic lyrics are divided into Z groups

In each group, DP calculation is executed

DP calculation for next group

F＜＜＜＜Fth（（（（threshold））））Calculated lyrics with lower distance Values are provided as search results.

Finish

DP calculation for next group

Cut off function F is calculated

No

Yes

Figure4.6 Flowchart of the second pass search

4.2.5 Experiments

Two segments of experiments were carried out in this research. First, the improve-

ment of search accuracy by applying acoustic distance was evaluated. Second, the

proposed method applying the acoustic distance and the two-pass search algorithm

was compared with three conventional methods to evaluate its performance on both


search accuracy and processing time.

The results of the experiments were all obtained using a personal computer, with

the specifications of Intel Core2Duo CPU 3.0GHz and 4G RAM.

4.2.5.1 Verification of Improvements in Search Accuracy by Applying

the Acoustic Distance

Two exhaustive DP-based search methods using different distances were compared

to evaluate the advantage of the acoustic distance. One method is Exhaustive DP

applying Edit Distance (EDPED) of phoneme strings, and the other method is

Exhaustive DP applying Acoustic Distance (EDPAD).

The test set consisted of 220 incorrect queries that were mistaken via acoustic

confusion, the same as the queries used in Section 4.2.1. Also, a database of 10,000

lyric texts was collected. It contained both Japanese and English lyrics. The lyrics

corresponding to the queries were included in the database.

As shown in Table 4.5, T-best (T =1, 20) represents the top T candidates of the

ranked lyrics. The hit rate of T-best is defined as the rate of the total number of hits

within top T candidates to the total number of search accesses (this is calculated as

the search accuracy). EDPAD improves the hit rates by 2.8% and 4.4% respectively

when the value of T in T-best is 1 and 20. A t-test was also conducted. When T in

T-best is 20, the p-values are very low, which indicates that the proposed acoustic

distance method achieved statistically significantly better performances than edit

distance. Furthermore, an analysis of the queries that failed to identify the target

lyric text using EDPAD method reveals that most of them are smaller than 6 syl-

lables, indicating that the distance between the query and lyric texts was too close

to make the target lyric distinguishable.


Table4.5 Search accuracies and p-values of EDPED and EDPAD

Search methods EDPED EDPAD

hit rate(%) 49.5 50.9

1-best p-value 0.203

hit rate(%) 70.5 73.6

20-best p-value 0.017

4.2.5.2 Evaluation of Search Accuracy and Search Time

The total performance of the proposed method applying the acoustic distance and

the two-pass search algorithm, which are described in Section 4.2.4 is evaluated by

lyric search experiments in this section.

Before the comparison experiment, in order to optimize the parameters of the

proposed method, preliminary experiments were carried out.

First, a preliminary experiment has been done to determine the parameters for the

first and the second passes. As described in Section 4.2.4, the following parameters

in the proposed method need to be optimized:

• Ic: the number of candidates in the first pass

• F : the cut-off function in the second pass

To find corpus-independent parameters, 842 misheard lyric queries in English

were collected from the website “kissthisguy”. Also, a database of 50,000 lyric

texts was collected. The lyrics corresponding to the queries were all included in the

database. Note that, the queries and lyric texts are entirely different from those

used in Section 4.2.5.1.


First, an experiment was carried out to decide Ic. The first pass search using the

index described in Section 4.2.4 was executed to investigate the relationship between

search accuracy and Ic to choose the best value for Ic. The results are shown in

Figure 4.7. The horizontal axis shows the values of each tested Ic from 100 to 2,000,

and the vertical axis is the hit rate within Ic candidates. Each line represents a

different number of lyrics in the search space. The hit rates are almost saturated

when Ic is larger than 1500, in spite of the variation in the search space. Therefore,

Ic is set at 1500 in this research.

70%80%90%100%

hit rate 100002000030000

50%60%

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000Ic

300004000050000

Figure4.7 Relationship between hit rates and Ic for various sizes of lyric database


Second, an investigation was undertaken to decide F . In most of the 842 queries,

it was found that, by sorting the lyrics according to the approximate distance R(k)

and dividing them into groups, the target lyric has a significantly lower DP distance

D(k) than other lyrics in the same group. Based on the investigation above, F is

defined by Equation 4.15, where Dmin is the minimum value and Dmean is the mean

value of the group. The experimental results those reveal the relationship between

processing time and search accuracy with respect to Fth are shown in Figures 4.8

and 4.9, where the horizontal axis represents Fth, the right vertical axis represents

the processing time, and the left vertical axis represents the hit rate. Figure 4.8

shows the results for the 1-best case, while Figure 4.9 shows the results for the 20-

best case. Both figures show that the value of Fth between 0.4 to 0.6 is the optimal

threshold to reduce processing time without deteriorating search accuracy.

F =Dmin

Dmean

(4.15)


0.20.30.40.50.6

70%75%80%85%90%

processing time(s)

hit rate hit rateprocessing time(s)

00.10.2

60%65%70%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9processing time(s)

FthFigure4.8 Search accuracy and processing time with respect to Fth in the case of

1-best


0.20.30.40.50.6

70%75%80%85%90%

processing time(s)

hit rate

hit rateprocessing time(s)

00.10.2

60%65%70%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9processing time(s)

FthFigure4.9 Search accuracy and processing time with respect to Fth in the case of

20-best

Then, to evaluate the overall performance of the proposed method, both hit rate

and processing time were compared with three conventional DP-based methods. All

methods applied the proposed acoustic distance. The details are described below.

• “Two-pass DP search with Adaptive Termination (TDPAT)” is the proposed

method described in Section 4.2.4. Fth is tuned from 0 to 1. Considering the

balance of index size and search accuracy, here N of the syllable N -gram index

is set to 3. A total of 50,000 entries of syllable 3-grams, which cover 92% of all

syllable 3-grams in the collected lyric corpus, are prepared in the index. As all


the syllable 3-grams which exist in the queries are prepared, no search errors

come from out-of vocabulary syllable 3-grams in the experiment. The acoustic

distance is normalized by the length of the corresponding DP path. The group

size for the second pass is set at 100 lyrics, as Z is 15. It is optimized by a

preliminary experiment.

• “EDPAD” is an exhaustive DP-based search over the entire search space of

lyrics, and this method is mentioned in Section 4.2.5.1.

• “High-speed DP search with Suffix Array (HDPSA)”is based on the method in

[45]. In the experiment, since the input query and the database are both text,

the texts were converted into syllable strings (instead of the phonemes origi-

nally used); and divided into syllable N -grams. Also, a suffix array recorded

the boundary information of the lyrics in order to avoid matching queries

across two lyrics. Here N is set to 3 because this value resulted in better

performance than when N = 2 or N = 4 in a preliminary experiment. The

total threshold was tuned from 0 to 1.3 to find the optimal value balancing

search accuracy and processing time.

• “Two-pass DP search with Distance-based Termination (TDPDT)” is a method

that has almost the same processes as the proposed method, with the excep-

tion that the DP is terminated when the acoustic distance D(k) exceeds a

predetermined threshold value, that is tuned from 0 to 1.

The test set of 220 incorrect queries and 10000 lyric texts were the same with

those used in Section 4.2.5.1.

First, to evaluate the robustness of TDPAT, a comparison with EDPAD and

TDPAT is represented in Figure 4.10. Here, Fth for TDPAT is set at 0.4, as optimized

in Section 4.2.5.2. TDPAT maintains almost the same hit rate as the T of T -best is

varied from 1 to 40. As T is over 40, there is also only less than 1.7% deterioration


of search accuracy. As the processing time of TDPAT is 0.23 seconds per query, it

is reduced by 89.3% compared with EDPAD. This improvement is due to the well-

designed two-pass search algorithm that avoids losses occurring in the pre-selection

and the adaptive termination processes.

The search accuracy and time complexity of three high-speed DP methods TD-

PAT, TDPDT and HDPSA are shown in Figures 4.11 and 4.12, where the horizontal

axis represents processing time and the vertical axis represents hit rate. Each point

in these figures indicates the processing time cost and the hit rate achieved when

a particular threshold is set. Figures 4.11 and 4.12 show the results in the cases of

1-best and 20-best, respectively.

As shown in both figures, the performance of TDPAT is superior to that of HDPSA

in terms of both processing time and search accuracy. In the case of 1-best, to achieve

the same hit rate of 50.0%, TDPAT reduces processing time by a maximum of 96.5%

compared with HDPSA. In the case of 20-best, to achieve the same hit rate of 70.0%,

TDPAT reduces processing time by a maximum of 86.2%. These results indicate

that the proposed search algorithm is more efficient than the conventional algorithms

that determine the pruning threshold according to the length of the queries.

Also, TDPAT obtains higher search accuracy than TDPDT at the same processing

times, especially for short processing times. It proves that the hypothesis of the

definition for F is correct and is effective in the search process.


65%70%75%80%85%90%

hit rate TDPATEDPAD

50%55%60%0 10 20 30 40 50 60 70 80 90 100T-best

Figure4.10 Search accuracy of TDPAT and EDPAD


45%50%55%60%65%70%75%80%

hit rate

TDPATTDPDTHDPSA

30%35%40%45%0 0.2 0.4 0.6 0.8 1 1.2 1.4processing time(s)

Figure4.11 Average processing times and search accuracy of three search methods

in the case of 1-best


45%50%55%60%65%70%75%80%

hit rate TDPATTDPDTHDPSA

30%35%40%45%0 0.2 0.4 0.6 0.8 1 1.2 1.4processing time(s)

Figure4.12 Average processing times and search accuracy of three search methods

in the case of 20-best

4.3 Summary

This chapter presents two strategies to optimize dialogue management for informa-

tion search and to decrease failures in information search mainly caused by mistaken

queries. First, the author proposed to select question by optimizing information gain

and user preference for the spoken dialogue system. It is expected to assist users in


easily selecting satisfactory results by minimizing the number of search refinement

steps. The evaluation experiments prove that the application with the proposed

strategy performs better than the conventional applications in terms of satisfaction

with search results and the effort spent for reconsidering search keywords.

Second, the author also proposed a robust and fast search strategy with a two-

pass search algorithm to decrease information search failures caused by incorrect

queries that are misheard or mismemorized. It uses an index-based approximate

pre-selection for the first pass and DP-based search process with an adaptive ter-

mination strategy in the second pass. For the incorrect queries that are misheard

or mismemorized, the experiments proved that applying acoustic distance improved

search accuracy by 4.4% over edit distance. The proposed method achieved real-

time operation by reducing processing time by more than 86.2% with a slight loss in

search accuracy compared with a complete search by DP matching with all lyrics.

It is proved to be the most practical solution for acoustic confusion queries, consid-

ering the trade-off between high search accuracy and low computation complexity.

In addition, this proposed search strategy is expected to be effective for recovering

the search failures caused by speech recognition errors in spoken dialogue systems

as well, which can be inferred by other researches of morph-based spoken document

retrieval and unlimited-vocabulary speech recognition [63] [64].

Chapter 5

Applying Prosody Information in

Spoken Dialogue Systems

5.1 Prosody Information

To improve the usability of the spoken dialogue system, some researches [77] [78]

were interested in using the speech prosody information, which represents the tune

and rhythm of speech. Regarded as a part of the grammar of a language, prosody

information is used to convey lexical meaning for stress (e.g. English), accentual

(e.g. Japanese) and tone (e.g. Mandarin Chinese) languages. Taking Mandarin

Chinese to give a further explanation, the lexical tone varieties define different words

even though other acoustic characteristics are almost the same. For example, in

Chinese, the dissyllable sound “ji shu” with the lexical tones (1,4) means “cardinal

number”, while the meaning changes to “technology” as the tones shift to (4,4).

Therefore, in order to completely recognize Chinese language, not only the phonetic

content, but also the prosody features are required [79]. Meanwhile, prosody also

conveys non-lexical information such as intonation, which affects to differentiate

76

Chapter 5. Applying Prosody Information in Spoken Dialogue Systems 77

declarative sentences from questions. Furthermore, prosody is verified to convey

more complicate non-lexical information: emotion. Many studies have shown that

prosody information provide a reliable indication of the speech emotion [32] [33] [34]

[80].

For the further understanding of prosody, it is characterized by the following items

at the phonetic level [81]:

• vocal pitch (fundamental frequency)

• loudness (acoustic intensity)

• rhythm (phoneme and syllable duration)

Many experiments studying emotional speech are based on the stylized emotion,

which is presented by actors and actresses. The results indicate that a few categories

of emotions can be reliably identified by listeners. Consistent acoustic correlates of

these categories are analyzed.

For example:

• Excitement is expressed by high pitch and fast speed.

• Sadness is expressed by low pitch and slow speed.

• Hot anger is characterized by over-articulation, fast, downward pitch move-

ment, and overall elevated pitch.

• Cold anger shares many attributes with hot anger, but the pitch range is set

lower.

Especially the pitch features carry the information about intention, attitude or

emotion expression from the user’s speech. However, coventional pitch detection

methods are not robust enough in real noisy environments.


My work proposes a new algorithm named adaptive running spectrum filtering

(ARSF) to restore the amplitude spectra of speech corrupted by additive noises. It

realized the robust pitch detection in the real world environments against various

noise situations.

5.2 Spectra Analysis and Pitch Detection

To give a better understanding of the proposed pitch detection method, some fun-

damentals of related speech signal processing are introduced in this section. Firstly,

the mechanics of producing speech in human beings are described. Then the theories

of running spectra and modulation spectra are presented.

5.2.1 Fundamentals of Speech Production

Reference [82] describes a schematic diagram of the human vocal mechanism. As

shown in Figure 5.1 [83], air enters the lungs via the normal breathing mechanism.

After air is expelled from the lungs through the trachea, the tensed vocal cords

within the larynx are caused to vibrate by the air flow. The air flow is chopped

into quasi-periodic pulses. Then the pulses are modulated in frequency in passing

through the pharynx, the mouth cavity, and possibly the nasal cavity. Different

sounds are produced depending on the positions of various articulators, such as jaw,

tongue, velum, lips and mouth. The speech signals are slowly time varying signals.

When analyzed within a sufficiently short period of time, such as from 5 to 100

msec, its characteristics are fairly stationary. However, over long periods of time,

such over 1/5 seconds, the signal characteristics change to reflect the different speech

sounds being spoken.


Figure5.1 Schematic view of the human vocal mechanism (cited from [83])

In order to represent the time-varying signal characteristics of speech, a parame-

terization of the spectral activity based on the model of speech production is used.

The human vocal tract is essentially a tube, of varying cross-sectional area that is

excited either at one end or at a point along the tube. Therefore, the transfer func-

tion of energy from the excitation source to the output can be described in terms of

the natural frequencies or resonances of the tube, which is based on acoustic theory.


Such resonances are called formants for speech. They represent the frequencies that

pass the most acoustic energy from the source to the output.

5.2.2 Introduction of Running Spectrum and Modulation

Spectrum

Human phonetic judgments are remarkably robust in presence of variability from

non-linguistic sources of information such as speaker variability, channel variability,

the environment in which the speech was produced, recording equipment used for

speech acquisition. For an example, people seem to be capable of focusing attention

on the linguistic message during conversational speech. It works very well even in

horrible noise situation. It can be considered that peripheral properties of human

hearing must also have something to do with the ways speech evolved and is being

used for human communication. Linear distortions and additive noise in speech

signal show as a biases in the short-term spectral parameters. The rate of such extra

linguistic changes is often outside the typical rate of change of linguistic components.

To investigate the further details of the noise distortion, speech modulation spectra

are analyzed [84] [85]. To introduce the modulation spectra, speech signals y(k) are

segmented into frames by a window function ω(k, t) where t is frame number. Short

time Fourier transform of the windowed speech signals, X(t, f), are calculated by

Equations 5.1 to 5.4, where FT[∗] denotes Fourier transform. Then, running spectra

|X(t, f)| are calculated by taking the absolute values of spectra X(t, f), which is the

same process to calculate speech amplitude spectra. Running spectrum is so called

because the spectrum looks like it is running along the time axis. It represent the

temporal properties of time varying amplitude spectra. As shown in Equation 5.5,

the modulation spectra Xm(f, g) are obtained by applying Fourier transform on the

running spectrum |X(t, f)| at each frequency [84]. T is the total number of frames.

g is modulation frequency.


yt(k) = ω(k, t)y(k)

=

{y(k) (N − 1)t ≤ k < Nt

0 others(5.1)

Y (f) = FT [y(k)]|k=0,···,∞ (5.2)

W (i, t) = FT [ω(k, t)]|k=0,···,∞ (5.3)

X(t, f) =∞∑

i=−∞Y (f − i)W (i, t) (5.4)

Xm(f, g) = FT [|X(t, f)|]|t=1,···,T (5.5)

Figures 5.2 and 5.3 show the modulation spectra of clean speech and speech cor-

rupted by 0 dB white noise. Modulation spectra show some important characteristics

related to noisy speech. First, the energy of the additive noise is distributed from 0

Hz to about 1 Hz modulation frequency. Most of the noise energy is concentrated

near 0 Hz. On the other hand, the energy of speech signals is distributed in wide

range. The crucial information of speech is concentrated in the area from 0 Hz to

about 13 Hz. In addition, the part 2 to 4 Hz is quite important since it is related to


the variation of phonemes. The difference between noise and speech on the modula-

tion spectra can be explained as the non linguistic spectra components contained in

the noise change more slowly than the typical range of speech. The characteristics

of modulation spectra are shown in Figure 5.4. Here the spectra in 130 Hz (nor-

mal frequency) are used as examples because 130 Hz is located in the the region of

human fundamental frequency.

Figure5.2 Modulation spectra of clean speech


Figure5.3 Modulation spectra of speech corrupted by 0dB white noise


0 5 10 15 20

Modulation Frequency (Hz)

Amplitude

speech with 0 dB car noise

clean speech

Figure5.4 Modulation spectra at 130 Hz in different conditions

5.2.3 Pitch Detection

Pitch detection methods also use short-term analysis techniques, which means that

a score f(T |xm) is calculated by a function of the candidate pitch periods T for

every frame xm.

In speech processing literatures, a wide variety of pitch detection methods has

been proposed. However, accurate and robust pitch detection in noisy environments

still remains as a difficult and important issue in the real world application.


As described in Section 2.3, AUTOC method is one of the most commonly-used

algorithms. The autocorrelation function of the voiced frame is searched for the

maximum value. And the location of the maximum contains information of pitch

period. As shown in Figure 5.5, Nf is searched as pitch position. And pitch period

equals to the value of Nf/Fs. Fs is the sampling frequency of the singals. Autocor-

relation is usually regarded as time-domain function. However, based on the theory

of Wiener-Khintchine (Equation5.6), the autocorrelation function R(n) in AUTOC

is obtained by calculating IFFT on the 2-nd power amplitude spectra E(k) of speech

signals. Similarly, in CEP method, the peak cepstral value is determined and the lo-

cation indicates the pitch period. Cepstrum is the logarithm of an amplitude speech

spectrum.

R(n) =1

N|E(k)|2ej

2nkπN (5.6)

Pitch detection methods typically fail in several cases, such as sub-harmonic errors,

harmonic errors and noisy conditions. Especially the noises give great influence.

When the signal to noise ratio (SNR) is low, most pitch detection methods are quite

unreliable. It can be explained as that the various noise distortions impose many

damages on speech amplitude spectra, which leads to pitch misdetection.

In case of white noise, the energy is uniformly distributed along the frequency axis.

It does not form prominent energy peaks. After exponentiation calculation is taken

on |E|, the high energy parts which represent speech components are enhanced. So

AUTOC performs robust against white noise [86]. Recently, considering the effects

of noise and formants, Reference [49] proposed a new method. As shown in Equa-

tion5.6, the part |E(k)|2 is replaced by |E(k)|p. It adjusts the exponent p of the

amplitude spectrum according to SNR of each speech frame. As the SNR decreases,

which represents that noise gets serious, the value p is increased. Consequently,

the high energy voiced part gets more enhanced. They are obviously outstanding


compared with the low noise parts. The periodical structure of amplitude spectra

is kepted. In the case of white noise and other wide-band noises, the accuracy of

pitch extraction is improved by the new method in Reference [49]. However, the

car noise distortion is originated from its periodicity. The energy is concentrated in

a narrow band in the amplitude spectra of car noise. As presented in Figure 5.6,

the prominent peak around 20 to 400Hz is generated by car noise. It changes the

original periodical structure of the amplitude spectrum. Only increasing the value

of p does not ameliorate the detection accuracy well in the car noise background. On

the other hand, logarithm calculation prefers to extract the envelope of amplitude

power. Therefore, CEP method is relatively robust against the noise whose energy

is distributed in a fairly narrow band, such as car noise. However, under most wide

band noises, CEP gives unsatisfied performance. It is averagely weaker than other

conventional detection methods in noise conditions. As discussed above, these con-

ventional methods can not keep working well against the changing noise situation.

This class of detectors can be considered as amplitude-spectra-based pitch detec-

tors. They use the property that if the signal is periodic in the time domain, then

the frequency spectrum of the signal will consist of a series of impulses at the fun-

damental frequency and its harmonics. Therefore, to keep high robustness against

unspecific noise condition, a spectra restoration function is supposed to be applied

into pitch detection. By the restoration, the energy distribution of the amplitude

spectra is expected to get close to the spectra under clean condition. The periodicity

in spectra is clearly retained, which helps to improve the detecting accuracy.


Figure5.5 AUTOC method

5.3 Robust Speech Spectra Restoration for Pitch

Detection

5.3.1 Adaptive Running Spectra Filtering designing

According to the characteristics of speech modulation spectra, an adaptive running

spectrum filtering (ARSF) process is proposed to restore the periodic structure of

amplitude spectra against unspecific noises. It improves the accuracy of pitch detec-

tion. Figure 5.7 shows a block diagram of the ARSF approach. |X(t, f)| means the

running spectra of speech (also means the amplitude spectra). A low-pass filter with

the fixed parameters and an adaptive high-pass filter are implemented separately


Figure5.6 Noise distortion in amplitude spectrum

on each running spectrum. The low-pass filter is a FIR-type low-pass filter with

cutoff frequency of 13 Hz. It eliminates noise distortion in parts higher than 13 Hz

modulation frequency. Some speech components also exist in this area. However,

they are few and they do not contain important information about pitch, so that

the low-pass filtering does not distort the accuracy of pitch detection.

The adaptive high-pass filter works by varying the filter parameters according to

the noise level on the running spectrum. The noise level is pre-estimated.

As mentioned above, the highest noise energy is concentrated in Xm(f, g|g=0),

the component at 0 Hz in the modulation spectrum. Equation 5.7 shows that, as

Xm(f |f=fn , g|g=0) is adapted close to the level of clean speech by high-pass filtering,

the corresponding running spectrum |X(t, f |f=fn)| is restored. As |X(t, f |f=fn)| isrestored at every normal frequency fn, consequently, the whole amplitude spectra

|X(t, f)| are restored close to that of clean speech. It helps to keep the correct pitch


Figure5.7 Block diagram of the ARSF approach

information against noise.

|X(t, f |f=fn)| =1

N

N−1∑g=0

Xm(f |f=fn , g)e2iπN

tg

=1

N(Xm(f |f=fn , g|g=0) +

N−1∑g=1

Xm(f |f=fn , g)e2iπN

tg) (5.7)

The distortions of the additive noises are variable and complicated in noisy speech.

Xm(f, g|g=0) of each running spectrum varies by noise conditions. Therefore, the

adaptive high-pass filter is designed for each running spectrum in order to adapt

Xm(f, g|g=0). To design the filter well, the first thing is to estimate the increase on

DC part (0 Hz) of the modulation spectrum.

|EM(f |f=fn)| is defined as the absolute value in 0 Hz (modulation frequency)

part of the modulation spectrum which corresponds to fn Hz (normal frequency).

|EM(f)clean| is |EM(f)| in clean speech, and |EM(f)nspeech| is |EM(f)| in noisy

speech (clean speech + additive noise). The increase in 0 Hz part of the modulation

spectrum can be expressed by Equation 5.8:


IIDCN(f) = 10log10|EM(f)clean|2

|EM(f)nspeech|2(5.8)

(IIDCN : Inverse of Increase in the DC part by Noise)

The adaptive high-pass filter is expected to be designed by assigning IIDCN to

the attenuation ratio of filter corresponding to each frequency. The stopband edge

frequency is set to 0 Hz. Theoretically, the value in 0 Hz modulation frequency of

noisy speech is decreased to the level of clean speech after filtering. The calculation

of IIDCN is important in this processing. As the absolute value of 0 Hz modulation

spectrum, |EM(f |f=fn)| is equal to the sum of the running spectrum amplitude that

corresponds to fn Hz by Equation 5.9:

|EM(f |f=fn)| = |Xm(f |f=fn , g|g=0)|

= |T∑t=1

|X(t, f |f=fn)|e2iπN

tg|g=0|

= |T∑t=1

|X(t, f |f=fn)|| (5.9)

where T is the total number of frames. It depends on the length of the speech

section and the frames size. The frame size is 46.3ms. t is frame number. Based on

Equations 5.9 and 5.8, IIDCN is given as Equation 5.10:

IIDCN(f) = 10log10|EM(f)clean|2

|EM(f)nspeech|2

= 10log10|∑T

t=1 |Xclean(t, f)||2

|∑Tt=1 |Xnspeech(t, f)||2

(5.10)

where, |Xclean(t, f)| is |X(t, f)| of clean speech, and |Xnspeech(t, f)| is |X(t, f)| ofnoisy speech. Equation 5.10 shows that, with noise estimation, IIDCN is possible

to be calculated in real noisy environment.


Assuming the first few frames in the speech section contain no speech but noise,

the noise amplitude spectrum |Xnoise(f)| is estimated from these frames. The value

of estimated |Xclean(t, f)|, |X ′clean(t, f)|, is calculated by Equation 5.11.

|X ′clean(t, f)| = 2

√|Xnspeech(t, f)|2 − |Xnoise(f)|2

(5.11)

Here, if |Xnspeech(t, f)|2 − |Xnoise(f)|2 < 0, |X ′clean(t, f)| is set to 0.

IIDCN ′ represents the estimated value of IIDCN , which is given by Equa-

tion 5.12.

IIDCN ′(f) = 10log10|∑T

t=1 |X ′clean(t, f)||2

|∑Tt=1 |Xnspeech(t, f)||2

(5.12)

The estimated IIDCN ′ is proposed to be used in ASRF in order to realize adap-

tive filtering.

The second thing is to set the cutoff frequency of the high-pass filter. As de-

scribed in [84] and [85], the noise is concentrated in 0 to 1 Hz of modulation spectra.

However, some speech information which contributes to pitch is contained in 0 to

1 Hz of modulation spectra. To search for the cutoff frequency that can provide

best performance of pitch detection, a high-pass filter is tested in a pitch detection

experiment. 15 Chinese words and 1 Chinese sentence are used. They are added

by car interior noise and white noise with level of 0 dB and 10 dB. The stopband

edge frequency of the high-pass filter is set to 0 Hz, and the attenuation is fixed to

a certain value (the value at -5 dB, -10 dB and -20 dB are tried separately). Only

the cutoff frequency is changed in the experiment. The results of pitch detection of

each cutoff frequency are compared. Though 0.2 Hz and 0.3 Hz give better results

for some words or in some specific noise situations, 0.1 Hz is verified to provide


0.1 1 10 21.5

-20

-15

-10

-5

0

5

Modulation Frequency (Hz)

Magnitude (dB)

Figure5.8 The magnitude response of the adaptive high-pass filter with -20 dB at-

tenuation

the totally best performance among 0 to 1 Hz. Also 0.1 Hz provides best detection

result in the clean speech.

Finally, the characteristics of the adaptive high-pass filter are decided as: FIR

type high-pass filter with 0.1 Hz cutoff frequency; the stopband edge frequency of

the high-pass filter is set to 0 Hz; the attenuation ratio is equal to the estimated

IIDCN ′ of each corresponding running spectrum. The magnitude response of the

adaptive high-pass filter with -20 dB attenuation is described in Figure 5.8. It

corresponds to -20 dB IIDCN ′. In addition, if IIDCN ′(f) = 0 or IIDCN ′(f) > 0,

which means that there is no noise distortion, the filtering is not applied.

5.3.2 Pitch Detection with ARSF

As an algorithm of spectra restoration, ARSF is added in the pitch detecting process

to keep the original (clean condition) periodic spectra structure against noises. The

process is presented as follows: First the speech section is segmented into frames by


the Hanning window. In the experiment the frame length is 46 ms (512 sampling

points by 11025 Hz sampling frequency). The shift of frames is 23 ms (256 sampling

points by 11025 Hz sampling frequency). Then fast Fourier transform (FFT) is

applied on each speech frame and its absolute value shows the amplitude spectrum.

Since the formant characteristics of vocal tract adversely affect pitch detection, a

band-limitation operation is implemented to diminish the frequency-multiplication

contents of speech signals. It is actually done by keeping the values of 50 to 500Hz

part while setting the values of other parts to 0 in amplitude spectra. It corresponds

to the region of human fundamental frequency. Then, ARSF is applied to the

running spectrum of each frequency (band-limited from 50 to 500 Hz). In the first

step, a low-pass filter with cutoff frequency of 13 Hz is applied. Secondly the adaptive

high-pass filter is designed by the estimated IIDCN ′. After the ARSF process,

the autocorrelation contour is obtained by applying inverse fast Fourier transform

(IFFT) on 2-nd power restored amplitude spectrum. Finally, the positions of pitch

peaks are detected in each speech frame. The flowchart of pitch detection with

ARSF is described in Figure 5.9.

To prove the improvement of noise robustness by ARSF process, an example of

one voice frame is presented. The amplitude spectrum and the autocorrelation

contour of the frame in clean condition are shown in Figures 5.10 and 5.14. By

various noise influences, the periodicity of amplitude spectrum is changed. It is

shown by the dashed lines in Figures 5.11 5.12 and 5.13. As a result, the pitch

periods are misdetected in the autocorrelation contours. It is shown by the dashed

lines in Figures 5.15 5.16 and 5.17. After ARSF is used, the noise components

are mainly filtered out while the speech components are kept. The information of

periodic structure remains clear in the amplitude spectra. It is shown by the solid

lines in Figures 5.11 5.12 and 5.13. As the pitch peaks stay significant by ARSF,

pitch periods are correctly extracted in the autocorrelation contours. It is shown by

the solid lines in Figures 5.15 5.16 and 5.17.


Figure5.9 Flowchart of the proposed pitch detection method


0 100 200 300 400 5000

1

2

3

4

5

6

7

Frequency (Hz)

Am

plitu

de

Figure5.10 Amplitude spectrum of clean speech


0 100 200 300 400 5000

1

2

3

4

5

6

7

Frequency (Hz)

Am

plitu

de

without ARSF processwith ARSF process

Figure5.11 Amplitude spectrum of speech with 0 dB car noise


0 100 200 300 400 5000

1

2

3

4

5

6

7

8

Frequency (Hz)

Am

plitu

de


Figure5.12 Amplitude spectrum of speech with 0 dB pink noise


0 100 200 300 400 5000

1

2

3

4

5

6

7

8

Frequency (Hz)

Am

plitu

de


Figure5.13 Amplitude spectrum in speech with 0 dB talking babble noise


0 5 10 15 20 25−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Time (ms)

Am

plitu

de

pitch peak

Figure5.14 Autocorrelation contour in clean speech


0 5 10 15 20 25−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Time (ms)

Am

plitu

de


wrong pitch peak

correct pitch peak

Figure5.15 Autocorrelation contour in speech with 0 dB car noise


0 5 10 15 20 25−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (ms)

Am

plitu

de


wrong pitch peak

correct pitch peak

Figure5.16 Autocorrelation contour in speech with 0 dB pink noise


0 5 10 15 20 25−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Time (ms)

Am

plitu

de


wrong pitch peak correct pitch peak

Figure5.17 Autocorrelation contour in speech with 0 dB talking babble noise

5.3.3 Experiments

Before the experimental results are presented, some elements of judgment criteria

are introduced. The voiced error e(n) is defined by Equation 5.13:

e(n) = Pd(n)− Pr(n) (5.13)


where n is the frame number in speech sections. Pd is the pitch period detected by

each method. The reference values of pitch period Pr is calculated by SAPD method

[87]. If |e(n)| > 10 samples (it is 10/Fs=0.9ms), it is called gross pitch error. The

percentage of the frames in which gross pitch errors occur in the whole frames of

the speech section is defined as GPE. The second type of pitch error is the fine pitch

period error in which case |e(n)| < 10 samples. As a measure of the accuracy of the

pitch detector, the Standard Deviation of the Fine Pitch Errors (SDFPE) is defined

by Equation 5.14:

σe =

√√√√√ 1

Ni

Ni∑j=1

e2(mj)− e2 (5.14)

where mj is the j-th frame in the utterance for which |e(n)| < 10 samples. Ni is the

number of such frames in the utterance. e is mean of the fine pitch errors which is

defined by Equation 5.15:

e =1

Ni

Ni∑j=1

e(mj) (5.15)

Both GPE and SDFPE are used as the criteria to estimate the robustness and

accuracy of the detection methods.

The experiment uses 10 isolated Mandarin Chinese words. They are spoken by 6

females and 2 males. Four methods are compared in speech data interfered by noises.

There are five types of noises included in the NOISEX-92 noise database. The noise

levels are 0 dB, 5 dB and 10 dB. The experimental results of GPE are in Table 5.1.

ARSF (ideal) means that ideal |Xclean(t, f)| is used in ARSF. The ideal |Xclean(t, f)|is the originally recorded clean speech before noise is added. ARSF (estimated) is

the proposed method that is based on noise estimation. |X ′clean(t, f)| is used. Table


1 shows that, only except for the case of 10 dB white noise in which Reference [49]

method shows a little better performance, the proposed method with ARSF realizes

the best robustness among various noise conditions. Also, ARSF (ideal) and ARSF

(estimated) are compared. Their results are almost at the same level. It means that

there are no big mistakes in noise estimation and that the idea of ARSF is correct

and practical. However, the fact that 4 to 5% difference occurs in the cases of 0

dB car noise and 0 dB engine noise indicates an more accurate and smarter noise

estimator remains as an important issue in the future research. Table 5.2 shows

the SDFPE of four methods. The proposed method with ARSF achieves lower or

the same SDFPEs in low SNR car, babb and engine noises. However AUTOC and

Reference [49] show better performance in the cases of white noise, pink noise and

high SNR of other noises. It is because that the noise level of each running spectrum

in white noise, pink noise and high SNR of other noises is not serious. By-effects of

filtering in ARSF give small influences, which degrade pitch information a little. For

the next step, smoothing after ARSF is to be considered. By the way, the proposed

method keeps SDFPE within a small range from 1.55 to 2.10 in all of the noise

conditions. It is more stable than AUTOC and Reference [49] method.


Table5.1 Comparison results at various noise conditions (GPE %)

Noise SNRARSF ARSF Ref.[6]

AUTOC(ideal) (estimated) Method

0 11.8 16.0 33.3 33.1

car 5 5.7 7.2 20.8 21.6

10 2.7 3.5 10.0 10.1

0 1.9 2.1 2.4 3.3

white 5 1.3 1.4 1.4 1.8

10 0.9 1.1 1.0 1.1

0 4.9 7.0 12.0 15.1

pink 5 2.3 3.3 4.8 5.7

10 1.6 1.8 2.5 2.5

0 10.2 14.4 18.6 21.5

engine 5 3.7 4.7 7.3 8.7

10 2.0 2.3 3.9 4.1

0 4.1 5.1 9.8 10.6

babb 5 2.0 2.1 4.3 4.3

10 1.5 1.7 2.2 2.2

Average 3.8 4.9 8.9 9.7


Table5.2 Comparison results at various noise conditions (SDFPE)

Noise SNRARSF ARSF Ref.[6]

AUTOC(ideal) (estimated) Method

0 1.72 1.54 1.92 2.36

car 5 1.73 1.56 1.47 1.74

10 1.70 1.66 1.23 1.43

0 1.65 1.66 1.41 1.43

white 5 1.61 1.71 1.23 1.21

10 1.69 1.66 1.04 1.05

0 1.84 1.66 1.56 1.54

pink 5 1.77 1.72 1.28 1.34

10 1.55 1.59 1.19 1.09

0 2.10 2.07 2.14 2.07

engine 5 1.83 1.79 1.73 1.67

10 1.84 1.80 1.40 1.39

0 1.91 1.81 2.12 1.91

babb 5 1.67 1.72 1.72 1.70

10 1.71 1.68 1.50 1.34

5.4 Summary

This chapter presents a new algorithm ARSF to restore the amplitude spectra. It

is applied in robust noise detection of speech pitch period. The ARSF process in-

telligently rehabs speech modulation spectra according to the noise conditions. The

noise distortion that influences the amplitude spectral structure is much diminished.

Therefore the proposed pitch detection becomes more accurate than other conven-

tional methods under different noise conditions. The future study will try to obtain


better models to estimate the noise amplitude spectrum in order to further improve

the accuracy of pitch detection. Furthermore, the author will consider applying the

extracted pitch information to recognize speaker’s emotion. Understanding speaker’s

emotion helps to generate more appropriate dialogue actions to present superiority

and differentiation to other modalities.

Chapter 6

Conclusion and Future work

6.1 Conclusion

This thesis proposes three researches to build an optimal spoken dialogue system

for robust information search in the real world, which is expected to realize user’s

the habitual and continual use of dialogue system in the daily life.

First, the research of designing dialogue takes advantage of the gamification theory

to design a dialogue agent and a dialogue scenario to promote the habitual use. The

real-world data also prove the novelty of this research, in which over 23% users were

keeping speaking to the dialogue agent continually.

Second, the research of dialogue management proposes two strategies to improve

the user experience for information search: firstly, a strategy of optimal question

selection has been verified to be effective to assist users’ operation in a knowledge-

based spontaneous dialogue system. Furthermore a robust and fast matching strat-

egy based on phoneme strings decreases the failures caused by the queries containing

incorrect parts. Experimental results prove that this proposed search strategy in-

creases search accuracy by 4.4% and reduces processing time by at least 86.2%.

108

Chapter 6. Conclusion and Future work 109

Third, the author proposed a research of high-accuracy voice pitch detection

against noise. The proposed method intelligently restores speech modulation spectra

according to the noise conditions. The noise distortion that influences the amplitude

spectral structure is much diminished. Even in a variety of noise types and levels,

the proposed pitch detection method is able to verify the high robustness compared

with the existing methods. Furthermore, the extracted pitch information is planned

to be applied for recognizing speaker’s emotion in the future research, which help

generate the appropriate dialogue action.

6.2 Future Work

The future direction for my research work is to utilize the spoken dialogue system

to be a cross-device type interactive agent in the user’s daily life and be able to

provide personalized information based on the user’s preferences.

The expected future system mechanism is shown in Figure 6.1. The dialogue

agent will be applied to other devices, such as set top box (STB) and in-vehicle

machines, such as GPS Navigator, as well as smartphone to cover more life scenes

to support the user’s information search and device operation.

To make the spoken dialogue system more optimized according to user’s personalinformation, the technologies that understand the user’s emotion and build user

model which reflects the user’s hobbies and preferences are required.

The following researches are going to be focused in my future works:

1. Robust emotion recognition in the real world: besides the pitch extraction as

proposed in Chapter 5, other prosody features such as energy, duration in-

formation and quality features including formants information are going to be

used. And the statistical models such as, hidden Markov model or deep neural

network are going to be applied to correctly classify the emotion. Furthermore,

Chapter 6. Conclusion and Future work 110

Figure6.1 Structure of the future system

not only speech information, text and visual information are also considered

to be combined to generate better recognition performance.

2. Automatic and active learning to build user’s profile: by means of analyzing the

user’s emotion and extracting the keywords in dialogue log data, a user model

(or profile) is expected to be established, with which the dialogue management

executes the dialogue task more efficiently. Furthermore, the technologies of

automatic question generation and situation estimation for active learning are

required.

Acknowledgement

The author is deeply grateful to his supervisor, Prof. Yoshikazu Miyanaga of

the Division of Media and Network Technologies, Graduate School of Information

Science and Technology, Hokkaido University. He has taught the author innumer-

able lessons and insights on the workings of academic research in general. Thanks

for his technical and editorial advice led the accomplishment of the present thesis.

The author also thanks all of the members of the Information and Communication

Networks Laboratory for the kindness and so much support for his research. The

author also wishes to thank Prof. Kunimasa Saitoh, Prof. Takeo Ohgane and Prof.

Hiroshi Tsutsui for taking time to examine the contents of this thesis.

The author is also grateful to his colleagues in KDDI CORPORATION and KDDI

R&D Laboratories, Inc., for their kind understanding and supporting.

Last, the author thanks his family for all the love and support. Especially, the

author would like to thank his wife Naoko for her understanding during the past

few years. Her constant support and encouragement was in the end what made this

dissertation possible.

111

References

[1] Glass, J.. “Challenges for spoken dialogue systems”, in Proceedings of IEEE

ASRU Workshop, 1999.

[2] Weizenbaum, J.. “ELIZA – A computer program for the study of natural lan-

guage communication between man and machine.” Journal, Communications of

the ACM, Vol.9, No.1, pp.36-45, 1966.

[3] Zue, V., Glass, J., Goodine, D., Leung, H., Phillips, M., Polifroni, J. and Seneff,

S.. “Integration of speech recognition and natural language processing in the

MIT VOYAGER system”, in Proceedings of IEEE-ICASSP, pp.713-716, 1991.

[4] Price, P.J.. “Evaluation of spoken language systems: the ATIS Domain”, in

Proceedings of DARPA Speech & Natural Language Workshop, 1990.

[5] Jason D. Williams, et.al., “Introduction to the Issue on Advances in Spoken

Dialogue Systems and Mobile Interface”, IEEE Journal of Selected Topics in

Signal Processing, VOL.6, NO.8, pp.889-890, 2012.

[6] Lu Yang, Hong Cheng, Jiasheng Hao, Yanli Ji, Yiqun Kuang. “A Survey on

Media Interaction in Social Robotics”, in Proceedings of 16th Pacific-Rim Con-

ference on Multimedia (PCM), pp.181-190, 2015

112

References 113

[7] Guizzo, E.. “ Cynthia Breazeal Unveils Jibo, a Social Robot for the Home”,

IEEE Spectrum,16th July 2014.

http://spectrum.ieee.org/automaton/robotics/home-robots/

cynthia-breazealunveils-jibo-a-social-robot-for-the-home

[8] G. Dahl, D. Yu, L. Deng, and A. Acero. “Context-dependent pre-trained deep

neural networks for large vocabulary speech recognition,” IEEE Trans. Audio,

Speech, Lang. Proc., vol. 20, pp.30-42, 2012.

[9] Raymond, C. and Riccardi, G. “Generative and discriminative algorithms for

spoken language understanding”, in Proceedings of InterSpeech, pp.1605-1608,

2007

[10] Young, S.. “Using POMDPs for Dialog Management”, in Proceedings of

IEEE/ACL SLT, 2006.

[11] Intelligent Voice. http://www.intelligentvoice.com/blog/

new-poll-apple-oversold-siri-says-46-americans/

[12] Tatsuya Kawahara. “A Brief History of Spoken Dialogue Systems”, Journal of

the Japanese Society for Artificial Intelligence, Vol.28, No.1, pp.45-51, 2013.

[13] S. Young. “Large Vocabulary Continuous Speech Recognition: A Review,”

IEEE Signal Processing Magazine, vol. 13, no. 5, pp.45-57, 1996.

[14] J. Baker, L. Deng, J. Glass, S. Khudanpur, Chin hui Lee, N. Morgan, and

D. O ’Shaughnessy. “Developments and directions in speech recognition and

understanding, part 1,” Signal Processing Magazine, IEEE, vol.26, no.3, pp.75-

80, may 2009

[15] G. Dahl, D. Yu, L. Deng, and A. Acero. “Large vocabulary continuous speech

recognition with context-dependent DBN-HMMs,” in Proceedings of ICASSP,

2011.

References 114

[16] McGlashan et al.. “Voice Extensible Markup Language (VoiceXML) Version

2.0”, W3C Recommendation 16 March 2004,

http://www.w3.org/TR/2004/REC-voicexml20-20040316/

[17] P. Blunsom and T. Cohn. “Discriminative word alignment with conditional

random fields,” in Proceedings of International Conference on Computational

Linguistics and Annual Meeting of the Association for Computational Linguistics

(COLINGACL), pp.65-72, 2006.

[18] . D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal

of Machine Learning Research, vol.3, pp.993-1022, 2003.

[19] Ilya Sutskever, Oriol Vinyals, Quoc V. Le. “Sequence to Sequence Learning

with Neural Networks”, in Proceedings of NIPS, 2014.

[20] Gabriel Skantze. “Error Handling in Spoken Dialogue Systems - Managing Un-

certainty, Grounding and Miscommunication” ,Doctoral dissertation, KTH, De-

partment of Speech, Music and Hearing, Stockholm, Sweden. 2007.

[21] Small, S., Liu, T., Shimizu, N. and Strzalkowski, T.. “HITIQA: An Interactive

Question Answering System”, in Proceedings of the ACL Workshop on Multi-

lingual Summarization and Question, 2003.

[22] Busemann, Stephan and Helmut Horacek. “A flexible shallow approach to text

generation”, In Proceedings of the Ninth International Workshop on Natural

Language Generation,pages 238-247, 1998.

[23] Theune, Mariet, Esther Klabbers, Jan-Roelof de Pijper, Emiel Krahmer, and

Jan Odijk. “From data to speech: A general approach”. Natural Language En-

gineering, 7(1), pp.47-86, 2001.

References 115

[24] McRoy, Susan W., Songsak Channarukul, and Syed S. Ali. “An augmented

template-based approach to text realization”. Natural Language Engineering,

9(4), pp.381-420, 2003.

[25] Ioannis Konstas and Mirella Lapata. “Concept-to-text generation via discrim-

inative reranking.” In Proceedings of the 50th Annual Meeting of the Associa-

tion for Computational Linguistics: Long Papers-Volume 1, ACL’12, pp.369-378,

Stroudsburg, PA, USA, 2012.

[26] Pablo A. Duboue and Kathleen R. McKeown. “Statistical acquisition of con-

tent selection rules for natural language generation.” In Proceedings of the 2003

Conference on Empirical Methods in Natural Language Processing, EMNLP’03,

pp.121-128, Stroudsburg, PA, USA, 2003.

[27] Miguel Ballesteros, Simon Mille, and Leo Wanner. “Classifiers for data-driven

deep sentence generation”, in Proceedings of the 8th International Natural Lan-

guage Generation Conference (INLG), pp.108-112, Philadelphia, Pennsylvania,

U.S.A., June 2014

[28] Beskow, J. “Talking heads - Models and applications for multimodal speech syn-

thesis”, Doctoral dissertation, KTH, Department of Speech, Music and Hearing,

KTH, Stockholm., 2003.

[29] Chung-Hsien Wu, Chi-Chun Hsia, Chung-Han Lee, and Mai-Chun Lin,” Hier-

archical Prosody Conversion using Regression-Based Clustering for Emotional

Speech Synthesis”, in IEEE Transactions on Audio, Speech and Language Pro-

cessing, Vol.18, No.6, August 2010.

[30] Dan-ning Jiang, Wei Zhang, Li-qin Shen and Lian-Hong Cai,” Prosody Anal-

ysis and Modeling for Emotional Speech Synthesis”, in Proceedings of ICASSP,

pp.281-284, 2005.

References 116

[31] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B., ”A

database of German emotional speech”, In Proceedings of Interspeech, pp. 1517-

1520, 2005.

[32] M.E. Ayadi, M.S. Kamel, and F. Karray. “Survey on Speech Emotion Recog-

nition: Features, Classification Schemes, and Databases”, Pattern Recognition,

vol. 44, no. 3, pp.572-587, Mar. 2011.

[33] M Borchert, A Dusterhoft. “Emotions in speech - experiments with prosody and

quality features in speech for use in categorical and dimensional emotion recog-

nition environments”, in Proceedings of 2005 IEEE International Conference on

Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’05)

(Piscataway, IEEE), pp.147-151, 2005

[34] B.Schuller, G.Rigoll, and M.Lang. “Hidden Markov Model-based Speech Emo-

tion Recognition”, in Proceedings of IEEE-ICASSP, pp.401-405, 2003

[35] Kun Han, Dong Yu, Ivan Tashev. “Speech emotion recognition using deep neu-

ral network and extreme learning machine”, in Proceedings of INTERSPEECH

2014, pp.223-227, 2014.

[36] Yankelovich, N., Levow, G., and Marx, M. “Designing Speech Acts: Issues in

Speech User Interfaces”, in Proceedings of CHI, pp.369-376, 1995.

[37] Stefan Kopp,et.al. “A Conversational Agent as Museum Guide - Design and

Evaluation of a Real-World Application”, in Proceedings of IVA05, pp.329-343,

2005.

[38] Ball, G., Breese, J. “Emotion and personality in a conversational character”, in

Proceedings of the Workshop on Embodied Conversational Characters, pp.83-84

and 119-121, 1998.

References 117

[39] Bell, L.,et.al. “The Swedish NICE Corpus - Spoken dialogues between chil-

dren and embodied characters in a computer game scenario”, in Proceedings of

Interspeech, pp.2765-2768, 2005.

[40] Y. Minami,et.al. “The World of Mushrooms: Human-Computer Interaction

Prototype Systems for Ambient Intelligence”, in Proceedings of ICMI, pp.366-

373, 2007.

[41] Satoshi Oyama, Takashi Kokubo, Toru Ishida. “Domain-Specific Web Search

with Keyword Spices”, IEEE Transactions on Knowledge and Data Engi-neering,

VOL. 16, NO. 1, pp.17-27, 1983.

[42] N. Ring and A. Uitenbogerd. “Finding ’Lucy in Disguise’: The Misheard Lyric

Matching Problem” , in Proceedings of AIRS 2009, pp.157-167, 2009.

[43] H. Hirjee and D.G. Brown. “Solving Misheard Lyric Search Queries using a

Probabilistic Model of Speech Sounds” , in Proceedings of ISMIR, pp.137-148,

2010.

[44] KISS THIS GUY, “Search by Song, Lyric or Artist”.

http://www.kissthisguy.com/

[45] K. Katsurada, S. Teshima and T. Nitta. “Fast Keyword Detection Using Suffix

Array”, in Proceedings of INTERSPEECH, pp.2147-2150, 2009.

[46] Yamasita, T. and Matsumoto, Y.. “Full Text Approximate String Search us-

ing Suffix Arrays”, IPSJ SIG Technical Reports, 1997-NL-121, pp.23-30, 1997.

(Japanese)

[47] M.M. Sondhi. “New Methods of Pitch Extraction”, IEEE Trans. Audio and

Electroacoustics, Vol. AU-16, No.2, pp.262-266, June 1968.

References 118

[48] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg, and C. A. McGonegal. “A

Comparative Performance Study of Several Pitch Detection Algorithms”, IEEE

Transactions on Acoustics, Speech, and Signal Processing, ASSP-24(5):399-418,

October 1976.

[49] Tetsuya Shimamura, Hiroshi Takagi. “Noise-Robust Fundamental Frequency

Extraction Method Based on Exponentiated Band-Limited Amplitude Spec-

trum”, in Proceedings of The 47th IEEE International Midwest Symposium on

Circuits and Systems, 2004

[50] マルチデバイス連携が可能なスマートフォン用対話プラットフォームの開発利用者の嗜好や習慣を理解し最適な情報提供を実現 (Japanese), 2013.

http://www.kddilabs.jp/newsrelease/2013/101001.html

[51] Akinobu Lee, Keiichiro Oura, and Keiichi Tokuda. “MMDAgent - a fully open-

source toolkit for voice interaction systems”, in Proceedings of ICASSP, pp.8382-

8385, 2013.

[52] Nobuyuki Nishizawa and Tsuneo Kato. “Accurate parameter generation using

fixed-point arithmetic for embedded HMM-based speech synthesizers”, in Pro-

ceedings of ICASSP, IEEE (2011), pp.4696-4699, 2011.

[53] On Ballinger, Cyril Allauzen, Er Gruenstein, Johan Schalkwyk 2010. “On-

Demand Language Model Interpolation for Mobile Speech Input”, in Proceedings

of Interspeech, pp.1812-1815, 2010.

[54] Wilks, Y. “Artificial Companions as a New Kind of Interface to the Future

Internet”, Oxford Internet Institute Research report No.13, 2006.

[55] Vanden Abeele, et.al. “Introducing human-centered research to game design:

designing game concepts for and with senior citizens”, CHI’06 extended abstracts

on Human factors in computing systems, ACM, pp.1469-1474, 2006.

References 119

[56] “User Behavior Analysis Research on Smart-phone Games of Mobaga and

GREE”, Seed Planning, Inc. 2012 (Japanese)

[57] Peratama (Android Application in Google Play).

https://play.google.com/store/apps/details?id=jp.kddilabs.

shaberobo{\&}hl=ja

[58] Katherine Isbister, Noah Schaffer “Game Usability: Advancing the Player Ex-

perience”, CRC Press, 2008

[59] Hey Peratama (Android Application in Google Play).

https://play.google.com/store/apps/details?id=jp.kddilabs.

peratama2{\&}hl=ja

[60] COOKPAD.

http://cookpad.com/.

[61] Recipe Search.

http://www.medc.jp/recipe-search/.

[62] Fadi Badra et al.. “Taaable: Text Mining, On-tology Engineering, and Hierar-

chical Classification for Textual Case-Based Cooking”, 9th European Conference

on Case-Based Reasoning-ECCBR, pp.219-228, 2008.

[63] Teemu Hirsimaki and Mikko Kurimo. ”Analysing recognition errors in

unlimited-vocabulary speech recognition” , in Proceedings of Human Language

Technologies: The 2009 Annual Conference of the North American Chapter of

the Association for Computational Linguistics, Companion Volume: Short Pa-

pers, May 31-June 05, Boulder, Colorado, 2009.

References 120

[64] Ville T. Turunen, Mikko Kurimo. “Indexing confusion networks for morph-

based spoken document retrieval” , in Proceedings of the 30th annual inter-

national ACM SIGIR conference on Research and development in information

retrieval, pp.631-638, 2007.

[65] J. R. Quinlan. “Induction of Decision Trees”, Journal Machine Learning Volume

1 Issue 1, pp.81-106, 1986.

[66] Mark Dredze, Koby Crammer, Fernando Pereira. “Confidence-weighted linear

classification”, in Proceedings of the 25th international conference on Machine

learning, pp.264-271, 2008.

[67] Akiko Aizawa. “An information-theoretic perspective of tf-idf measures”, Infor-

mation Processing and Management: an International Journal, v.39 n.1, pp.45-

65, 2003.

[68] Michael McCandless, Erik Hatcher, Otis Gospodnetic. “Lucene in Action (2nd

ed.)”, Manning Publications, 2010.

[69] Downie and Cunningham. “Toward a theory of music information retrieval

queries: System design implications”, in Proceedings of ISMIR, pp.299-300,

2002.

[70] OKWAVE (Japanese).

http://okwave.jp/

[71] 教えて!goo (Japanese).

http://oshiete.goo.ne.jp/

[72] Christopher Fox. “A stop list for general text”, ACM SIGIR Forum, Vol.24,

No.1-2, pp.19-21, Fall 1989/Winter, 1990.

References 121

[73] Nina Kummer, Christa Womser-Hacker and Noriko Kando. “MIMOR@NTCIR

5: A Fusion-based Approach to Japanese Information Retrieval”, in Proceedings

of NTCIR-5 Workshop Meeting, Tokyo, Japan, 2005.

[74] Oshyvanyk, D. Gueheneuc, Y.-G. Marcus, A. Antoniol, G., Rajlich, V. “Com-

bining Probabilistic Ranking and Latent Semantic Indexing for Feature Identifi-

cation” , the 14th IEEE International Conference on Program Comprehension,

pp.137-148, 2006.

[75] Makoto Yamada, Tsuneo Kato, Masaki Naito and Hisashi Kawai. “Improve-

ment of Rejection Performance of Keyword Spotting Using Anti-Keywords De-

rived from Large Vocabulary Considering Acoustical Similarity to Keywords”,

in Proceedings of INTERSPEECH, pp.1445-1448, 2005.

[76] MeCab: Yet Another Part-of-Speech and Morphological Analyzer (Japanese).

http://taku910.github.io/mecab/

[77] Chazan D., Zibulski M., Hoory R. and Cohen G.. “Efficient periodicity extrac-

tion based on sine-wave representation and its application to pitch determination

of speech signals”, in Proceedings of EUROSPEECH, pp.2427-2430, 2001.

[78] Schuller, B., Muller, R., Lang, M., Rigoll, G.. “Speaker Independent Emotion

Recognition by Early Fusion of Acoustic and Linguistic Features within Ensem-

bles”, in Proceedings of Interspeech, pp.805-809, 2005.

[79] Zhu Xiaoyan, Wang Yu and Liu Jun. “An Approach of Fundamental Frequen-

cies Smoothing for Chinese Tone Recognition”, Chinese Journal of Computer,

Vol 24, No 2, pp.213-218, 2001.

[80] Vogt, T., Andre, E.. “Comparing feature sets for acted and spontaneous speech

in view of automatic emotion recognition”, in Proceedings of ICME, pp.474-477,

2005.

References 122

[81] “What is Prosody?”.

http://kochanski.org/gpk/prosodies/section1/

[82] Lawrence Rabiner, Biing-Hwang Juang. “Fundamentals of Speech Recogni-

tion”, Prentice Hall PTR, 1993.

[83] J.L. Flangan. “Speech Analysis, Synthesis, and Perception, 2nd ed.”, Springer

Verlag, 1972.

[84] Noboru Hayasaka, Naoya Wada, Yoshikazu Miyanaga and Nobuo Hataoka.

“Running Spectrum Filter for Robust Speech Recognition”, IEICE Technical

Report, pp 31-36, 2003.

[85] Qi Zhu, Noriyuki Ohtsuki, Yoshikazu Miyanaga, and Norinobu Yoshida. “Noise-

Robust Speech Analysis Using Running Spectrum Filtering”, IEICE Trans Fun-

damentals 2005,

[86] Hess, W.J. “Pitch and Voicing Determination, Advances in Speech Signal Pro-

cessing”. Edited by S. Furui and M.M. Sondhi. Marcel Dekker, New York, 1992.

[87] McGonegal, C. A., Rabiner, L. R. and Rosenberg, A. E. “A Semiautomatic

Pitch Detector (SAPD)”, IEEE Trans. Acous., Speech, and Sig. Processing, vol.

ASSP-23, no. 6, pp.570-574, Dec 1975.

List of Publications

[1] Peer-reviewed Journal

1. Xin Xu, Noboru Hayasaka, Yoshikazu Miyanaga. “Robust Speech Spec-

tra Restoration against Unspecific Noise Conditions for Pitch Detection”,

IEICE Transactions, Vol. E91-A, No. 3, pp. 775-781, 2008.

2. Xin Xu and Tsuneo Kato. ”Robust and Fast Phonetic String Matching

Method for Lyric Searching Based on Acoustic Distance”, IEICE Transac-

tions on Information and Systems, Vol.E97-D No.9 pp.2501-2509, 2014.

[2] International Conference

1. Xin Xu, Noboru Hayasaka, Qi Zhu and Yoshikazu Miyanaga. “Noise Robust

Chinese Speech Recognition System for Isolate Words”, in Proceedings of

International Workshop on Nonlinear Signal and Image Processing, pp.420-

425, 2005

2. Xin Xu and Yoshikazu Miyanaga. “A Robust Pitch Detection in Noisy

Speech with Band-Pass Filtering on Modulation Spectra”, in Proceedings

of International Symposium on Communications and Information Technolo-

gies, pp.266-269, 2005

3. Xin Xu, Masaki Naito, Tsuneo Kato, Hisashi Kawai. “Robust and Fast

Lyric Search Based on Phonetic Confusion Matrix”, in Proceedings of

123


the 10th International Conference on Music Information Retrieval(ISMIR),

pp.417-422, 2009.

4. Xin Xu, Tsuneo Kato, “Robust and Fast Two-Pass Search Method for Lyric

Search Covering Erroneous Queries Due to Mishearing”, in Proceedings of

the 13th international conference on Computational Linguistics and Intel-

ligent Text Processing - Volume Part II pp.306-317, 2012.

5. Xin Xu, Jianming Wu, Kengo Fujita, Tsuneo Kato, Fumiaki Sugaya. ”Hey

Peratama: a breeding game with spoken dialogue interface”, in Proceedings

of the 13th International Conference on Mobile and Ubiquitous Multimedia,

pp.266-267, 2014.

6. Xin Xu, Yoshikazu Miyanaga. ”Prosody Information for Emotion Recog-

nition in the Real World”, in Proceedings of International Symposium on

Multimedia and Communication Technology, 2016. (採録決定済み）

[3] Domestic Conference

1. Xin Xu, Noboru Hayasaka, Qi Zhu, and Yoshikazu Miyanaga. “Noise Ro-

bust Fundamental Frequency Detection for Isolate Words”, Joint Conven-

tion Record, The Hokkaido Chapters of The Institues of Electrical and

Information Engineers,Japan, pp.337-338, 2004.

2. Xin Xu, Masaki Naito, Tsuneo Kato, Hisashi Kawai. “An Introduction of

a Fuzzy Text Retrieval System For Music Information Retrieval”, 情報処理学会研究報告. [音楽情報科学] 2008(127), pp.41-46, 2008.

3. Xin Xu, Tsuneo Kato. “Interactive Recipe Search Interface using Spoken

Dialogue Agent for Tablet Devices”, 電子情報通信学会技術研究報告, vol.

113, no. 73, HIP2013-25, pp.191-193, 2013年 5月.

4. 大谷智子, 徐シン, 加藤恒夫, 相澤清晴. “キャラクタの印象がユーザの許容度に与える影響”, 第 363回音響工学研究会/第 74回超音波エレクトロニクス研究会, 東北大学工学研究会分科会, 363-1, 74-1, 2013.


5. 大谷智子, 徐シン, 加藤恒夫, 相澤清晴. “キャラクタの印象がユーザの許容度に与える影響～音声対話アプリケーションを用いた検証～”，第 18回日本バーチャルリアリティ学会大会, 34C1, 2013.

6. 大谷智子, 徐シン, 加藤恒夫, 相澤清晴. “エージェントの印象とユーザの誤りに対する許容度に関する予備的検討”，電気関係学会東北支部連合研究会,

2H15, 2013.

7. 呉剣明, 住友亮翼, 萩谷俊幸, 徐シン, 矢崎智基. “パーソナライズ情報を提供するクロスデバイス型対話エージェント”, pp.161, IEICE総合大会, 2016.

8. 住友亮翼, 呉剣明, 徐シン, 矢崎智基. “音声対話システムの時間帯による発話傾向の分析”, pp.160, IEICE総合大会, 2016.

Documents

Study on Optimal Spoken Dialogue System for Robust Information