[IEEE 2012 IEEE 36th Annual Computer Software and Applications Conference - COMPSAC 2012 - Izmir, Turkey (2012.07.16-2012.07.20)] 2012 IEEE 36th Annual Computer Software and Applications

Implementation and Evaluation of Commodity Hardware and Software in an Open World Spoken Dialog Framework

Hareendra Manuru, Rajagopal Vasudevan, Ashok Sasidharan

Electrical and Computer Engineering The Ohio State University

Columbus, United States of America {manuru.1, vasudevan.19, sasidharan.2}@osu.edu

Thomas Lynch, Seth Darbyshire, Sayajeet Raje, Rajiv Ramnath, Jayashree Ramanathan

Computer Science and Engineering The Ohio State University

Columbus, United States of America {lynch.268, darbyshire.11, raje.3, ramnath.6,

ramanathan.2}@osu.edu

Abstract— Several published papers describe various frameworks to implement an Open World Dialog system. This research conducts a critical review of one system using commodity hardware and software, independent of the vendor and authors. The results delineate the parts of the system that are implemented via the SDK and identify the components that require development. Furthermore, we estimate the difficulty of implementing each component of the dialog system using the SDKs.

Keywords-HCI; dialogue system; multiparty; multimodal

I. INTRODUCTION Spoken dialog systems are commonplace today serving

as interfaces to routine services of businesses. The next horizon is spoken dialog interfaces to systems in public settings such as malls, museums and office buildings providing a more natural and efficient interface. These systems bring new challenges including managing dialogs with multiple people and background noise. An Open World Spoken Dialog System as defined by Bohus and Horvitz [1] specifies the challenges for these dialog systems.

Speech recognition and text-to-speech Software Development Kits (SDK) are readily available for free or at low prices from several vendors and research groups. With the release of the Microsoft Kinect, a low-cost commodity hardware device can provide the microphone array and cameras needed to analyze and segregate the people and noises in the dynamic scene to implement the public spoken dialog systems. Combined these components provide the potential for implementing an Open World Dialog system.

Several published papers describe various frameworks to implement an Open World Dialog system. This research conducts a critical review of the hardware and software independent of the vendors and authors. The system evaluated uses the commodity hardware and software provided by Microsoft in the framework proposed by Microsoft Research. The results delineate what parts of the system are implemented in the Kinect and Speech Server SDK, and identify what components require development. Furthermore, we estimate the difficulty of implementing each component of the Open World Spoken Dialog system using the SDKs. applicable criteria that follow.

II. RESULTS AND CONCLUSIONS Most of the requirements of the framework are not found

in the commodity components, however implementing the required components takes only a moderate level of effort except for the Multiparty Management System (MMS). The MMS is the core component of the system that controls the dialog states and goals of each agent in the scene. The system described in the reference frameworks is a complex piece of software that will require a significant effort to develop and test. The completion of this component will allow rapid development and deployment of these systems. Subsequent work includes development of the MMS.

The other difficult to implement missing component is the Animated Avatar. The skill set required to develop this component is normally not found in most software development groups. A developer with this skill set could easily develop the required avatar and animation actions.

TABLE I. FRAMEWORK COMPARISON.

Component Reference Framework

Commodity Framework

Implementation (1-easy,

10-difficult) Person Tracking No Yes Optional Face detection and tracking

Yes No 6

Pose tracking Yes No 6 Focus of attention Yes No 4 Sound source localization

No Yes 3

Agent characterization Yes No Optional Group inferences Yes No Optional Speech recognition Yes Yes 2 Text-to-speech Yes Yes 2 Animated Avatar Yes No 8 Multiparty Management System

Yes No 9

ACKNOWLEDGMENT This project is a collaboration between the Institute for

Sensing Systems (ISS) and CERCS for Enterprise Transformation and Innovation (CETI).

REFERENCES [1] D. Bohus and E. Horvitz, "Dialog in the Open World: Platform and

Applications," in International Conference on Multimodal Interfaces, New York, 2009.

2012 IEEE 36th International Conference on Computer Software and Applications

0730-3157/12 $26.00 © 2012 IEEE

DOI 10.1109/COMPSAC.2012.101

364


0730-3157/12 $26.00 © 2012 IEEE

DOI 10.1109/COMPSAC.2012.101

367


0730-3157/12 $26.00 © 2012 IEEE

DOI 10.1109/COMPSAC.2012.101

369

Documents

[IEEE 2012 IEEE 36th Annual Computer Software and Applications Conference - COMPSAC 2012 - Izmir, Turkey (2012.07.16-2012.07.20)] 2012 IEEE 36th Annual Computer Software and Applications