Upload
jayashree
View
217
Download
3
Embed Size (px)
Citation preview
Implementation and Evaluation of Commodity Hardware and Software in an Open World Spoken Dialog Framework
Hareendra Manuru, Rajagopal Vasudevan, Ashok Sasidharan
Electrical and Computer Engineering The Ohio State University
Columbus, United States of America {manuru.1, vasudevan.19, sasidharan.2}@osu.edu
Thomas Lynch, Seth Darbyshire, Sayajeet Raje, Rajiv Ramnath, Jayashree Ramanathan
Computer Science and Engineering The Ohio State University
Columbus, United States of America {lynch.268, darbyshire.11, raje.3, ramnath.6,
ramanathan.2}@osu.edu
Abstract— Several published papers describe various frameworks to implement an Open World Dialog system. This research conducts a critical review of one system using commodity hardware and software, independent of the vendor and authors. The results delineate the parts of the system that are implemented via the SDK and identify the components that require development. Furthermore, we estimate the difficulty of implementing each component of the dialog system using the SDKs.
Keywords-HCI; dialogue system; multiparty; multimodal
I. INTRODUCTION Spoken dialog systems are commonplace today serving
as interfaces to routine services of businesses. The next horizon is spoken dialog interfaces to systems in public settings such as malls, museums and office buildings providing a more natural and efficient interface. These systems bring new challenges including managing dialogs with multiple people and background noise. An Open World Spoken Dialog System as defined by Bohus and Horvitz [1] specifies the challenges for these dialog systems.
Speech recognition and text-to-speech Software Development Kits (SDK) are readily available for free or at low prices from several vendors and research groups. With the release of the Microsoft Kinect, a low-cost commodity hardware device can provide the microphone array and cameras needed to analyze and segregate the people and noises in the dynamic scene to implement the public spoken dialog systems. Combined these components provide the potential for implementing an Open World Dialog system.
Several published papers describe various frameworks to implement an Open World Dialog system. This research conducts a critical review of the hardware and software independent of the vendors and authors. The system evaluated uses the commodity hardware and software provided by Microsoft in the framework proposed by Microsoft Research. The results delineate what parts of the system are implemented in the Kinect and Speech Server SDK, and identify what components require development. Furthermore, we estimate the difficulty of implementing each component of the Open World Spoken Dialog system using the SDKs. applicable criteria that follow.
II. RESULTS AND CONCLUSIONS Most of the requirements of the framework are not found
in the commodity components, however implementing the required components takes only a moderate level of effort except for the Multiparty Management System (MMS). The MMS is the core component of the system that controls the dialog states and goals of each agent in the scene. The system described in the reference frameworks is a complex piece of software that will require a significant effort to develop and test. The completion of this component will allow rapid development and deployment of these systems. Subsequent work includes development of the MMS.
The other difficult to implement missing component is the Animated Avatar. The skill set required to develop this component is normally not found in most software development groups. A developer with this skill set could easily develop the required avatar and animation actions.
TABLE I. FRAMEWORK COMPARISON.
Component Reference Framework
Commodity Framework
Implementation (1-easy,
10-difficult) Person Tracking No Yes Optional Face detection and tracking
Yes No 6
Pose tracking Yes No 6 Focus of attention Yes No 4 Sound source localization
No Yes 3
Agent characterization Yes No Optional Group inferences Yes No Optional Speech recognition Yes Yes 2 Text-to-speech Yes Yes 2 Animated Avatar Yes No 8 Multiparty Management System
Yes No 9
ACKNOWLEDGMENT This project is a collaboration between the Institute for
Sensing Systems (ISS) and CERCS for Enterprise Transformation and Innovation (CETI).
REFERENCES [1] D. Bohus and E. Horvitz, "Dialog in the Open World: Platform and
Applications," in International Conference on Multimodal Interfaces, New York, 2009.
2012 IEEE 36th International Conference on Computer Software and Applications
0730-3157/12 $26.00 © 2012 IEEE
DOI 10.1109/COMPSAC.2012.101
364
2012 IEEE 36th International Conference on Computer Software and Applications
0730-3157/12 $26.00 © 2012 IEEE
DOI 10.1109/COMPSAC.2012.101
367
2012 IEEE 36th International Conference on Computer Software and Applications
0730-3157/12 $26.00 © 2012 IEEE
DOI 10.1109/COMPSAC.2012.101
369