2
25 ATZ elektronik 04I2006 Authors: Eric Lehmann, Norman Rohr and Martin Reber Erfolgreiche Planung und Durchführung von Sprachausgabesystemen You will find the figures mentioned in this article in the German issue of ATZ elektronik 04I2006 beginning on page 60. Successful Language Dialogue Project Implementation Text-to-Speech (TTS) based multilin- gual voice output systems have be- come a standard in the automotive in- dustry. The increasingly complex challenges within the automated voice output market can be master- fully managed through careful adher- ence to a few important ground rules. The bulk of this article from the Svox AG is devoted to explaining how to optimize project implementation, based on the long-standing collabora- tive success with OEM and Tier-1 partners. 1 Introduction Over the past several years, the expectations the driver has of his car and all that it can do have steadily increased. When the first auto- mated navigation devices appeared, few had a critical word to say to the static voices spitting ers have likewise become more numerous and increasingly complicated. 2 Application Possibilities The use of the TTS-system is no longer con- fined to navigation. Indeed, more and more users feel that the benefits of speech tech- nology are a must-have in several domains. And TTS-based speech solutions open up a whole new world of possibilities in fields as diverse as communications, travel, infotain- ment, and auto maintenance. The many advantages of TTS-based speech solutions are a great boon for the automotive manufacturer faced with satisfying ever greater consumer demands. More and more drivers want navigation systems with special add-ons, something that was previously re- served only for the ultra-luxury automotive seg- out general instructions. Nowadays the voice must sound tonally and linguistically natural. Until recently, the integration of such adaptable extras has meant high costs, al- most by default. Previously, only a static statement like “you have reached your desti- nation” could be recorded. A dynamic com- bination like “you have reached your desti- nation at 43 Robert-Koch Street” appeared only seldom, and in most cases, represented a patchwork splice whose result sounded far from natural and expressive. Svox offers an intelligent alternative that uses previously recorded prompts as inputs that are synthet- ically combined and regenerated in a natu- ral-sounding output. TTS-based voice output systems have, in fact, long existed in the au- tomotive domain. But owing to the growing complexity of market demands, the chal- lenges facing voice output system develop-

Successful language dialogue project implementation

Embed Size (px)

Citation preview

Page 1: Successful language dialogue project implementation

25ATZ elektronik 04I2006

Authors:Eric Lehmann, Norman Rohr and Martin Reber

Erfolgreiche Planung und Durchführung von Sprachausgabesystemen

You will find the figures mentioned in this article in the German issue of ATZ elektronik 04I2006 beginning on page 60.

Successful Language Dialogue Project Implementation

Text-to-Speech (TTS) based multilin-gual voice output systems have be-come a standard in the automotive in-dustry. The increasingly complexchallenges within the automatedvoice output market can be master-fully managed through careful adher-ence to a few important ground rules.The bulk of this article from the SvoxAG is devoted to explaining how tooptimize project implementation,based on the long-standing collabora-tive success with OEM and Tier-1partners.

1 Introduction

Over the past several years, the expectationsthe driver has of his car and all that it can dohave steadily increased. When the first auto-mated navigation devices appeared, few had acritical word to say to the static voices spitting

ers have likewise become more numerousand increasingly complicated.

2 Application Possibilities

The use of the TTS-system is no longer con-fined to navigation. Indeed, more and moreusers feel that the benefits of speech tech-nology are a must-have in several domains.And TTS-based speech solutions open up awhole new world of possibilities in fields asdiverse as communications, travel, infotain-ment, and auto maintenance.

The many advantages of TTS-based speechsolutions are a great boon for the automotivemanufacturer faced with satisfying evergreater consumer demands. More and moredrivers want navigation systems with specialadd-ons, something that was previously re-served only for the ultra-luxury automotive seg-

out general instructions. Nowadays the voicemust sound tonally and linguistically natural.

Until recently, the integration of suchadaptable extras has meant high costs, al-most by default. Previously, only a staticstatement like “you have reached your desti-nation” could be recorded. A dynamic com-bination like “you have reached your desti-nation at 43 Robert-Koch Street” appearedonly seldom, and in most cases, representeda patchwork splice whose result sounded farfrom natural and expressive. Svox offers anintelligent alternative that uses previouslyrecorded prompts as inputs that are synthet-ically combined and regenerated in a natu-ral-sounding output. TTS-based voice outputsystems have, in fact, long existed in the au-tomotive domain. But owing to the growingcomplexity of market demands, the chal-lenges facing voice output system develop-

Page 2: Successful language dialogue project implementation

MMI Voice Output

26 ATZ elektronik 04I2006

ment, but that is popping up ever more in themid-level. Business professionals are no longerthe only ones who want stress-free directionthrough an unfamiliar city. Vacationing fami-lies also seek less taxing travel commutes.

Automobile manufacturers offeringhigh-quality vehicle navigation systems in-crease the value and attractiveness of theirproduct. And yet, many shrink away fromwhat they envision to be high costs and in-tense labor and construction.

3 System Design

Svox offers OEM and Tier-1 suppliers a cost-efficient product that distinguishes itselfthrough its modular design and its platformindependent components.

3.1 Software ComponentsThe main components of the Svox voice-out-put speech solution are the software engineand the lingware packages, Figure 1.

3.1.1 Engine and LingwareThe engine itself does not contain specificspeech- or voice-based data. Rather, it workswith datasets provided by the lingware. Theengine itself contains only the functionsand algorithms required for general textanalysis and voice-synthesis. This system hasthe advantage of running off from softwarethat is not language specific and can there-fore work with data in any language.

This division into the engine and the ling-ware component offers several other advan-tages when compared with more convention-al systems. For instance, more traditionally,each language would require its own soft-ware system, which in the worst case, may al-so be scaled differently. For manufacturers inthe automotive and navigational device in-dustry, this means higher costs, greater mem-ory capacity, and more installation time.

Yet another advantage of the two-compo-nent design is that the system can be set upto run with any existing version. Those com-ponents with which the manufacturer is al-ready satisfied need not be updated. Theycan easily be used in combination with newproducts models and features.

At runtime, the engine requires only thetext to be spoken as input. Application contextcan be added as a parameter to trigger themost appropriate speaking style. Based on thetext and context input, the Svox speech solu-tion (Svox ExpertSpeech) automatically cre-ates the best possible speech output by mix-ing fixed prompts and variable TTS parts andspeaking them with a single voice.

The applied language and voice data arefound on the lingware, which has two subdi-

than exchanging an entire lingware dataset,SpeechCreate allows the changes to be effec-tively overwritten by installing a new fileconfiguration. These cost-effective adapta-tions are possible even at the last minute be-fore the production phase. This means thatnecessary changes can be made at almostany stage without much overhead.

3.2 Implementation and SupportAs mentioned above, the lingware data areplatform independent and can be employedanywhere, whether on a PC or the head-unit. The performance is the same on eitherplatform.

The advantage of such a system for OEMengineers is that they can test and developon their desktop computers precisely thesame version of a language application thatmust be assembled by the manufacturers inthe head-unit. For the automotive manufac-turer, this is an excellent quality-control sys-tem that runs before and all throughoutproduction. Should a detail not meet thestandards of the customer or should the re-quested specifications change, with theSpeechCreate Tool, even late-stage alter-ations can be made elegantly, efficiently,and cost-effectively.

4 Conclusion

At a time when increasingly extravagantelectronic vehicular systems are competingfor storage space while the demands andstandards of consumers are steadily rising,more quality on less memory space has be-come a matter of singular importance. Simi-larly, the improved expressivity of the voice-output systems of the future will stand hand-in-hand with reduced memory requirementsthat maintain a high quality productthrough more efficient signal compression.In these days of ever shorter product life-cy-cles and short-lived trends, it is becoming in-creasingly clear that lead time and cost-to-market margin must be minimized.

If you are driving a top range luxury caryou expect the voice of the vehicle to reflectthe corporate image, Figure 2. Understandinghow to manage the in-car voice output suc-cessfully is a fundamental necessity to opena whole new world of customer usability,convenience, comfort, and safety benefits.

Supporting OEMs and Tier 1 partners inthe planning, designing and execution ofspeech solutions guarantees to go above andbeyond challenging business goals, to notonly satisfy but exceed application require-ments, and most importantly, optimize theuser experience by providing them with acompelling voice interface. ■

visions. First, there is the language module,which contains the data necessary to analyzetext in any language. This module containseverything linguistical, for instance, com-mon abbreviations, word structure, andgrammatical ground rules. The language ofthe datasets the system will ultimately be fed,whether French, Mandarin or British Englishis unrelated to the interfaces themselves.

On the other hand, the voice-module con-tains all of the data necessary to synthesizethe language data with a particular voiceand a particular vocal quality. This way,rather than delivering generic merchandise,a product is tailor-made to customer specifi-cations. And because many manufacturersproduce for a number of different languagemarkets, it is important that the quality beequally high for every language. Svox has setthe standard as an industry leader, offeringmore than 20 different voices for its lan-guages. This makes for the consistently highquality of all Svox products.

In addition to addressing general issues oflanguage and speech quality, it is importantto accommodate the individual stipulationsof OEMs and Tier-1 suppliers regarding spe-cial word pronunciation and sentence inflec-tion. If desired, the inflection of single wordscan be altered to reflect the situational varia-tions inherent in naturally expressive speech.Svox voices pronounce words from foreignlanguages with the same attention and care.If the English language system instructs: “Dri-ve straight ahead towards Marseille”, Mar-seille would be pronounced with a slight Eng-lish accent in order to avoid confusing theEnglish-speaking driver. This special featureis available in nearly 20 languages.

3.1.2 Cost-efficiency and theSpeechCreate ToolSvox’s SpeechCreate tool presents the optionof optimizing the output to match thespecifics of the application context and ofgenerating several different versions of a text.Possible inputs could be changes to thedataset, for instance. SpeechCreate permitsthat several variations of a sequence berecorded, played back, and a selection ofthose that are appropriate be made. In arecording studio, it is theoretically possible torecord a single sequence in several differentways and then to choose a variation. Above allelse, this last method is quite time-consum-ing, as it requires booking a recording studio,choosing and arranging working with a voicetalent. The time needed to finish an accept-able prompt could be as long as a few weeks.

SpeechCreate cannot only facilitate a pre-selection, it can also act as an update fordatasets that are no longer current. Rather