Multilingual speech-to-speech translation system for mobile consumer devices

508 IEEE Transactions on Consumer Electronics, Vol. 60, No. 3, August 2014

Contributed Paper Manuscript received 06/30/14 Current version published 09/23/14 Electronic version published 09/23/14. 0098 3063/14/$20.00 © 2014 IEEE

Multilingual Speech-to-Speech Translation System for Mobile Consumer Devices

Seung Yun, Young-Jik Lee, and Sang-Hun Kim

Abstract — Along with the advancement of speech

recognition technology and machine translation technology in addition to the fast distribution of mobile devices, speech-to-speech translation technology no longer remains as a subject of research as it has become popularized throughout many users. In order to develop a speech-to-speech translation system that can be widely used by many users, however, the system needs to reflect various characteristics of utterances by the users who are actually to use the speech-to-speech translation system other than improving the basic functions under the experimental environment. This study has established a massive language and speech database closest to the environment where speech-to-speech translation device actually is being used after mobilizing plenty of people based on the survey on users’ demands. Through this study, it was made possible to secure excellent basic performance under the environment similar to speech-to-speech translation environment, rather than just under the experimental environment. Moreover, with the speech-to-speech translation UI, a user-friendly UI has been designed; and at the same time, errors were reduced during the process of translation as many measures to enhance user satisfaction were employed. After implementing the actual services, the massive database collected through the service was additionally applied to the system following a filtering process in order to procure the best-possible robustness toward both the details and the environment of the users’ utterances. By applying these measures, this study is to unveil the procedures where multi-language speech-to-speech translation system has been successfully developed for mobile devices1.

Index Terms — Speech-to-speech translation system, speech recognition, machine translation, human-computer interface

I. INTRODUCTION

Speech recognition technology may be the most notable user-friendly interface as it has been widely mentioned throughout numerous movies and novels. Because of its user-

1 This work was supported by the Ministry of Science, ICT and Future

Planning, Korea, under the ICT R&D Program (Strengthening competitiveness of automatic translation industry for realizing language barrier-free Korea).

Seung Yun is with the Department of Computer Software, University of Science and Technology, Daejeon, Korea, He is also with the Automatic Speech Translation Section, ETRI, Daejeon, 305-700, Korea (e-mail: [email protected]).

Young-Jik Lee and Sang-Hun Kim are with the Automatic Speech Translation Section, ETRI, Daejeon, 305-700, Korea (e-mail: [email protected], [email protected]).

friendliness, speech recognition technology has been studied since the 1960’s; however, the technology actually began being used in the 1990’s. Since the 2000’s, speech recognition technology became popularized as the collection of a corpus was made possible through the internet while computing power made remarkable advancement [1], [2]. Lately, starting with automobile navigation system [3], speech recognition technology is applied to various devices including digital camera [4], smart robot [5], refrigerator and smart TV [6], [7]. Especially, as mobile terminal with built-in microphone and wireless data network, namely a smart phone, became rapidly popularized, speech recognition technology is being applied to mobile terminals in a wide variety including voice search services [8], [9] and personal secretary services. One of the most notable examples is with the speech-to-speech translation technology where speech recognition technology, machine translation technology and speech synthesis technology all came together.

Speech-to-speech translation technology represents a technology which automatically translates one language to another language in order to enable communication between two parties with different native tongues. To translate a voice in one language to another voice in a different language, speech-to-speech translation technology is comprised of three core technologies that had been previously mentioned: speech recognition technology, which recognizes the utterance of a human and converts it into a text; machine translation technology, which translates the text in a certain language into a text in another language; and speech synthesis technology, which converts the translated text into a speech. Additionally, the technology to understand the natural language and the user interface-related technology integrated with the UI also play an important role in this speech-to-speech translation system.

Speech-to-speech translation technology has been intensively studied since the 1990’s. The technology has been developed mostly through international joint studies such as C-STAR [10], and a number of studies on speech-to-speech translation technology have been conducted through various international joint researches including NESPOLE! [11], TC-STAR and GALE [12]. However, the technology was not widely used by the general public at that time, and it is now slowly being spread as the technology became more mature in recent years.

Speech-to-speech translation technology has been used by the U.S. Armed Forces in Iraq for military use [13], and LA Police Department has also employed the speech-to-speech

S. Yun et al.: Multilingual Speech-to-Speech Translation System for Mobile Consumer Devices 509

translation devices. The technology was also used in hospitals to help the communication between doctors and patients [14]. In addition, the speech-to-speech translation devices are being currently employed at the various fields such as news broadcasting translation [15] and lecture translation [16]. Recently, as mobile devices and wireless data communication have become popularized throughout the general public, the foundation has been laid for the general public to easily use speech-to-speech translation technology.

This study is to introduce “Genietalk”, a network-based multi-language speech-to-speech translation system which recorded for over 1.8 million downloads. Based on the idea that the speech-to-speech translation procedures are considered an actual conversation between people, Genietalk makes its distinction with the massive quantity of corpus collected by over 6,000 people over several years in order to reflect the diversity of general public. Moreover, it boasts an additional strength where its UI has been specially developed to provide the utmost convenience to its users. Most of all, its performance was verified after actually providing massive services; and based on the log data collected from such services, it should be noted that Genietalk has gradually improved its speech-to-speech translation performance.

This study is to describe these research experiences in the following order. First, Chapter 2 is to detail the result of the survey on the users’ demands, while Chapter 3 is about explaining how the system is developed based on the survey. Chapter 4 reports on the performance of speech-to-speech translation. After Chapter 5 introduces the current whereabouts of speech-to-speech translation services, Chapter 6 discusses how to improve the system performance through the actual operation of the service.

II. USERS’ DEMANDS

A. Survey on Users’ Demands

In order to design an effective speech-to-speech translation system, an interview-oriented survey as well as a FGI (Focus Group Interview) was conducted as follows: an interview-oriented survey was performed upon 302 people between the age of 20 through 59, and the distribution of residential regions, genders, age groups and household income was evenly made after considering the demographic characteristics. The FGI was conducted upon a group of 7 to 8 people after dividing the groups of people in their 20’s, 30’s and 40’s. All participants of FGI had experiences in overseas travelling, yet with inefficient foreign language skills. In other words, they are the main targets of using speech-to-speech translation devices. Additionally, an in-depth interview was conducted upon 3 professional oversea tour guides to listen to their experiences to identify which speech-to-speech translation system is required. First, the survey investigated the situations that dictated verbal conversation during overseas travelling. Table I indicates top 3 situations where speech-to-speech translation device was required depending on the sites.

TABLE I KEY SITUATIONS WHEN SPEECH-TO-SPEECH TRANSLATION DEVICE IS

NEEDED DURING OVERSEA TRAVELLING

Site Rank Situation

Airport 1 Report of belongings 2 Going through immigration 3 Delay of airplane

Hotel 1 Request for equipment & report of incidents 2 Asking how to operate facilities 3 To change rooms

Tour 1

Visiting hours & entry fees for tour attractions

2 Recommending tour attraction 3 Report of theft & loss

Shops 1 Refund, exchange, cancellation 2 Inquiry on products 3 Errors on price calculation

Restaurant

1 Menu recommendation

2 Omitted order, change & cancellation of menu

3 Description of dishes

Transportation 1 How to use public transportation 2 Location of destination 3 Reservation on transportation

Based on the analysis of the survey, participants mostly responded that they needed a speech-to-speech translation device when facing unexpected situation or at the situation where they had to provide specific explanation rather than at generally predictable situations. During the FGI, an in-depth interview was conducted to find out whether a speech-to-speech translation system is necessary and what demands users would make if found necessary. Consequently, 18 out of 26 participants in the FGI responded that speech-to-speech translation device is absolutely necessary, and 4 of them said they found the device somewhat necessary, whereas only 4 responded that they didn’t think the device was necessary, which indicated that majority of them highly evaluated the necessity of the device. Especially, the participants from older age groups raised the issues that the device was necessary rather than the participants in their 20’s. It seemed because younger generation attained English education more than the older counterparts when English was not their native tongues. And when travelling to non-English-speaking nations, it was found that the demands for the device equipped with the local language were very high.

After investigating users’ demands on the input methods of the speech-to-speech translation device, respondents appeared to favor a text-input method through keypad as well as speech recognition method. On the other hand, respondents did not show much interest in the methods which limit the scope of translation such as a search of simple example sentences and the use of designated menu depending on the situations. The demands for convenient functions were a speech-to-speech translation function using bluetooth headset and a search function for advanced example sentences with intended expression under the restriction defined by a user. Upon preference for the types of speech-to-speech translation devices, smart phone turned out to be the most favored device,


followed by laptop computers and speech-to-speech translation-only device. The rationale behind this finding is the portability of mobile device which is also easy to use once a user becomes accustomed to the operation of the device. At the same time, the urge to lighten the weight of the luggage when travelling also seemed to make an impact.

The most anticipated category, the level of expectation toward the performance of speech-to-speech translation, was unveiled where 6 out of 25 FGI participants expressed 100%, followed by 6 people with 95%, 6 with 90% and 4 with below 90%. This figure was higher than first anticipated; however, it should be understood that these levels of expectation were not upon the general translation of all sentences but upon specific situations such as travelling and tourism. Based on these findings, it can be deemed that a speech-to-speech translation device can be commercialized when its performance exceeds 90% upon the utterance that people generally need under the situation of travelling/tourism.

B. Definition of Speech-to-speech Translation System Configuration Based on the Survey of Users’ Demands

According to users’ demands discussed in the previous chapter, smart phone was selected, since it is not only easy to carry around, but also the most popular device among all speech-to-speech translation devices. At the same time, as data communication became generalized where a burden on the use of wireless communication network has been lessened, the system was designed to perform actual speech-to-speech translation through a communication with servers after embedding speech-to-speech translation engine in the high-performance server. Also considering the users’ high demand for multi-language service, it was decided to start with the Korean-English speech-to-speech translation service, and to expand to Korean-Japanese and Korean-Chinese services. Available languages for speech-to-speech translation will continue to increase with French, Spanish, Russian and German.

The next most important issue would be the performance of speech-to-speech translation engine. In order to satisfy every participant of the survey, speech-to-speech translation ratio should be at 100%; however, it is a number that is impossible to achieve. It is because the speech recognition performance cannot be perfect, and the performance is bound to further deteriorate after going through the machine translation procedure. Thus, after removing users who anticipated 100% of speech-to-speech translation success rate, the remainder of participants was included into the candidates to use speech-to-speech translation device. If the translation success rate reaches 90%, it is expected to secure 70% of potential users who are interested in speech-to-speech translation. Therefore, the system was developed with a goal to achieve 90% of translation success rate. We conducted the development of speech-to-speech translation system focusing on the travelling/daily lives in order to achieve the goal. The system was designed to satisfy the users’ expectation by building massive corpus for the language model in the field of travelling/daily lives through various means. The following series of procedures will be thoroughly

detailed in the next chapter. The input method for speech-to-speech translation was designed to enable both speech recognition and keypad input methods, while the advanced search function for example, sentences based on words and sentences most demanded by the users, was included into the scope of development.

III. STRUCTURE OF SPEECH-TO-SPEECH TRANSLATION

SYSTEM

A. Genietalk System Overview

Reflecting users’ demands which were discussed at the previous chapter, we developed “Genietalk,” multi-language speech-to-speech translation system. Genietalk, network-based application was designed to work at the mobile devices using Android and iPhone OS, and it is available for everyone to download to use. Currently, speech-to-speech translation is made available for Korean-English, Korean-Japanese and Korean-Chinese. Vocabulary recognized by each speech recognition engine is 230,000 words for Korean language, 80,000 words for English, 130,000 words for Japanese and 330,000 words for Chinese. Upon translation engines, pattern-based translation methodology was employed by the translation engines for Korean-English and Korean-Chinese whereas SMT-based translation methodology [17] was selected for Korean-Japanese translation engine under the consideration of linguistic similarities between the two languages. For speech synthetic engines, a commercial engine was chosen after considering the market climates.

B. Speech Recognition Engine

Speech recognition engine for Genietalk was trained as a HMM-based acoustic model with the tri-gram language model, and the decoder was structured with wFST (weighted Finite State Transducer) type.

Korean language acoustic model was trained with a 1,820 hour-long speech database. For speech signal, characteristics were extracted while moving the frames by 10ms under the unit of 20ms, and 39 MFCC were employed at the feature vector. The number of GMM (Gaussian Mixture Model) following triphone-tying was determined to be 32. One notable fact is that the speech recognition engine for speech-to-speech translation was built to contain plenty of speech data with dialog style narration through channel input of mobile device as much as possible, since the engine is aimed to achieve dialog-style speech recognition. And to be durable against noisy environment, the engine was trained under the environment where SNR (signal-to-noise ratio) 5 to 15 dB of actual noise was randomly inserted into most of the database for training. Acoustic models for English, Japanese and Chinese were also configured in the same manner; however, these models experienced discrepancy in the quantity of database used for training. For Korean language, all speech logs collected after implementing actual services were already employed for speech recognition engine training after going through the transcription process. Table II shows the amount of speech database used to train each language.


TABLE II

SPEECH DATABASE FOR ACOUSTIC MODEL

KOR ENG JPN CHI

Training DB (hours) 768 817 509 443 Actual service log (hours) 225 N/A N/A N/A Ratio of dialog-style DB (mobile channel)

71% 11%2 61% 49%

Total training DB (hours) (noise added)

1,820 1,386 1,019 862

KOR, Korean; ENG, English; JPN, Japanese; CHI, Chinese

Language models were structured focusing on the travelling/daily life-related conversional texts. At the same time, three methods were implemented when developing the travelling/daily life-related conversional text database to satisfy various demands made by users. First, to secure the naturalness and the diversity, a massive number of people were recruited to prepare sentences and the utterance of the sentences. Subsequently, the outcomes were collected to develop the database. This database is a collection of data from the following practices: foreigners speak to each other through a translator under the assumed situation of travelling/daily lives; two users using the same language talk to each other under the assumption where they were foreigners; and an individual prepares sentences expected to use when facing foreigners while travelling or on daily life. A portion of this data was utilized for the database for LM of another language after going through translation. Second, in addition to BTEC (Basic Travel Expression Corpus) [18], the database was collected from phrasebooks in various fields including travel, business, daily life, airline, hotel, transportation, medical field, aesthetics, restaurant and sports. Finally, database was also collected from blogs, general publications and subtitles of drama/movie. Quantity of each database is as Table III.

TABLE III

CORPUS FOR LANGUAGE MODEL

KOR ENG JPN CHI # of sentences from people (Thousands) (# of participants)

517 (2,219)

312 (1,366)

278 (970)

387 (1,516)

# of sentences from phrasebook (Thousands)

1,830 1,565 1,432 1,640

# of general Sentences (Thousands)

5,399 3,562 3,615 4,231

# of actual service log (Thousands)

10,330 4,810 N/A N/A

# of total sentences/words (Thousands)

18,076/ 94,712

10,249/ 83,244

5,325/ 39,937

6,258/ 53,193

Genietalk makes its distinction from all other speech-to-

speech translation systems because it not only has collected

2 For English speech database, since the quantity of the PC microphone

channel database collected through long-standing existing researches is large, the quantity of dialog style database for mobile device is relatively small. Because the development of database for mobile device channel continues including the logs collected from actual services, dialog-style database for mobile device is bound to take bigger ratio at English speech database.

various expressions from a massive number of people, but it is also focused on the travel/daily life-related conversational texts. Using this distinctive database, language models were developed through interpolation by each type in order to provide optimal performance based on tri-gram of back-off type.

C. Machine Translation Engine

The machine translation engine of Genietalk selected the statistics based machine translation method for the machine translation of Korean-Japanese and Japanese-Korean, whereas the regulation-based machine translation method was chosen for the machine translations of Korean-English, English-Korean, Korean-Chinese and Chinese-Korean. These two methods are the most popular methods where statistics-based methodology is widely used in recent years, because it is not only capable of shortening a development period, but also relatively easy to expand with new languages while it assures reliable performances if there is a similarity between the two languages. After taking these factors into consideration, Genietalk is featured with statistics-based translation system for Korean-Japanese and Japanese-Korean machine translation, focusing on bilingual corpus in the scope of travel/daily life, which was collected in massive scale. At the same time, 2.5 million bilingual sentences were utilized for the translation model. And 1.89 million of those sentences were formed as dialog-style corpus in the field of travel/daily life in order to elevate the performance of dialog-style translation in the field of travel/daily life, which is the main target of this speech-to-speech translation system. For the language model, training was carried out with 15.79 million sentences in Korean and 13.33 million sentences in Japanese. By adopting various post-processing regulations, it was designed to resolve errors on negative sentences and interrogative sentences, which are prone to occur at SMT. When the similarity between two languages is deemed unlikely, the pattern-based machine translation methodology was employed, because it is hard for statistics-based methodology to obtain sufficient quantity of bilingual corpus to extract reliable statistical information; and it is also prone to experience translation ratio to drop comparing to regulation-based machine translation methodology due to errors caused by the different attributes of the languages. This was determined based on the consideration where disparity between languages is significant for Korean-English, English-Korean, Korean-Chinese and Chinese-Korean. The pattern-based machine translation method was developed to be specialized at the domain by adding massive quantity of pattern knowledge to existing regulation-based machine translation devices. It is equipped with bilingual dictionary containing 3 million for Korean-English, 2 million for English-Korean, 800,000 for Korean-Chinese and 800,000 for Chinese-Korean. It was also specially developed for the field of the travel/daily life by adopting sentence translation patterns, phrase translation patterns and noun phrase translation patterns.


D. User Interface Composition

User Interface of Genietalk is composed as shown in Fig.1.

Fig. 1. Screenshot of Genietalk (Speech recognition screen & Main screen)

By pressing a microphone button at the bottom of screen (Fig. 1(a)), it initiates speech recognition. During the process of speech recognition, it helps users utter at proper volume in effectively timely manner by displaying acoustic waveform and database gauge as shown at the left screenshot (Fig. 1(b)). It is designed to help users easily get used to the speech recognition engine. Translation results from speech recognition or text inputs were designed in history-based conversation style, which is similar to SMS or messenger, as shown at the right screenshot in Fig.1 after taking UX (User eXperience) into consideration. And to help first-time users of Genietalk, it provides information on additional features of Genietalk through speech bubbles of characters in order to ensure users to access various features offered by Genietalk (Fig. 1(c)).

Another distinction of Genietalk is that it provides other translation result by complementing imperfection of speech-to-speech translation to elevate translation ratio. When a number icon, such as Fig. 1(d), shows up at the display window, it means that the relevant number of other translation results are available. By touching the number icon, the results are displayed after searching through a massive database of more than 2 million example sentences that were collected as shown in Fig. 2(a). And these other translation results are not merely the results from TM(Translation Memory) matching, but the search results from similar translated example sentences after extracting key search words from input sentence and adding weighted value. For proper nouns, the match probability was enhanced by conducting keyword search after classifying ‘individual name,’ ‘business name’ and ‘place name’ based on attributes. Additionally, to help those with inadequate understanding of the language to translate, it offers a feature to listen to the synthesized sound

of the translated sentence by touching the icon (Fig. 1(e)) as shown in Fig. 2(b). Also by displaying the pronunciation of translated sentence in native language of the user, it offers a feature to give the user an opportunity to pronounce in translation-target language. Especially, when the target language is Chinese, Japanese or Korean using Chinese character, Kana or Hangul, it helps users make the pronunciation even if the user is unable to read the language.

Fig. 2. Screenshot of Genietalk: (a) other translation screen, (b) listen to sound & view screen

It also offers a bookmark feature to save frequently-used

sentences, an edit feature to modify sentences from speech recognition and a copy feature to use the result of speech-to-speech translation at other applications. And through the icon (Fig. 1(f)) of ‘Send Errors’ directly accessible at the main translation screen, users may transmit errors on speech-to-speech translation that they found directly to the server, which helps improving the performance of speech-to-speech translation device.

IV. PERFORMANCE EVALUATION OF SPEECH-TO-SPEECH

TRANSLATION ENGINE

A. Performance of Speech Recognition Engine

Considering the environments the speech-to-speech translation devices are actually used in, performance evaluation upon speech recognition engine of Genietalk was conducted under four different environments, which are quiet office environment, avenues with heavy traffic, streets dense with stores and pedestrians and restaurants with relatively loud noise. Under these four different environments, the results from a total of 8,000 utterances where 40 people presented 50 utterances each were used as a test set; and the contents of the utterances was composed with the utterances made by various users that can be used at actual speech-to-speech translation situation in the field of travel/daily life. The results of the performance test are shown in Table IV.


TABLE IV PERFORMANCE OF SPEECH RECOGNITION ENGINE

Environments (Word Accuracy (%))

Office Avenue Street Restaurant Average

English 95.07 94.47 92.55 93.36 93.86 Korean 93.83 92.99 91.34 90.70 92.22

Japanese 90.40 90.18 89.37 85.07 88.76 Chinese 88.88 86.06 77.97 82.53 83.86

Slight deviation did exist depending on languages and

environments; however, high performance was reported from Korean speech recognition engine with the logs of actual service environment reflected as well as English speech recognition engine with plenty of data; and they were followed by Japanese and Chinese speech recognition engines. With Chinese language, performance deviation was apparent depending on environments comparing to other engines, and it seems because the engine has not been fully optimized since it was given relatively less time for its development. In overall, it satisfied users' demands as the performance reached 90% under office environment. Especially under noisy environment, the performance of the engine did not significantly suffer comparing to the office environment, which means that the engine can be fully utilized at actual service situations.

B. Performance of Machine Translation Engine

In general, performance evaluation on machine translation engine either employs Automated Metrics such as NIST, BLEU (Bilingual Evaluation Understudy), METEOR (Metric for Evaluation of Translation with Explicit ORdering) and TER (Translation Error Rate) or is conducted by professional translators. To accommodate a special circumstance of speech-to-speech translation, this study set the standard of evaluation whether core intention of the user was properly conveyed over to the counterpart. For the purpose of evaluation, 300 dialog-style sentences in the field of travel/daily life were randomly extracted from each language and went through machine translation; subsequently, 3 professional translators conducted evaluation on the relevant translation results of each language. Based on the standard previously mentioned, translation results were deemed successful when the intention that the user absolutely needed to convey from the given utterance was properly translated even if some awkward expression was discovered.

TABLE V

PERFORMANCE OF MACHINE TRANSLATION ENGINE

ENG- KOR

KOR- ENG

CHI- KOR

KOR- CHI

JPN-KOR

KOR- JPN

Translation Ratio (%) 87.63 88.21 85.38 77.94 89.0 86.69

According to the evaluation results, the performance by

each language reached around 78~89%, which indicates that the users will not experience much difficulties to convey their intention using the actual speech-to-speech translation device.

Because a repair tool through multi-modal is being provided upon errors of speech recognition during the process of performance evaluation on entire speech-to-speech translation system, the performance of speech-to-speech translation system can be deemed identical with the performance of machine translation. But with Genietalk, when a search for ‘other translation result’ is performed upon a sentence even if translation error is made, a chance to find similar sentence previously translated reaches 20.1% on the average. This means that a chance to mitigate the translation failure is given to the user at 20.1% of time even after translation has failed. Therefore, when using ‘other translation result’ feature, the performance that users perceive will be even higher than it actually is.

V. STATUS OF GENIETALK SERVICES

Since starting with English-Korean and Korean-English translation services in October, 2012, Genietalk added Japanese-Korean and Korean-Japanese translation services in May, 2013; and it expanded into Chinese-Korean and Korean-Chinese services in December of 2013. Currently, it has recorded 1.8 million downloads where 1.4 million of them were for Android OS while 400,000 downloads were for iOS. It serves users from 189 countries around the world, and 18 of those nations have more than 1,000 users. And because the active ratio where the application still remains after being installed without being removed reaches 50% although over one year has passed since the beginning of the service, it indicates that users are using this application rather usefully.

Currently, Genietalk service is being operated across 12 servers located in IDC (Internet Data Center). When a user requests for speech-to-speech translation, the speech of the user goes through the load balancer and reaches the speech recognition engine for each language; then the translation result is delivered to speech synthesis engine after going through the machine translation process. After converted into a synthetic speech, the result is finally delivered to the user along with the translation result. When the translation request is made through text input, the text input by the user will be directly transmitted to machine translation engine to go through machine translation and speech synthesis process; and the translation result as well as synthetic speech will be delivered to the user. Current speech recognition engine is designed to accommodate 160 users for English, Japanese and Korean simultaneously. The Chinese speech recognition engine is capable of accommodating 20 users at the same time3. The structure of the machine translation engine is designed to serve 16 users simultaneously for each language group, because its processing speed is faster than the speech recognition engine. With the speech synthesis engine, Korean-English translation attracts the most demands; thus, English speech synthesis engine is featured with 10 channels whereas speech synthesis engines for other languages are equipped with 5 channels each. Fig. 3 shows an overview of Genietalk Service.

3 Chinese speech recognition engine offers its result in N-Best type after

reflecting 6 different LM images using 6Core of CPU at each server in order to improve its performance.


Fig. 3. Service Configuration of Genietalk

Since the beginning of Genietalk service, a monthly average of 2.9 million logs is being currently accumulated. Amongst these logs, speech-to-speech translation logs through speech recognition takes up 35%, while the remaining 65% represents machine translation logs through text input. The next chapter will explain how the performance of speech-to-speech translation is being improved by using service logs collected as described above.

VI. PERFORMANCE IMPROVEMENT OF SPEECH-TO-SPEECH

TRANSLATION LOG-BASED SPEECH RECOGNITION

A. Design of Speech-to-speech Translation Log

Speech-to-speech translation logs can be categorized into 3 types based on their input. One is a speech file input by users for speech recognition, another is text sentence input through keyboard and the other is the result of error reporting through ‘Report’ explained previously. As of today, speech recognition result and machine translation result are saved together when saving speech-to-speech translation logs in order to make reference available upon the entire process of translation in accordance with overall flow. At the same time, all speech-to-speech translation logs are saved along with language locale information of the device, device model information and unique speaker ID. Language locale information of the device is used as basic information whether the input is made by native language speaker or foreign language speaker depending on the language locale of the device. The device model information is used to improve the performance of the speech recognition considering the characteristics of channels of each device, and the speaker information is for the improvement of speech recognition performance through speech adjustment by each speaker. In the future, a tailored translation function will be offered through automatic personal history management by using the information discussed above.

B. Performance Improvement of Speech recognition Engine through Text Log

Improvement work on the speech recognition engine performance through text log was conducted for the Korean speech recognition engine. By adding massive text data collected from speech-to-speech translation services to the language model training, it was designed to reflect speech-to-

speech translation device-using patterns practiced by various users under the actual service environment upon the system.

Improvement work on speech recognition engine using Korean log was largely featured with two elements. First, the classification and the refinement were manually conducted upon 700,000 sentences from Korean text logs, which were input through keyboard. The refinement work by human was performed to verify overall characteristics of log data as well as to define future processing criteria. Classification was made into a refined group deemed to make positive contribution when reflecting Korean text logs on LM and a miscellaneous group, which may cause side effect during the reflection. Subsequently, the refinement work was ensued.

Refined Group: Refinement work was implemented in order to revise the sentences with a format of 'subject-predicate' to complete sentences - For sentences qualified for a refined group, spelling check and

the proper word spacing were implemented (however, generally accepted expressions were not revised even if they did not comply with the proper spelling system or the range of proper nouns)

- Voluntarily appeared exclamations or filled pauses were eliminated. With symbols included in the sentence, the emoticons and the symbols only used through internet were removed while generally accepted symbols were only allowed

Meaningless Group: Incomplete sentence and meaningless expressions were categorized into a meaningless group - Sentences lack of basic sentence format without final endings,

subject or predicate - Despite proper sentence format, when it is hard to find

generally accepted meaning from the sentence Korean sentence mixed with foreign language, numbers and

special characters - Sentences containing foreign language, numbers and special

characters not generally used along with Korean language Abbreviation, internet terminology

- If a sentence can be deemed to belong to a refined group once abbreviation or chatting language is eliminated, the sentence will be classified to the refined group following revision.

Lascivious expression & abusive language Following the classification, the refined group took up

72.1%, and the meaningless group was at 22.48% while the rest was at mere 5.42%.

Second, automatic refinement was conducted upon 10.41 million sentences from Korean log collected for 6 months. Automatic refinement was performed using automatic correction tool with Korean spelling system and word spacing practice. During this process, emoticon, special characters, foreign language, number, symbol and meaning repetition of Korean language were excluded from the subjects of refinement. The sentences excluded through this method represented approximately 6% of the total. After adding 10.29 million sentences extracted through two types of methods to the training corpus, the training was conducted again. Based on the


re-evaluation of the Korean speech recognition engine after reflecting the trained LM, the average recognition rate under 4 different environments improved from 89.87% to 90.40%. This figure is equivalent to 5.23% of ERR, which indicates that the reflection of text logs made notable contributions.

C. Performance Improvement of Speech recognition Engine through Acoustic Log

Performance improvement work through acoustic log file was first implemented upon the Korean speech recognition engine. Performance improvement on speech recognition engine was performed to strengthen the engine under various environments with many speakers by reflecting acoustic data under service environment collected through speech-to-speech translation services upon acoustic model training. To accomplish the given task, 400,000 utterances were extracted after eliminating the utterance that was too short from Korean acoustic logs. At the same time, consideration was made for the statistical balance by evenly maintaining the distribution of users and durations during the process of selecting acoustic logs. Next, transcription was performed on 400,000 utterances. During this process, exclusion was made to the followings since the logs were collected through actual services:

When voicing is clipped When waveform is cut off at a specific level even though

the speech did not exceed the peak value; When speeches from 2 or more people overlap each other

or when the speeches of two people are overlapped Laugh, singing or playful voice of children which is not a

speech; When there is no speech at all; When a speech is made while clearly separating each

syllable through the entire voice; Speech in foreign language not in Korean language; When it is deemed that the speech in Korean language was

made by a foreigner When the volume of the speech is too low (when speech is

hardly verified through waveform); When silent section at the front and the rear of EPD (End

Point Detection) region is completely unsecured and when the speech is cut off in the middle;

When unclear speech is detected from an utterance; or When excessively hesitant voicing is detected.

After performing transcription based on the principle

described above, the sentences excluded from transcription represented 19.86%. Transcription was performed on the remaining 320,000 utterances. From these utterances, 100,000 utterances with clearly notable noise in the background were eliminated, and 220,000 utterances were added to the existing speech database to study the acoustic model again. After conducting the speech recognition experiments throughout all these procedures, the performance was evenly improved under all environments as shown in Table VI.

TABLE VI EVALUATION RESULTS ON SPEECH RECOGNITION PERFORMANCE WITH

ACOUSTIC LOG REFLECTED

Environments (Word Accuracy (%))

Office Avenue Street Restau

rant Average

Prior to the addition of acoustic log

93.39 92.66 90.89 89.80 91.69

After the addition of acoustic log

93.83 92.99 91.34 90.70 92.22

ERR (%) 6.66 4.50 4.94 8.82 6.38

According to the evaluation results, it was notable that the average speech recognition rate brought the performance improvement with 6.38% of ERR. Especially under the restaurant environment where the performance was the least effective previously, ERR recorded 8.82%, which was quite an accomplishment propelled by the reflection of actual-service-based log data.

VII. CONCLUSION

In this study, a speech-to-speech translation engine capable of providing actual service was developed by building training data the closest to the speech-to-speech translation situation after employing a massive number of people based on the survey result upon users’ demands. This study also suggested progressive measures to enhance user satisfaction through additional features such as a search for ‘other translation result.’ Moreover, it was possible to continue improving the performance of speech-to-speech translation engine by continuously reflecting text and acoustic logs collected from smart mobile devices of the users on the system, after providing actual services based on the above.

Moving forward, if speech-to-speech translation logs collected as described above are also utilized for the performance improvement of machine translation, remarkable performance improvements can be anticipated not only on the speech recognition, but also throughout the entire speech-to-speech translation field.

REFERENCES [1] G. Dahl, D. Yu, L. Deng and A. Acero, “Context dependent pre-trained deep

neural networks for large-vocabulary speech recognition,” IEEE Trans. Speech and Aud. Proc., vol. 20, no. 1, pp.30-42. Jan. 2012.

[2] S. Lee, B. Kang, H. Jung, Y. Lee, “Intra-and inter-frame features for automatic speech recognition,” ETRI Journal, vol. 36, no. 3, pp.514-517. Jun. 2014.

[3] Y. Qian, J. Liu and M. Johnson, “Efficient embedded speech recognition for very large vocabulary mandarin car-navigation systems,” IEEE Trans. Consumer Electron., vol. 55, no. 3, pp.1496-1500. Aug. 2008.

[4] Y. Oh, J. Yoon, H. Kim, M. Kim and S. Kim, “A voice driven scene-mode recommendation service for portable digital imaging devices,” IEEE Trans. Consumer Electron., vol. 55, no. 4, pp. 1739-1747, Nov. 2009.

[5] F. Martin and M. Salichs, “Integration of a voice recognition system in a social robot,” Cybernetics and Systems, vol. 42, no. 4. pp. 215-245. Jun. 2011.

[6] J. Park, G. Jang, J. Kim and S. Kim, “Acoustic interference cancellation for a voice-driven interface in smart TVs,” IEEE Trans. Consumer Electron., vol. 59, no.1. pp. 244-249. Feb. 2013.


[7] J. Hong, S. Jeong and M. Hahn, “Wiener filter-based echo suppression and beamforming for intelligent TV interface,” IEEE Trans. Consumer Electron., vol. 59, no. 4, pp. 825-832. Nov. 2013.

[8] J. Park, G. Jang and J. Kim, “Multistage utterance verification for keyword recognition-based online spoken content retrieval,” IEEE Trans. Consumer Electron., vol.58, no. 3. pp. 1000-1005. Aug, 2012.

[9] D. Yang, Y. Pan and S. Furui, “Vocabulary expansion through automatic abbreviation generation for Chinese voice search,” Computer Speech and Language, vol. 26, no.5, pp. 321-335. Oct. 2012.

[10] L. Levin, D. Gates, A. Lavie and A. Waibel, “An interlingua based on domain actions for machine translation of task-oriented dialogues,” in Proc. International Conference on Spoken Language Processing, Sydney, Australia. pp. 1155-1158. Nov. 1998.

[11] A. Lavie, F. Pianesi and L. Levin, “The Nespole! System for multilingual speech communication over the internet,” IEEE Trans. Speech and Aud. Proc., vol. 14, No. 5, pp. 1664-1673. Sep. 2006.

[12] Y. Al-Onaizan and L. Mangu, “Arabic ASR and MT integration for GALE,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, USA, pp.1285-1288. Apr. 2007.

[13] M. Frandsen, S. Riehemann and K. Precoda, “IraqComm and FlexTrans: A speech translation system and flexible framework,” in Proc. International Speech Communication Association, Antwerp, Belgium, pp.2324-2327. Aug. 2007.

[14] J. Shin, P. Georgiou and S. Narayanan, “Enabling effective design of multimodal interfaces for speech-to-speech translation system: An empirical study of longitudinal user behaviors over time and user strategies for coping with errors,” Computer Speech and Language, vol. 27, no. 2, pp. 554-571. Feb. 2013.

[15] S. Raybaud, D. Langlois and K. Smaili, “Broadcast news speech-to-text translation experimetns,” in Proc. The Thirteenth Machine Translation Summit, Xiamen, China, pp. 378-381. Sep. 2011.

[16] M. Heck, S. Stuker and A. Waibel, “A hybrid phonotactic language identification system with an SVM back-end for simultaneous lecture translation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, pp.4857-4860. Mar. 2012.

[17] C. Callison-Burch, P. Koehn, C. Monz and O. Zaidan, “Findings of the 2011 workshop on statistical machine translation,” in Proc. the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, pp.22-64. Jul. 2011.

[18] K. Genichiro, S. Eiichiro, T. Toshiyuki and Y. Seiichi, “Creating corpora for speech-to-speech translation,” in Proc. EUROSPEECH, Geneva, Switzerland, pp.381-384. Sep. 2003.

BIOGRAPHIES

Seung Yun received the M.A. degree in Korean Informatics from Yonsei University, Seoul, Korea in 2001. He is currently a Ph.D. candidate in Computer Software at University of Science and Technology, Daejeon, Korea. Since 2001, he has been working for ETRI. His current research interests include speech-to-speech translation, speech database, speech recognition, and human-machine interface.

Young-Jik Lee received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea, the M.S. degree in electrical engineering from Korea Advanced Institute of Science and Technology, Seoul, Korea and the Ph.D. degree in electrical engineering from the Polytechnic University, Brooklyn, New York, USA. From 1981 to 1985 he was with Samsung Electronics

Company, where he had developed video display terminals. From 1985 to 1988 he had worked on sensor array signal processing. Since 1989, he has been with Electronics and Telecommunications Research Institute pursuing researches in speech recognition, speech synthesis, speech translation, machine translation, information retrial, multimodal interface, digital contents, computer graphics, computer vision, pattern recognition, neural networks, and digital signal processing.

Sang-Hun Kim received the B.S. degree in Electrical Engineering from Yonsei University, Seoul, Korea in 1990 and the M.S. degree in Electrical Engineering and Electronic Engineering from KAIST, Daejon, Korea in 1992 and the Ph.D. degree in Department of Electrical, Electronic and Information Communication Engineering from University of Tokyo, Japan in 2003. Since 1992, he

has been working for ETRI. His interests are speech translation, spoken language understanding and multi-modal information processing.

Documents

Multilingual speech-to-speech translation system for mobile consumer devices