26
VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 35 T operates world- wide in a variety of international environ- ments that require language translation. Translators who can interpret military terminology are a scarce commodity in countries such as the Re- public of Korea (R.O.K.), and U.S. military leaders there support the development of bilingual machine translation. Although U.S. and R.O.K. military per- sonnel have been working together for more than forty years, the language barrier still significantly re- Automated English-Korean Translation for Enhanced Coalition Communications Clifford J. Weinstein, Young-Suk Lee, Stephanie Seneff, Dinesh R. Tummala, Beth Carlson, John T. Lynch, Jung-Taik Hwang, and Linda C. Kukolich This article describes our progress on automated, two-way English-Korean translation of text and speech for enhanced military coalition communications. Our goal is to improve multilingual communications by producing accurate translations across a number of languages. Therefore, we have chosen an interlingua-based approach to machine translation that readily extends to multiple languages. In this approach, a natural-language-understanding system transforms the input into an intermediate-meaning representation called a semantic frame, which serves as the basis for generating output in multiple languages. To produce useful, accurate, and effective translation systems in the short term, we have focused on limited military-task domains, and have configured our system as a translator’s aid so that the human translator can confirm or edit the machine translation. We have obtained promising results in translation of telegraphic military messages in a naval domain, and have successfully extended the system to additional military domains. The system has been demonstrated in a coalition exercise and at Combined Forces Command in the Republic of Korea. From these demonstrations we learned that the system must be robust enough to handle new inputs, which is why we have developed a multistage robust translation strategy, including a part-of-speech tagging technique to handle new words, and a fragmentation strategy for handling complex sentences. Our current work emphasizes ongoing development of these robust translation techniques and extending the translation system to application domains of interest to users in the military coalition environment in the Republic of Korea. duces the speed and effectiveness of coalition com- mand and control. During hostilities, any time saved by computers that can quickly and accurately trans- late command-and-control information could pro- vide an advantage over the enemy and reduce the pos- sibility of miscommunication with allies. Machine translation has been a challenging area of research for four decades, as described by W.L. Hutchins and H.L. Somers [1], and was one of the original problems addressed with the development of

Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 35

T operates world-wide in a variety of international environ-ments that require language translation.

Translators who can interpret military terminologyare a scarce commodity in countries such as the Re-public of Korea (R.O.K.), and U.S. military leadersthere support the development of bilingual machinetranslation. Although U.S. and R.O.K. military per-sonnel have been working together for more thanforty years, the language barrier still significantly re-

Automated English-KoreanTranslation for EnhancedCoalition CommunicationsClifford J. Weinstein, Young-Suk Lee, Stephanie Seneff, Dinesh R. Tummala,Beth Carlson, John T. Lynch, Jung-Taik Hwang, and Linda C. Kukolich

■ This article describes our progress on automated, two-way English-Koreantranslation of text and speech for enhanced military coalition communications.Our goal is to improve multilingual communications by producing accuratetranslations across a number of languages. Therefore, we have chosen aninterlingua-based approach to machine translation that readily extends tomultiple languages. In this approach, a natural-language-understanding systemtransforms the input into an intermediate-meaning representation called asemantic frame, which serves as the basis for generating output in multiplelanguages. To produce useful, accurate, and effective translation systems in theshort term, we have focused on limited military-task domains, and haveconfigured our system as a translator’s aid so that the human translator canconfirm or edit the machine translation. We have obtained promising results intranslation of telegraphic military messages in a naval domain, and havesuccessfully extended the system to additional military domains. The system hasbeen demonstrated in a coalition exercise and at Combined Forces Command inthe Republic of Korea. From these demonstrations we learned that the systemmust be robust enough to handle new inputs, which is why we have developed amultistage robust translation strategy, including a part-of-speech taggingtechnique to handle new words, and a fragmentation strategy for handlingcomplex sentences. Our current work emphasizes ongoing development of theserobust translation techniques and extending the translation system toapplication domains of interest to users in the military coalition environment inthe Republic of Korea.

duces the speed and effectiveness of coalition com-mand and control. During hostilities, any time savedby computers that can quickly and accurately trans-late command-and-control information could pro-vide an advantage over the enemy and reduce the pos-sibility of miscommunication with allies.

Machine translation has been a challenging area ofresearch for four decades, as described by W.L.Hutchins and H.L. Somers [1], and was one of theoriginal problems addressed with the development of

Page 2: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

36 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

computers. Although general, effective solutions re-main elusive, we have made substantial advances indeveloping an automated machine-translation systemto aid human translators in limited domains, specifi-cally for military translation tasks in the CombinedForces Command (CFC) in Korea. Our strategy toenhance the probability of success in this effort hasbeen threefold: first, to build upon the tremendousadvances in the research and development commu-nity over the past decade in natural-language under-standing and generation, machine translation, andspeech recognition; second, to carefully choose lim-ited but operationally important translation applica-tions to make the task manageable; and third, to fa-cilitate user interaction with the translation system, sothat the primary goal is not a fully automated transla-tor but an aid that helps the human translator bemore effective.

Machine-Translation Background

The pyramid diagram of Figure 1 shows source-lan-guage analysis along the left side and target-languagegeneration along the right side, and three machine-translation strategies: interlingua, transfer, and direct.Most machine-translation strategies cut off thesource-language analysis at some point along the way,and perform a bilingual transfer. The interlingua ap-proach is different. It eliminates a bilingual transferphase by producing a language-independent meaningrepresentation called the interlingua that is directlyusable for target-language generation. In addition, itgreatly facilitates the development of a multilingualsystem, because the same interlingua can be used togenerate multiple target languages. Although achiev-ing a language-independent interlingual representa-tion is a difficult challenge for general domains, theinterlingua approach offers significant advantages inlimited domains.

Direct translation systems do little source-languageanalysis, proceeding immediately to a transfer. Theyproduce a word-for-word translation, much like anautomated bilingual-dictionary lookup. The resultingtranslation generally does not have proper word or-der, syntax, or meaning in the target language, al-though it may be of some help to a user.

Transfer systems perform some intermediate form

of analysis, then proceed to a bilingual transfer. TheSYSTRAN translation system, which has been usedin our project for Korean-to-English translation, fallsinto this category. Transfer systems vary greatly in thequality of translation output and, for multilingual ap-plications, require substantial additional effort inanalysis and generation for each language pair. Theadvantage of a state-of-the-art transfer system likeSYSTRAN is that it produces translations for a widerange of input texts and does not require a limiteddomain. When compared to an interlingual ap-proach, however, the transfer system has a disadvan-tage: the translations produced, although better thanword-for-word direct translations, often do not cap-ture the correct syntax or meaning of the input text.

CCLINC Translation-System Structure

The architecture for our translation system, presentedin Figure 2, consists of a modular, multilingual struc-ture including language understanding and languagegeneration in English and Korean. We refer to thistranslation system as the common coalition languagesystem at Lincoln Laboratory, or CCLINC. The sys-tem input can be text or speech. The understanding

FIGURE 1. Pyramid illustrating the relationships among in-terlingua, transfer, and direct approaches to machine trans-lation. The interlingua approach differs from the other twoby producing a language-independent meaning representa-tion called the interlingua that is directly usable for target-language generation.

Source text Target text

Interlingua

Sou

rce-

lang

uage

ana

lysi

s

Target-language generation

Direct

Transfer

Page 3: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 37

module of CCLINC converts each input into an in-terlingual representation. In CCLINC, this interlin-gual representation is called a semantic frame. In thecase of speech input, the understanding module inFigure 2 performs speech recognition and under-standing of the recognition output. Our currentspeech-recognition system and its performance onspeech translation are described in a later section. Al-though our original work on this project involvedspeech-to-speech translation [2], we have recentlyemphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An ongoing ef-fort by Korean researchers in English-to-Korean texttranslation is described in Reference 4.

The CCLINC translation system provides feed-back to the originator on its understanding of eachinput sentence by forming a paraphrase in theoriginator’s language. For example, when an Englishspeaker enters a sentence into the system, the sen-tence is first transformed into a semantic frame by theEnglish-understanding module. Then the English-generation module produces a paraphrase of what the

system understood, which can be verified by theoriginator before the Korean-generation module pro-vides the translation to the receiver. Figure 2 illus-trates how the interlingual approach expedites the ex-tension of the system to multiple languages. Forexample, adding Japanese to the English-Korean sys-tem requires Japanese-understanding and Japanese-generation modules, but the English and Koreanmodules do not change. Successful system operationdepends on the ability to define a sufficiently con-strained yet useful vocabulary and grammar as well asthe application of powerful understanding and gen-eration technology so that a high percentage of inputsentences can be understood. Figure 2 also shows atwo-way connection between the translation systemand a Command, Control, Communications, Com-puting, and Intelligence (C4I) system. Because thetranslation system involves the understanding of eachinput, C4I data and displays based on this under-standing can be periodically updated and users canrequest information through the C4I system whilecommunicating with other people via translation.

FIGURE 2. Architecture of the common coalition language system at Lincoln Laboratory (CCLINC). The un-derstanding modules convert Korean or English input into a language-independent meaning interlingualrepresentation known in this case as a semantic frame. The use of semantic frames allows the CCLINC sys-tem to extend to multiple languages. The meaning representation in the semantic frame could also be usedto provide two-way communication between a user and a Command, Control, Communications, Computing,and Intelligence (C4I) system.

C4I informationand displays

Englishgeneration

Koreanunderstanding

Other languages

Korean textor speech

Korean textor speech

English textor speech

English textor speech

Koreangeneration

Englishunderstanding

Semantic frame

Page 4: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

38 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

FIGURE 4. Parse-tree example based on English input sentence. The parse tree represents the struc-ture of the input sentence, and is represented in terms of both general syntactic categories, such asthe subject or participial phrase, and domain-specific semantic categories, highlighted in red, of ma-terial being translated, such as the ship name.

This article deals mainly with our work in English-to-Korean text translation. Although the CCLINCtranslation system is general and extendable, most ofour work to date has focused on English-to-Koreantext translation because it is the application of mostinterest to U.S. forces in Korea. Our work has alsoincluded two-way English-Korean translation of both

speech and text. We have started developing an inter-lingua-based Korean-to-English translation sub-system in CCLINC. (Our previous Korean-to-En-glish system was developed by SYSTRAN, Inc.,under a subcontract.) Our initial work on this projectincluded translation from English speech and text toFrench text [2].

FIGURE 3. Process flow for English-to-Korean text translation in CCLINC. The TINA language-understanding sys-tem utilizes the English grammar and analysis lexicon to analyze the English text input and produce a semantic framerepresenting the meaning of the input sentence. The GENESIS language-generation system utilizes the Koreangrammar and generation lexicon to produce a Korean output sentence based on the semantic frame.

Languageunderstanding

with TINA

Koreantext output

Englishtext input

Semanticframe

English grammarand analysis

lexicon

Languagegeneration

with GENESIS

Korean grammarand generation

lexicon

sentence

full_parse

statement

pre-adjunct subject participial_phrase

time_expression np passive

gmt_time a_ship vp_taken_under_fire

numeric_time ship_mod ships under_fire v_by_agent v_with_instrument

numeric gmt uss ship_name v_by np v_with np

quantifier a_shipindef a_missile

ship_name missiles

uss sterett taken under fire by a kirov with ssn-12 s0819 z

Input sentence: 0819 z uss sterett taken under fire by a kirov with ssn-12s.

vtake

Page 5: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 39

System Process Flow and Example forEnglish-to-Korean Text Translation

Figure 3 illustrates the process flow of English-to-Ko-rean text translation in CCLINC. The core ofCCLINC consists of two modules: the language-un-derstanding system, TINA [5], and the language-gen-eration system, GENESIS [6]. Both modules wereoriginally developed by the Spoken Language Sys-tems group at the MIT Laboratory for Computer Sci-ence, under Defense Advanced Research ProjectsAgency (DARPA) sponsorship, for applications inhuman-computer interaction with a variety of lan-guages [7, 8]. Our project was the first to adapt TINAand GENESIS for language translation and to applythese systems to the Korean language.

The understanding and generation modules oper-ate from a set of files that specify the source-languageand target-language grammars. The modules are me-diated by the semantic frame, which serves as the ba-sis for generating output in multiple languages, andcan be integrated into the command-and-control in-formation system for database query.

The first task domain that we used to develop ourCCLINC translation system consists of a set of mes-sages about simulated naval engagements in the Pa-cific. To illustrate system operation, we show the rolesof the various system modules in translating the fol-lowing sentence from that domain:

0819 z uss sterett taken under fire by a kirov withssn-12s.

Given this input sentence, the language-understand-ing system produces a parse tree, as illustrated in Fig-ure 4. The parse tree, which represents the input sen-tence structure, is produced automatically byCCLINC. The parse tree identifies grammatical, orsyntactic, information such as the pre-adjunct 0819z, the subject uss sterett, and the predicate taken underfire by a kirov with ssn-12s. The parse tree also pro-vides domain-specific information—in this example,z stands for Greenwich Mean Time; sterett and kirovare ship names; and ssn-12 is a missile name. The cat-egories such as ship and missile name are not standardEnglish grammatical categories, but are domain-spe-

cific semantic categories that represent the meaningof the words in this domain of naval messages. Thesedomain-specific categories enable CCLINC to reducethe ambiguity of the input sentence.

The language-understanding system then derives asemantic frame from the parse tree, as shown in Fig-ure 5. As a language-neutral meaning representationof the input sentence, the semantic frame captures thecore meaning of the input sentence through threemajor categories: topic, predicate, and clause. Themain action of the sentence—taken under fire—isrepresented by the predicate category, and the entitiesinvolved in the action—sterett, kirov, ssn-12—are rep-resented by the topic category. The semantic framealso preserves the information that the sentence is astatement rather than a question or command. How-ever, the semantic frame purposely does not retainstructural information that tells us how to generate a

FIGURE 5. Semantic frame, paraphrase, and translation forthe example sentence of Figure 4. The semantic frame repre-sents the meaning of the input in terms of fundamental lan-guage-neutral categories such as topic and predicate, and isused as the basis for generation of both the English para-phrase and the Korean output sentence. Entries in red in thesemantic frame are replaced by the corresponding vocabu-lary items in the Korean-generation lexicon.

:time_expression :topic “z”

:pred “819”

:topic :name “sterett”

:pred “uss”

:pred taken_under_fire

:pred v_by

:topic :quantifier indef:name “kirov”

:pred v_with_instrument

:topic :name “ssn-12”:number “pl”

Paraphrase:819 Z USS Sterett taken under fire by a kirov with SSN-12s.

Translation:8 ⌠19 áÆ ëüåÆÂâŒÄì ܌ÄÆÂíìÚ âÃêòÖòı㌠SSN-12 ÜŒâìãŒÈÖõ ëõÄôœÉûÉì.

:statement

Page 6: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

40 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

sentence in a particular language with the meaningrepresented in the frame. This language-generationinformation needs to be put in by a generation systemspecific to the target language, as discussed below.

In addition to the three major categories, other cat-egories such as number (singular or plural) and tense(present, past, or future) can be added. Whether weadd more categories to the semantic-frame represen-tation depends on how detailed a representation is re-quired. To refine the translation, we can increase thenumber of semantic-frame categories. In addition,some languages require more elaborate tense or hon-orific representations than others. The flexibility ofthe semantic-frame representation makes the TINAlanguage-understanding system an ideal tool for ma-chine translation.

The primary task of the language-generation sys-tem is to produce target-language output that cap-tures the meaning represented in the semantic framein a proper and grammatically correct sentence in thetarget language. In our translation system, we haveboth a Korean-generation and an English-generationmodule. For English source language, the English-generation module must produce a paraphrase of theinput in the source language. Both the English para-phrase and the Korean translation are shown beneath

Table 1. Sample English-KoreanLanguage-Generation Lexicon

V1 V “ha” PRESENT “han”PAST “hayss ”PP “hayss” PSV “toy”

indef D “ ”

kirov N “khirob”

ssn-12 N “ssn-12 misail”

sterett N “stheret”

take_under_fire V1 “phokyek”

uss N “mikwunham”

z N “pyocwunsikan”

v_by P “uyhay”

v_with_instrument P “lo”

the example semantic frame in Figure 5. The para-phrase in this case is essentially identical to the origi-nal (except that 0819 is replaced by 819). The Koreanoutput is Hangul text composed from the 24 basicletters and 16 complex letters of the Korean alphabet.

To produce translation output, the language-gen-eration system requires three data files: a lexicon, a setof message templates, and a set of rewrite rules. Thesefiles are language-specific and external to the core lan-guage-generation system. Consequently, extendingthe language-generation system to a new language re-quires creating only the data files for the new lan-guage. A pilot study of applying the GENESIS sys-tem to Korean language generation can be found inReference 9. For generating a sentence, all the vo-cabulary items in the semantic frame such as z, uss,and by are replaced by the corresponding vocabularyitems provided in the lexicon. All phrase-level con-stituents represented by topic and pred are combinedrecursively to derive the target-language word order,as specified in the message templates. We give ex-amples below of the data files that are necessary togenerate Korean translation output.

Table 1 shows a sample language-generation lexi-con necessary to generate the Korean translation out-put of the input sentence from the semantic frame inFigure 5—0819 z uss sterett taken under fire by a kirovwith ssn-12s. Words and concepts in the semanticframe are given in the left column of the table, andthe corresponding forms in Korean are given in theright column. The Korean forms are in YaleRomanized Hangul, a representation of Korean textin a phonetic form that uses the Roman alphabet[10]. Because the semantic frame uses English as itsspecification language, lexicon entries contain wordsand concepts found in the semantic frame with corre-sponding forms in Korean. (For a discussion aboutdesigning interlingua lexicons, see Reference 11.)

In the lexicon, P stands for the part of speechpreposition; N noun; D determiner; and V verb.Verbs are classified into several subgroups accordingto grammatical rules that govern which tense formsare used. The first row of the example in Table 1 saysthat the entry V1 is a category verb ha for which thepresent tense is han, past tense hayss, past participlehayss, and passive voice toy.

Page 7: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 41

Message templates are target-language grammarrules corresponding to the input-sentence expressionsrepresented in the semantic frame. The word order ofthe target language is specified in the message tem-plates. Table 2 gives a set of message templates re-quired to produce the Korean translation outputfrom the semantic frame in Figure 5.

Template a instructs that a statement consists of atime expression followed by the topic, which in turnis followed by the predicate (corresponding to theverb phrase). The morpheme i following :topic is thesubject case marker, and the morpheme ta following:predicate is the marker indicating that the sentence isa statement. According to template b, a topic (typi-cally equivalent to a noun phrase) consists of a quan-tifier and the head noun itself. Template c says that averb phrase consists of an object followed by the verb.This template specifies that in Korean the object pre-cedes the verb, as opposed to English, in which theobject follows the verb. Also, it illustrates that thepredicate category encompasses several syntactic sub-categories including a verb and a verb phrase. Tem-plate d says that uss is a predicate embedded under ahigher-level predicate. Templates e and f say that theprepositional phrases headed by the equivalents of byand with are predicates, and take an object to theirleft, and are embedded under a higher-level category.

Rewrite rules are intended to capture surface pho-nological constraints and contractions, in particular,the conditions under which a single morpheme hasdifferent phonological realizations. In English, the re-write rules are used to generate the proper form of theindefinite article, a or an. Choosing one indefinite ar-

ticle over the other depends on the phonology of theword that follows. For example, if the word that fol-lows starts with a vowel, the appropriate indefinite ar-ticle is an; if the word that follows starts with a conso-nant, the appropriate indefinite article is a. TheKorean language employs similar types of morpho-logical variations. In Table 3, the so-called nominativecase marker is realized as i when the preceding mor-pheme (John in this example) ends with a consonant,and as ka when the preceding morpheme (Maria inthis example) ends with a vowel. Similarly, the so-called accusative case marker is realized as ul after aconsonant, and as lul after a vowel. Because thesetypes of alternations are regular, and it is not possibleto list every word to which these markers are attachedin the rewrite-rule templates, a separate subroutinewritten in C-code has been implemented to improveefficiency. For details of other related phenomena inthe Korean language, see Reference 12.

User View of System as Translator’s Aid

Before proceeding to an extended discussion of thetechnical operation and performance of our system,we describe its operation as a translator’s aid. Figure 6shows the graphical user interface of our system in theEnglish-to-Korean translation mode. The interfacefeatures four windows and five icon buttons. Englishtext is entered in the top window. Input is entered byvoice, through the keyboard, or from a file or externalmessage source. To enter a voice input, the user acti-vates the speech recognizer by clicking on the micro-phone icon and speaks the sentence. The recognizedspeech appears in the English input window and is

Table 2. Sample Korean Language-Generation Message Templates

(a) statement :time_expression :topic i :predicate ta

(b) topic :quantifier :noun_phrase

(c) predicate :topic :predicate

(d) np-uss :predicate :noun_phrase

(e) np-v_by :topic :predicate :noun_phrase

(f) np-v_with_ instrument :topic :predicate :noun_phrase

Page 8: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

42 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

then treated as text input. To translate a sentence inthe input window, the user clicks on the English-to-Korean translation icon (indicated by flags) and thetranslation appears in the third window from the top.In this example of text translation, the user has acti-vated translation of the sentence that begins At 0823z Sterett. The English paraphrase is shown in the para-phrase window, and the Korean translation of thatsentence (in Hangul characters) is shown in the win-dow below the English paraphrase. The user then hasan opportunity to edit the Korean translation by us-

ing a Hangul text editor. When the translation is ac-ceptable, the user clicks on the check icon, and thetranslated sentence is moved to the output window atthe bottom. Here, the translation of the prior sen-tence starting with 0819 z USS Sterett is shown in theoutput window. If the user wishes to view the transla-tion process in more detail, the parse tree or semanticframe can be viewed by clicking on the tree or frameicons.

In configuring our system as a translator’s aid, weprovide the user with as much help as possible. If thesystem is unable to parse and understand the inputsentence, a word-for-word translation is provided tothe user, consisting of a sequence of word translationsfrom the Korean-generation module. If some of theEnglish words are not in the generation lexicon, theoriginal English word is included in the translationoutput in the place where its Korean equivalentwould have occurred. In both cases, the problem isnoted on the output.

The interlingua-based Korean-to-English transla-tion system operates with the same graphical user in-

Table 3. Phonologically ConditionedCase Markers in Korean

Nominative Accusative

Case Case

Following consonant John-i John-ul

Following vowel Maria-ka Maria-lul

FIGURE 6. Graphical user interface of translator’s aid in English-to-Korean translation mode. The input is entered byvoice, through the keyboard, or from a file to the top window. The English paraphrase is shown below the input win-dow, and the Korean translation of that sentence (in Hangul characters) is shown in the window below the Englishparaphrase. The user can edit the translation output by using a Hangul text editor. If the translation is acceptable, thetranslated sentence can be moved to the bottom window by clicking on the check icon. The parse tree and the seman-tic frame of the input sentence can be displayed by clicking on the tree and the frame buttons, respectively.

Page 9: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 43

terface, except the U.S. and Korean flags are inter-changed in the translation icon, and the input lan-guage is Korean. The SYSTRAN transfer-based Ko-rean-to-English translation system, however, does notprovide the user a paraphrase, parse tree, or semanticframe.

English-to-Korean System Development onNaval Message Domain: A Domain-SpecificGrammar Approach

From June 1995 to April 1996 we trained our systemon the MUC-II corpus, a collection of naval opera-tional report messages from the Second Message Un-derstanding Conference (MUC-II). These messageswere collected and prepared by the center for NavalResearch and Development (NRaD) to supportDARPA-sponsored research in message understand-ing. Lincoln Laboratory utilized these messages forDARPA-sponsored machine-translation research. Wechose to use the MUC-II corpus for the followingreasons: (1) the messages were typical of actual mili-tary messages that our users would be interested intranslating, including high usage of telegraphic textand military jargon and acronyms; (2) the domainwas limited but useful, so that we felt that our inter-lingua approach could be applied with reasonableprobability of success; and (3) the corpus was avail-able to us in usable form.

The MUC-II Naval Message Corpus

MUC-II data consist of a set of naval operational re-port messages that feature incidents involving differ-ent platforms such as aircraft, surface ships, subma-rines, and land targets. The MUC-II corpus consistsof 145 messages that average 3 sentences per messageand 12 words per sentence [13, 14]. The total vo-cabulary size of the MUC-II corpus is about 2000words. The following example shows that MUC-IImessages are highly telegraphic with many instancesof sentence fragments and missing articles:

At 1609 hostile forces launched massive recon ef-fort from captured airfield against friendly units.Have positive confirmation that battle force is tar-geted (2035z). Considered hostile act.

The messages in this article are modeled after butare not from the MUC-II corpus. For each message, acorresponding modified message has been con-structed in more natural English. For example, in themodified version below, words that are underlinedhave been added to the original message:

At 1609 z hostile forces launched a massive reconeffort from a captured airfield against friendlyunits. Friendly units have positive confirmationthat the battle force is targeted (2035z). This isconsidered a hostile act.

MUC-II data have other features typical of naturaltext. There are several instances of complex sentenceshaving more than one clause, coordination problemsinvolving conjunctions (and, or), and multiple nounand verb modifiers, as in the following examples:

Complex sentences—Two uss lion based strikeescort f-14s were engaged by unknown number ofhostile su-7 aircraft near land9 bay (island targetfacility) while conducting strike against guerrillacamp.

Coordination problem—Fox locked on with firecontrol radar and fired torpedo in tiger’s direction.

Multiple noun and verb modifiers—Thedeliberate harassment of uscgc tiger by hostile foxendangers an already fragile political/militarybalance between hostile and friendly forces.

Translation-System Training

For our translation-system training and development,we have used both the original and the modified data,including 105 messages from the MUC-II corpus.These messages, including both original and modi-fied versions, comprised a total of 641 sentences. Foradditional training material, we added a set of 154MUC-II-like sentences that were created in an in-house experiment, so that the total number of sen-tences used in training was 795. This training corpuswas divided into four data sets. We trained the trans-lation system by using an iterative procedure in whichgrammar and vocabulary were developed for the first

Page 10: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

44 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

set, and then we tested and modified the translationsystem on subsequent sets.

In our training procedure, we developed analysisrules by hand on the basis of observed patterns in thedata. These rules are then converted into a networkstructure. Probability assignments in the network areobtained automatically by parsing each training sen-tence and updating appropriate counts [5].

When the translation-system development wascompleted on the MUC-II corpus, the size of the lexi-con was 1427 words for analysis and 1000 words forgeneration; the size of the grammar was 1297 catego-ries for analysis and 850 categories for generation.The actual number of rules is much greater becauseTINA allows the sharing, or cross-pollination, ofcommon elements [5]. When the training was com-plete, the translation system was able to translate 673of the 795 sentences correctly, for a translation accu-racy rate of 84.7%.

Parsing Telegraphic Messages

In developing our system on the MUC-II corpus, weaddressed two key problems. First, telegraphic mes-sages induce a greater degree of ambiguity than textswritten in natural English. Second, our initial systemwas unable to parse sentences containing words newto the grammar.

Our solution to the problem of resolving ambigu-ity in telegraphic messages was applied in initial sys-tem development, and is reflected in the accuracy re-sults described above. Additional details of our workin resolving ambiguity are presented in Reference 15.When the rules are defined in terms of syntactic cat-egories (i.e., parts of speech) [16], telegraphic mes-sages with omission introduce a greater degree of syn-tactic ambiguity than texts without any omittedelement. The following examples contain prepositionomission:

1410 z (which means “at 1410 GreenwichMean Time”) hostile raid composition of 19 air-craft.

Haylor hit by a torpedo and put out of action 8hours (which means “for 8 hours”).

To accommodate sentences with a preposition omis-sion, the grammar needs to allow all instances ofnoun phrase NP to be ambiguous between an NPand a prepositional phrase PP. The following ex-amples show how allowing the grammar an input inwhich the copula verb be is omitted causes the pasttense form of a verb to be interpreted as either themain verb with the appropriate form of be omitted asin phrase a, or as a reduced relative clause modifyingthe preceding noun, as in phrase b.

Aircraft launched at 1300 z.(a) Aircraft were launched at 1300 z.(b) Aircraft which were launched at 1300 z.

Syntactic ambiguity and the resultant misparse in-duced by such an omission often lead to a mistransla-tion. For example, the phrase TU-95 destroyed 220nm could be misparsed as an active rather than a pas-sive sentence due to the omission of the verb was, andthe prepositional phrase 220 nm could be misparsedas the direct object of the verb destroy. The semanticframe reflects these misunderstandings because it isderived directly from the parse tree, as shown in Fig-ure 7. The semantic frame then becomes the input tothe generation system, which produces the followingnonsensical Korean translation output:

TU-95-ka 220 hayli-lul pakoy-hayssta.TU-95-NOM 220 nautical mile-OBJ destroyed.

The sensible translation is

TU-95-ka 220 hayli-eyse pakoy-toyessta.TU-95-NOM 220 nautical mile-LOC was destroyed.

In the examples, NOM stands for the nominative casemarker, OBJ the object case marker, and LOC thelocative postposition. The problem with the nonsen-sical translation above is that the object particle lulnecessarily misidentifies the preceding locative phrase220 hayli as the object of the verb. This type of mis-understanding is not reflected in the English para-phrase because English does not have case particlesthat overtly mark the case role of an NP.

Many instances of syntactic ambiguity are resolved

Page 11: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 45

on the basis of the semantic information. However,relying on semantic information requires the parser toproduce all possible parses of the input text and for-ward them to a separate module to resolve the ambi-guity, a more complex understanding process.

One way of reducing ambiguity at an early stage ofprocessing without relying on another module is toincorporate the domain-specific semantic knowledgeinto the grammar. Therefore, we introduce domain-specific categories to restrict the types of phrases thatallow omissions. For the example TU-95 destroyed220 nm, we can introduce the following sequence ofgrammar rules to capture the domain-specific knowl-edge that a prepositional phrase denoting a location(locative prepositional phrase) allows the prepositionat to be omitted, and noun phrases that typically oc-cur in a locative prepositional phrase with prepositionomission are the ones that denote distance.

locative_PP ->{in, near, off, on, ...} NPat_locative

at_locative ->[at] NP_distance

NP_distance ->numeric nautical_mile

nautical_mile ->nm

In the preceding grammar, the first rule states thata locative prepositional phrase locative_PP consists ofeither a preposition (in, near, off, on) and a nounphrase NP or it is simply an “at_locative.” The secondrule says that the prepositional phrase at_locative con-sists of the preposition at, which may be omitted asindicated by the brackets and a noun phrase denotingdistance NP_distance. The third rule states that a dis-tance denoting noun phrase NP_distance consists of anumeric expression. The head noun nautical_mile iswritten as nm according to the fourth rule. With thisgrammar, the expression 220 nm can be correctly un-derstood as a locative prepositional phrase rather thana noun phrase.

We rely on the training capability of the system tounderstand the verb “destroyed” as the main verb ofthe passive sentence in which the verb “was” is omit-ted, rather than as a verb in a reduced relative clause.Namely, a noun-verb sequence, which is ambiguousbetween the past tense and past participial form, ismore likely to be the subject and the main verb of apassive sentence (i.e., TU-95 was destroyed ), as op-posed to the noun modified by a reduced relativeclause (i.e., TU-95 which was destroyed ).

The introduction of domain-specific semanticgrammar and the training capacity of the system al-lows the input sentence TU-95 destroyed 220 nm to becorrectly understood as the one equivalent to TU-95was destroyed at 220 nm. Figure 8 shows the semanticframe that reflects the proper understanding. Thewhole locative prepositional phrase 220 nm is repre-sented as the predicate at_locative, in which 220 nmis actually mapped onto the category topic. This se-mantic frame representation contrasts with Figure 7,which illustrates how the understanding system canmistranslate when no domain-specific knowledge isincorporated into the grammar.

FIGURE 7. Semantic frame for the mistranslation of the in-put sentence TU-95 destroyed 220 nm (which means “TU-95was destroyed at 220 nm”). The mistranslation occurs be-cause the locative expression 220 nm is misunderstood asthe object of the verb destroyed, and the sentence is misun-derstood to be in active voice rather than passive voice.

:statement :topic nn_head

:name “tu-95”

:pred destroy

:mode “past”

:topic nn_head

:name “nm”

:pred cardinal

:topic “220”

Wrong Translation: êŒãÀ-95 áŒíîÄŒÄì 220 íîÖŒÖÃÈ ëõÄôœíîÉì.

Page 12: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

46 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

Text-Translation Evaluation withDomain-Specific Grammar

After using a data set of 641 training sentences to de-velop our translation system, we conducted systemevaluations on two sets of test sentences that had notbeen used in training. First, the system was evaluatedon a set of 111 sentences comprising 40 messages,called the TEST set. Second, the system was evalu-ated on another set of data, called TEST′, which wascollected from an in-house experiment. For this ex-periment, the subjects were asked to study a numberof MUC-II sentences and create about twenty newMUC-II-like sentences to form data set TEST′. Be-cause our domain-specific grammar at this stage ofdevelopment could handle only words that had beenentered in the grammar, we knew that the perfor-mance on TEST, which was certain to contain wordsunknown to the grammar, would be limited. In creat-ing TEST′, subjects were likely to use words shown tothem in the example sentences. Consequently, the

percentage of unknown words in TEST′ was lowerand the percentage of sentences correctly parsed wasgreater, as reflected in the following results.

We present evaluation results for our understand-ing-based translation system on the simple basis ofwhether correct understanding and generation areachieved. Because our system tends to produce an ac-curate translation for about 85% of the sentences thatare parsed, we have not found it necessary to use morecomplex evaluation methods like those described inReference 17. Earlier work in evaluating English-Ko-rean translation systems is described in Reference 18.

Of the 111 sentences in the TEST set, 45 had atleast one unknown word, and hence could not beparsed with this domain-specific grammar. Of the re-maining 66 sentences, 23 (35%) were parsed, and 20(87%) of these parsed sentences were correctly trans-lated. However, the system failed on 41% of the newMUC-II sentences in TEST because it could nothandle new words at that time. We discuss our solu-tion to the new-word problem in the next section.

The results on the 280 TEST′ sentences weresomewhat better because of the much lower fre-quency of unknown words and the fact that the sen-tences in TEST′ generally followed the pattern of thetraining sentences. In TEST′, 41 sentences, or 15%,failed to parse because of the presence of at least oneunknown word. Of the remaining 239 sentences, 103(43%) were parsed, and of these, 88 (85%) were cor-rectly translated.

System Enhancement for New Words:Two-Stage Parsing

Although the language processing is efficient whenthe system relies on domain-specific grammar rules,some drawbacks exist. Because vocabulary items areentered into the grammar as part of the grammarrules, parsing fails if an input sentence contains newwords. For example, the following sentence is notparsed if the word incorrectly is not in the grammar:

0819 z unknown contact replied incorrectly.

This drawback was reflected in the initial perfor-mance evaluation of our machine-translation system,as discussed previously.

FIGURE 8. Semantic frame for the accurate translation ofthe input TU-95 destroyed 220 nm. Entries in red are replacedby the corresponding vocabulary items in the Korean-gen-eration lexicon. Unlike the semantic frame in Figure 7, thelocative expression 220 nm is understood correctly as thelocative expression, and the sentence is translated in pas-sive voice. The correct translation results from the domain-specific knowledge of the grammar and the grammar-train-ing capability of the Korean-understanding system.

:statement :topic aircraft

:name “tu-95”

:pred destroy

:mode “psv”

:topic distance

:name “nm”

:pred at_locative

:pred 220

Translation: êŒãÀ 95 áŒíîÄŒÄì 220 íîÖŒãòâó ëìÄûÉûãóÉì.

Page 13: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 47

To handle the new word problem, we developed atwo-stage parsing strategy. We use domain-specificgrammar rules to try parsing on the input word se-quence. If parsing fails on the input word sequencebecause there are words or constructs not covered inthe domain-specific grammar, we replace the inputwords with their parts of speech, and try to parse thepart-of-speech sequence by using general grammarrules defined in terms of part of speech rather than in-dividual words.

At the first stage of parsing, the input sentence0819 z unknown contact replied incorrectly fails on thedomain-specific grammar rules because of the un-known word incorrectly. Then part-of-speech taggingtakes place, replacing the input word sequence withthe corresponding part-of-speech sequence cardinal zadjective noun replied adverb. At the second stage ofparsing, the part-of-speech sequence is successfullyparsed, resulting in the parse tree shown in Figure 9.A major difference between the parse tree in Figure 4and that of Figure 9 is that there are syntactic catego-ries like adjective, noun, and adverb in the lower lev-

els of the latter, whereas the former contains only do-main-specific semantic categories at the lower levels.On a closer examination, the input sequence at thesecond-stage parsing does not consist solely of partsof speech, but of the mix of parts of speech andwords. Unless the word is a verb or preposition, wereplace the word with its part of speech. By not sub-stituting parts of speech for words that are verbs andprepositions, we avoid ambiguity [15, 19].

Integration of Rule-Based Part-of-Speech Tagger

To accommodate the part-of-speech input to theparser, we integrated the rule-based part-of-speechtagger, developed by E. Brill [20], as a preprocessor tothe parser. An advantage of integrating a part-of-speech tagger over a lexicon containing part-of-speech information is that only the former can tagwords that are new to the system, which thereforeprovides a way of handling unknown words.

The rule-based part-of-speech tagger uses thetransformation-based error-driven learning algorithm[20, 21]. While most stochastic taggers require a large

FIGURE 9. Parse tree derived from a mixed sequence of words and part-of-speech tags. The inputsentence at the top is converted into the mixed sequence below it by using the part-of-speech tagger.This mixed sequence is the input to the parser. In the parse tree, part-of-speech units are shown inred. When parsing is complete, the part-of-speech units are replaced by the words in the originalsentence. For example, adjective is replaced by unknown, and adverb is replaced by incorrectly.

Input sentence: 0819 z unknown contact replied incorrectly.

Input to parser: cardinal z adjective noun replied adverb

sentence

full_parse

statement

pre_adjunct subject predicate

time_expression np vp_reply

gmt_time adjective noun vreply adverb_phrase

numeric_time adverb

cardinal gmt

0819 z unknown contact replied incorrectly

Page 14: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

48 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

amount of training data to achieve high rates of tag-ging accuracy, this rule-based tagger achieves perfor-mance comparable to or higher than that of stochastictaggers, even with a training corpus of modest size.Given that the size of our training corpus is small(7716 words), a rule-based tagger is well suited to ourneeds.

The rule-based part-of-speech tagger operates intwo stages. First, each word in the tagged trainingcorpus has a lexicon entry consisting of a partially or-dered list of tags, indicating the most likely tag forthat word, and all other tags seen with that word (inno particular order). Every word is initially assignedits most likely tag in isolation. Unknown words areassumed to be nouns, and then cues based upon pre-fixes, suffixes, infixes, and adjacent word co-occur-rences are used to update the most likely tag. Second,after the most likely tag for each word is assigned,contextual transformations are used to improve theaccuracy.

We evaluated the tagger performance on the TESTdata set both before and after training on the MUC-IIcorpus. Table 4 presents the results of our evaluations.Tagging statistics before training are based on thelexicon and rules acquired from the Brown corpusand the Wall Street Journal (WSJ) corpus. Taggingstatistics after training are divided into two categories,both of which are based on the rules acquired fromtraining data sets of the MUC-II corpus. The onlydifference between the two is that in one case (aftertraining I) we use a lexicon acquired from the MUC-II corpus, and in the other case (after training II) weuse a lexicon acquired by combining the Brown cor-pus, the WSJ corpus, and the MUC-II corpus.

Table 4 shows that the tagger achieves a tagging ac-curacy of up to 98% after training and using the com-

bined lexicon. The tagging accuracy for unknownwords ranges from 82% to 87%. These high rates oftagging accuracy are largely due to two factors: thecombination of domain-specific contextual rules ob-tained by training the MUC-II corpus with generalcontextual rules obtained by training the WSJ corpus;and the combination of the MUC-II lexicon with theWSJ corpus lexicon.

Adapting the Language-Understanding System

The language-understanding system derives the se-mantic-frame representation from the parse tree. Theterminal symbols (i.e., words in general) in the parsetree are represented as vocabulary items in the seman-tic frame. Once we have allowed the parser to take apart of speech as the input, the parts of speech (ratherthan actual words) will appear as terminal symbols inthe parse tree, and hence as the vocabulary items inthe semantic-frame representation. We adapted thesystem so that the part-of-speech tags are used forparsing, but are replaced with the original words inthe final semantic frame. Figure 10 illustrates the se-mantic frame produced by the adapted system for theinput sentence 0819 z unknown contact replied incor-rectly. Once the semantic frame has been produced, asabove, generation proceeds as usual.

Summary of Results for English-to-KoreanTranslation on Naval Messages

After integrating the part-of-speech tagger into thesystem to implement the two-stage parsing tech-nique, we reevaluated the system on the TEST andTEST′ data. The experimental results show that byadopting a two-stage parsing technique, we increasedthe parsing coverage from 35% to 77% on the TESTdata, and from 43% to 82% on the TEST′ data.

Figure 11 summarizes the results on all trainingand test sentences (including TEST and TEST′).With the integration of the two-stage procedure thatincludes the part-of-speech tagger, we have been ableto increase the translation accuracy on this domain to80%. We believe that this level of accuracy, whencombined with a fall-back position that providesword-for-word translations for sentences that cannotbe parsed, would be of operational value to humantranslators and would significantly reduce their work

Table 4. Rule-Based Part-of-Speech TaggerEvaluation on the TEST Data Set

Training status Tagging accuracy

Before training 1125 ÷ 1287 (87.4%)

After training I 1249 ÷ 1287 (97%)

After training II 1263 ÷1287 (98%)

Page 15: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 49

load. This hypothesis remains to be tested, and to betruly useful the translation also needs to be extendedbeyond the MUC-II corpus to more operational do-mains. Work along these lines is described later in thisarticle.

Speech Translation in the MUC-II Domain

Although our primary emphasis in working on theMUC-II domain was text translation, we also devel-oped a speech-translation system for a subset of thisdomain. In our original speech-translation work wehad used a hidden Markov model (HMM) speechrecognizer that had been developed earlier at LincolnLaboratory. For the MUC-II domain, we developed anew HMM speech recognizer by building upon theHMM Toolkit (HTK) software system originally de-veloped at Cambridge University [22, 23]. Given avocabulary, a grammar, training data, and a numberof key parameters of the HMM system, the HTK sys-tem can be used to build a speech recognizer.

The speech training data used for the MUC-IIspeech recognizer was drawn from an independentdata source—the TIMIT general English corpus [24].The HTK system was used to train speaker-indepen-dent acoustic triphone models on the TIMIT corpus.Separate gender acoustic models were generated from

a total of 233 minutes of data from 326 male and 136female speakers. The core HMM models were three-state, state-tied models, with one Gaussian per state.Despite the large size of the TIMIT corpus, 22% ofthe triphone models that occurred in the MUC-IIsentences did not occur in TIMIT. For those triphonemodels, back-off monophone models were used [23].

For speech recognition on the MUC-II corpus, asimple language model was generated in the form of aword-pair grammar (WPG) constructed on the basisof the text of 207 sentences drawn from the MUC-IIcorpus. A WPG is a special case of a bigram grammar[23]; the WPG specifies the set of words WF that areallowed to follow any given word WI in the vocabu-lary, and equalizes the probabilities of a transitionfrom a given WI to any of the WF. Many vocabularyitems in the MUC-II corpus, particularly naval terms,abbreviations, and acronyms, were not included inour available phonetic dictionaries. Phonetic expan-sions that were created by hand for about 200 suchitems were added to the dictionary. In summary, forthe purposes of this MUC-II speech-translation ex-

FIGURE 10. Accurate semantic frame derived from the parsetree with the part-of-speech input sequence. Entries in redare replaced by the corresponding vocabulary items in theKorean-generation lexicon. FIGURE 11. Summary of English-to-Korean translation re-

sults on the MUC-II training and test data, which includesboth TEST and TEST′. Use of the part-of-speech tagger,primarily to solve the unknown word problem, substantiallyenhances translation performance on the test data.

100

90

80

70

60

50

40

30

20

10

0Domain-specific

grammar

Per

cent

Part-of-speech tagger plus

domain-specific grammar

Trainingsentences

Testsentences

32

86

80

Domain-specific

grammar

:statement :time_expression

:topic

:pred

:topic :name:pred

:pred:mode

:adverb

“z”

“819”

“contact”unknown

reply_v

“past”

incorrectly

Paraphrase: 819 Z unknown contact replied incorrectly.

Translation: 8⌠19áÆ åóéòáÆÈÜôãÕ ÜÆÈéòÄì áÆåóíúœíìÄò ãÃÉìÛíîÉì.

Page 16: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

50 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

periment, the size of the vocabulary was about 500words and the perplexity (geometric mean of thenumber of words that can follow a given word) of theWPG on the data was 6.4 [25].

To test this system, we arranged to have 207 sen-tences recorded by one male speaker and one femalespeaker. The speaker-independent acoustic modelswere completely independent of the test data, but theword-pair grammar was developed for this particularset of sentences. (These speech-translation resultswere obtained by using the domain-specific MUC-IIparsing system, prior to the work on the part-of-speech tagger). Figure 12 shows the performance re-sults for speech-recognition and speech-translationexperiments on the 207 sentences. With a word errorrate of 7%, the sentence accuracy (percentage of sen-tences perfectly recognized with no word errors) was54%. To separate the effects of speech-recognitionperformance and text-translation performance, weevaluated speech-translation performance only onthose sentences which had been translated correctlyby the text-translation system. For this group, thepercentage of sentences correctly translated (85%) ishigher than the percentage of sentences that were per-fectly recognized (54%).

The reason for this higher translation rate is thatmany of the speech-recognition errors are caused byomissions or incorrect recognition of items such asarticles or plurals. Our translation system, which hadbeen developed to deal with telegraphic text and tohandle number disagreement within sentences, wastolerant of the errors that are often produced byspeech recognizers. For descriptions of other work inEnglish-Korean speech translation, see References 26through 29.

Korean-to-English Translation

In the early stages of our project, we learned thatSYSTRAN, Inc., a company with a long and success-ful history of work in machine translation, had justembarked on a Department of Defense (DoD)–spon-sored project in Korean-to-English translation [30,31]. Rather than develop the Korean-to-English partof the system ourselves, we chose to gain leveragefrom that work, and initiated a subcontract withSYSTRAN to adapt their Korean-to-English system

to the MUC-II domain. Although this made our two-way system asymmetric in that the SYSTRAN systemuses a transfer approach instead of an interlingua ap-proach, we decided that the advantage in expeditingdevelopment was worthwhile.

To provide a Korean MUC-II corpus, we sepa-rately contracted with another organization to pro-duce human translations of 338 MUC-II corpus sen-tences into Korean. We then supplied this Koreancorpus to SYSTRAN for their training, developing,and testing. From this Korean corpus, 220 sentenceswere used for training and 118 sentences were usedfor testing. During training, significant changes weremade to all modules because the system had neverdealt with telegraphic messages of this type. The sys-tem dictionary, which had about 20,000 Korean en-tries but lacked many terms in naval operations re-ports, was augmented to include the new words inMUC-II. We found that performance on the MUC-II Korean-to-English task was good; 57% of thetranslations of the test sentences were at least close to

FIGURE 12. Speech-recognition and translation perfor-mance on MUC-II naval message data. The sentences aver-aged 12 words in length and 54% of the sentences were per-fectly recognized. Speech-translation performance, shownonly for those sentences which were translated correctly bythe text-translation system, is 85%, which demonstrates thecapability of the parser to handle errors in the text input thatit receives.

100

90

80

70

60

50

40

30

20

10

0Sentences

perfectly recognizedP

erce

nt

Sentencescorrectly translated

54

85

Page 17: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 51

being correct, and for 97% of the sentences, a humantranslator could extract the essential meaning fromthe translation output. After the work on the MUC-II corpus, SYSTRAN extended the Korean-to-En-glish system to another domain, which we obtainedfrom a bilingual English-Korean combat officer’sbriefing course, and which was similar in vocabularysize to the MUC-II corpus. Korean-to-English per-formance on this domain was similar to the perfor-mance on the MUC-II corpus.

For demonstrations, we also developed a small-scale Korean-to-English speech-translation subsystemfor the MUC-II domain. We collected training datafrom two Korean speakers, and using HTK we devel-oped a rudimentary Korean speech recognizer withabout a 75-word vocabulary. With this system wewere able to demonstrate translation of Koreanspeech when we fed the output of the Korean speechrecognizer into the Korean-to-English translator. Thisdemonstration was of high interest to many observ-ers, but we cautioned them that a great deal of workwas still required to make a truly effective Koreanspeech-recognition and translation system.

Recently, we developed a preliminary working sub-system for interlingua-based Korean-to-Englishtranslation that includes a Korean analysis grammarfor TINA. This makes CCLINC, to our knowledge,the first machine-translation system that implementstwo-way, interlingua-based English-Korean transla-tion. Other related ongoing work includes research inKorean language understanding to produce an inter-lingua representation [32] and transfer-based Korean-to-English translation [33].

System Development on C2W Domain andTreatment of Complex Sentences

While we were carrying out our translation-systemdevelopment on the MUC-II corpus, we worked withpersonnel in CFC Korea to obtain data that would bemore directly typical of translation applications inthat environment. In November 1996, we obtaineddata for a new task domain in the form of an Englishand Korean Command-and-Control Warfare (C2W)handbook. The handbook provided us with over twohundred pages of new material in each language, usedroutinely by CFC, in an electronic format. It con-

tained a vocabulary of 8500 words and 3400 sen-tences, each with an average size of 15 words.

The new material created challenges. In particular,the sentences were longer and more complex thanthose in the MUC-II corpus. We were motivated bythe C2W corpus to confront some of the difficultchallenges in machine translation, which in turn ledus to develop a more complete and robust translationsystem, as described below.

The C2W Data

For the C2W data, we focused our effort on develop-ing a technique to handle complex sentences that in-cludes fragmentation of a sentence into meaningfulsubunits before parsing, and composition of the cor-responding semantic-frame fragments into a singleunified semantic frame. Compared to those of theMUC-II corpus, the sentences in the C2W data aremuch longer and are written in grammatical English:

A mastery of military art is a prerequisite to suc-cessful practice of military deception but the mas-tery of military deception takes military art to ahigher level.

Although opportunities to use deception should notbe overlooked, the commander must also recognizesituations where deception is not appropriate.

Often, the skillful application of tenets of militaryoperations-initiative, agility, depth, synchroniza-tion and versatility, combined with effectiveOPSEC, will suffice in dominating the actions ofthe opponent.

Such long, complex sentences are difficult to parse.Acquiring a set of grammar rules that incorporate allinstances of complex sentences is not easy. Even if acomplex sentence is covered by the grammar, a longsentence induces a higher degree of ambiguity than ashort sentence, requiring a much longer processingtime. To overcome the problems posed by under-standing of complex sentences, we have been devel-oping sentence-fragmentation and semantic-framecomposition techniques. We briefly describe thesetechniques below.

Page 18: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

52 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

FIGURE 13. Operation of the sentence-fragmentation algo-rithm. From top to bottom are shown the input sentence, theApple Pie Parser output, and the four fragments into whichthe input sentence is broken via the operation of the frag-mentation algorithm on the output of the Apple Pie Parser.

Sentence Fragmentation

For sentence fragmentation, the input sentence is firstparsed with the Apple Pie Parser, a system developedat New York University. This system runs on a cor-pus-based probabilistic grammar and produces theparse tree with the highest score among the trees de-rived from the input [34]. Our sentence-fragmenta-tion algorithm [35] is applied to the Apple Pie Parseroutput, producing sentence fragments that each forma meaningful unit. Figure 13 provides an example ofthe Apple Pie Parser output and fragmenter output.

As the Apple Pie Parser output and the fragmentedoutput show, the fragmentation algorithm extractselements with category labels such as TOINF andSBAR, each of which form an independent meaningunit [36]. Once a fragment is extracted from thehigher-level category, the label of the extracted ele-ment is left behind to compose the componentsemantic frames at a later stage. In Figure 13, twofragments have been extracted from the input sen-tence—an adverbial clause (although opportunities touse deception should not be overlooked ) whose categorylabel in the parsing output is SBAR, and a relativeclause (where deception is not appropriate ) whose cat-egory label is also SBAR. Labels of these two extractedelements are left in the first fragment as adverbc1 andrelclause1, respectively. Likewise, an infinitival clausewhose category label in the parsing output is TOINFhas been extracted from the adverbial clause, leavingits label toinfc1 in the second fragment.

Understanding Fragments andSemantic-Frame Composition

Once sentence fragments are generated according tothe fragmentation algorithm, each fragment is pro-cessed by the language-understanding system, TINA,to produce the parse tree and the semantic frame. Thesemantic frames of each fragment are then combinedto capture the meaning of the original input sentence.Figure 14 illustrates this process. The language-un-derstanding system derives the parse tree and the cor-responding semantic frame for each fragment, andthe semantic frames for each frame are combined.The combined frame then becomes the input to theGENESIS language-generation system.

Robust Translation System

The robust translation system copes with the techni-cal challenges discussed in the prior section. Figure 15illustrates the process flow of this system. Given aninput sentence, the TINA parser tries to parse the in-put sentence either on the word sequence or the mixof part of speech and the word sequence. If the pars-ing succeeds, then TINA produces the semanticframe. If not, the input sentence is fragmented intoseveral subunits. The TINA parser is then applied toeach fragment. If parsing succeeds, the semanticframe for the parsed fragment is produced. If not, a

Input: Although opportunities to use deception should

not be overlooked, the commander must also recognize

situations where deception is not appropriate

Apple Pie Parser output

adverbc1 comma the commander must also recognize situations relclause1

adverbc1 although opportunities toinfc1 should not be overlooked

relclause1 where deception is not appropriatetoinfc1 to use deception

(S (SBAR although (SS (NP opportunities (TOINF (VP to (VP use (NPL deception))))) (VP should not (VP be (VP overlooked))))) -COMMA- (NPL the commander) (VP must (ADVP also (VP recognize) (NP (NPL situations) (SBAR (WHADVP where) (SS (NPL deception) (VP is (ADJP not appropriate))))))))

Page 19: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 53

FIGURE 14. Operation of the robust translation system for parsing and understanding sentence fragments, composing the re-sults into a combined semantic frame, and producing the final translation and paraphrase. In this example, two fragments areprocessed. The parts of the figure are (a) parse tree 1, (b) semantic frame 1, (c) parse tree 2, (d) semantic frame 2, and (e) com-bined semantic frame with paraphrase and translation. The labels in red represent the categories that have been extracted bythe fragmentation algorithm.

Input: Military art focuses on the direct use of military force to impose one’s intent on an opponent.

Fragment 1: Military art focuses on the direct use of military force toinfc1

sentence

full_parse

statement

subject

modifier

military

nn_head

predicate

q_np vp_focus

art

vfocus v_on_pp

focuses

v_on q_np

on

det modifier nn_head n_pp

the

adjective

direct use

n_of_pp

n_of

of

q_np

military force toinf1

(a)

opponenttoinfc1 to impose one’s intent on an

sentence

fragment

toinf_tag_clause

toinfc_tag to_infinitive

to_inf predicate

vp_impose

vimpose dir_object v_on_pp

q_np

nn_mod nn_head

v_on q_np

det nn_head

Fragment 2: toinfc1 to impose one’s intent on an opponent(c)

Paraphrase: Military art focuses on the direct use of military

force to impose one’s intent on an opponent.

ÄÆÂâìåóÂâÆÈãàâìÉîáìãò åìâŒÂãÕ ãÕÉõÖÃÈ ÄìãüíìÇàÄÆÂâìáôÖôœãÕ åŒœåóÛâìãüãò éõıåóÚãÃÈ ÉÆÂÉì.Paraphrase:

:statement (e):topic art :pred military

:pred focus_v :pred v_on :topic use :pred direct :pred n_of :topic force :pred military

:to_infinitive to_inf :pred impose :topic intent :pred one’s :pred v_on :topic opponent

fragment (d):tag “toinfc1”:to_infinitive to_inf

:pred impose :mode “root”

:topic intent :pred one’s

:pred v_on

:topic opponent

:statement (b):topic art :pred military

:pred focus_v :mode “present” :number “third”

:pred v_on

:topic use :pred direct

:pred n_of

:topic force :pred military to_infinitive "toinfc1"

Page 20: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

54 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

form of word-for-word understanding is applied tothe fragment, which results in a semantic frame thatserves as a place holder. After all the fragments havebeen understood, the semantic frames for the frag-ments are composed, and generation proceeds fromthe composed semantic frame.

Our initial development of the robust translationsystem shown in Figure 15 was done on the C2Wdata, which, as mentioned earlier, included manycomplex sentences with an average sentence length offifteen words. With an early version of the robustparser on 286 sentences of C2W data, 158 (55%) ofthese sentences were fully translated. Of these 158fully translated sentences, 64 (22%) input sentenceswere both fully fragmented by the system and fullyparsed. This approach has increased the parsing cov-

erage and translation rate on complex sentences inour current translation material of Commander-in-Chief (CINC) daily briefings. We believe this ap-proach provides aid to the user when full translationis not possible.

Software Implementation

Figure 16 illustrates the major modules of the currentsoftware implementation of CCLINC. The top-levelmodule of the CCLINC system, the graphical userinterface, interacts with both the English-to-Koreanand Korean-to-English translation systems. The En-glish-to-Korean translation system consists of threesubsystems, namely, speech recognition, language un-derstanding, and language generation. The language-understanding system interacts with two subsystems

Parsing

Fragmentation

Part-of-speechtagging

Englishinput

Semanticframe

Semanticframes

1, 2, 3, ...

Word-for-wordunderstanding

Composedsemantic

frame

Semanticframe

Koreangeneration

Koreanoutput

Inputparsed?

Fragmentsparsed?

Parsing onfragments1, 2, 3, ...

Yes

No

Yes

Part-of-speechtagging

No

FIGURE 15. Process flow of robust translation system. Given an input sentence, the translation system assigns parts of speechto each word. Parsing takes place with the part-of-speech sequence as input. If parsing succeeds at this stage, the corre-sponding semantic frame is produced. If parsing does not succeed, the input sentence is fragmented, and parsing takes placeon each fragment. Once parsing and semantic-frame generation of all of the fragments has been completed, the semanticframes for the fragments are composed. Generation proceeds with the composed semantic frame as input.

Page 21: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 55

for robust processing: the rule-based part-of-speechtagger to handle unknown words, and the Apple PieParser and sentence fragmenter to handle complexsentences. The Korean-to-English translation systemincludes two subsystems that employ different ap-proaches to machine translation: the transfer-basedKorean-to-English system developed by SYSTRAN,and our interlingua-based Korean-to-English systemunder development.

The translation system operates on UNIX plat-forms, and has been run on workstations underSolaris and on a Pentium laptop under PC Solaris andLINUX (Solaris and LINUX are versions of UNIX).The system with the part-of-speech tagger and thefragmenter uses about 50 MB of memory and, de-pending on the size of the data files used by eachmodule, the memory usage varies from 80 MB to 100MB. The processing times for translation rely on thetask domain, the grammar, the length and complexity

of the input sentence, and the processor being used.For all the tasks we have run, translation is generallycompleted within a few seconds per sentence.

As an example, text-translation processing timesfor the MUC-II domain, with the system running ona 125-MHz Hypersparc workstation ranged from 1.7sec for an average sentence length of 12 words, to 2.3sec for a 16-word sentence, to about 4 sec for a com-plex sentence containing 38 words. For the same pro-cessor, English speech recognition in the MUC-II do-main runs in about two times real time. We cautionthe reader that the processing efficiency of a system isdetermined by various factors that include CPUspeed, machine memory size, the size of data-gram-mar-lexicon files required by the system, and thecomplexity of the input data, which largely deter-mines the parsing time. For a general introduction tothe efficiency issue of different parsing techniques, seeReference 37.

FIGURE 16. Major software modules of the current implementation of the CCLINC automated translation system. The graphi-cal user interface interacts with both the English-to-Korean and Korean-to-English translation systems. The English-to-Koreansystem consists of three subsystems: speech recognition, language understanding, and language generation. The language-understanding system interacts with two subsystems for robust processing: the rule-based part-of-speech tagger and theApple-Pie-Parser and fragmenter. The Korean-to-English system consists of two systems that employ different approaches tomachine translation: the interlingua-based system being developed at Lincoln Laboratory and the transfer-based system devel-oped by SYSTRAN under a subcontract.

Graphical user interface as translator’s aid,written in C-code and user interface language

Speech-recognition

tool kit

Languageunderstanding

(TINA);C-code

Languagegeneration

(GENESIS);C-code

Rule-basedpart-of-speech

tagger;C-code

Apple Pie Parserand

fragmenter;C-code

Korean to EnglishEnglish to Korean

Interlingua-basedsystem

(TINA and GENESIS)

SYSTRANtransfer-based

system

Page 22: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

56 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

System Demonstrations and Task Definitions

From the outset, this project has focused on develop-ing an automated translation system that would beuseful for military coalition forces in Korea. There-fore, we have actively pursued user feedback in dem-onstrating and testing our technology in the user en-vironment, and iteratively adjusted our efforts torespond to this feedback. These system demonstra-tion and technology activities include our first visit toKorea in September 1994, a system demonstration onboard the USS Coronado at the June 1996 Rim of thePacific Coalition (RIMPAC 96) exercises, and a sys-tem demonstration at CFC Korea in April 1997 inconjunction with the Reception, Staging, OnwardMovement, and Integration (RSO&I) exercises. Dur-ing these exercises, we tested the system on new op-erational data comprising intelligence spot reports,intelligence summaries, and excerpts from CINCdaily guidance letters and briefings.

The mission was successful in winning supportand encouragement from high-ranking officers andmilitary personnel who would be directly workingwith the system. We held discussions with CFC trans-lators, operations personnel, and flag officers to helpus define tractable translation tasks, with the CINCbriefings becoming our highest priority. We also

brought back samples of key operational material tobe used in system development.

As a result of the RSO&I exercises, we are develop-ing the system to translate CINC daily briefings,which consist of slides and speaker notes that are usedby a presenter to explain each slide. The speaker notesthat accompany a slide include longer, more complexsentences, and hence our robust translation approachof handling complex sentences is critical for translat-ing this material. Our ultimate goal in training thetranslation system on CINC briefings is to allowCFC personnel to focus more on the content of thebriefings than on translation. Figure 17 illustrates abriefing slide in which each of the English sentenceshas been translated into Korean by our system. Al-though the translation is accurate for this slide, a sub-stantial amount of system development on similarmaterial is needed before the translation accuracy onnew CINC briefing material will be high enough foreffective operational use. We plan to bring the systemto Korea by spring of 1998 for tests on new CINCbriefing material.

Summary and Plans

Substantial progress has been made in automated En-glish-Korean translation. Major accomplishments inthis project to date include (1) development and fea-

FIGURE 17. Sample slide of Commander-in-Chief (CINC) briefing material, in which each English sentence has been translatedby CCLINC. The development of CCLINC to achieve high performance for a large variety of such material is the focus of ourcurrent work.

CINC’s Daily Guidance Letter

Input Translation

âìÖôÄúÂãÕ ãŒÈãŒÈ åŒéŒÚ âóâŒÂ

• Üõœåóœ

– ÄúÄó 24âŒÄìÂãÕ âìÖôÄúÂãÕ åŒéŒÚãÃÈ ãÀëõíìÂÉì.

– ÜŒÖî åìœåóÂãÃÈ ãøíì Äöíûœ åŒéŒÚ.

– êõíìÛ ãŒÚÜÆ ÜôÖôãÃÈ ãøíì åŒéŒÚ.

• Üõœëü

– ÜŒÖî åìœåóÂãÃÈ ãøíì âìÖôÄúÂãÕ åìœåó åŒéŒÚãÕ ãüãïœ

– âìÖôÄúÂãÕ åÆãüíì åŒéŒÚãÃÈ âŒÉìÈíìÂÉì.

• âóÄú

– åŒíø, êõåò ÜŒ êõâŒÂ ÄöíûœãàâìÖôÄú âÃãŒÂãÃÈ ãøíìãô åõÄì éûâŒÂ åóáü

åòÄõ ãŒåóÂãò åŒíø, êõåò ÜŒ êõâŒÂãò éõãìÂãÃÈ åòÄõíìÂÉì.

– åŒíø, êõåò ÜŒ êõâŒÂ ÄöíûœãàíìÂÜŒãôÂíìÛâìÖôáÆ éìÚÜõ, ÄÆâóáÆÉî, ÄÃÖŒÄõ

ãöâõœáÆÉîãò âÃãŒÂÉû êõâŒÂÜÆÂãÃÈ åòÄõíìÂÉì.

• Purpose– Disseminate CINC’s guidance of the past 24 hours– Planning guidance for Future Operations–

• Objectives– Summary of CINC’s operational guidance for future ops– Issue CINC's prioritized guidance

• Products– C3 Plans provide draft to C3 prior to Morning Update for

CINC approval– C3 Plans provide approved letter to CFC Staff,

Components, and Subordinates

Guidance for Integrated Task Order (ITO)

Page 23: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 57

sibility demonstrations of automated two-way En-glish-Korean text and speech translation for militarymessages; (2) development of a modular, interlingua-based translation system that is extendable to mul-tiple languages and to human interaction with C4Isystems; (3) development of a multistage, robusttranslation system to handle complex text; (4) devel-opment of an integrated graphical user interface for atranslator’s aid; and (5) several successful demonstra-tions and technology transfer activities, includingparticipation in the RIMPAC 96 coalition exercise onboard the USS Coronado and the RSO&I coalitionexercises at CFC Korea.

Our plans for the future involve extending the sys-tem capability to additional application domains, in-cluding translation of operations orders and opera-tions plans. We will expand our recently begun effortin developing an interlingua-based Korean-to-En-glish translation system by using the same under-standing-based technology that we have applied toEnglish-to-Korean translation. Ultimately, we hopeto integrate the system’s understanding capabilitieswith C4I systems to allow multilingual human-com-puter and human-human communication. One suchapplication would involve a report translated by thesystem for communication among coalition partners.The report’s meaning, captured in the semanticframe, would be conveyed to the C4I system to up-date databases with situation awareness information.

Acknowledgments

This project has benefited from the contributions ofindividuals inside and outside Lincoln Laboratory,and we particularly appreciate the contributions ofand interactions with people in the DoD and researchcommunities. We would like to cite the contributionsof the following people: Ronald Larsen, Allen Sears,George Doddington, John Pennella, and Lt. Comdr.Robert Kocher, DARPA; Seok Hong, James Koh,Col. Joseph Jaremko, Lt. Col. Charles McMaster, Lt.David Yi, and Willis Kim, U.S. Forces Korea-Com-bined Forces Command; Beth Sundheim and Chris-tine Dean, NRaD; Capt. Richard Williams and NeilWeinstein, USS Coronado, Command Ship of theThird Fleet; Victor Zue, James Glass, Ed Hurley, andChristine Pao, MIT Laboratory for Computer Sci-

ence, Spoken Language Systems Group; Key-SunChoi, Korean Advanced Institute for Science andTechnology; Ralph Grishman, New York University;Martha Palmer, University of Pennsylvania; and JerryO’Leary, Tom Parks, Marc Zissman, Don Chapman,Peter Jung, George Young, Greg Haber, and DennisYang, Lincoln Laboratory.

Page 24: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

58 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

R E F E R E N C E S1. W.J. Hutchins and H.L. Somers, An Introduction to Machine

Translation (Academic, London, 1992).2. D. Tummala, S. Seneff, D. Paul, C. Weinstein, and D. Yang,

“CCLINC: System Architecture and Concept Demonstrationof Speech-to-Speech Translation for Limited-Domain Multi-lingual Applications,” Proc. 1995 ARPA Spoken Language Tech-nology Workshop, Austin, Tex., 22–25 Jan. 1995, pp. 227–232.

3. C. Weinstein, D. Tummala, Y.-S. Lee, and S. Seneff, “Auto-matic English-to-Korean Text Translation of TelegraphicMessages in a Limited Domain,” 16th Int. Conf. on Computa-tional Linguistics ’96, Copenhagen, 5–9 Aug. 1996, pp. 705–710; C-STAR II Proc. ATR International Workshop on SpeechTranslation, 10–11 Sept. 1996, Kyoto, Japan.

4. K.-S. Choi, S. Lee, H. Kim, D.-B. Kim, C. Kweon, and G.Kim, “An English-to-Korean Machine Translator: MATES/EK,” Proc. 15th Int. Conf. on Computational Linguistics I,Kyoto, Japan, 5–9 Aug. 1994, pp. 129–131.

5. S. Seneff, “TINA: A Natural Language System for SpokenLanguage Applications,” Computational Linguistics 18 (1),1992, pp. 61–92.

6. J. Glass, J. Polifroni and S. Seneff, “Multilingual LanguageGeneration across Multiple Domains,” 1994 Int. Conf. onSpoken Language Processing, Yokohama, Japan, 18–22 Sept.1994, pp. 983–986.

7. J. Glass, D. Goodine, M. Phillips, M. Sakai, S. Seneff, and V.Zue, “A Bilingual VOYAGER System,” Proc. Eurospeech, Ber-lin, 21–23 Sept. 1993, pp. 2063–2066.

8. V. Zue, S. Seneff, J. Polifroni, H. Meng, J. Glass, “MultilingualHuman-Computer Interactions: From Information Access toLanguage Learning,” Proc. Int. Conference on Spoken LanguageProcessing, ICSLP-96 4, Philadelphia, 3–6 Oct. 1996, pp.2207–2210.

9. D. Yang, “Korean Language Generation in an Interlingua-Based Speech Translation System,” Technical Report 1026,MIT Lincoln Laboratory, Lexington, Mass., 21 Feb. 1996,DTIC #ADA-306658.

10. S. Martin, A Reference Grammar of Korean (Tuttle, Rutland,Vt., 1992).

11. C. Voss and B. Dorr, “Toward a Lexicalized Grammar forInterlinguas,” J. Machine Translation 10 (1–2), 1995, pp. 143–184.

12. H.-M Sohn, Korean (Routledge, London, 1994).13. B.M. Sundheim, “Plans for a Task-Oriented Evaluation of

Natural Language Understanding Systems,” Proc. DARPASpeech and Natural Language Workshop, Philadelphia, 21–23Feb. 1989, pp. 197–202.

14. B.M. Sundheim, “Navy Tactical Incident Reporting in aHighly Constrained Sublanguage: Examples and Analysis,”Technical Document 1477, Naval Ocean Systems Center, SanDiego, 1989.

15. Y.-S. Lee, C. Weinstein, S. Seneff, and D. Tummala, “Ambigu-ity Resolution for Machine Translation of Telegraphic Mes-sages,” Proc. Assoc. for Computational Linguistics, Madrid, 7–12July 1997.

16. R. Grishman and J. Sterling, “Analyzing Telegraphic Mes-sages,” Proc. DARPA Speech and Natural Language Workshop,Philadelphia, 21–23 Feb. 1989, pp. 204–208.

17. J.S. White and T.A. O’Connell, “Evaluation in the ARPA Ma-chine Translation Program: 1993 Methodology,” Proc. Human

Language Technology Workshop, Plainsboro, N.J., 8–11 Mar.1994, pp. 135–140.

18. W.B. Kim and W.B. Rhee, “Machine Translation Evaluation,”MITRE Working Note WN 94W0000198, Nov. 1994.

19. E. Brill and P. Resnik, “A Rule-Based Approach to Preposi-tional Phrase Attachment Disambiguation,” Proc. COLING-1994, Kyoto, Japan, 5–9 Aug. 1994.

20. E. Brill, “A Simple Rule-Based Part of Speech Tagger,” Proc.Third Conf. on Applied Natural Language Processing, ACL,Trento, Italy, 31 Mar.–3 Apr. 1992, pp. 152–155.

21. E. Brill, “Transformation-Based Error-Driven Learning andNatural Language Processing: A Case Study in Part-of-SpeechTagging,” Computational Linguistics 21 (4), 1996, pp. 543–565.

22. S.J. Young, P.C. Woodland, and W.J. Byrne, HTK Version 1.5:User, Reference & Programmer Manual (Cambridge UniversityEngineering Department and Entropic Research Laboratories,Inc., Sept. 1993).

23. P.C. Woodland, J.J. Odell, V. Valtchev, and S.J. Young, “LargeVocabulary Continuous Speech Recognition Using HTK,”Proc. ICASSP ’94 2, Adelaide, Australia, 19–22 Apr. 1994,pp. 125–128.

24. W.M. Fisher, G.R. Doddington, and K.M. Goudie-Marshall,“The DARPA Speech Recognition Research Database: Speci-fications and Status,” Proc. DARPA Workshop on Speech Recog-nition, Palo Alto, Calif., Feb. 1986, pp. 93–99.

25. S. Young, “A Review of Large-Vocabulary Continuous-Speech Recognition,” IEEE Signal Process. Mag. 13 (5),1996,pp. 45–57.

26. A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavaldo,T. Zeppenfeld, and P. Zhan, “JANUS-III: Speech-to-SpeechTranslation in Multiple Languages,” ICASSP-97 Proc. I, 21–24 Apr. 1997, Munich, pp. 99–102.

27. C-STAR II Proc. ATR Int. Workshop on Speech Translation,Kyoto, Japan, 10–11 Sept. 1996.

28. C-STAR Web Site, http://www.is.cs.cmu.edu/cstar/29. J.-W. Yang and J. Park, “An Experiment on Korean-to-English

and Korean-to-Japanese Spoken Language Translation,”ICASSP-97 Proc. I, Munich, 21–24 Apr. 1997, pp. 87–90.

30. D.A. Bostad, “Aspects of Machine Translation in the UnitedStates Air Force,” in AGARD: Benefits of Computer AssistedTranslation to Information Managers and End Users, N91-13352, June 1990.

31. W.J. Hutchins and H.L. Somers, “Systran,” in An Introductionto Machine Translation (Academic, London, 1992), pp. 175–206.

32. B. Dorr, “LCS-BASED Korean Parsing and Translation,”TCN No. 95008, Institute for Advanced Computer Studiesand Department of Computer Science, University of Mary-land, 1997.

33. H.-S. Park, “Korean Grammar Using TAGs,” IRCS Report 94-28, Institute for Research in Cognitive Science, University ofPennsylvania, 1994.

34. S. Sekine and R. Grishman, “A Corpus-Based ProbabilisticGrammar with Only Two Non-Terminals,” Fourth Int. Work-shop on Parsing Technology, Prague, 1995, pp. 216–223.

35. J.-T. Hwang, “A Fragmentation Technique for Parsing Com-plex Sentences for Machine Translation,” M. Eng. Thesis, MITDepartment of EECS, June 1997.

36. V. Fromkin and R. Rodman, An Introduction to Language, 4thed. (Holt, Rinehart and Winston, Fort Worth, Tex., 1988).

37. J. Allen, Natural Language Understanding, 2nd ed. (Benjamin/Cummings, Redwood City, Calif., 1995).

Page 25: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

VOLUME 10, NUMBER 1, 1997 THE LINCOLN LABORATORY JOURNAL 59

. leads the Information SystemsTechnology group and isresponsible for initiating andmanaging research programsin speech technology, machinetranslation, and informationsystem survivability. He joinedLincoln Laboratory as an MITgraduate student in 1967, andbecame group leader of theSpeech Systems Technologygroup (now InformationSystems Technology group) in1979. He has made technicalcontributions and carried outleadership roles in researchprograms in speech recogni-tion, speech coding, machinetranslation, speech enhance-ment, packet speech commu-nications, information systemsurvivability, integrated voice-data communication net-works, digital signal process-ing, and radar signalprocessing. Since 1986, Cliffhas been the U.S. technicalspecialist on the NATORSG10 Speech ResearchGroup, authoring a compre-hensive NATO report andjournal article on applyingadvanced speech technology inmilitary systems. In 1993, hewas elected an IEEE Fellow fortechnical leadership in speechrecognition, packet speech,and integrated voice-datanetwork. He received S.B.,S.M., and Ph.D. degrees inelectrical engineering fromMIT.

- is a staff member in the Infor-mation Systems Technologygroup, and has been workingon machine translation sincejoining Lincoln Laboratory in1995. As a principal investiga-tor of the Korean-Englishtranslation project, she helpsdevelop and integrate severalsubmodules of the CCLINCsystem, including English andKorean understanding andgeneration, part-of-speechtagging, robust parsing, gram-mar and lexicon acquisitionand updating, and graphicaluser interface. Her mainresearch interest is in thedevelopment of interlingualrepresentation with semanticframes for multilingual ma-chine translation and othermultilingual applications.Before coming to LincolnLaboratory, she taughtlinguistics at Yale University.She received a B.A. degree inEnglish linguistics and litera-ture from Seoul NationalUniversity, Korea, where shegraduated summa cum laudein 1985. She also has anM.S.E. degree in computerand information science and aPh.D. degree in linguisticsfrom the University of Penn-sylvania. She is a member ofthe Association for Computa-tional Linguistics and theLinguistic Society of America.

is a principal research scientistin the Spoken Language Sys-tems group at the MIT Labo-ratory for Computer Science.During the 1970s, she was amember of the research staff atLincoln Laboratory, where herresearch encompassed a widerange of speech processingtopics, including speech syn-thesis, voice encoding, featureextraction (formants andfundamental frequency),speech transmission overnetworks, and speech recogni-tion. Her doctoral thesisconcerned a model for humanauditory processing of speech,and some of her later work hasfocused on the application ofauditory modeling to com-puter speech recognition. Overthe past several years, she hasbecome interested in naturallanguage, and has participatedin many aspects of the devel-opment of spoken languagesystems, including parsing,grammar development, dis-course and dialogue modeling,probabilistic natural-languagedesign, and integration be-tween speech and naturallanguage. She is a member ofthe Association for Computa-tional Linguistics and theIEEE Society for Acoustics,Speech and Signal Processing,serving on their Speech Tech-nical committee. She receiveda B.S. degree in biophysics,and M.S., E.E., and Ph.D.degrees in electrical engineer-ing, all from MIT.

. works to expand and adaptmachine-translation systems tolarger and new domains as astaff member in the Informa-tion Systems Technologygroup. He also develops semi-automated lexicon and gram-mar acquisition techniques.He joined Lincoln Laboratoryin 1993, after researchingpattern recognition systemsand natural-language inter-faces in information retrievalduring internships at DigitalEquipment Corporation. Hereceived an S.B degree incomputer science and engi-neering and an S.M. degree inelectrical engineering andcomputer science from MIT.He was awarded a NationalScience Foundation GraduateFellowship.

Page 26: Automated English-Korean translation for enhanced coalition … · 2020-01-14 · emphasized text translation [3] in response to the pri-orities of U.S. military users in Korea. An

• WEINSTEIN, LEE, SENEFF, TUMMALA, CARLSON, LYNCH, HWANG, AND KUKOLICHAutomated English-Korean Translation for Enhanced Coalition Communications

60 THE LINCOLN LABORATORY JOURNAL VOLUME 10, NUMBER 1, 1997

is a former staff member of theInformation Systems Technol-ogy group. She researched anddeveloped algorithms forinformation retrieval, machinetranslation, and foreign lan-guage instruction beforeleaving Lincoln Laboratory inFebruary 1997. Prior to thisposition, she worked for GTELaboratories in Waltham,Mass., developing speech-recognition algorithms fortelephone and cellular applica-tions. She received B.E.E. andPh.D. degrees in electricalengineering from GeorgiaInstitute of Technology.

. worked with the InformationSystems Technology group fortwelve years before retiring in1996 to study psychology. Hisresearch involved test andevaluation of speech technol-ogy systems and machine-translation systems. He alsoworked on applications forautomated speech and textinformation retrieval andclassification. During his lastthree years, he served as anappointed volunteerombudsperson. He joined theOptical Communicationsgroup at Lincoln Laboratoryin 1970 and worked for fiveyears on various aspects of theLincoln Experimental Satel-lites (LES) 8 and 9. Then hespent three years at the MITCenter for Advanced Engi-neering Study as director ofTutored Video Instruction, acontinuing education programfor industrial engineers thatvideotapes MIT classes. Thiseffort was followed by twoyears of developing supercon-ducting signal processingdevices with the Analog De-vice Technology group atLincoln Laboratory. He thenjoined the faculty of BostonUniversity as associate profes-sor of electrical engineering forthree years before returning toLincoln Laboratory. He re-ceived S.B. and S.M. degreesin electrical engineering fromMIT and a Ph.D. degree inelectrical engineering fromStanford University.

- works for JLM Technologies,Inc., in Boston, Mass., as asystem architect and consult-ant, designing solutions toclient problems. Prior tojoining JLM Technologies, hewas a research assistant in theInformation Systems Technol-ogy group, working on tech-niques to improve the perfor-mance of machine translationof long sentences. He receivedB.S. and S.M. degrees incomputer science from MIT.

. develops and maintains soft-ware systems for the Informa-tion Systems Technologygroup. Previously she devel-oped software for the OpticalCommunications SystemsTechnology group. She re-ceived a B.S. degree in appliedmathematics from MIT.