3
104 Computer A round Christmas, feeling the need for some light technical reading and having long been interested in languages, I turned to a story in Com- puter’s Technology News department (Steven J. Vaughn-Nichols, “Statisti- cal Language Approach Translates into Success,” Nov. 2003, pp. 14-16). Toward the end of the story, the fol- lowing paragraph startled me: Nonetheless, the grammatical sys- tems of some languages are difficult to analyze statistically. For example, Chinese uses pictographs, and thus is harder to analyze than languages with grammatical signifiers such as spaces between words. First, the Chinese writing system uses relatively few pictographs, and those few are highly abstracted. The Chinese writing system uses logographs—con- ventional representations of words or morphemes. Characters of the most common kind have two parts, one sug- gesting the general area of meaning, the other pronunciation. Second, most Chinese characters are words in themselves, so the space be- tween two characters is a space between words. True, many words in modern Chinese need two and sometimes more characters, but these are compounds, much like English words such as pass- word, output, and software. Third, Chinese does have grammat- ical signifiers. Pointing a browser equipped to show Chinese characters at a URL such as www.ausdaily.net.au will immediately show a wealth of what are plainly punctuation marks. Fourth, Chinese is an isolating lan- guage with invariant words. This should make it very easy to analyze statistically. English is full of prefixes and suffixes—the word prefixes itself has one of each—which leads to more difficult statistics. I do not mean these observations to disparage the journalist who wrote the story—but they do suggest that some computing professionals may know less than they should about language. LANGUAGE ANALYSIS The news story contrasted two approaches to machine translation. Knowledge–based systems rely on programmers to enter various lan- guages’ vocabulary and syntax information into databases. The programmers then write lists of rules that describe the possible relation- ships between a language’s parts of speech. Rather than using the knowledge- based system’s direct word-by-word translation techniques, statistical approaches translate documents by statistically analyzing entire phrases and, over time, ‘learning’ how vari- ous languages work. The superficial difference seems to be that one technique translates word by word, the other phrase by phrase. But what one language deems to be words another deems to be phrases— agglutinative languages mildly so and synthetic languages drastically so— compared to relatively uninflected lan- guages like English. Also, the com- ponents of a phrase can be contiguous in one language and dispersed in another—as in the case of German versus English as Samuel Langhorne Clemens described (www.bdsnett.no/ klaus/twain). The underlying difference seems to be that the knowledge–based systems’ data for each language comes from grammarians, while the statistical sys- tems’ data comes from a mechanical comparison of corresponding docu- ments, the one a professional transla- tion of the other. LANGUAGE TRANSLATION Looking at translation generally, the problem with the statistical approach is that it requires two translation pro- grams for every pair of languages: one Languages and the Computing Profession Neville Holmes, University of Tasmania THE PROFESSION Continued on page 102 Translating natural language is too important and complex for computing professionals to tackle alone.

Languages and the computing profession

  • Upload
    n

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Languages and the computing profession

104 Computer

A round Christmas, feeling theneed for some light technicalreading and having long beeninterested in languages, Iturned to a story in Com-

puter’s Technology News department(Steven J. Vaughn-Nichols, “Statisti-cal Language Approach Translatesinto Success,” Nov. 2003, pp. 14-16).Toward the end of the story, the fol-lowing paragraph startled me:

Nonetheless, the grammatical sys-tems of some languages are difficultto analyze statistically. For example,Chinese uses pictographs, and thusis harder to analyze than languageswith grammatical signifiers such asspaces between words.

First, the Chinese writing system usesrelatively few pictographs, and thosefew are highly abstracted. The Chinesewriting system uses logographs—con-ventional representations of words ormorphemes. Characters of the mostcommon kind have two parts, one sug-gesting the general area of meaning, theother pronunciation.

Second, most Chinese characters arewords in themselves, so the space be-tween two characters is a space betweenwords. True, many words in modernChinese need two and sometimes morecharacters, but these are compounds,much like English words such as pass-word, output, and software.

Third, Chinese does have grammat-ical signifiers. Pointing a browserequipped to show Chinese charactersat a URL such as www.ausdaily.net.auwill immediately show a wealth ofwhat are plainly punctuation marks.

Fourth, Chinese is an isolating lan-guage with invariant words. Thisshould make it very easy to analyzestatistically. English is full of prefixesand suffixes—the word prefixes itselfhas one of each—which leads to moredifficult statistics.

I do not mean these observations todisparage the journalist who wrote thestory—but they do suggest that somecomputing professionals may knowless than they should about language.

LANGUAGE ANALYSISThe news story contrasted two

approaches to machine translation.

Knowledge–based systems rely onprogrammers to enter various lan-guages’ vocabulary and syntaxinformation into databases. The

programmers then write lists of rulesthat describe the possible relation-ships between a language’s parts ofspeech.

Rather than using the knowledge-based system’s direct word-by-wordtranslation techniques, statisticalapproaches translate documents bystatistically analyzing entire phrasesand, over time, ‘learning’ how vari-ous languages work.

The superficial difference seems tobe that one technique translates wordby word, the other phrase by phrase.

But what one language deems to bewords another deems to be phrases—agglutinative languages mildly so andsynthetic languages drastically so—compared to relatively uninflected lan-guages like English. Also, the com-ponents of a phrase can be contiguousin one language and dispersed inanother—as in the case of German versus English as Samuel LanghorneClemens described (www.bdsnett.no/klaus/twain).

The underlying difference seems tobe that the knowledge–based systems’data for each language comes fromgrammarians, while the statistical sys-tems’ data comes from a mechanicalcomparison of corresponding docu-ments, the one a professional transla-tion of the other.

LANGUAGE TRANSLATIONLooking at translation generally, the

problem with the statistical approachis that it requires two translation pro-grams for every pair of languages: one

Languages and the ComputingProfessionNeville Holmes, University of Tasmania

T H E P R O F E S S I O N

Continued on page 102

Translating naturallanguage is too importantand complex for computingprofessionals to tacklealone.

Page 2: Languages and the computing profession

102 Computer

T h e P r o f e s s i o n

whelming task, certain necessary char-acteristics suggest a starting point.

• Specificity. Every primary mean-ing must have only one code, andevery primary code must haveonly one meaning. The difficultyhere is deciding which meaningsare primary.

• Precision. A rich range of qualify-ing codes must derive secondarymeanings from primary meaningsand assign roles to meaningswithin their context.

• Regularity. The rules for combin-ing and ordering codes, and forsystematic codes such as those forcolors, must be free from excep-tions and variations.

• Literality. The intermediate lan-guage must exclude idioms, clichés,hackneyed phrases, puns, and thelike, although punctuational codescould be used to mark their pres-ence.

• Neutrality. Proper names, mosttechnical terms, monoculturalwords, explicit words such as inkjetwhen used as shown here, and pos-sibly many other classes of wordsmust pass through the intermediatelanguage without change otherthan, when needed, transliteration.

My use of the term “code” in thesesuggested characteristics, rather thanmorpheme or word, is deliberate.Designing the intermediate languageto be spoken as words and thus toserve as an auxiliary language wouldbe a mistake.

First, designing the intermediate lan-guage for general auxiliary use wouldunnecessarily and possibly severelyimpair its function as an intermediary.Second, a global auxiliary language’s

going each way. Ab initio, the same istrue of the grammatical approach.

The number of different languagesis such that complete coverage requiresnumerous programs—101 languageswould require 10,100 translation pro-grams. Daunting when we consider thethousands of different languages stillin popular use.

The knowledge–based or grammat-ical approach provides a way aroundthis. If all translations use a single inter-mediate language, adding an extra lan-guage to the repertoire would requireonly two extra translation programs.

The news story does describe a sim-ilar approach, a transfer system, butthis uses a lingua franca as the inter-mediate language, which in part isprobably why it has been found un-satisfactory. The other unsatisfactoryaspect is commercial—the extra stagewhen the commercial enterprise seeksmerely to translate between two writ-ten languages adds extra complexityand execution time.

To cope with the variety of andwithin natural languages, a completelyunnatural language must serve as theintermediary. Designing this interme-diate language would be a huge anddifficult task, but it would reap equallyhuge benefits.

Without this approach to machinetranslation, it would be difficult andexpensive to cater for minor languages,to make incremental improvements asindividual languages change or becomebetter understood, and to add para-meters that allow selection of styles,periods, regionalities, and other varia-tions. When the translation adds con-version between speech and text ateither end, adopting the intermediaryapproach will become more important,if not essential.

INTERMEDIATE LANGUAGEThe intermediate language must be

like a semipermeable membrane thatlets the meaning pass through freelywhile blocking idiosyncrasies. Al-though designing and managing theintermediary would be a nearly over-

desirable properties differ markedlyfrom those needed for an intermediaryin translation, as the auxiliary lan-guage Esperanto’s failure in the inter-mediary role demonstrates.

Indeed, given the possibility of gen-eral machine translation, it is possibleto make an argument against the veryidea of a global auxiliary language.Natural languages—the essence ofindividual cultures—are disappearingmuch faster than they are appearing.Global acceptance of an auxiliary lan-guage would foster such disappear-ances. Versatile machine translation,particularly when speech-to-speechtranslation becomes practical, wouldlessen the threat to minor languages.

WORK TO BE DONEDefining the intermediate language

requires developing and verifying itsvocabulary and grammar as suitablefor mediating translation between allclasses and kinds of natural language.

The vocabulary—the semantic struc-ture, specifically the semes and their rela-tionships—will in effect provide auniversal semantic taxonomy. Thesemes would be of many different kinds,both abstract and concrete. A majorchallenge will be deciding which mean-ings are distinct and universal enough towarrant their own seme and where toplace them in the seme hierarchy. Thekey professionals doing such work willbe philosophers and semanticists.

The rules for associating and sepa-rating semes and seme clusters, thegrammar, would encompass the workof punctuation, although much of themeaning found in natural-languagepunctuation could be coded in theintermediate language’s semes, unlessimplied by the language’s grammar.The intermediary grammar might needto designate some semes—for example,some of the two dozen or so meaningsgiven for the term “the” in the OxfordEnglish Dictionary—as required to beinferred if they are not present in thesource language. The key professionalsin this work will be translators, inter-preters, and linguists.

Continued from page 104

Global acceptance of anauxiliary language wouldfoster the disappearance

of minor languages.

Page 3: Languages and the computing profession

March 2004 103

I began this essay when reports fromthe UN Forum on the Digital Dividein Geneva first became public. The

failure of this beanfeast was both pre-dictable (“The Digital Divide, the UN,and the Computing Profession,” Com-puter, Dec. 2003, pp. 144, 142-143),and a scandalous waste of money giventhe number of poor in the world dyingdaily of hunger or cheaply curable ill-nesses.

Strategically, a much better way touse digital technology to help the poorand counter global inequity and itssymptomatic digital divide would befor the UN to take responsibility forthe development and use of a globalintermediate translation language.International support would be essen-tial, both to make swift developmentpossible and, more importantly, to pro-tect the work from intellectual-prop-erty predators.

Success would make truly global useof the Internet possible. Ultimately, withtranslation and speech-to-text conver-sion built into telephones, UN and otheraid workers could talk to the economi-cally disadvantaged without humaninterpreters.

However, an intermediate languageproject such as this could not be con-templated without the strong andactive support of various professionalbodies, particularly those from thefields of computing, philosophy, andlanguage. Computing professionalsshould work with others to get publicattention for this project and ensurethat the needed professional support ismade available. �

Neville Holmes is an honorary researchassociate at the University of Tasma-nia’s School of Computing. Contacthim at [email protected] of citations in this essay, andlinks to further material, are at www.comp.utas.edu.au/users/nholmes/prfsn.

When involved in a project to de-velop an intermediary language, thesetwo groups of professionals will needto work closely together, as grammarand vocabulary are closely interde-pendent. In this case, both must copewith the translation of many hundredsof wildly different languages.

What role would computing profes-sionals play in such a team? Given theproject’s purpose—to make generalmachine translation possible—com-puting professionals would be of vitalimportance, but in a supporting role.Using different approaches to evaluatethe intermediate language and its usefor a variety of languages wouldrequire a succession of translation pro-grams.

Those involved in this project willneed to consider how to keep Webpages in both their original languageand the intermediary so that browserscould, if necessary, translate the pageeasily into any user’s preferred lan-guage. Allied to this requirement wouldbe consideration of how to index theintermediary text so that all of theWeb’s content would be available tosearchers. Indeed, the qualities of anintermediate language could makesearch engines much more effective.

Translation of SMS messages and e-mail should also be studied; ultimately,use of the intermediate language intelephones for speech translationshould become possible. Users wouldselect the natural language to use ontheir phone. The translation mightthen be through text, staged withspeech-to-text conversion, or theprocessor might convert speechdirectly to or from the intermediatelanguage. In any case, intermediarycodes would be transmitted betweenusers’ phones and thus the language ofone user would be independent ofanother user’s.

General use of such speech transla-tion would trail text translation by along way, but even general text trans-lation would promote global coopera-tion, providing an excellent return oninvestment in the project.

www.computer.org/join/grades.htm

GIVE YOUR CAREER A BOOST�

UPGRADE YOUR MEMBERSHIP

Advancing in the IEEEComputer Society canelevate your standing inthe profession.

Application to Senior-grade membership recognizes

✔ ten years or more of professional expertise

Nomination to Fellow-grade membership recognizes

✔ exemplary accomplishments in computer engineering

REACHHIGHER