Seminar report on a statistical approach to machine

A STATISTICAL APPROACH TO MACHINE

TRANSLATION

Seminar Report

By

B HRISHIKESH

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Rajagiri School of Engineering & Technology Rajagiri Valley, Kakkanad, Kochi – 682 039

OCTOBER 2014

RAJAGIRI SCHOOL OF ENGINEERING & TECHNOLOGY

RAJAGIRI VALLEY, KAKKANAD, KOCHI – 682 039

CERTIFICATE

Certified that this is a Bonafide Record of the Seminar on A STATISTICAL

APPROACH TO MACHINE TRANSLATION done by B HRISHIKESH

university register number 11012288 of branch Computer Science and

Engineering during the Seventh Semester in the year 2014 at Rajagiri School Of

Engineering & Technology, Kakkanad, Kochi.

Seminar-In-Charge Head of the Department

i

ABSTRACT

Machine Translation (MT) refers to the use of computers for the task of translating

automatically from one language to another. The differences between languages and

especially the inherent ambiguity of language make MT a very difficult problem. Traditional

approaches to MT have relied on humans supplying linguistic knowledge in the form of rules

to transform text in one language to another. Given the vastness of language, this is a highly

knowledge intensive task. Statistical MT is a radically different approach that automatically

acquires knowledge from large amounts of training data. This knowledge, which is typically

in the form of probabilities of various language features, is used to guide the translation

process. This report provides an overview of MT techniques, and looks in detail at the basic

statistical model.

ii

ACKNOWLEDGEMENT

First and foremost, I would like to thank God Almighty for all the blessings he has showered

on me during this project time.

I would like to express my hearty thanks to our Principal, Dr. A Unnikrishanan, for the

facilities offered by him to enable me to do this project.

I am highly indebted to Mr. Ajith S, HOD, Department of Computer Science and

Engineering, RSET for his encouragement and guidance for the completion of the design

phase of project.

My sincere thanks to Mrs Sangeetha Jamal, Assistant Professor, Department of Computer

Science and Engineering, RSET for her valuable and timely suggestion and support.

I humbly express my heartfelt thanks to all teaching and non-teaching staffs of Rajagiri

School of Engineering and Technology for their cooperation and support.

Finally, I am obliged to all my friends and family members for their suggestions and moral

support throughout the project.

iii

TABLE OF CONTENTS

SL.NO. TOPICS PAGE NO

ABSTRACT i

ACKNOWLEDGEMENT ii

LIST OF FIGURES iv

LIST OF ABBREVIATIONS v

1 INTRODUCTION 1

2 MACHINE TRANSLATION: AN OVERVIEW

2

3 STATISTICAL MACHINE TRANSLATION: AN OVERVIEW

4

4 LANGUAGE MODELING USING N-GRAMS

7

5 ALIGNMENTS 10

6 TRANSLATION MODELING 13

7 SEARCH 15

8 PARAMETER ESTIMATION 16

9 TWO PILOT EXPERIMENTS 17

10 PLANS 21

11 QUESTIONS AND CLARIFICATIONS 22

12 CONCLUSION 23

13 REFERENCES 24

iv

LIST OF FIGURES

Figure 2.1.1: Machine Translation Pyramid…………………………………………………..2

Figure3.1: The Noisy Channel Model for Machine Translation……………………...……….4

Figure4.1: Bigram probabilities from a corpus………………………………………..………8

Figure5.1: An alignment with independent English words…..................................................11

Figure5.2: An alignment with independent French words………...…………………………11

Figure5. 3: A general alignment………………………………………………………….…..12

Figure6.1: An example alignment……………………………………………………………13

Figure9.1: Probabilities for "the." ……………………………..…………………………….17

Figur 9.2: Probabilities for "not."……………………………………………………………18

Figure9.3: Probabilities for "hear."…………………………………………………………..18

Figure 9.4:Translation Examples…………………………………………………………….19

Figure 9.5: Translation Results…………………..…………………………………………..20

v

LIST OF ABBREVIATIONS

1. FST: Finite State Transducer

2. FSM: Finite State Machine

3. FSA: Finite State Automation

4. HMM: Hidden Markov Model

5. HAMT: Human Aided Machine Translation

6. MT: Machine Translation

Department of Computer Science & Engineering CS010-709 Seminar Report Rajagiri School of Engineering and Technology, Kochi – 39

© University Register No:11012288 1 A STATISTICAL APPROACH TO MACHINE TRANSLATION

1. INTRODUCTION

1.1 Preamble

To develop a Statistical machine translation (SMT) which treats the translation

of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate.

1.2 Scope and Objective

Machine Translation (MT) can be defined as the use of computers to automate

some or all of the process of translating from one language to another. MT is an area of applied research that draws ideas and techniques from linguistics, computer science, Artificial Intelligence (AI), translation theory, and statistics. Work began in this field as early as in the late 1940s, and various approaches — some ad hoc,others based on elaborate theories — have been tried over the past five decades. This report discusses the statistical approach to MT, which was first suggested by Warren Weaver in 1949 [Weaver, 1949], but has found practical relevance only in the last decade or so. This approach has been made feasible by the vast advances in computer technology, in terms of speed and storage capacity, and the availability of large quantities of text data.

1.3 Literature Survey and Current State of Art

In 1949 Warren Weaver suggested that the problem be attacked with statistical

methods and ideas from information theory, an area which he, Claude Shannon, and others were developing at the time (Weaver 1949). Although researchers quickly abandoned this approach, advancing numerous theoretical objections, we believe that the true obstacles lay in the relative impotence of the available computers and the dearth of machine readable text from which to gather the statistics vital to such an attack. Today, computers are five orders of magnitude faster than they were in 1950 and have hundreds of millions of bytes of storage. Large, machine-readable corpora are readily available. Statistical methods have proven their value in automatic speech recognition (Bahl et al. 1983) and have recently been applied to lexicography (Sinclair 1985) and to natural language processing (Baker 1979; Ferguson 1980; Garside et al. 1987; Sampson 1986; Sharman et al. 1988).



2. MACHINE TRANSLATION: AN OVERVIEW

Machine Translation (MT) can be defined as the use of computers to automate

some or all of the process of translating from one language to another. MT is an area of applied research that draws ideas and techniques from linguistics, computer science, Artificial Intelligence (AI), translation theory, and statistics. Work began in this field as early as in the late 1940s, and various approaches — some ad hoc, others based on elaborate theories — have been tried over the past five decades. This report discusses the statistical approach to MT, which was first suggested by Warren Weaver in 1949 [Weaver, 1949], but has found practical relevance only in the last decade or so. This approach has been made feasible by the vast advances in computer technology, in terms of speed and storage capacity, and the availability of large quantities of text data.

2.1. TYPES OF MACHINE TRANSLATION SYSTEMS

Machine translation systems that produce translations between only two particular languages are called bilingual systems and those that produce translations for any given pair of languages are called multilingual systems. Multilingual systems may be either uni-directional or bi-directional. Multilingual systems are preferred to be bi-directional and bi-lingual as they have ability to translate from any given language to any other given language and vice versa.

Figure 2.1.1:Machine Translation Pyramid

2.1.1 DIRECT MACHINE TRANSLATION APPROACH

Direct translation approach is the oldest and less popular approach. Machine translation systems that use this approach are capable of translating a language, called source language (SL) directly to another language, called target language (TL). The analysis of SL texts is oriented to only one TL. Direct translation systems are basically



bilingual and uni-directional. Direct translation approach needs only a little syntactic and semantic analysis. SL analysis is oriented specifically to the production of representations appropriate for one particular TL.

2.1.2 INTERLINGUA APPROACH

Interlingua approach intends to translate SL texts to that of more than one language. Translation is from SL to an intermediate form called Interlingua (IL) and then from IL to TL. Interlingua may be artificial one or auxiliary language like Esperanto with universal vocabulary. Interlingua requires complete resolution of all ambiguities in the SL text.

2.1.3 TRANSFER APPROACH

Unlike interlingua approach, transfer approach has three stages involved. In the first stage, SL texts are converted into abstract SL-oriented representations. In the second stage, SL-oriented representations are converted into equivalent TL-oriented representations. Final texts are generated in the third stage. In transfer approach complete resolution of ambiguities of SL text is not required, but only the ambiguities inherent in the language itself are tackled. Three types of dictionaries are required: SL dictionaries, TL dictionaries and a bilingual transfer dictionary. Transfer systems have separate grammars for SL analysis, TL analysis and for the transformation of SL structures into equivalent TL forms



3. STATISTICAL MACHINE TRANSLATION: AN OVERVIEW

One way of thinking about MT is using the noisy channel metaphor. If we want to translate a sentence f in the source language F to a sentence e1 in the target language E, the noisy channel model describes the situation in the following way: We suppose that the sentence f to be translated was initially conceived in lan-guage E as some sentence e. During communication e was corrupted by the channel to f. Now, we assume that each sentence in E is a translation of f with some probability, and the sentence that we choose as the translation (ˆe) is the one that has the highest probability. In mathematical terms [Brown et al., 1990],

eˆ = arg max P (e|f)

e

Intuitively, P (e|f) should depend on two factors:

1. The kind of sentences that are likely in the language E. This is known as the language model — P (e).

2. The way sentences in E get converted to sentences in F. This is called the

translation model — P (f|e).

This intuition is borne out by applying Bayes’ rule in equation:

Language Model e Translation Model fDecoder

P(e)

P(f|e)

e = argmax P(e|f)

e

Figure 3.1: The Noisy Channel Model for Machine Translation

eˆ = arg max P (e)P (f|e)

P (f)e

Since f is fixed, we omit it from the maximization and get below equation.

eˆ = arg max P (e)P (f|e)

e This model for MT is illustrated in Figure 3.1.



Given a French sentence f, we seek the English sentence e that maximizes P(e | f). (The “most likely” translation). Sometimes we write: argmax P(e | f) e Read this argmax as follows: “the English sentence e, out of all such sentences, which yields the highest value for P(e | f). If you want to think of this in terms of computer programs, you could imagine one program that takes a pair of sentences e and f, and returns a probability P(e | f). We will look at such a program later on. e ----> +--+ | | ----> P(e|f) f ----> +--+ Or, you could imagine another program that takes a sentence f as input, and outputs every conceivable string ei along with its P(ei | f). This program would take a long time to run, even if you limit English translations some arbitrary length. +--+ ----> e1, P(e1 | f) f ----> | | ... +--+ ----> en, P(en | f)

Now we have a way of finding the best translation given a set of candidate translations using equation , but what are these candidate translations? Unlike in our earlier supposition, we cannot practically consider each sentence in the target language. Therefore, we need a heuristic search method that can efficiently find the sentence with (close to) the highest probability. Thus, statistical translation requires three key components: 1. Language model 2. Translation model 3. Search algorithm We take up first two components in turn in the next couple of chapters. The last problem is the standard decoding problem in AI, and variants of the Viterbi and A∗ algorithms are used in statistical MT to solve this problem.



3.1 Word Reordering in Translation If we reason directly about translation using P(e | f), then our probability estimates had better be very good. On the other hand, if we break things apart using Bayes Rule, then we can theoretically get good translations even if the probability numbers aren't that accurate. For example, suppose we assign a high value to P(f | e) only if the words in f are generally translations of words in e. The words in f may be in any order: we don't care. Well, that's not a very accurate model of how English gets turned into French. Maybe it's an accurate model of how English gets turned into really bad French. Now let's talk about P(e). Suppose that we assign a high value to P(e) only if e is grammatical. That's pretty reasonable, though difficult to do in practice. An interesting thing happens when we observe f and try to come up with the most likely translation e. Every e gets the score P(e) * P(f | e). The factor P(f | e) will ensure that a good e will have words that generally translate to words in f. Various “English” sentences will pass this test. For example, if the string “the boy runs” passes, then “runs boy the” will also pass. Some word orders will be grammatical and some will not. However, the factor P(e) will lower the score of ungrammatical sentences. In effect, P(e) worries about English word order so that P(f | e) doesn't have to. That makes P(f | e) easier to build than you might have thought. It only needs to say whether or not a bag of English words corresponds to a bag of French words. This might be done with some sort of bilingual dictionary. Or to put it in algorithmic terms, this module needs to be able to turn a bag of French words into a bag of English words, and assign a score of P(f | e) to the bag-pair.



4. LANGUAGE MODELING USING N-GRAMS Language modeling is the task of assigning a probability to each unit of text. In the context of statistical MT, as described in the previous chapter, a unit of text is a sentence. That is, given a sentence e, our task is to compute P (e). For a sentence containing the word sequence w1w2 . . . wn, we can write without loss of generality,

The problem here, and in fact in all language models, is that of data sparsity. Specifically, how do we calculate probabilities such as P (wn|w1w2 . . . wn−1)? In no corpus will we find instances of all possible sequences of n words; actually we will find only a miniscule fraction of such sequences. A word can occur in just too many contexts (history of words) for us to count off these numbers. Thus, we need to approximate the probabilities using what we can find more reliably from a corpus. N-gram models provide one way of doing this. 4.1 The N-gram approximation In an N-gram model [Jurafsky and Martin,2000], the probability of a word given all previous words is approximated by the probability of the word given the previous N-1 words. The approximation thus works by putting all contexts that agree in the last N-1 words into one equivalence class. With N = 2, we have what is called the bigram model, and N = 3 gives the trigram model. Though linguistically simple-minded, N-grams have been used successfully in speech recognition, spell checking, part-of-speech tagging and other tasks where language modeling is required. This is because of the ease with which such models can be built and used. More sophisticated models are possible — for example, one that gives each sentence structure a certain probability (using a probabilistic grammar). Such models, however, are much more difficult to build, basically because of the non-availability of large enough annotated corpora. N-gram models, on the other hand, can work with raw corpora, and are easier to build and use. N-gram probabilities can be computed in a straightforward manner from a cor-pus. For example, bigram probabilities can be calculated as in equation



Here count(wn−1wn) denotes the number of occurrences of the the sequence wn−1wn. The denominator on the right hand side sums over all word w in the corpus — the number of times wn−1 occurs before any word. Since this is just the count of wn−1, we can write it as,

Given the bigram probabilities in table 3.1, the probability of the sentence is calculated as in 3.5. P (all men are equal) = 0.16 × 0.04 × 0.20 × 0.08 = 0.00028 (3.5) Now consider assigning a probability to the sentence, “all animals are equal”, assuming that the sequence “all animals” never occurs in the corpus. That is, P (animals|all) = 0. This means that the probability of the entire sentence is zero! Intuitively, “all animals are equal” is not such an improbable sentence, and we would like our model to give it a non-zero probability, which our bigram model fails to do.

Bigram Probability start all 0.16 all men 0.09 men are 0.24 are equal 0.08

Figure4.1: Bigram probabilities from a corpus This brings us back to the problem of data sparsity that we mentioned at the beginning of this chapter. We simply cannot hope to find reliable probabilities for all possible bigrams from any given corpus, which is after all finite. The task of the language model, then, is not just to give probabilities to those sequences that are present in the corpus, but also to make reasonable estimates when faced with a new sequence. Mathematically, our model in equation uses the technique called Maximum Likelihood Estimation. This is so called because given a training corpus T , the model M that is



generated is such that P (T |M) is maximized. This means that the entire probability mass is distributed among sequences that occur in the training corpus, and nothing remains for unseen sequences. The task of distributing some of the probability to unseen or zero-probability sequences is called smoothing. 4.2 Smoothing N-gram models can assign non-zero probabilities to sentences they have never seen before. It's a good thing, like Martha Stewart says. The only way you'll get a zero probability is if the sentence contains a previously unseen bigram or trigram. That can happen. In that case, we can do smoothing. If “z” never followed “xy” in our text, we might further wonder whether “z” at least followed “y”. If it did, then maybe “xyz” isn't so bad. If it didn't, we might further wonder whether “z” is even a common word or not. If it's not even a common word, then “xyz” should probably get a low probability. Instead of b(z | x y) = number-of-occurrences(“xyz”) / number-of-occurrences(“xy”) we can use b(z | x y) = 0.95 * number-of-occurrences(“xyz”) / number-of-occurrences(“xy”) + 0.04 * number-of-occurrences (“yz”) / number-of-occurrences (“z”) + 0.008 * number-of-occurrences(“z”) / total-words-seen + 0.002 It's handy to use different smoothing coefficients in different situations. You might want 0.95 in the case of xy(z), but 0.85 in another case like ab(c). For example, if “ab” doesn't occur very much, then the counts of “ab” and “abc” might not be very reliable. Notice that as long as we have that “0.002” in there, then no conditional trigram probability will ever be zero, so no P(e) will ever be zero. That means we will assign some positive probability to any string of words, even if it's totally ungrammatical. Sometimes people say ungrammatical things after all. We'll have more to say about estimating those smoothing coefficients in a minute. This n-gram business is not set in stone. You can come up with all sorts of different generative models for language. For example, imagine that people first mentally produce a verb. Then they produce a noun to the left of that verb. Then they produce adjectives to the left of that noun. Then they go back to the verb and produce a noun to the right. And so on. When the whole sentence is formed, they speak it out loud.



5. ALIGNMENTS

We say that a pair of strings that are translations of one another form a translation, and we show this by enclosing the strings in parentheses and separating them by a vertical bar. Thus, we write the translation (Qu'aurions-nous pu faire? I What could we have done?) to show that What could we have done? is a translation of Qu'aurions-nous pufaire? When the strings end in sentences, we usually omit the final stop unless it is a question mark or an exclamation point. Brown et al. (1990) introduce the idea of an alignment between a pair of strings asan object indicating for each word in the French string that word in the English stringfrom which it arose. Alignments are shown graphically, as in Figure 5.1, by drawinglines, which we call connections, from some of the English words to some of the Frenchwords. The alignment in Figure 5.1 1 has seven connections: (the, Le), (program, programme),and so on. Following the notation of Brown et al., we write this alignment as (Le programme a e`te` mis en application I And the(l) program(2) has(3) been(4) implemented(5,6,7)).The list of numbers following an English word shows the positions in the French stringof the words to which it is connected. Because And is not connected to any Frenchwords here, there is no list of numbers after it. We consider every alignment to becorrect with some probability, and so we find (Le programme a ete mis en application 1 And(I,2,3,4,5,6,7) the program has been implemented) perfectly acceptable. Of course, we expect it to be much less probable than the alignment shown in Figure 5.1.In Figure 5.1 each French word is connected to exactly one English word, but moregeneral alignments are possible and may be appropriate for some translations. Forexample, we may have a French word connected to several English words as in Figure 5.2, which we write as (Le reste appartenait aux autochtones I The(l) balance(2) was(3)the(3) territory(3) of(4) the(4) aboriginal(5) people(5)). More generally still, we may haveseveral French words connected to several English words as in Figure 5.3, which wewrite as (Les pauvres sont d~munis I The(l) poor(2) don't(3,4) have(3,4) any(3,4) money(3,4)).Here, the four English words don't have any money work together to generate the two French words sont demunis. In a figurative sense, an English passage is a web of concepts woven together according to the rules of English grammar. When we look at a passage, we cannot see the concepts directly but only the words that they leave behind. To show that these words are related to a concept but are not quite the whole story, we say that they form a cept. Some of the words in a passage may participate in more than one cept, while others may participate in none, serving only as a sort of syntactic glue to bind the whole together. When a passage is translated into French, each of its cepts contributes some French words to the translation. We formalize this use of the term cept and relate it to the idea of an alignment as follows. We call the set of English words connected to a French word in a particular alignment



the cept that generates the French word. Thus, an alignment resolves an English string into a set of possibly overlapping cepts that we call the ceptual scheme of the English string with respect to the alignment. The alignment in Figure 3 contains the three cepts The, poor, and don't have any money. When one or more of the French wordsis connected to no English words, we say that the ceptual scheme includes the emptycept and that each of these words has been generated by this empty cept.

Figure 5.1:An alignment with independent English words.

Figure 5.2: An alignment with independent French words.

Formally, a cept is a subset of the positions in the English string together with thewords occupying those positions. When we write the words that make up a cept, wesometimes affix a subscript to each one showing its position. The alignment in Figure5.2 includes the cepts the1and of 6 the7, but not the cepts of 6 the1 or the7. In (J'applaudis a` laddcision ] I(1) applaud(2) the(4) decision(5)), a`is generated by the empty cept. Althoughthe empty cept has no position, we place it by convention in position zero, and write it as e0. Thus, we may also write the previous alignment as (J'applaudis a` la



decisionleo(3) I(1) applaud(2) the(4) decision(5)).

Figure 5. 3:A general alignment. We denote the set of alignments of (f|e) by .A(e, f). If e has length I and f has length m, there are Im different connections that can be drawn between them because each of the m French words can be connected to any of the I English words. Since an alignment is determined by the connections that it contains, and since a subset of the possible connections can be chosen in 2 ways, there are 2 alignments in A(e, f).



6. TRANSLATION MODELING As discussed , the role of the translation model is to find P (f|e), the probability of the source sentence f given the translated sentence e. Note that it is P (f|e) that is computed by the translation model and not P (e|f). The training corpus for the translation model is a sentence-aligned parallel corpus of the languages F and E.

It is obvious that we cannot compute P (f|e) from counts of the sentences f and e in the parallel corpus. Again, the problem is that of data sparsity. The solutionthat is immediately apparent is to find (or approximate) the sentence translation probability using the translation probabilities of the words in the sentences. The word translation probabilities in turn can be found from the parallel corpus. There is, however, a glitch — the parallel corpus gives us only the sentence alignments; it does not tell us how the words in the sentences are aligned.

A word alignment between sentences tells us exactly how each word in sentence f is translated in e. Figure 4.1 shows an example alignment1. This alignment canalso be written as (1, (2, 3, 6), 4, 5), to indicate the positions in the Hindi sentence with which each word in the English sentence is aligned.

How to get the word alignment probabilities given a training corpus that is only sentence aligned? This problem is solved by using the Expectation-Maximization (EM) algorithm.

Figure 6.1: An example alignment 6.1 Expectation Maximization: The Intuition The key intuition behind EM is this: If we know the number of times a word aligns with another in the corpus, we can calculate the word translation probabilities easily. Conversely, if we know the word translation probabilities, it should be possible to find the probability of various alignments. Apparently we are faced with a chicken-and-egg problem! However, if we start with some uniform word translation probabilities and calculate alignment probabilities, and then use these alignment probabilities to get (hopefully) better translation probabilities,



and keep on doing this, we should converge on some good values. This iterative procedure, which is called the Expectation-Maximization algorithm, works because words that are actually translations of each other, co-occur in the sentence-aligned corpus.



7. SEARCH

In searching for the sentence e that maximizes P (e) P (f |e) , we face the difficulty that there are simply too many sentences to try. Instead, we must carry out a suboptimal search. We do so using a variant of the stack search that has worked so well in speech recognition (Bahl et al. 1983). In a stack search, we maintain a list of partial alignment hypotheses. Initially, this list contains only one entry corresponding to the hypothesis that the target sentence arose in some way from a sequence of source words that we do not know. In the alignment notation introduced earlier, this entry might be (Jean aime Marie I *) where the asterisk is a place holder for an unknown sequence of source words. The search proceeds by iterations, each of which extends some of the most promising entries on the list. An entry is extended by adding one or more additional words to its hypothesis. For example, we might extend the initial entry above to one or more of the following entries: (Jean aime Marie |John(I)*), (Jean aime Marie| *loves (2)*), (Jean aime Marie | *Mary(3)), (Jean airne Marie | Jeans(l)*). The search ends when there is a complete alignment on the list that is significantly more promising than any of the incomplete alignments. Sometimes, the sentence e ' that is found in this way is not the same as the sentence e that a translator might have been working on. When S' itself is not an accept-able translation, then there is clearly a problem. If P(e')P(f|e') is greater than P(e)P(f|e), then the problem lies in our modeling of the language or of the translation process. If, however, P(e')P(f|e') is less than P(e)P(f|e), then our search has failed to find the most likely sentence. We call this latter type of failure a search error. In the case of a search error, we can be sure that our search procedure has failed to find the most probable source sentence, but we cannot be sure that were we to correct the search we would also correct the error. We might simply find an even more probable sentence that nonetheless is incorrect. Thus, while a search error is a clear indictment of the search procedure, it is not an acquittal of either the language model or the translation model.



8. PARAMETERESTIMATION Both the language model and the translation model have many parameters that must be specified. To estimate these parameters accurately, we need a large quantity of data. For the parameters of the language model, we need only English text, which is available in computer-readable form from many sources; but for the parameters of the translation model, we need pairs of sentences that are translations of one another. By law, the proceedings of the Canadian parliament are kept in both French and English. As members rise to address a question before the house or otherwise express themselves, their remarks are jotted clown in whichever of the two languages is used. After the meeting adjourns, a collection of translators begins working to produce a com-plete set of the proceedings in both French and English. •These proceedings are called Hansards, in remembrance of the publisher of the proceedings of the British parliament in the early 1800s. All of these proceedings are available in computer-readable form, and we have been able to obtain about 100 million words of English text and the corresponding French text from the Canadian government. Although the translations are not made sentence by sentence, we have been able to extract about three million pairs of sentences by using a statistical algorithm based on sentence length. Approximately 99% of these pairs are made up of sentences that are actually translations of one another. It is this collection of sentence pairs, or more properly various sub-sets of this collection, from which we have estimated the parameters of the language and translation models. We have discussed alignments of sentence pairs. If we had a collection of aligned pairs of sentences, then we could estimate the parameters of the translation model by counting, just as we do for the language model. However, we do not have alignments but only the unaligned pairs of sentences. This is exactly analogous to the situation in speech recognition where one has the script of a sentence and the time waveform corresponding to an utterance of it, but no indication of just what in the time waveform corresponds to what in the script. In speech recognition, this problem is attacked with the EM algorithm (Baum 1972; Dempster et al. 1977). We have adapted this algorithm to our problem in translation. In brief, it works like this: given some :initial estimate of the parameters, we can compute the probability of any particular alignment. We can then re-estimate the parameters by weighing each possible alignment according to its probability as determined by the initial guess of the parameters. Repeated iterations of this process lead to parameters that assign ever greater probability to the set of sentence pairs that we actually observe. This algorithm leads to a local maximum of the probability of the observed pairs as a function of the parameters of the model. There may be many such local maxima. The particular one at which we arrive will, in general, depend on the initial choice of parameters.



9. TWO PILOT EXPERIMENTS In our first experiment, we test our ability to estimate parameters for the translation model. We chose as our English vocabulary the 9,000 most common words in the English part of the Hansard data, and as our French vocabulary the 9,000 most common French words. For the purposes of this experiment, we replaced all other words with either the unknown English word or the unknown French word, as appropriate. We applied the iterative algorithm discussed above in order to estimate some 81 million parameters from 40,000 pairs of sentences comprising a total of about 800,000 words in each language. The algorithm requires an initial guess of the parameters. We assumed that each of the 9,000 French words was equally probable as a translation of any of the 9,000 English words; we assumed that each of the fertilities from 0 to 25 was equally probable for each of the 9,000 English words; and finally, we assumed that each target position was equally probable given each source position and target length. Thus, our initial choices contained very little information about either French or English. Figure 1shows the translation and fertility probabilities we estimated for the English word the. We see that, according to the model, the translates most frequently into the French articles le and la. This is not surprising, of course, but we emphasize that it is determined completely automatically by the estimation process. In some sense, this correspondence is inherent in the sentence pairs themselves. Figure 5 shows these probabilities for the English word not. As expected, the French word pas appears as a highly probable translation. Also, the fertility probabilities indicate that not translates most often into two French words, a situation consistent with the fact that negative French sentences contain the auxiliary word ne in addition to a primary negative word such as pas or rien. Figure 9.1Probabilities for "the." For both of these words, we could easily have discovered the same information from a dictionary. In Figure 6, we see the trained parameters for the English word hear. As we would expect, various forms of the French word entendre appear as possible translations, but the most probable translation is the French word bravo. When we look at the fertilities here, we see that the probability is about equally divided between fertility 0 and fertility 1. The reason for this is that the English speaking members of parliament express their approval by shouting Hear, hear/, while the French speaking ones say Bravo/The translation model has learned that usually two hears produce one bravo by having one of them produce the bravo and the other produce nothing.

E n g l i s h : the

F r e n c h P r o b a b i l i t y F e r t i l i t y P r o b a b i l i t yle .610 1 .871

la .178 0 .1241' .083 2 .004les .023 ce .013 il .012 de .009 .007 que .007



A given pair of sentences has many possible alignments, since each target word can be aligned with any source word. A translation model will assign significant probability only to some of the possible alignments, and we can gain further insight about the model by examining the alignments that it considers most probable. We show one such alignment in Figure 3. Observe that, quite reasonably, not is aligned with ne and pas, while implemented is aligned with the phrase misses en application. We can also a deficiency of the model since intuitively we feel that will and be act in concert to produce seront while the model aligns will with seront but aligns be with nothing.

E n g l i s h : not

F r e n c h P r o b a b i l i t y F e r t i l i t y P r o b a b i l i t y p ~ .469 2 .758

n e .460 0 .133

n o n .024 1 .106

pa , du tout .003 faux .003

plus .002

Figure 9.2: Probabilities for "not."

Figure 9.3: Probabilities for "hear." We used our search procedure to decode 73 new French sentences from elsewhere in the Hansard data. We as-signed each of the resulting sentences a category according to the following criteria. If the decoded sentence was exactly the same as the actual Hansard translation, we assigned the sentence to the exact category. If it conveyed the same meaning as the Hansard translation but in slightly different words, we assigned it to the alternate category. If the decoded sentence was a legitimate translation of the

E n g l i s h : Hear

F r e n c h P r o b a b i l i t y F e r t i l i t y P r o b a b i l i t y

bravo .992 0 .584

entendre .005 1 .416

entendu .002

entends .001



French sentence but did not convey the same meaning as the Hansard translation, we assigned it to the different category. If it made sense as an English sentence but could not be interpreted as a translation of the French sentence, we assigned it to the wrong category. Finally, if the decoded sentence was grammatically deficient, we assigned it to the ungrammatical category. An example from each category is shown in Figure 7, and our decoding results are summarized in Figure 8. Only 5% of the sentences fell into the exact category. However, we feel that a decoded sentence that is in any of the first three categories (exact, alternate, or different) represents a reasonable translation. By this criterion, the system performed successfully 48% of the time. As an alternate measure of the system's performance, one of us corrected each of the sentences in the last three categories (different, wrong, and ungrammatical) to either the exact or the alternate category. Counting one stroke for Exact Ces ammendements sont certainement n~cessaires. Hansard: These amendments are certainly necessary. Decoded as: These amendments are certainly necessary. Alternate C'est pourtant tr~s simple. Hansard: Yet it is very simple. Decoded as: It is still very simple. Different J'al re~u cette demande en effet. Hansard: Such a request was made. Decoded as: I have received this request in effect. Wrong Permettez que je donne un example ~, la Chambre. Hansard: Let me give the House one example. Decoded as: Let me give an example in the House. Ungrammatical Vous avez besoin de toute l'~de disponible. Hansard: You need all the help you can get. Decoded as: You need of the whole benefits available. Figure 9.4: Translation Examples



each letter that must be deleted and one stroke for each letter that must be inserted, 776 strokes were needed to repair all of the decoded sentences. This compares with the 1,916 strokes required to generate all of the Hansard translations from scratch. Thus, to the extent that translation time can be equated with key strokes, the system reduces the work by about 60%.

Category Number of sentences Percent

Exact 4 5

Alternate 18 25

Different 13 18

Wrong 11 15

Ungramatical 27 37

Total 73

Figure 9.5: Translation Results.



10. PLANS

There are many ways in which the simple models described in this paper can be improved. We expect some improvement from estimating the parameters on more data. For the experiments described above, we estimated the parameters of the models from only a small fraction of the data we have available: for the translation model, we used only about one percent of our data, and for the language model, only about ten percent. We have serious problems in sentences in which the translation of certain source words depends on the translation of other source words. For example, the translation model produces aller from to go by producing aller from go and nothing from to. Intuitively we feel that to go functions as a unit to produce aller. While our model allows m a n y target words to come from the same source word, it does not allow several source words to work together to produce a single target word. In the future, we hope to address the problem of identifying groups of words in the source language that function as a unit in translation. This m a y take the form of a probabilistic division of the source sentence into groups of words. A t present, we assume in our translation model that words, are placed into the target sentence independently of one another. Clearly, a more realistic assumption must account for the fact that words form phrases in the target sentence that are translations of phrases in the source sentence and that the target words in these phrases will tend to stay together even if the phrase itself is moved around. We are working on a model in which the positions of the: target words produced by a particular source word depend on the identity of the source word and on the positions of the target words produced by the previous source word. We are preparing a trigram language model that we hope will substantially improve the performance of the system. A useful information - theoretic measure of the complexity of a language with respect to a model is the perplexity as defined by Bahl et al. (1983). With the bigram model that we are currently using, the source text for our 1,000-word translation task has a perplexity of about 78. With the trigram model that we are preparing, the perplexity of the source text is about 9. In addition to showing the strength of a trigram model relative to a bigram model, this also indicates that the 1,000-word task is very simple. We: treat words as unanalyzed wholes, recognizing no connection, for example, between va, vais, and vont, or between tall, taller, and tallest. As a result, we cannot improve our statistical characterization of va, say, by observation of sentences involving vont. We are working on morplhologies for French and English so that we can profit from statistical regularities that our current word - based approach must overlook. Finally, we treat the sentence as a structure less sequence of words. Sharman et al. discuss a method for deriving a probabilistic phrase structure grammar automatically from a sample of parsed sentences (1988). We hope to apply their method to construct grammars for both French and English and to base future translation models on the grammatical constructs thus defined.



11. QUESTIONS AND CLARIFICATIONS

1) Explain basic probability for computing various models in SMT. Answer: We're going to consider that an English sentence e may translate into any French sentence f. Some translations are just more likely than others. Here are the basic notations we'll use to formalize “more likely”: P(e) -- a priori probability. The chance that e happens. For example, if e is the English string “I like snakes,” then P(e) is the chance that a certain person at a certain time will say “I like snakes” as opposed to saying something else. P(f | e) -- conditional probability. The chance of f given e. For example, if e is the English string “I like snakes,” and if f is the French string “maison bleue,” then P(f | e) is the chance that upon seeing e, a translator will produce f. Not bloody likely, in this case. P(e,f) -- joint probability. The chance of e and f both happening. If e and f don't influence each other, then we can write P(e,f) = P(e) * P(f). For example, if e stands for “the first roll of the die comes up 5” and f stands for “the second roll of the die comes up 3,” then P(e,f) = P(e) * P(f) = 1/6 * 1/6 = 1/36. If e and f do influence each other, then we had better write P(e,f) = P(e) * P(f | e). That means: the chance that “e happens” times the chance that “if e happens, then f happens.” If e and f are strings that are mutual translations, then there's definitely some influence. 2) Explain word choice in translation.

Answer: The P(e) model can also be useful for selecting English translations of French words. For example, suppose there is a French word that either translates as “in” or “on.” Then there may be two English strings with equally good P(f | e) scores: (1) she is in the end zone, (2) she is on the end zone. (Let's ignore other strings like “zone end the in is she” which probably also get good P(f | e) scores). Well, the first sentence is much better English than the second, so it should get a better P(e) score, and therefore a better P(e) * P(f | e) score. Remember that the most likely translation is the one with the best P(e) * P(f | e) score.



12. CONCLUSION Traditional MT techniques require large amounts of linguistic knowledge to be en-coded as rules. Statistical MT provides a way of automatically finding correlations between the features of two languages from a parallel corpus, overcoming to some extent the knowledge bottleneck in MT. The model that we have looked at in this report was entirely devoid of linguistic knowledge, but similar (non-linguistic) models have achieved encouraging results. Researchers believe that introducing linguistic knowledge can further strengthen the statistical model. Such knowledge may be in the form of morphological rules, rules about word-order, idiomatic usages, known word correspondences and so on. Intuitively, for translation between English and Hindi (or any other Indian language) such linguistic knowledge might be crucial because of the vast structural and lexical differences between the two languages. A major drawback with the statistical model is that it presupposes the existence of a sentence-aligned parallel corpus. For the translation model to work well, the corpus has to be large enough that the model can derive reliable probabilities from it, and representative enough of the domain or sub-domain (weather forecasts, match reports, etc.) it is intended to work for. Another issue is that most evaluation of statistical MT has been with training documents that are very rigid translations of each other (parliamentary proceedings have been widely used). News articles and books, for example, are generally rather loosely translated — one sentence in the source language is often split into multiple sentences, multiple sentences are clubbed into one, and the same idea is conveyed in words that are not really exact translations of each other. In such situations, sentence-alignment itself might be a big challenge, let alone word-alignment. Since statistical MT is in some sense word alignment (with probabilities), it can be used for lexicon acquisition also, apart from the larger goal of MT. Statistical MT techniques have not so far been widely explored for Indian lan-guages1. It would be interesting to find out to what extent these models can con-tribute to the huge ongoing MT efforts in the country.



13. REFERENCES [1] [Brown et al., 1990] Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vin-cent J. Della Pietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin, A Statistical Approach to Machine Translation, Computational Linguistics, 16(2), pages 79–85, June 1990. [2] [Brown et al., 1993] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, The Mathematics of Statistical Machine Trans-lation: Parameter Estimation, Computational Linguistics, 19(2), pages 263–311, June 1993. [3] [Dave et al., 2001] Shachi Dave, Jignashu Parikh, and Pushpak Bhattacharyya, Interlingua-based English-Hindi Machine Translation and Language Divergence, Journal of Machine Translation (JMT), 16(4), pages 251–304, 2001. [4] [Gale and Church, 1993] W. A. Gale and K. W. Church, A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics, 19(1), pages 75–102, 1993. [5] [Haruno and Yamazaki, 1996] Masahiko Haruno and Takefumi Yamazaki, High-Performance Bilingual Text Alignment using Statistical and Dictionary Informa-tion, Proceedings of the 34th Conference of the Association for Computational Linguistics, pages 131–138, 1996. [6] [Hutchins and Somers, 1992] W. John Hutchins and Harold L. Somers, An Intro-duction to Machine Translation, Academic Press, 1992. [7] [Jurafsky and Martin,2000] Daniel Jurafsky and James H. Martin, Speech and Lan-guage Processing, Pearson Education Inc, 2000. [8] [Kay and Roscheisen, 1993] Martin Kay and Martin Roscheisen, Text-Translation Alignment, Computational Linguistics, 19(1), pages 121–142, 1993. [9] [Manning and Schutze, 2000] Christopher D. Manning and Hinrich Schutze, Foun-dations of Statistical Natural Language Processing, The MIT Press, 2000. [10] [Nancy and Veronis, 1998] I. Nancy and J. Veronis Word Sense Disambiguation: The State of the Art Computational Linguistics, 24(1), 1998. [10] [Och and Ney, 2001] Franz Josef Och and Herman Ney, A Comparison of Alignment Models for Statistical Machine Translation, Proceedings of the 17th Conference on Computational Linguistics, pages 1086–1090, 2000. [11] [Somers, 1999] Harold L. Somers, Example-based Machine Translation, Machine Translation, 14, pages 113–157, 1999. [12] [Weaver, 1949] Warren Weaver, Translation, Machine Translation of Languages: Fourteen Essays, William Locke and Donald Booth (eds), pages 15–23, 1955.



[13] [Witten, 1991] I. H. Witten and T. C. Bell, The Zero-Frequency Problem: Esti-mating the Probabilities of Novel Events in Adaptive Text Compression., IEEE Transactions on Information Theory, 37(4), pages 1085–1094, 1991. [14] [Yamada and Knight, 2001] Kenji Yamada and Kevin Knight, A Syntax-based Sta-tistical Translation Model, Proceedings of the Conference of the Association for Computational Linguistics (ACL), 2001.