[IEEE 2010 International Conference on Asian Language Processing (IALP) - Harbin, China (2010.12.28-2010.12.30)] 2010 International Conference on Asian Language Processing - Training

Training MT Model Using Structural SVM

Tiansang DU Institute of Computational Linguistics

Peking University Beijing, China

[email protected]

Baobao CHANG Institute of Computational Linguistics

Peking University Beijing, China

[email protected]

Abstract—This paper presents a training method of log-linear model for statistical machine translation based on structural support vector machine. This method is designed to directly optimize parameters with respect to translation quality. By adopting maximum-margin principle of SVM, the MT model can learn from training samples with generalization capability. Experiments are carried out on a hierarchical phrase-based MT system facing Chinese to English translation. Result shows that structural SVM training has the ability of re-ranking the k-best list of MT system according to automatic evaluation criteria BLEU, and it can enhance the average quality of MT system outputs.

Keywords-structure support vector machine; statistical machine translation; discriminative training

I. INTRODUCTION Statistical machine translation has achieved many important

progresses in the last decades. The model improves from word-based model to phase-based model and syntax-based model. Hierarchical phrase-based model [1] is based on synchronous context-free grammar, and achieves phrase reordering in a reasonable training and decoding time. The log-linear model [2] is widely used in SMT as the discriminative model framework. In the log-linear model, a set of feature functions defined on source and target sentence are used to choose the highest scorer from the outputs of a translation system. Log-linear model allow different combinations of feature functions, so training the parameters can yield better results. This paper presents a discriminative training method for log-linear model based on structure support vector machine (structural SVM) [3], and experiments on a hierarchical phrase-based MT system.

The advantages of structural SVM training are mainly in three aspects: first, the maximum-margin principle of SVM [4] aims to obtain the biggest separation margin between the best candidate translation and other candidate translations, therefore the trained translation model could learn ability of generalization; second, the training procedure allows to use arbitrary loss function such as MT evaluation criteria BLEU, therefore it could directly optimize translation quality; third, the cutting-plane algorithm [5] could compute arbitrary close approximations to the SVM optimization problems in polynomial time.

A widely used parameter tuning method is the minimum error rate training presented by Och [6]. The MERT is a training method for log-linear models which is directly related

to translation quality, and solved by a line optimization algorithm. As discussed in [6], one disadvantages of MERT is that directly optimizing error rate for many more parameters would lead to serious over-fitting problems. Therefore even if MERT achieves good result on the developing set, it probably could not perform well on a totally new sample set. The structural SVM training inherits the maximum-margin principle, and uses the support vectors to determine the parameters. This leads to generalization capability and robustness, and reduces the level of over-fitting.

The rest of this paper is organized as follows: Section 2 briefly describes the structural SVM and the cutting-plane algorithm. Section 3 describes how to train the SMT model with structural SVM. Section 4 presents the experiments and results on Chinese-English translation, followed by conclusions in Section 5.

II. STRCUTURE SVM AND CUTTING-PLANE ALGORITHM Structural SVM is used to learn a prediction function :f →X Y from a training sample set of input-output

pairs 1 2{( , ), , ( , )}n nS x y x y= , where X is the input space, and Y is the output space.

Discriminant function : RF × →X Y is defined over input-output pairs, and returns a real value. In structural SVM, the discriminant function F is assumed to be linear in a set of feature vectors ( , )x yΨ , i.e.

( , ; ) , ( , )F x y w w x y= ⟨ Ψ ⟩ . (1)

The prediction function f is derived by maximizing F over the output space for a given input x , i.e.

( ) arg max ( , )y

f x F x y∈

=Y

. (2)

Given the feature functions, the training becomes a task of parameter estimation. The parameter w is obtained by training

( , ; )F x y w on the sample set to make the correct prediction. The following inequalities describe these constraints:

\

{1, , } :max{ ( , )} ( , )Y

Y

∈

∀ ∈ ∈Ψ ≤ Ψ

i

n

T Ti i iy y

i nw x y w x y

(3)

2010 International Conference on Asian Language Processing

978-0-7695-4288-1/10 $26.00 © 2010 IEEE

DOI 10.1109/IALP.2010.53

257

2010 International Conference on Asian Language Processing

978-0-7695-4288-1/10 $26.00 © 2010 IEEE

DOI 10.1109/IALP.2010.53

249

There could be more than one solutionw that satisfies the upper constraints. To determine one best solution, the structural SVM adopts the maximum-margin principle, and chooses the parameter w which makes the separation margin, i.e. the minimal differences between the score of the correct output and the closest runner-up, maximal. When making predictions using this trained f , the correct output could be selected by an outstanding advantage.

Structural SVM training can directly optimize translation quality with regard to arbitrary loss function. The loss function ˆ( , )Δ y y quantifies the loss associated with a prediction y and the correct output y .

The training procedure is supervised learning, and the aim is to learn the function f with low prediction error. The prediction error appears as the empirical risk over the sample set S :

1

1( ) ( , ( ))n

S i ii

R f y f xn

Δ

=

= Δ∑ (4)

Intuitively, the sample ( , )i ix y that violates a margin

constraint with high loss ( , )Δ iy y should be penalized more severely than the one with smaller loss. Slack rescaling and margin rescaling both can accomplish the idea. As proved in [3] for both methods, the optimal solution of structural SVM objective function is an upper bound on the empirical risk.

To allow noises in the training set, structural SVM also adds slack variables ξ for soft-margin optimization like non-linear classification. The “n-slack” algorithm uses one slack variable ξi for violations of the constraints corresponding to

every each sample ix . Because the “n-slack” algorithm training is computationally expensive on large datasets, Joachims proposed an equivalent “1-slack” reformulation of the structural SVM training problem [5]. Considering the scale of output space of MT problem, we propose to use the “1-slack” algorithm with margin-rescaling described in [5]. The objective function and constraints are as follows:

, 0

1

1 1

1min2

. . ( , , ) :1 1[ ( , ) ( , )] ( , )

Y

ξξ

ξ

≥

= =

+

∀ ∈

Ψ − Ψ ≥ Δ −∑ ∑

T

w

nn

n nT

i i i i i ii i

w w C

s t y y

w x y x y y yn n

(5)

where C is a constant that controls the trade-off between training error minimization and margin maximization. ξ is the slack variable. Now this parameter estimation problem turns to the optimization problem of convex quadratic program with constraints.

The cutting-plane method [5] is an iterative approximate algorithm as showed in Figure 1. By this method, only a much smaller subset of constraints needs to be explicitly examined. The algorithm iteratively constructs a working set W of constraints. In each iteration, the algorithm computes the

solution over the current W (Line 4), finds the most violated constraint (Line 5-7), and adds it to the working set. The algorithm stops once no constraint can be found that is violated by more than the desired precision ε (Line 9). The algorithm can compute arbitrary close approximations to the optimization problem. The number of iterations does not depend on the number of training samples, and it is linear in the desired precision and the regularization parameter.

Algorithm for training structural SVMs (with margin-rescaling) via the 1-Slack Formulation.

1: Input: 1 2{( , ), , ( , )}, , ε= n nS x y x y C

2: ←W 0

3: repeat

4:

5: for 1, ,=i n do

6: ˆ ˆ ˆarg max { ( , ) ( , )}Y∈

← Δ + ΨTi i iy

y y y w x y

7: end for

8: { }1 ny , , y← ∪W W ˆ ˆ

9: until 1 1

1 1( , ) [ ( , ) ( , )] ξ ε= =

Δ − Ψ − Ψ ≤ +∑ ∑n n

Ti i i i i i

i i

y y w x y x yn n

10: return ( , )ξw

Figure 1. Cutting-plane algorithm

III. TRAINING MT USING STRUCTURAL SVM As specific MT method, we use the hierarchical phrase-

based translation model [1]. Hierarchical phrase-based translation model considers translation as parsing the source language sentence f and the target language sentence e

at the

same time using synchronous CFG. Because hierarchical phrase-based translation model employs log-linear model, so the training problem naturally falls into the framework of structural SVM.

For the training problem of SMT, the input space X is source language sentences, i.e. f ∈ X , and the output spaceY is target language sentences, i.e. e∈Y . The goal is to learn the best discriminant function. In structural SVM training, the discriminant function is determined by a set of M feature functions and the corresponding weight parameters as

, 0

1

1 1

1( , ) min2

. . ( , , ) :1 1[ ( , ) ( , )] ( , )

ξξ ξ

ξ

≥

= =

← +

∀ ∈

Ψ − Ψ ≥ Δ −∑ ∑

T

w

nn

n nT

i i i i i ii i

w w w C

s t y y

w x y x y y yn n

Y

258250

in (1). When training the MT system, the feature functions and parameters can be similarly formulated.

Equation (2) shows the output is the highest scorer of the discriminant function, therefore when training machine translation system, the output is:

ˆ( ) arg max ( , ; ) arg max ( , )= = ΨTe e

e f F f e w w f e . (6)

The training is carried out on a sample set. For every sentence in the training samples, we can get its k-best translation candidates from a decoding system. The sentence at the top of k-best list is the one with highest discriminant function score, so it is the predicted output, but it might not be the one with the best translation quality. Thus we need to train the parameters and optimize directly with respect to a criteria that reflects the error rate of the MT system. As error rate evaluation criteria, we use the automatic evaluation metric BLEU [7]. BLEU measures accuracy, i.e. the opposite of error rate, so large BLEU scores are better. Therefore we set the loss function used in structural SVM training to be

( , ) (1 BLEU( , ))′ ′Δ = × −e e sf e e (7)

Because the value of (1 BLEU( , ))′− e e is usually too small comparing with the feature function value, we use a scale factor sf to adjust it.

In structural SVM training, the reference translations need to be in the k-best list, so we can compute the loss function between the hypothesis translation and the reference translation. However most of the time, we do not have this reference translation in the k-best list because the search algorithm performs pruning, which in principle limits the possible translations that can be produced given a certain input sentence. So we need to select a candidate in the k-best list to be considered as the reference translation. As we choose BLEU to be the metric, we compute the BLEU score of all the candidates in the k-best list, and label the one with the highest BLEU score to be the reference translation.

The training problem equals to solve the optimization problem in (5). The cutting-plane algorithm can compute arbitrary close approximations to this optimization problem. After structural SVM training, we will get a set of support vectors, which is a small part of the training samples, and together we will get the weight parameters derived from the support vectors. With these weight parameters, we can renew the MT system and test its performance.

IV. EXPERIMENTS The experiments are carried out on a hierarchical phrase-

based machine translation system facing Chinese to English translation.

The corpus we use is the bilingual parallel corpus provided by Institute of Computational Linguistic, Peking University. This corpus consists of Chinese-English parallel sentences in the field of news, literature and political comments. The corpus is divided into three separate sets: one for training, one for

development and one for test. The characteristics of the corpus are displayed in Table 1.

TABLE I. Characteristics of training corpus (Train), development corpus (Develop), test corpus (Test).

Chinese English

Train Sentences 500,000

Words 8,722,247 10,901,295

Develop Sentences 1,000

Words 18,252 21,760

Test Sentences 1,000

Words 27,593 29,874

The feature functions of our MT system are identical to the basic features in hierarchical phrase-based model [1]. For the alignment model in the hierarchical phrase-based machine translation system, we use GIZA++ to obtain word alignments, and hierarchical phrase-based translation model to obtain aligned hierarchical phrase pairs. For the evaluation, we use BLEU metric. The loss function in the structural SVM training procedure is also based on BLEU. For the k-best list, we set the biggest candidate number k to be 1000. The language model used in the feature function is computed by SRILM [8] toolkit. We use a 5-grams language model and apply interpolation by Kneser-Ney discount. The decoding system is a CKY parser with beam search. For a certain input Chinese sentence, the decoding system returns the target language part of the best parsing result. Also, for the training procedure, the decoding system provides a k-best list of candidate translations. We use the API program of “1-slack” cutting-plane algorithm implemented by Joachims [5] to develop the structural SVM training system for SMT system. The baseline system is the hierarchical phrase-based machine translation system with the same feature functions and uniform weight parameters, i.e. 1 6 1λ λ= = =… .

For every sample in the training set, we can derive its k-best list by the baseline system. After we get the weight parameters trained by structural SVM, we can obtain a new ranking number of every candidate in the original k-best list by re-computing the discriminant function. The expected changing is after structural SVM training the ranking number of reference sentence should become smaller, and the BLEU difference between the reference and the output prediction should be smaller, too. We define two indicators to quantify the change:

259251

1. “REF RANKING” is the ranking number of the reference translation, i.e. the ranking number of the candidate which has the highest BLEU score in the k-best list. This ranking number is the average over all the samples. In the ideal situation, the reference translation should be the output prediction, therefore this indicator is the smaller, the better.

2. “DIFF BLEU” is the difference of BLEU score between the reference translation and the output prediction, i.e. the difference between the candidate with the highest BLEU score and the candidate at the top of k-best list. In the ideal situation, the DIFF BLEU should be 0, so this indicator is also the smaller, the better.

TABLE II. Re-ranking the k-best list using structural SVM

Algorithm BLEU REF RANKING

DIFF BLEU

Baseline 0.2489 7.42 0.08479

Structural SVM

0.2947 4.39 0.08036

TABLE III. Results using the trained parameters

Algorithm BLEU VAR DIFF BLEU

Baseline 0.2235 0.003857 0.08574

Structural SVM

0.2618 0.002139 0.08124

Table 2 shows the performance of re-ranking the k-best list using structural SVM. The training and testing are proceeding on the same k-best list from the baseline system on the development corpus. In Table 2, the BLEU score had increased over 18% using the new parameters trained by structural SVM. The average ranking number (REF RANKING) decreases by more than 40%, so the reference translation is moving forward in the k-best list. The difference of BLEU score (DIFF BLEU) also decreased as expected. The result shows the structural SVM training can help re-ranking the k-best list of machine translation system according to automatic machine translation evaluation criteria BLEU.

To test the generalization ability of structural SVM training, we arrange the experiments trained on the development corpus and test on the test corpus. In this experiment, we define the following two indicators to quantify the average enhancement of the k-best candidates in every sample:

1. “VAR” is the variance of BLEU score in the k-best list, and it is the average over every sample.

2. “DIFF BLEU” is the same as the one used before.

Table 3 shows the results using the new parameters trained by structural SVM. Because we use different corpus for training and test, the BLEU score is lower than the one we get from the re-ranking experiment, but inside this experiment, the BLEU score still increases after structural SVM trained by more than 16%. In Table 3, the variance of the structural SVM trained system is lower than the one of baseline system, while the BLEU score of the structural SVM trained system is higher than the one of baseline system, which means the overall enhancement of the new k-best list. The “DIFF BLEU” indicator decreases after trained by structural SVM, which means the difference of BLEU score between the reference translation and the output prediction is smaller, therefore the output prediction is more similar to the reference translation. The result shows the average quality of the machine translation system outputs is improved.

V. CONCLUTIONS The structural SVM model has the ability to learn prediction

functions for the problems with complex structures. We apply this model to the training problem of hierarchical phrase-based machine translation. Experiments show that with the parameters trained by structural SVM, the k-best list of MT system is re-ranked according to automatic evaluation criteria BLEU, and the average quality of the k-best list is improved. The structural SVM’s design for arbitrary output structure makes it suitable for many problems in natural language processing.

ACKNOWLEDGMENT This paper has been supported by National Natural Science

Foundation of China (#60975054) and National Social Science Fund of China (#06BYY048).

REFERENCES [1] D. Chiang. 2007. Hierarchical phrase-based translation Computational

Linguistics, 33(2): 201-228 [2] F. J. Och and H. Ney. 2002. Discriminative training and maximum

entropy models for statistical machine translation. In ACL 2002. [3] Tsochantaridis I, Joachims T, Hofmann T, Altun Y. 2005. Large margin

methods for structured and independent output variables. Journal of Machine Learning Research 6: 1453-1484

[4] V. Vapnik. Statistical Learning Theory. Wiley and Sons Inc., New York, 1998.

[5] T. Joachims. T. Finley, Chun-Nam John Yu. 2009. Cutting-plane training of structural SVMs.

[6] F. J. Och. 2003. Minimum error rate training for statistical machine translation. In ACL 2003.

[7] Papineni, K., Kishore, A., Roukos, S., Ward, T., Zhu, W. Bleu: A method for automatic evaluation of machine translation. 2001. Technical Report RC22176, W019022.

[8] Stolcke, A.: Srilm – an extensivle language modeling toolkit. In: Proc. Of the Int. Conference on Spoken Language Processing, vol.2, pp. 901-904. 2002

260252

Documents

[IEEE 2010 International Conference on Asian Language Processing (IALP) - Harbin, China (2010.12.28-2010.12.30)] 2010 International Conference on Asian Language Processing - Training