View
241
Download
4
Embed Size (px)
Citation preview
On Using Monolingual Corpora in Neural Machine Translation
Presentation by: Ander Martinez Sanchez (D1) 松本研
Abstract● Recent NMT showed promising results
○ Because of good corpora
● Investigate how to leverage abundant monolingual corpora● Up to +1.96 BLEU on a low-resource pair (Turkish-English)
○ +1.59 on a focused domain pair (Chinese-English chat messages)
● Also benefits high resource languages○ +0.39 BLEU on Chinese-English○ +0.47 BLEU on German-English
Introduction● Goal: Improve NMT by using monolingual data● By: Integrating a Language Model (LM) for the target language (English)● For:
a. Resource-limited pair: Turkish-Englishb. Domain restricted translation: Chinese-English SMS chatsc. High-resource pairs: German-English and Chinese-English
● Article structure:a. Recent workb. Basic model architecturec. Shallow and deep fusion approachesd. Datasetse. Experiments and results
Background: Neural Machine TranslationSMT
● Theory: Maximize p(y|x). Bayes: p(y) ← language model● Reality: Systems tend to model
○ fj(x, y) ← a feature, like pair-wise statistics
○ C is a normalization constant. Often ignored.
NMT
● A single network optimizes log p(x|y), including feature extraction and C● Typically, encoder-decoder framework● Once the conditional distribution is learnt,
○ find a translation using, for instance a beam search algorithm
Model Description
Figure from [Ling et al. 2015]
1. Word embeddings2. Annotation vectors
(Encoding)3. y word embeddings4. Decoder hidden state5. Alignment model
6. Context vector
7. Deep output layer and softmax
Optimize:
Integrating Language Model into the Decoder
● Two methods for integrating a LM: “shallow fusion” and “deep fusion”● Both, the Language Model (LM) and the Translation Model (TM) are pre-trained.● The LM is based on Recurrent Neural Networks (RNNLM) [Mikolov et al. 2011]
○ Very similar to the TM, but without steps (5) and (6) in the previous slide
Shallow Fusion Deep Fusion
● NMT computes `p’ for each next word● New score is the summation to the word score
and the hypothesis for t-1● K top hypotheses are selected as candidate● THEN, rescore the hypothesis using a weighted
sum.
● Concatenate the hidden states of the LM and TM before the Deep Output Layer.
● The model is finetunned.○ Only for the parameters involved.
Integrating Language Model into the Decoder
Deep Fusion - Balancing the LM and the TM
● In some cases the LM is more informative than in others. Examples:○ Articles: because Chinese doesn’t have them, in Zh-En the LM is more informative than TM○ Nouns: The LM is less informative in this case.
● Controller mechanism added.○ The hidden state of the LM ( ) is multiplied by gt
vg and bg are learnt parameters.○
● Intuitively, this decides the importance of the LM for each word.
Datasets1. Chinese-English (Zh-En)
a. from NIST OpenMT’15 challengei. SMS/Chatii. Conversational Telephone Speech (CTS)iii. Newsgroups/weblogs from DARPA BOLT Project
b. Chinese part on character-levelc. Restricted to CTS (<--- ?????)
2. Turkish-English (Tr-En)a. WIT and SETimes parallel corpora (TEDx talks)b. Turkish tokenized as subword-units (Zemberek)
3. German-English (De-En)4. Czech-English (Cs-En)
a. WMT’15. Weird sentences dropped.
5. Monolingual Corpora: English Gigaword (LDC)
Datasets
SettingsNMT
● Vocabulary sizes for Zh-En and Tr-En: Zh (10k) Tr (30k) En (40)● Vocabylary sizes for Cs-En and De-En: 200k using sampled softmax
○ [Jean et al. 2014]
● Size of recurrent units: Zh-En (1200), Tr-En (1000) OTHERS?● Adadelta with minibatches of 80● Clip the gradient to 5 if L2 is exceeding● Non-recurrent layers have dropout [Hinton et al. 2012]
○ and gaussian noise (mean: 0, std: 0.001) to prevent overfitting [Graves, 2011]
● Early stopped on development set BLEU● Weight matrices initialized as random orthonormal
SettingsLM
● For each English vocabulary (3 variations) constructed LM with○ LSTM of 2,400 units (Zh-En and Tr-En)○ LSTM of 2,000 units (Cs-En and De-En)
● Optimized with○ RMSProp (Zh-En and Tr-En) [Tieleman and Hilton, 2012]○ Adam (Cs-En and De-En) [Kingma and Ba, 2014]
● Sentences with more than 10% of UNK were discarded● Early stopped on perplexity
Settings● Shallow Fusion
○ Beta (Eq. 5) selected to maximize translation performance on dev set○ Range in (0.001 and 0.1)○ Renormalize softmax of LM without EOS and OOV symbols
■ Maybe due to domain differences in LM and TM
● Deep Fusion○ Finetunned parameters of Deep Output Layer and the controller
■ RMSProp: ● Dropout prob: 0.56 ● STD of weight noise: 0.005● Reduce level of regularization after 10K updates
■ Adadelta: Scaling down update steps by 0.01
● Handling Rare Words: For De- and Cs- cases copy UNK from source using attention mechanism. (Improved +1.0 BLEU)
ResultsZh-En: OpenMT’15
● Phrase-Based (PB) SMT [Koehn et al. 2003]
○ Rescoring with external neural LM (+CSLM)■ [Schwenk]
● Hierarchical Phrase-Based SMT (HPB) [Chiang, 2005]
○ +CSLM
● NMT, NMT+Shallow, NMT+Deep● Except CTS, +Deep helps● NMT outperformed Phrase-Based SMT
ResultsTr-En: IWSLT’14
● Using Deep Fusion○ +1.19 BLEU○ Outperformed the best previously reported result [Yilmaz et al. 2013]
ResultsCs-En and De-En: WMT’15
● Shallow: +0.09 and +0.29 BLEU● Deep: +0.39 and +0.47
Analysis● Depends heavily on domain similarity● In the case of Zh the domain if very different (Conversational vs News)
○ This is supported by the high perplexity
● Perplexity in Tr is lower, which led to larger improvement for shallow and deep○ Perplexity is even lower for De- and Cs-; so, larger the improvement
● For the case of deep the weight of LM is regulated through the controller○ For more similar domains it will be more active
○ In the case of De- and Cs- the controller
was more active.■ Correlates with +BLEU
■ Deep can adapt better
to domain mismatch
Conclusion and Future Work● 2 methods were presented and empirically evaluated.● For Chinese and Turkish, the deep fusion approach achieve better result than
existing SMT● Also improvement was observed for high resource pairs● The improvement depends heavily in the domain match between LM and ™
○ In the case were the domain matched, there was improvement for both the shallow and deep
approach
● Suggests that domain adaption for the LM may improve translations