Context-Aware Smoothing for Neural Machine Translation · Context-Aware Smoothing for Neural...

Context-Aware Smoothing for Neural Machine TranslationKehai Chen1, Rui Wang2, Masao Utiyama2, Eiichiro Sumita2 and Tiejun Zhao1

1Harbin Institute of Technology, China

2National Institute of Information and Communications Technology, Japan

The 8th International Joint Conference on Natural Language Processing

Content

• Motivation

• Related Works

• Context-Aware Representation

• NMT with Context-Aware Smoothing

• Experimental Results

• Conclusion

11/28/2017 1

Motivation-1:polysemy words他们tamen

想xiang

通过tongguo

比赛bisai

来lai

解决jiejue

矛盾maodun

They want to solve the dispute by playing the game

(pinyin)

They are beating each other for a disputeTrg2

他们tamen

正在zhengzai

因为yinwei

争执zhengzhi

对方duifang

vplaying

vbeating

Two bilingual parallel sentence pairs The lexicon semantic depends on its specific context

11/28/2017 2

Motivation-1:polysemy words他们tamen

想xiang

通过tongguo

比赛bisai

来lai

解决jiejue

矛盾maodun

They want to solve the dispute by playing the game

(pinyin)

They are beating each other for a disputeTrg2

他们tamen

正在zhengzai

因为yinwei

争执zhengzhi

对方duifang

vplaying

vbeating

𝒉𝒋 = 𝒇𝒆𝒏𝒄(𝒗𝒋, 𝒉𝒋−𝟏)

Learn specific-sentence word representation 𝒗𝒋

Better source representation… …Better target translation

Two bilingual parallel sentence pairs The lexicon semantic depends on its specific context

Motivation-1: Enhancing word representation for polysemy words

11/28/2017 3

Motivation-2:OOV

𝒔𝐼

𝒄𝐼

𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦5

v1 v5 … vJ

Attention α

Word embedding layer

Encoder layer

Decoder layer

11/28/2017 4

Motivation-2:OOV

𝒔𝐼

𝒄𝐼

…OOV

v1 vu … vJ

Attention α

• The source sentence includes a OOV

Single vector vu represents all OOVs

11/28/2017 5

Motivation-2:OOV

𝒔𝐼

𝒄𝐼

…OOV

v1 vu … vJ

Attention α

• Breaking the structure of sentence;• Pool source representation;• … …• Affecting translation prediction of target

These gray parts indicate the parameters of NMT which are affected by the OOV

• The source sentence includes a OOV

Single vector vu represents all OOVs

11/28/2017 6

Related Works

• Translation Granularity for NMT

---Smaller Translation Granularity: Word, Sub-word (BPE), Character for OOV.

• Source representation for NMT

---RNN or CNN-based Encoder: learning source representation over the sequence of fixed word vectors.

Sennrich et al. (2016), Costa-jussa and Fonollosa (2016), and Li et al. (2016), … …

Bahdanau et al. (2015), Sutskever et al. (2014), … …

11/28/2017 7

Related Works

• Translation Granularity for NMT

---Smaller Translation Granularity: Word, Sub-word (BPE), Character for OOV.

• Source representation for NMT

• This work focus on enhancing word embedding layer.

---RNN or CNN-based Encoder: learning source representation over the sequence of fixed word vectors.

---Learning a specific-sentence representation for polysemy or OOV word by its context words.

---Offering context-aware representation enhances word embedding layer, thereby improving translations (though RNN Encoder can capture word context).

Sennrich et al. (2016), Costa-jussa and Fonollosa (2016), and Li et al. (2016), … …

Bahdanau et al. (2015), Sutskever et al. (2014), … …

11/28/2017 8

Context-Aware Representation

x4 unkx3x2x1 x9x8x7x6

If there is an OOV “unk” (or polysemy word) in the sentence:

v4v3v2v1 v9v8v7v6

Context information

When one understands natural language sentence intuitively, especially including OOV or polysemy word, one often inferences the meaning of these words depending on its context words.

11/28/2017 9

• We define a context Lj for source word xj in a fixed size window 2n:

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1 , 𝑥𝑗+1 , … , 𝑥𝑗+𝑛

Historical n words Future n words

11/28/2017 10

x4 x5x3x2x1 xJ...x7x6

• We define a context Lj for source word xj in a fixed size window 2n:

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1 , 𝑥𝑗+1 , … , 𝑥𝑗+𝑛

• Take x5 as an example, its context 𝐿5 follows (n=2):

𝐿5 = 𝑥3, 𝑥4, 𝑥6, 𝑥7

Historical n words Future n words

11/28/2017 11

Feedforward Context-of-Words Model (FCWM)

Output layer:

Input layer:

Context words Lj of xj :

concatenation

xj-2 xj-1 xj+1 xj+2

vj-2 vj-1 vj+1 vj+2

𝑽Lj

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛

𝑳𝒋 = [𝒗𝒋−𝒏: … : 𝒗𝒋−𝟏: 𝒗𝒋+𝟏: … : 𝒗𝒋+𝒏]

𝑳𝒋 = 𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏

Concatenation:

𝑽𝑳𝒋 = 𝜎(𝑾𝟏𝑳𝒋 + 𝒃𝟏)

𝑽𝑳𝒋 = 𝜑1 (𝐿𝑗; 𝜃1)11/28/2017 12

Feedforward Context-of-Words Model (FCWM) Convolutional Context-of-Words Model (CCWM)

Output layer:

Input layer:

concatenation

xj-2 xj-1 xj+1 xj+2

vj-2 vj-1 vj+1 vj+2

𝑽Lj

xj-2 xj-1 xj+1 xj+2

vj-2 vj-1 vj+1 vj+2

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛

Input layer:

Convolution layer:

Non-linear output layer:

Pooling layer:

𝑳𝒋 = [𝒗𝒋−𝒏: … : 𝒗𝒋−𝟏: 𝒗𝒋+𝟏: … : 𝒗𝒋+𝒏]

𝑳𝒋 = 𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏

Concatenation:

𝑽𝑳𝒋 = 𝜎(𝑾𝟏𝑳𝒋 + 𝒃𝟏)

𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛

𝓜= [𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏]

𝓛𝒋 = 𝜓(𝑾𝟐ℳ+𝒃𝟐)

𝓟𝒍 = max[𝓛𝟐𝒍−𝟏, 𝓛𝟐𝒍]

𝓛 = [𝓛𝟏, … , 𝓛𝟐𝒏−𝒌+𝟏]

𝓟 = max[𝓟𝟏, … ,𝓟𝟐𝒏−𝒌+𝟏𝟐

𝓥𝓛𝒋 = 𝜎(𝑾𝟑(𝒂𝒗𝒆(σ𝒍=𝟏

𝟐𝒏−𝒌+𝟏

𝟐 𝓟𝒍))+ 𝒃𝟑)

𝑽𝑳𝒋 = 𝜑1 (𝐿𝑗; 𝜃1) 𝓥𝓛𝒋 = 𝜑2 (𝐿𝑗; 𝜃2)11/28/2017 13

𝒉𝒋 = ቐ𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏 , 𝑥𝑗 ∈ 𝑉𝑠𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏), 𝑥𝑗 ∉ 𝑉𝑠

𝒉𝒋 = 𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏

Standard NMT :

This work:

𝒑(𝑦𝑖|𝑦𝑖<𝑖 , 𝑥) = 𝑔(𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊)

Standard NMT :

This work:

𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = ቐ𝑔 𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∈ 𝑉𝑡

𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∉ 𝑉𝑡

𝒔𝐼

𝒄𝐼

𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦u

xJx4 …

φe(Lu)

… vJ

φd(Lu)

𝑉𝐿𝑢′

Attention α

• CARNMT-Enc

ecoder

TNMT for OOV Smoothing

11/28/2017 14

Standard NMT :

This work:

Standard NMT :

This work:

𝒔𝐼

𝒄𝐼

xJx4 …

v1v3 … vJ

φd(Lu)

𝑉𝐿𝑢′

Attention α

• CARNMT-Dec

ecoder

11/28/2017 15

Standard NMT :

This work:

Standard NMT :

This work:

𝒔𝐼

𝒄𝐼

xJx4 …

φe(Lu)

… vJ

φd(Lu)

𝑉𝐿𝑢′

Attention α

• CARNMT-Both

ecoder

11/28/2017 16

NMT for Smoothing all words

𝒔𝐼

𝒄𝐼

Attention α

v5 : vL5

φe(L5)

v4: vL4

φe(L4)

v3: vL3

φe(L3)

v2: vL2

φe(L2)

v1: vL1

φe(L1)

vJ: vLJ

φe(LJ)

φd(…)

…:…

φe(…)

v𝑦2

:v𝐿2′ v

𝑦3:v𝐿3

:v𝐿1′ v

𝑦4:v𝐿4

′ v𝑦5

:v𝐿5′ v

𝑦𝐼:v𝐿𝐼

′… :…

)𝜑𝑑(𝐿1′ )𝜑𝑑(𝐿2

′ )𝜑𝑑(𝐿3′ )𝜑𝑑(𝐿5

′ )𝜑𝑑(𝐿I′)𝜑𝑑(𝐿4

𝒉𝒋 = 𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏)

Standard NMT :

This work:

Standard NMT :

This work:

𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = 𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊

• CARNMT-ALL

ecoder

11/28/2017 17

Experimental Settings• Training data includes 1.42M Chinese-to-English parallel sentence pairs from LDC

corpus.

• The NIST 2002 (MT02) and NIST 2003-2008 (MT03-08) datasets are as validation set

and test sets, respectively. The Case-insensitive 4-gram NIST BLEU score (Papineni et al.,

2002) is as evaluation metric.

• Vocab is 30k; Sentence length is 80; Mini-batch size 80; Word embedding dim is 620;

Hidden layer dim is 1000; Dropout on the all layers; Optimizer is Adadelta.

• The baseline includes: Standard Attentional NMT (Bahdanau et al., 2014); Subword-based

NMT (Sennrich et al., 2016); Character-based NMT (Costa-jussa and Fonollosa, 2016);

Replacing unk with similarity semantic in vocabulary words (Li et al., 2016).

11/28/2017 18

Experimental Results• Results for Chinese-to-English Translation Task

• Moses VS NMT ------> Strong baselines• CARNMT-Enc/Dec VS Bahdanau et al. (2015) ------> Our method can effectively smooth the negative effect (Motivation 1)• CARNMT-Both VS CARNMT-Enc/Dec ------> Source-side smoothing is orthogonal with target-side smoothing (Motivation 1)• ALLSmooth VS CARNMT-Both ------> In-vocabulary smoothing is beneficial for NMT (Motivation 2)

11/28/2017 19

• CARNMT-Enc/Dec VS Bahdanau et al. (2015) Our smooth method can relieve the negative effect of OOV effectively, as in Motivation 2

• CARNMT-Both VS CARNMT-Enc/Dec ------> Source-side smoothing is orthogonal with target-side smoothing (Motivation 1)• ALLSmooth VS CARNMT-Both ------> In-vocabulary smoothing is beneficial for NMT (Motivation 2)

11/28/2017 20

• CARNMT-Both VS CARNMT-Enc/Dec Source-side smoothing is orthogonal with target-side smoothing

11/28/2017 21

• ALLSmooth VS CARNMT-Both In-vocabulary smoothing is also beneficial for NMT (Motivation 1)

11/28/2017 22

• FCWM VS CCWM The CCWM learns the context semantic representation directly for smoothing word vector, while the FCWM predicts

semantic representation of word depending on its context.

11/28/2017 23

Experimental Results• Translation Qualities for Sentences with Different Numbers of OOV

2306 1827 1121 678 391 215 123 59 37 24 29

• The number of OOV = 0 • ALLSmooth is better than the baseline

Bahdanau et al. (2015).

• Both of CARNMT-Enc/Dec are similar to

baseline Bahdanau et al. (2015).

• With the increasing in the number of OOVs• The gap between our methods and other

methods (except PBSMT) become larger,

especially when more than five.

• When the number of OOV is more than seven• PBSMT is better than all NMT models

The number of sentences

11/28/2017 24

2306 1827 1121 678 391 215 123 59 37 24 29

11/28/2017 25

2306 1827 1121 678 391 215 123 59 37 24 29

11/28/2017 26

2306 1827 1121 678 391 215 123 59 37 24 29

11/28/2017 27

• The negative effect of OOV exists in NMT• The OOV “jiyuqi” itself is not translated.• The phrase “lizheng yousuo zuowei” (the red part in English) is not translated.

• Smoothing the negative effect of OOV• Obtaining the translation “striving to accomplish something” of “lizheng yousuo zuowei”.

11/28/2017 28

Conclusion• Experimental results showed that the negative effect of OOV decreased

the translation performance of NMT, and the existing RNN encoder can

not adequately address the problem.

• The learned CAR was integrated into the Encoder to smooth word

representation, and thus enhanced the Decoder of NMT.

• Experimental results showed that the proposed method can greatly

alleviate the negative effect of OOV and enhance word representation of

in-vocabulary words, thus improving the translations.

11/28/2017 29

Q&AThanks

11/28/2017 30

Context-Aware Smoothing for Neural Machine Translation · Context-Aware Smoothing for Neural...

Documents

Fast domain-aware neural network emulation of a planetary … · J. Wang et al.: Fast domain-aware neural network emulation 4263 Table 1. Inputs and outputs for the DNNs developed

Subcategory-aware Convolutional Neural …yuxng.github.io/Xiang_WACV17_03292017.pdfSubcategory-aware Convolutional Neural Networks for Object Proposals and Detection Yu Xiang1, Wongun

Shape-aware Deep Convolutional Neural Network for ...gregslabaugh.net/publications/ArifMSKI-MICCAI2017.pdf · Shape-aware Deep Convolutional Neural Network for Vertebrae Segmentation

QCD-aware Recursive Neural Networks for Jet Physics arXiv:1702 · QCD-aware recursive neural networks Gated recursive activation: Each node actively selects, merges or propagates

Energy-Aware Neural Architecture Optimization with

Hardware-aware Softmax Approximation for Deep Neural Networks · Hardware-aware Softmax Approximation for Deep Neural Networks Xue Geng x, Jie Lin , Bin Zhao\, Anmin Kong , Mohamed

Sequential Neural Processes - arxiv.org · Thus, in Neural Processes, unlike typical neural networks, learning a function is fast and uncertainty-aware while, unlike Gaussian processes,

INFERNO: Inference-Aware Neural Optimisation · INFERNO: Inference-Aware Neural Optimisation Pablo de Castro INFN - Sezione di Padova pablo.de.castro@cern.ch Tommaso Dorigo INFN -

Improving Context-Aware Neural Machine Translation Using Self …milab.snu.ac.kr/pub/AAAI2020Yun.pdf · 2020-01-31 · Improving Context-Aware Neural Machine Translation Using Self-Attentive

LEARNING ANTONYMS WITH PARAPHRASES AND A MORPHOLOGY -AWARE NEURAL NETWORK Sneha … · 2017-11-13 · LEARNING ANTONYMS WITH PARAPHRASES AND A MORPHOLOGY -AWARE NEURAL NETWORK Sneha

An Attribute-aware Neural Attentive Model for Next Basket Recommendationrali.iro.umontreal.ca/rali/sites/default/files/publis/p... · 2018-09-27 · An Attribute-aware Neural Attentive

Discrimination-aware Channel Pruning for Deep Neural …

Content-Aware Convolutional Neural Network for In-Loop

Structure-Aware Convolutional Neural Networks › paper › 7287-structure-aware-convolutional-ne… · Structure-Aware Convolutional Neural Networks (SACNNs) are readily estab-lished

Subcategory-aware Convolutional Neural Networks for Object ... · Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection Yu Xiang1, Wongun Choi2, Yuanqing

HW-NAS-B : HARDWARE AWARE NEURAL AR CHITECTURE S …

Geometry-Aware Neural Renderingpapers.nips.cc/paper/9331-geometry-aware-neural-rendering.pdf · Geometry-Aware Neural Rendering Josh Tobin OpenAI & UC Berkeley josh@openai.com OpenAI

Architecture Learning Faster Neural Nets with Hardware-Aware

CoCoNuT: Combining Context-Aware Neural Translation ......CoCoNuT: Combining Context-Aware Neural Translation Models using Ensemble for Program Repair ISSTA ’20, July 18–22, 2020,

NEUZZ: Efficient Fuzzing with Neural Program SmoothingNEUZZ: Efficient Fuzzing with Neural Program Smoothing Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, and