View
22
Download
0
Category
Preview:
Citation preview
Context-Aware Smoothing for Neural Machine TranslationKehai Chen1, Rui Wang2, Masao Utiyama2, Eiichiro Sumita2 and Tiejun Zhao1
1Harbin Institute of Technology, China
2National Institute of Information and Communications Technology, Japan
The 8th International Joint Conference on Natural Language Processing
Content
• Motivation
• Related Works
• Context-Aware Representation
• NMT with Context-Aware Smoothing
• Experimental Results
• Conclusion
11/28/2017 1
Motivation-1:polysemy words他们tamen
想xiang
通过tongguo
打da
比赛bisai
来lai
解决jiejue
矛盾maodun
They want to solve the dispute by playing the game
Src1
(pinyin)
Src2
(pinyin)
Trg1
They are beating each other for a disputeTrg2
他们tamen
正在zhengzai
因为yinwei
争执zhengzhi
而er
打da
对方duifang
打da
vplaying
vbeating
Two bilingual parallel sentence pairs The lexicon semantic depends on its specific context
vda
11/28/2017 2
Motivation-1:polysemy words他们tamen
想xiang
通过tongguo
打da
比赛bisai
来lai
解决jiejue
矛盾maodun
They want to solve the dispute by playing the game
Src1
(pinyin)
Src2
(pinyin)
Trg1
They are beating each other for a disputeTrg2
他们tamen
正在zhengzai
因为yinwei
争执zhengzhi
而er
打da
对方duifang
打da
vplaying
vbeating
𝒉𝒋 = 𝒇𝒆𝒏𝒄(𝒗𝒋, 𝒉𝒋−𝟏)
Learn specific-sentence word representation 𝒗𝒋
Better source representation… …Better target translation
Two bilingual parallel sentence pairs The lexicon semantic depends on its specific context
vda
Motivation-1: Enhancing word representation for polysemy words
11/28/2017 3
Motivation-2:OOV
𝒉2
𝒉2
𝒉3
𝒉3
𝒉4
𝒉4
𝒉J
𝒉J
…
…
𝒉1
𝒉1
𝒉5
𝒉5
xJ
𝒔2
𝒄2
𝒔3
𝒄3
𝒔4
𝒄4
𝒔𝐼
𝒄𝐼
…
…
𝒔1
𝒄1
𝒔5
𝒄5
𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦5
x4
…
…
…x5
v4
x3
v3
x2
v2
x1
v1 v5 … vJ
Attention α
En
cod
er-D
eco
der
NM
T
Word embedding layer
Encoder layer
Decoder layer
11/28/2017 4
Motivation-2:OOV
𝒉2
𝒉2
𝒉3
𝒉3
𝒉4
𝒉4
𝒉J
𝒉J
…
…
𝒉1
𝒉1
𝒉5
𝒉5
xJ
𝒔2
𝒄2
𝒔3
𝒄3
𝒔4
𝒄4
𝒔𝐼
𝒄𝐼
…
…
𝒔1
𝒄1
𝒔5
𝒄5
𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦5
x4
…
…
…OOV
v4
x3
v3
x2
v2
x1
v1 vu … vJ
Attention α
En
cod
er-D
eco
der
NM
T
• The source sentence includes a OOV
Single vector vu represents all OOVs
11/28/2017 5
Motivation-2:OOV
𝒉2
𝒉2
𝒉3
𝒉3
𝒉4
𝒉4
𝒉J
𝒉J
…
…
𝒉1
𝒉1
𝒉5
𝒉5
xJ
𝒔2
𝒄2
𝒔3
𝒄3
𝒔4
𝒄4
𝒔𝐼
𝒄𝐼
…
…
𝒔1
𝒄1
𝒔5
𝒄5
𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦5
x4
…
…
…OOV
v4
x3
v3
x2
v2
x1
v1 vu … vJ
Attention α
En
cod
er-D
eco
der
NM
T
• Breaking the structure of sentence;• Pool source representation;• … …• Affecting translation prediction of target
word.
These gray parts indicate the parameters of NMT which are affected by the OOV
• The source sentence includes a OOV
Single vector vu represents all OOVs
11/28/2017 6
Related Works
• Translation Granularity for NMT
---Smaller Translation Granularity: Word, Sub-word (BPE), Character for OOV.
• Source representation for NMT
---RNN or CNN-based Encoder: learning source representation over the sequence of fixed word vectors.
Sennrich et al. (2016), Costa-jussa and Fonollosa (2016), and Li et al. (2016), … …
Bahdanau et al. (2015), Sutskever et al. (2014), … …
11/28/2017 7
Related Works
• Translation Granularity for NMT
---Smaller Translation Granularity: Word, Sub-word (BPE), Character for OOV.
• Source representation for NMT
• This work focus on enhancing word embedding layer.
---RNN or CNN-based Encoder: learning source representation over the sequence of fixed word vectors.
---Learning a specific-sentence representation for polysemy or OOV word by its context words.
---Offering context-aware representation enhances word embedding layer, thereby improving translations (though RNN Encoder can capture word context).
Sennrich et al. (2016), Costa-jussa and Fonollosa (2016), and Li et al. (2016), … …
Bahdanau et al. (2015), Sutskever et al. (2014), … …
11/28/2017 8
Context-Aware Representation
x4 unkx3x2x1 x9x8x7x6
If there is an OOV “unk” (or polysemy word) in the sentence:
v4v3v2v1 v9v8v7v6
Context information
When one understands natural language sentence intuitively, especially including OOV or polysemy word, one often inferences the meaning of these words depending on its context words.
11/28/2017 9
Context-Aware Representation
• We define a context Lj for source word xj in a fixed size window 2n:
𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1 , 𝑥𝑗+1 , … , 𝑥𝑗+𝑛
Historical n words Future n words
11/28/2017 10
Context-Aware Representation
x4 x5x3x2x1 xJ...x7x6
• We define a context Lj for source word xj in a fixed size window 2n:
𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1 , 𝑥𝑗+1 , … , 𝑥𝑗+𝑛
• Take x5 as an example, its context 𝐿5 follows (n=2):
𝐿5 = 𝑥3, 𝑥4, 𝑥6, 𝑥7
Historical n words Future n words
11/28/2017 11
Feedforward Context-of-Words Model (FCWM)
Output layer:
Input layer:
Context words Lj of xj :
concatenation
xj-2 xj-1 xj+1 xj+2
vj-2 vj-1 vj+1 vj+2
𝑽Lj
Context-Aware Representation
𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛
𝑳𝒋 = [𝒗𝒋−𝒏: … : 𝒗𝒋−𝟏: 𝒗𝒋+𝟏: … : 𝒗𝒋+𝒏]
𝑳𝒋 = 𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏
Concatenation:
𝑽𝑳𝒋 = 𝜎(𝑾𝟏𝑳𝒋 + 𝒃𝟏)
𝑽𝑳𝒋 = 𝜑1 (𝐿𝑗; 𝜃1)11/28/2017 12
Feedforward Context-of-Words Model (FCWM) Convolutional Context-of-Words Model (CCWM)
Output layer:
Input layer:
Context words Lj of xj :
concatenation
xj-2 xj-1 xj+1 xj+2
vj-2 vj-1 vj+1 vj+2
𝑽Lj
xj-2 xj-1 xj+1 xj+2
vj-2 vj-1 vj+1 vj+2
Context-Aware Representation
𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛
Input layer:
Context words Lj of xj :
Convolution layer:
Non-linear output layer:
Pooling layer:
𝑳𝒋 = [𝒗𝒋−𝒏: … : 𝒗𝒋−𝟏: 𝒗𝒋+𝟏: … : 𝒗𝒋+𝒏]
𝑳𝒋 = 𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏
Concatenation:
𝑽𝑳𝒋 = 𝜎(𝑾𝟏𝑳𝒋 + 𝒃𝟏)
𝐿𝑗 = 𝑥𝑗−𝑛, … , 𝑥𝑗−1, 𝑥𝑗+1, … , 𝑥𝑗+𝑛
𝓜= [𝒗𝒋−𝒏, … , 𝒗𝒋−𝟏, 𝒗𝒋+𝟏, … , 𝒗𝒋+𝒏]
𝓛𝒋 = 𝜓(𝑾𝟐ℳ+𝒃𝟐)
𝓟𝒍 = max[𝓛𝟐𝒍−𝟏, 𝓛𝟐𝒍]
𝓛 = [𝓛𝟏, … , 𝓛𝟐𝒏−𝒌+𝟏]
𝓟 = max[𝓟𝟏, … ,𝓟𝟐𝒏−𝒌+𝟏𝟐
]
𝓥𝓛𝒋 = 𝜎(𝑾𝟑(𝒂𝒗𝒆(σ𝒍=𝟏
𝟐𝒏−𝒌+𝟏
𝟐 𝓟𝒍))+ 𝒃𝟑)
𝑽𝑳𝒋 = 𝜑1 (𝐿𝑗; 𝜃1) 𝓥𝓛𝒋 = 𝜑2 (𝐿𝑗; 𝜃2)11/28/2017 13
𝒉𝒋 = ቐ𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏 , 𝑥𝑗 ∈ 𝑉𝑠𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏), 𝑥𝑗 ∉ 𝑉𝑠
𝒉𝒋 = 𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦𝑖<𝑖 , 𝑥) = 𝑔(𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊)
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = ቐ𝑔 𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∈ 𝑉𝑡
𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∉ 𝑉𝑡
𝒔2
𝒄2
𝒔3
𝒄3
𝒔4
𝒄4
𝒔𝐼
𝒄𝐼
…
…
𝒔1
𝒄1
𝒔5
𝒄5
𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦u
xJx4 …
v4
x5
v3
x2
v2
x1
v1vLu
φe(Lu)
… vJ
𝒉2
𝒉2
𝒉3
𝒉3
𝒉4
𝒉4
𝒉J
𝒉J
…
…
𝒉1
𝒉1
𝒉5
𝒉5
xu
φd(Lu)
𝑉𝐿𝑢′
Attention α
…
…
• CARNMT-Enc
En
cod
er-D
ecoder
NM
TNMT for OOV Smoothing
11/28/2017 14
𝒉𝒋 = ቐ𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏 , 𝑥𝑗 ∈ 𝑉𝑠𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏), 𝑥𝑗 ∉ 𝑉𝑠
𝒉𝒋 = 𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦𝑖<𝑖 , 𝑥) = 𝑔(𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊)
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = ቐ𝑔 𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∈ 𝑉𝑡
𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∉ 𝑉𝑡
𝒔2
𝒄2
𝒔3
𝒄3
𝒔4
𝒄4
𝒔𝐼
𝒄𝐼
…
…
𝒔1
𝒄1
𝒔5
𝒄5
𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦u
xJx4 …
v4
x5
v3
x2
v2
x1
v1v3 … vJ
𝒉2
𝒉2
𝒉3
𝒉3
𝒉4
𝒉4
𝒉J
𝒉J
…
…
𝒉1
𝒉1
𝒉5
𝒉5
x3
φd(Lu)
𝑉𝐿𝑢′
Attention α
…
…
• CARNMT-Dec
En
cod
er-D
ecoder
NM
TNMT for OOV Smoothing
11/28/2017 15
𝒉𝒋 = ቐ𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏 , 𝑥𝑗 ∈ 𝑉𝑠𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏), 𝑥𝑗 ∉ 𝑉𝑠
𝒉𝒋 = 𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦𝑖<𝑖 , 𝑥) = 𝑔(𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊)
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = ቐ𝑔 𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∈ 𝑉𝑡
𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊 , 𝑦𝑖−1∉ 𝑉𝑡
𝒔2
𝒄2
𝒔3
𝒄3
𝒔4
𝒄4
𝒔𝐼
𝒄𝐼
…
…
𝒔1
𝒄1
𝒔5
𝒄5
𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦u
xJx4 …
v4
x5
v3
x2
v2
x1
v1vLu
φe(Lu)
… vJ
𝒉2
𝒉2
𝒉3
𝒉3
𝒉4
𝒉4
𝒉J
𝒉J
…
…
𝒉1
𝒉1
𝒉5
𝒉5
xu
φd(Lu)
𝑉𝐿𝑢′
Attention α
…
…
• CARNMT-Both
En
cod
er-D
ecoder
NM
TNMT for OOV Smoothing
11/28/2017 16
NMT for Smoothing all words
𝒉2
𝒉2
𝒉3
𝒉3
𝒉4
𝒉4
𝒉J
𝒉J
…
…
𝒉1
𝒉1
𝒉5
𝒉5
𝒔2
𝒄2
𝒔3
𝒄3
𝒔4
𝒄4
𝒔𝐼
𝒄𝐼
…
…
𝒔1
𝒄1
𝒔5
𝒄5
𝑦2 𝑦3 𝑦4 𝑦𝐼…𝑦1 𝑦5
…
…
…
Attention α
x5
v5 : vL5
φe(L5)
x4
v4: vL4
φe(L4)
x3
v3: vL3
φe(L3)
x2
v2: vL2
φe(L2)
x1
v1: vL1
φe(L1)
xJ
vJ: vLJ
φe(LJ)
φd(…)
…:…
φe(…)
v𝑦2
:v𝐿2′ v
𝑦3:v𝐿3
′v1
:v𝐿1′ v
𝑦4:v𝐿4
′ v𝑦5
:v𝐿5′ v
𝑦𝐼:v𝐿𝐼
′… :…
)𝜑𝑑(𝐿1′ )𝜑𝑑(𝐿2
′ )𝜑𝑑(𝐿3′ )𝜑𝑑(𝐿5
′ )𝜑𝑑(𝐿I′)𝜑𝑑(𝐿4
′
𝒉𝒋 = 𝑓𝑒𝑛𝑐(𝝋𝑒 (𝐿𝒙𝒋), 𝒉𝒋−𝟏)
𝒉𝒋 = 𝑓𝑒𝑛𝑐 𝒗𝒋, 𝒉𝒋−𝟏
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦𝑖<𝑖 , 𝑥) = 𝑔(𝒗𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊)
Standard NMT :
This work:
𝒑(𝑦𝑖|𝑦<𝑖 , 𝑥) = 𝑔 𝝋𝑑 𝐿𝒚𝒊−𝟏 , 𝒔𝒊, 𝒄𝒊
• CARNMT-ALL
En
cod
er-D
ecoder
NM
T
11/28/2017 17
Experimental Settings• Training data includes 1.42M Chinese-to-English parallel sentence pairs from LDC
corpus.
• The NIST 2002 (MT02) and NIST 2003-2008 (MT03-08) datasets are as validation set
and test sets, respectively. The Case-insensitive 4-gram NIST BLEU score (Papineni et al.,
2002) is as evaluation metric.
• Vocab is 30k; Sentence length is 80; Mini-batch size 80; Word embedding dim is 620;
Hidden layer dim is 1000; Dropout on the all layers; Optimizer is Adadelta.
• The baseline includes: Standard Attentional NMT (Bahdanau et al., 2014); Subword-based
NMT (Sennrich et al., 2016); Character-based NMT (Costa-jussa and Fonollosa, 2016);
Replacing unk with similarity semantic in vocabulary words (Li et al., 2016).
11/28/2017 18
Experimental Results• Results for Chinese-to-English Translation Task
• Moses VS NMT ------> Strong baselines• CARNMT-Enc/Dec VS Bahdanau et al. (2015) ------> Our method can effectively smooth the negative effect (Motivation 1)• CARNMT-Both VS CARNMT-Enc/Dec ------> Source-side smoothing is orthogonal with target-side smoothing (Motivation 1)• ALLSmooth VS CARNMT-Both ------> In-vocabulary smoothing is beneficial for NMT (Motivation 2)
11/28/2017 19
Experimental Results• Results for Chinese-to-English Translation Task
• CARNMT-Enc/Dec VS Bahdanau et al. (2015) Our smooth method can relieve the negative effect of OOV effectively, as in Motivation 2
• CARNMT-Both VS CARNMT-Enc/Dec ------> Source-side smoothing is orthogonal with target-side smoothing (Motivation 1)• ALLSmooth VS CARNMT-Both ------> In-vocabulary smoothing is beneficial for NMT (Motivation 2)
11/28/2017 20
Experimental Results• Results for Chinese-to-English Translation Task
• CARNMT-Both VS CARNMT-Enc/Dec Source-side smoothing is orthogonal with target-side smoothing
11/28/2017 21
Experimental Results• Results for Chinese-to-English Translation Task
• ALLSmooth VS CARNMT-Both In-vocabulary smoothing is also beneficial for NMT (Motivation 1)
11/28/2017 22
Experimental Results• Results for Chinese-to-English Translation Task
• FCWM VS CCWM The CCWM learns the context semantic representation directly for smoothing word vector, while the FCWM predicts
semantic representation of word depending on its context.
11/28/2017 23
Experimental Results• Translation Qualities for Sentences with Different Numbers of OOV
2306 1827 1121 678 391 215 123 59 37 24 29
• The number of OOV = 0 • ALLSmooth is better than the baseline
Bahdanau et al. (2015).
• Both of CARNMT-Enc/Dec are similar to
baseline Bahdanau et al. (2015).
• With the increasing in the number of OOVs• The gap between our methods and other
methods (except PBSMT) become larger,
especially when more than five.
• When the number of OOV is more than seven• PBSMT is better than all NMT models
The number of sentences
11/28/2017 24
Experimental Results• Translation Qualities for Sentences with Different Numbers of OOV
2306 1827 1121 678 391 215 123 59 37 24 29
• The number of OOV = 0 • ALLSmooth is better than the baseline
Bahdanau et al. (2015).
• Both of CARNMT-Enc/Dec are similar to
baseline Bahdanau et al. (2015).
• With the increasing in the number of OOVs• The gap between our methods and other
methods (except PBSMT) become larger,
especially when more than five.
• When the number of OOV is more than seven• PBSMT is better than all NMT models
The number of sentences
11/28/2017 25
Experimental Results• Translation Qualities for Sentences with Different Numbers of OOV
2306 1827 1121 678 391 215 123 59 37 24 29
• The number of OOV = 0 • ALLSmooth is better than the baseline
Bahdanau et al. (2015).
• Both of CARNMT-Enc/Dec are similar to
baseline Bahdanau et al. (2015).
• With the increasing in the number of OOVs• The gap between our methods and other
methods (except PBSMT) become larger,
especially when more than five.
• When the number of OOV is more than seven• PBSMT is better than all NMT models
The number of sentences
11/28/2017 26
Experimental Results• Translation Qualities for Sentences with Different Numbers of OOV
2306 1827 1121 678 391 215 123 59 37 24 29
• The number of OOV = 0 • ALLSmooth is better than the baseline
Bahdanau et al. (2015).
• Both of CARNMT-Enc/Dec are similar to
baseline Bahdanau et al. (2015).
• With the increasing in the number of OOVs• The gap between our methods and other
methods (except PBSMT) become larger,
especially when more than five.
• When the number of OOV is more than seven• PBSMT is better than all NMT models
The number of sentences
11/28/2017 27
Experimental Results• Translation Qualities for Sentences with Different Numbers of OOV
• The negative effect of OOV exists in NMT• The OOV “jiyuqi” itself is not translated.• The phrase “lizheng yousuo zuowei” (the red part in English) is not translated.
• Smoothing the negative effect of OOV• Obtaining the translation “striving to accomplish something” of “lizheng yousuo zuowei”.
11/28/2017 28
Conclusion• Experimental results showed that the negative effect of OOV decreased
the translation performance of NMT, and the existing RNN encoder can
not adequately address the problem.
• The learned CAR was integrated into the Encoder to smooth word
representation, and thus enhanced the Decoder of NMT.
• Experimental results showed that the proposed method can greatly
alleviate the negative effect of OOV and enhance word representation of
in-vocabulary words, thus improving the translations.
11/28/2017 29
Q&AThanks
11/28/2017 30
Recommended