1. “Dual Learning for Machine Translation”
Di He et al.
2016年1⽉
Toru Fujino
東⼤ 新領域 ⼈間環境学 陳研究室 D1
2. Paper information
• Authors: Di He et al. (Microsoft Research Asia)
• Conference: NIPS 2016
• Date: 11/01/2016 (arxiv)
• Times cited: 1
3. Overview
• What
• Introduce an autoencoder-like mechanism, “Dual learning”,
to utilize monolingual datasets
• Results
• Dual Learning with 10% data ≈ Baseline model with 100% data
1) “Dual Learning: A New Learning Paradigm”, https://www.youtube.com/watch?v=HzokNo3g63E&feature=youtu.be
1)
1)
4. Neural machine translation
• Learn conditional probability 𝑃(𝑦|𝑥; Θ) from a input
𝑥 = {𝑥,, 𝑥., … , 𝑥01
} to an output 𝑦 = {𝑦,, 𝑦., … , 𝑦03
}
• Maximize the log probability
Θ∗
= argmax ; ; log 𝑃(𝑦>|𝑦?>, 𝑥; Θ)
03
>@,B,C ∈E
5. Difficulty in getting large bilingual data
• Solution: utilization of monolingual data
• Train a language model of the target language, and then
integrate it with the MT model1)2)
<- does not fundamentally address the shortage of
parallel data.
• Generate pesudo bilingual data from monolingual data3)4)
<- no guarantee on the quality of the pesudo bilingual data
1) T. Brants et al., “Large language models in machine translation”, EMNLP 2007
2) C. Gucehre et al., “On using monolingual corpora in neural machine translation”, arix 2015
3) R. Sennrich et al., “Improving neural machine translation models with monolingual data”, ACL 2016
4) N. Ueffing et al., “Semi-supervised model adaptation for statistical machine translation”, Machine Translation Journal 2008
6. Dual learning algorithm
• Use monolingual datasets to train translation
models through dual learning
• Things required
𝐷G: corpus of language A
𝐷I: corpus of language B (not necessarily aligned with 𝐷G)
𝑃(. |𝑠; ΘGI): translation model from A to B
𝑃(. |𝑠; 𝛩IG): translation model from B to A
𝐿𝑀G . : learned language model of A
𝐿𝑀I . : learned language model of B
7. Dual learning algorithm
1. Generate 𝐾 translated sentences
𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S
from 𝑃 . 𝑠; ΘTU based on beam search
8. Dual learning algorithm
1. Generate 𝐾 translated sentences
𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S
from 𝑃 . 𝑠; ΘTU based on beam search
2. Compute intermediate rewards
𝑟,,,, 𝑟,,., … , 𝑟,,S
from 𝐿𝑀I(𝑠PQR,W) for each sentence as
𝑟,,W = 𝐿𝑀I(𝑠PQR,W)
9. Dual learning algorithm
3. Get communication rewards
𝑟.,,, 𝑟.,., … , 𝑟.,W
for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT)
10. Dual learning algorithm
3. Get communication rewards
𝑟.,,, 𝑟.,., … , 𝑟.,W
for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT)
4. Set the total reward of k-th sentence as
𝑟W = 𝛼𝑟,,W + 1 − 𝛼 𝑟.,W
11. Dual learning algorithm
5. Compute the stochastic gradient of ΘGI and ΘTU
𝛻^_`
𝐸 𝑟 =
1
𝐾
;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)]
S
W@,
𝛻^`_
𝐸 𝑟 =
1
𝐾
;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)]
S
W@,
12. Dual learning algorithm
5. Compute the stochastic gradient of ΘGI and ΘTU
𝛻^_`
𝐸 𝑟 =
1
𝐾
;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)]
S
W@,
𝛻^`_
𝐸 𝑟 =
1
𝐾
;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)]
S
W@,
6. Update model parameters
ΘGI ← ΘGI + 𝛾,∇g_`
𝐸[𝑟]
ΘIG ← ΘIG + 𝛾.∇g`_
𝐸[𝑟]
14. Experiment settings
• Baseline models
• Bahdanau et al., “Neural Machine Translation by Jointly
Learning to Align and Translate”
• Sennrich et al., “Improving Neural Machine Translation
Models with Monolingual Data”
15. Dataset
• WMTʼ14
• 12M sentence pairs
• English -> French, French -> English
• Data usage (for dual learning)
• Small
1. Train translation models with 10% bilingual data.
2. Train translation models with 10% bilingual data and
monolingual data through dual learning algorithm.
3. Train translation models only with monolingual data through dual
learning algorithm.
• Large
1. Train translation models with 100% bilingual data.
2. Train translation models with 100% bilingual data.
3. Train translation models only with monolingual data through dual
learning algorithm.
17. Results
• Outperform the base line models
• In Fr->En, dual learning with 10% data ≈ baseline
models with 100% data.
• Dual learning is effective especially in a small dataset.
18. Results
• For different source sentence length
• Improvement is significant for long sentences.
21. Future extensions & words
• Application in other domains
• Generalization of dual learning
• Dual -> Triple -> … -> n-loop
• Learn from scratch
• only with monolingual data
• maybe plus lexical dictionary
Application Primal task Dual task
Speech processing Speech recognition Text to speech
Image understanding Image captioning Image generation
Conversation engine Question Response
Search engine Search
Query/Keyword
suggestion
22. Summary
• What
• Introduce “Dual learning algorithm” to utilize
monolingual data
• Results
• With 100% data, the model outperforms the baseline
models
• With 10% data, the model shows the comparable result
with the baseline models
• Future
• Dual learning mechanism can be applied to other
domains
• Learn from scratch
23. Some notes
• Dual Learning does not learn word-to-word
correspondences?
• Training from bilingual data is a must?
• Or lexical dictionary