Machine Translation Introduction

Machine Translation
A introduction
Shu 2016.5
1

Part of this slide is stolen from the slide of Kohen
(www.statmt.org)
2

3
Agenda
• Overview of the history
• Statistical machine translation
• Recent developments in SMT
• Neural machine translation
• Some problems of NMT
• Futures of MT

The dream of machine translation
5

The history of machine translation
• 1629
- Proposed universal language by René Descartes
- Different tongues shares one set of symbols
• 1947
- First computer used transistors instead of vacuum tubes
• 1949 ~
- Rule-based machine translation
• 1954
- First demo by IBM
• 1993 ~
- Statistical machine translation
• 2013 ~
- Neural machine translation
7

Rule-based translation systems
• Translation rules created by experts of
linguistics
• Hard to maintain or update
• The performance is still (or almost) the state-of-
the-art
8

Statistical machine translation
• Translation models are learned from parallel
corpus
• Language independent
9

10
Agenda
• Futures of MT

Statistical machine translation
11

For people who don’t like equations
12

A common pipeline of SMT
13
Alignment
Neural re-ranking

Evaluation of SMT
• BLEU
- n-gram matching (usually 4-gram)
• NIST
- Content words are more important
• RIBES (Hideki Isozaki, 2010)
- Order is also important
- Better for SVO-to-SOV language pairs
14

A brief history of the development of SMT
• 1990 ~ 2000
- Word-based models (IBM models)
- Brown, Och, Ney.
• 2003
- Phrase-based models
- Philip Kohen
• 2005;2007
- Hierarchical Phrase-based models
- David Chiang
• 2010 ~
- Tree models, Factor models
16

Language model
• Modelling p(the dog is sparking)
- In order to know which candidate is more natural
• Markov Assumption
• 5-gram model is mostly used in SMT
17

Word alignments in the matrix
21

How to get word alignments
• In short
- Run giza++ with parallel corpus
- Wait for 5 hours
• Technically
- 5 IBM models, HMM models, EM algorithm
22

Run the EM algorithm
26
10 years of the work

Phrase-based translation model
27
He goes to the curry restaurant
Group into phrases
Translate
彼は⾏くにカレー屋
Reorder

Extract phrase table
28
Word alignments Phrase table

Decoding
• In short
- Run moses
- Wait for 2 days
• Technically
- (1) Load all the translation rules
- (2) Search for the best hypothesis
29

Load all the translation rules
30

Search for the best hypothesis
• Beam search / Cube search
31

Hierarchical phrase-based models
• Allow phrases to have gaps
32

Hard problems of MT
• Word order
• Word sense
• Pronouns
• Tense
• Idioms
33

Different tenses
• Past tense vs. present tense
• Grammar discrepancy
37

Resources of SMT
• Parallel corpus
- LDC datas
- www.ldc.upenn.edu
- Europarl corpus
- Danish, Dutch, English, Finnish, French,
- German, Greek, Italian, Portuguese, Spanish, Swedish
- Japanese
- NTCIR-8 (3M) , ASPEC (3M)
• Word alignment software
- GIZA ++, Berkeley aligner
• Language modelling
- SRILIM, Berkeley LM, KenLM
• Decoder
- Moses (maintained by the group of Kohen)
- Travatar (Graham Neubig)
39

40
Agenda
• Futures of MT

Recent developments of SMT
• Advances in decoders
• Super-large-scale language model
- language model compression
• Margin Infused Relaxed Algorithm (MIRA)
- train the hyper parameters in a smart way
• Tree models
- Tree-to-Tree translation
- String-to-Tree translation
- Tree-to-String translation
- Forest-to-String translation *
- Robust to parsing errors
• Factor models
• Pre-reordering
41

What is a parse tree
42
Context-free grammar Dependency grammar

Tree-to-string translation models
43
• Translate source code to comment

Pre-reordering phrase-based translation model
44
He the curry restaurant
Group into phrases
He the curry restaurant
Translate
Pre-reordering
to goes
goesto

Example of pre-reordering
45
寿命の向上が実用化の大きな課題である。
the life of the improvement va_nsubjpass the practical application of a large problem is .
Restructured parse tree
the improvement of the life is a large problem of the practical application.
Original input
Reordered input
Reference

47
Agenda
• Futures of MT

Problem of conventional SMT
• Under-ﬁtting (non-parametric approach)
• Solution:
- Deep recurrent neural networks
48

Application of neural networks in MT
49

High computational complexity
50

High computational complexity
51
• Try AdaGrad, AdaDelta, Adam in the ﬁrst place

Neural machine translation
• encoder-decoder approach
52
Performance dropMulti-layer encoder-decoder model

Soft-attention mechanism
‣ make a weighted summary
53
soft-attention model

Visualization of learned representation
54

Evaluation result: human evaluation scores
56

Evaluation result: evaluation scores
57
BLEU RIBES HUMAN JPO
Baseline phrase-based SMT 29.80 0.691
Baseline hierarchical phrase-based SMT 32.56 0.746
Baseline Tree-to-string SMT 33.44 0.758 30.00
Submitted system 1
(NMT)
34.19 0.802 43.50
Submitted system 2
(NMT + System combination)
36.21 0.809 53.75 3.81
Best competitor 1: NAIST
(Travatar System with NeuralMT Reranking)
38.17 0.813 62.25 4.04
Best competitor 2: naver
(SMT t2s + Spell correction + NMT reranking)
36.14 0.803 53.25 4.00

(Option) Finding & Insights
‣ Soft-attention models outperforms multi-layer
encoder-decoder models
‣ Training models on pre-reordered data hurts
the performance
‣ NMT models tend to make grammatically
valid but incomplete translations
58

59
Agenda
• Futures of MT

Can’t use monolingual data
• Deep fusion (Gulcehre et al., 2015)
• Integrate a neural language model trained on massive
monolingual corpus
60

The attention mechanism is not perfect
• Local search (Minh-Thang Luong, 2015)
61
Local search modelGlobal search model

The attention mechanism is not perfect
• Input feeding
62

Translation does not cover all the words
• Coverage-based NMT model (Zhaopeng Tu et al., 2016)
63

Objective function is bad
• Cross-entropy is too much different to BLEU
• Solutions:
- (1) Data as demonstrator (Bengio et al., 2015)
64

Objective function is bad (cont.)
• Solutions:
- (2) Mixed REINFORCE (Ranzato et al., 2016)
65

Objective function is bad (cont.)
• Solutions:
- (3) Minimum Risk Training (Shen et al., 2015)
66
Objective of MRT
6 BLEU gain in Chinese-English task

Large vocabulary problem
• The problem
- English vocab. has 700K words
- So I set the size of output layer to 700K
- Then I get memory error
• Solutions
- I still want to use 700K vocab.
- Noise-contrastive estimation (Gutmann and Hyvarinen, 2010)
- Clustering (Mikolov. et al., 2013)
- Approximate Learning Approach (Jean et al., 2015)
- I give up, cut it to 80K vocab. and recover <UNK> tokens
- Positional unknown model (Minh-Thang Luong et al, 2015)
67

68
Agenda
• Futures of MT

Future of MT
• Semantic preserving translation
• Character/sub-word level models
• Translation in context
• Low-resource translation
- Knowledge transfer
- Multilingual translation
69

Multilingual seq-to-seq model
70

Beyond translation: Image/Video Caption Generation
72

Beyond translation: Image/Video Caption Generation
73

Machine Translation Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Machine Translation Introduction

Similar to Machine Translation Introduction (20)

More from nlab_utokyo

More from nlab_utokyo (12)

Recently uploaded

Recently uploaded (20)

Machine Translation Introduction