SlideShare a Scribd company logo
1 of 58
Promoting Science and
Technology Exchange using
Machine Translation
Toshiaki Nakazawa
Japan Science and Technology Agency
Oct. 30, 2015 @ PSLT2015
Topics Today
• Introduction
• Practical J-C MT Development Project by JST
• 2nd Workshop on Asian Translation (WAT2015)
2
Number of Patents in the World
3
http://www.meti.go.jp/press/2014/11/20141112003/20141112003.html
Ohters
China
Korea
Europe
USA
Japan
Number of Scientific Papers
4
USA
Japan
China
* JST has calculated from “Web of Science” by Thomson Reuters
0
50000
100000
150000
200000
250000
300000
350000
400000
4500001981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
C hina G erm any
France U nited Kingdom
India Japan
South Korea U nited States
Singapore
Q. Who is she?
• Tu Youyou (屠 呦呦)
• The first Chinese scientist to
win a Nobel science award
(Physiology or Medicine) in
2015
• Turned to ancient texts in
China and discovered clues for
the anti-parasitic drugs
5
Photo from The New York Times
Frontrunner 5000
• Issued by Institute of Scientific and Technical
Information of China(ISTIC)
• Selected 315 outstanding journals
among 4600 journals in China
• Further selected 5000 outstanding
papers from each scientific field
• Abstracts are written in English, but the
contents are in Chinese
– Less access from abroad
6
http://f5000.istic.ac.cn
Q. Who is he?
• Toshihide Maskawa (益川敏英)
• Professor Emeritus at Kyoto
University
• Awarded the 2008 Nobel Prize
in Physics
• Extremely poor at foreign
languages
– Made a Nobel Lecture in
Japanese
– Poorly written English papers
7
Photo from Wikipedia
“English is just one of the tools”
• Juichi Yamagiwa (山極寿一)
• World-renowned expert in the
study of gorillas
• The current president of Kyoto
University
• “Thinking faculty can be
obtained by thinking in their
mother tongue (Japanese).”
• Translate -> Think
8
Photo from Nikkei
Promoting the Information Access
• Increasing number of documents written in
other than English
• Important information exists among them
• MT is an essential tool for the easy access to
the foreign information
– Chinese/Korean patent translation/search by JPO
– Practical JC MT Development Project by JST
9
Topics Today
• Introduction
• Practical J-C MT Development Project by JST
– Language resource construction
• automatic dictionary construction [PACLIC2015]
– Sentence analyzers (dependency parser)
• accuracy on scientific papers
– MT engine development
• overview of KyotoEBMT
• 2nd Workshop on Asian Translation (WAT2015)
10
Project Overview
• Period: 5 years from 2013
• Participating organizations
– Japan: JST, KyotoU(supporting: Tsukuba U, NICT)
– China: ISTIC, CAS, BJTU, HIT
• Break through the language barrier between
Japan and China by MT and promote the
science and technology exchange
11
http://foresight.jst.go.jp/jazh_zhja_mt/
Goal of This Project
Language Resource Construction
MT Engine Development
Sentence Analyzers
Japanese Chinese
機械翻訳 机器翻译
アルゴリズム 算法
蓄積 积累
アセトン 丙酮
… …
4M Technical
Term
Dictionary
ja: 原言語の意味を正しく目的
言語に再現するためには,原言
語表現の意味に適した訳語の
選択が必要である。
zh: 为了能够正确的再现原来
语言的意思,选择适合表现原
来语言意思的译语是很重要的。
5M Parallel
Corpus
开发机器翻译技术
开发 机器 翻译 技术
开发
机器
翻译
技术
Word
Segmentation
Dependency
Analysis
作为
测量
器械
使用
了
秒表
Input:
作为测量器械使用
了秒表
TranslationExamples
Output:
測定機器としては
ストップウォッチを用いた
作为
使用
了
变位
操作者
オペレータ
して
は
変位
と
を
用いた
機器
して
は
ストップウォッチ
と
を
用いた
測定
使用
秒表
ストップウォッチ
を
使った
输入
器械
入力
機器
测量
频率
測定
頻度
・・・・・ ・・・・・
Example-based Machine Translation
especially
for Chinese
Word seg:
ACL2014 (short)
IJCNLP2013
Parsing:
PACLIC2012
Online Example
Retrieving:
EMNLP2011
Decoding:
EMNLP2014
Dictionary Construction
by pivoting:
NAACL2015
PACLIC2015
DEMO:
ACL2014
12
LANGUAGE RESOURCE CONSTRUCTION
13
J-C Language Resources
• Parallel Corpus
– Scientific Paper: 2M (including ASPEC, manual
construction and automatic extraction)
• will be increased to 5M during the project
– Patent: 31M (automatic extraction)
14
• One of the fruits of the Japanese-Chinese
machine translation project conducted
between 2006 and 2010 in Japan
• JE scientific paper abstract corpus
– 3M parallel sentences extracted from 2M JE
paper abstracts owned by JST
• JC scientific paper excerpt corpus
– 680K parallel sentences manually translated from
Japanese papers which are stored in the e-journal
site “J-STAGE” run by JST
15
http://lotus.kuee.kyoto-u.ac.jp/ASPEC/
J-C Language Resources
• Parallel Corpus
– Scientific Paper: 2M (including ASPEC, manual
construction and automatic extraction)
• will be increased to 5M during the project
– Patent: 31M (automatic extraction)
• Parallel Dictionary
– Automatic construction using the existing
resources
– 3.6M entries (about 90% accuracy)
16
Large-scale Dictionary Construction via
Pivot-based Statistical Machine
Translation with Significance Pruning and
Neural Network Features
Raj Dabre1, Chenhui Chu2, Fabien Cromieres2,
Toshiaki Nakazawa2, Sadao Kurohashi1
1: Kyoto University, Japan
2: JST, Japan
PACLIC2015
Overview
• What we want: High quality, large size
technical term dictionary
• Why: Can be used as additional resource for
MT or CLIR etc.
• How: pivot based SMT (baseline, Chu+ 2015)
+ significance pruning
+ reranking by NN model
+ character-based OOV translation by NN
18
Dictionary Construction via Pivot-based Statistical
Machine Translation (SMT) [Chu+ 2015]
19
Ja-Zh pivot phrase table
アダプター ||| 接头 ||| …
反応 ||| 反应 ||| …
・・・
Ja-Zh
SMT
アダプター蛋白質 ↵
||| 接头蛋白
アセチル化反応 ||| ↵
乙酰化反应
・・・
En-Zh corpus
reaction ||| 反应 ||| …
adapter ||| 接头 ||| …
・・・
En-Zh phrase table
Ja-En corpus
Ja-Zh corpus
Ja-Zh dictionary
蛋白 質 ||| 蛋白 ||| …
アセチル 化 ||| 乙酰化 ||| …
・・・
Ja-Zh direct phrase table
アダプター ||| adapter ||| …
反応 ||| reaction ||| …
・・・
Ja-En phrase table Pivoting
アダプター蛋白質 ↵
||| adapter protein
・・・
Ja-En dictionary
乙酰化反应 ||| ↵
acetylation reaction
・・・
Zh-En dictionary
Common
Chinese
characters
Z
h
雪 爱 发
Ja 雪 愛 発
Noise Problem
20
In the pivot phrase table, the average number of
translations for each source phrase is 10,451!
Pivot phrase table
アダプター ||| 接头 ||| …
アダプタ ||| 承载鞍 ||| …
しかも ||| 接头 ||| …
しかも |||承载鞍 ||| …
反応 ||| 反应 ||| …
反応 ||| 合成 ||| …
計算 ||| 反应 ||| …
計算 ||| 合成 ||| …
・・・
アダプター ||| adapter ||| …
しかも ||| adapter ||| …
反応 ||| reaction ||| …
計算 ||| reaction ||| …
・・・
Source-Pivot phrase table
Pivoting
reaction ||| 反应 ||| …
reaction ||| 合成 ||| …
adapter ||| 接头 ||| …
adapter ||| 承载鞍 ||| …
・・・
Pivot-Target phrase table
Significance Pruning (1/2)
[Johnson+ 2007]
• Contingency table of phrase pairs in corpus
21
# parallel sentences
containing phrase s, t
# source sentences
containing phrase s
# target sentences
containing phrase t
# parallel sentences
Significance Pruning (2/2)
[Johnson+ 2007]
• Fisher’s exact test
22
Phrase pairs with a p-value larger than
a threshold are pruned
Hypergeometric
distibution
Reranking by NN model
23
Character based
model
Reranker with
neural features
アダプター蛋白質 ↵
||| 接头蛋白
アセチル化反応 ||| ↵
乙酰化反应
・・・
Ja-Zh parallel corpus
(ASPEC, 680k)
Ja-Zh dictionary
automatically constructed
by the baseline method
(3.6M entries)
ジアルキルアミン
(Dialkyl amine)
二烷基仲胺 ||| -1.66314
二烃基胺 ||| -2.09771
・・・
二烷基酰胺 ||| -2.46545
二烃基胺 ||| -82.57215
二烷基仲胺 ||| -109.61948
・・・
二烷基酰胺 ||| -118.26405
Character-based NN Model
• Learn character-based NN translation model
for both translation directions
– Groundhog framework for learning
• Model can be used also for the translation of
OOV words
24
Dataset for Experiments
Language Name Size
Ja-En
(1.4M)
Wiki title 361k
Med 54k
EDR 491k
JST 550k
En-Zh
(4.5M)
Wiki title 151k
Med 48k
EDR 909k
Wanfang 2.0M
ISTIC 1.4M
Ja-Zh
(561k)
Wiki title 175k
Med 54k
EDR 330k
25
Language Name Size
Ja-En
(49.1M)
LCAS 3.5M
Abst title 22.6M
Abst JICST 19.9M
ASPEC 3.0M
En-Zh
(8.7M)
LCAS 6.0M
LCAS title 1.0M
ISTIC PC 1.5M
Ja-Zh
(680k)
ASPEC 680k
Bilingual dictionaries Parallel corpora
Experimental Results
26
Method BLEU
4
OOV
(%)
Accuracy w/ OOV Accuracy w/o OOV
1 best 20 best 1 best 20 best
1. Direct only 40.84 26 0.3721 0.5255 0.5011 0.7082
2. Pivot only 53.32 8 0.5038 0.7284 0.5470 0.7908
3. Direct+Pivot (1+2) 54.52 8 0.5136 0.7367 0.5574 0.7994
4. 3 + Statistical Pruning* 55.86 8 0.5303 0.7260 0.5755 0.7878
5. 4 + NN Reranking 58.55 8 0.5566 0.7260 0.6040 0.7878
6. 4 + SVM Reranking 55.28 8 0.5472 0.7260 0.5938 0.7878
7. 5 + OOV translation 58.00 0 0.5588 0.7300 - -
8. 6 + OOV translation 54.85 0 0.5494 0.7300 - -
* Only pivot-target phrase table is pruned
Evaluated on Ja-Zh Iwanami biology and life science dictionaries
(dev: 4,983 pairs, test: 4,982 pairs)
Underestimation Problem
Type Ja term References Translations
1 粘質土 粘质土/黏
质土
粘性土/软泥/黏土/粘质土/黏性土/亚粘土/粘质
土壤/粘性土壤/黏性土地/粘土质
2 チョウザメ
類
鲟形目鱼
类/鲟鱼类
鲟形目/鲟鱼/鱘科类/鲟鱼类/鲟类/鱘科亚纲/鲟
鱼亚纲/鱘科化合物/鲟鱼化合物/鲟亚纲
3 心血管系
デコンディ
ショニング
心血管脱
适应/心血
管脱锻炼
血管脱/心血管系统去条件化/心血管去条件化/去
条件化心血管系统/血管去条件化/心血管系去条
件化/去条件化心血管/去条件化的心血管系统/去
条件化对心血管系统/心血管系统的去条件化
27
Type 1: top 1 is correct, but not covered by the references
Type 2: correct one is listed in top 20
Type 3: correct one is *not* listed in top 20
76% (38/50) of the errors belong to Type 1
=> actual 1-best accuracy is about 90%
Summary of Dictionary Construction
• Using the proposed method, we constructed
3.6M dictionary by translating Ja-En and En-
Zh dictionaries
• Future work: Classify the dictionary into
different domains
• Open the dictionary to public soon
– improve the quality by crowd power
28
abnormity
畸形 (Biology)
反常 (Business Administration)
SENTENCE ANALYZERS (DEPENDENCY
PARSER)
29
Chinese-Japanese
Scientific Paper Treebank
• Selected 1000 parallel sentences from Ja-Zh
scientific papers
• HIT created Chinese treebank and Kyoto-U
created Japanese treebank
• Not enough for training the parsers, but
useful to check the practical accuracy of
parsers for scientific sentences
• Not public now, sorry … 
30
Dependency Parsing Accuracy
• Japanese: 88.3%
– Clause-level evaluation, starting from gold
segmentation and POS-tag
– Lower than that for Web or newspaper by 2-3%
• Chinese: 75.7%
– Starting from gold segmentation and POS-tag
– Root accuracy = 73.2%
– Sentence accuracy = 12.7%
31
MT ENGINE DEVELOPMENT
32
Overview of KyotoEBMT
33
Translation ExamplesInput:
例えばプラスチック
は石油から製造さ
れる
Output:
plastic is produced
from petroleum
for example
例えば for example
プラスチック
は
石油
から
製造
さ
れる
例えば
plastic
is
produced
from
petroleum
for example
the
水素
は
現在
天然ガス
や
石油
から
製造
さ
れる
hydrogen
is
produced
from
natural
gas
and
petroleum
at
present
・・・・・ ・・・・・
プラスチック
を
調査
した
We
investigated
plastic
raw
Specificities (1/2)
• No “phrase-table”
– all translation rules computed on-the-fly for each
input
– cons:
• possibly slower (but not so slow)
• computing significance/ sparse features more
complicated
– pros:
• full-context available for computing features
• no limit on the size of matched rules
• possibility to output perfect translation when input is
very similar to an example
34
Specificities (2/2)
• “Flexible” translation rules
– Optional words
– Alternative insertion positions
– Decoder can process flexible rules more
efficiently than a long list of alternative rules
• some “flexible rules” may actually encode > millions of
“standard rules”
35
Flexible Rules Extracted on-the-fly
36
プラスチック
(plastic)は
石油
から
製造
さ
れる
例えば(for
example)
the
水素
は
現在
天然ガス
や
石油
から
製造
さ
れる
hydrogen
is
produced
from
natural
gas
and
petroleum
at
present
raw
X(plastic)
is
petroleum
produced
from
Y(for example)
?
Y(for example)
Y(for example)
raw*
Y: ambiguous
insertion position
X: Simple case
(X has an equivalent in
the source example)
“raw”: null-aligned
= optional word
Improvements from Last Year
• Support forest input
– compact representation of many parses
– reduce the effect of parsing errors
• Supervised word alignment using Nile
together with the dependency tree-based
alignment model
• 10 new features
• Reranking with Neural MT
(Riesa et al., 2011)
(Nakazawa and Kurohashi, 2012)
(Bahdanau et al., 2015)
37
BLEU Improvement
30
31
32
33
34
35
36
37
38
39
40
2014/8/31
(WAT2014)
2015/3/31 2015/7/15 2015/8/31
(WAT2015)
Chinese->Japanese Translation
38
的重要性
Better Representation for PE
考虑到 计算 一般人口中发生肾上腺偶发肿瘤的概率
我们 调查了 体检中发现肾上腺偶发肿瘤的 概率
の重要性を考慮して を計算する 一般人口に副腎偶発腫が発生する確率
我々は を調査した 検診に副腎偶発腫を発現する 確率
,
。
,
。
の重要性 を考慮してを計算する一般人口に副腎偶発腫が発生する確率
我々は を調査した検診に副腎偶発腫を発現する 確率
,
。
Chinese analysis
Japanese
translation in
Chinese order
Japanese
Translation Result
[Kishimoto et. al, 2014 WPTP3]
Topics Today
• Introduction
• Practical J-C MT Development Project by JST
– Language resource construction
• automatic dictionary construction [PACLIC2015]
– Sentence analyzers (dependency parser)
• accuracy on scientific papers
– MT engine development
• overview of KyotoEBMT
• 2nd Workshop on Asian Translation (WAT2015)
40
• MT evaluation campaign focusing on Asian
languages (Japanese, Chinese, Korean and English
for now)
– Workshop was held the day before yesterday
• Tasks:
– Japanese  English scientific paper (ASPEC)
– Japanese  Chinese scientific paper (ASPEC)
– Chinese, Korean -> Japanese patent (JPC)
• All the data including test set are OPEN
– contribute to continuous evolution of MT research by
freely distributing the data (like PennTreebank sec. 23)
41
http://lotus.kuee.kyoto-u.ac.jp/WAT/
Participants List of MT Tasks
42
Team ID Organization
ASPEC JPC
JE EJ JC CJ CJ KJ
NAIST Nara Institute of Science and Technology ✓ ✓ ✓ ✓
Kyoto-U Kyoto University ✓ ✓ ✓ ✓ ✓
WEBLIO_M
T
Weblio, Inc. ✓
TMU Tokyo Metropolitan University ✓
BJTUNLP Beijing Jiaotong University ✓
Sense Saarland University & Nanyang Technological University ✓ ✓ ✓
NICT National Institute of Information and Communication Technology ✓ ✓
TOSHIBA Toshiba Corporation ✓ ✓ ✓ ✓ ✓ ✓
WASUIPS Waseda University ✓
naver NAVER Corporation ✓ ✓
EHR Ehara NLP Research Laboratory ✓ ✓ ✓ ✓
ntt NTT Communication Science Laboratories ✓
outside Japancompany
Over 50 audiences!
43
Human Evaluation in WAT2015
• Pairwise Crowdsourcing Evaluation
– System output v.s. baseline output
– Evaluators judge win (1), loss (-1), or tie (0) for
the system output
– 5 evaluators assessed for each translation pair
– The final judgment for each sentence is decided
by voting based on the sum of judgments:
• Win: sum ≧ 2, Loss: sum ≦ -2, Tie: otherwise
– Crowd score = 100 * (Win-Loss) / 400
44
Human Evaluation in WAT2015
• JPO Adequacy Evaluation (NEW)
– Top 3 teams of each subtask according to the
Crowd score
– 5-scale criterion defined by Japan Patent Office
45
5 All important informa7on is transmiced correctly. (100%)
4 Almost all important informa7on is transmiced correctly. (80%〜)
3 More than half of important informa7on is transmiced correctly. (50%〜)
2 Some of important informa7on is transmiced correctly. (20%〜)
1 Almost no important informa7on is transmiced correctly. (〜20%)
Findings at WAT2015
• Neural Network based re-ranking is effective (NAIST,
Kyoto-U, naver)
• The top SMT outperformed RBMT for Chinese-
Japanese and Korean-Japanese patent translation
• Korean-Japanese patent translation achieved high
scores for both automatic and human evaluations
• A problem of automatic evaluation was found in the
Korean-Japanese evaluation
• For the detail, please visit
http://lotus.kuee.kyoto-u.ac.jp/WAT/
or search papers in ACL Anthology
46
Scientific Paper J->E
47
-40.00
-30.00
-20.00
-10.00
0.00
10.00
20.00
30.00
40.00
50.00
NAIST Kyoto-U TOSHIBA RBMT D NICT SMT S2T Online D Sense TMU
Crowd Evaluation Score
Scientific Paper E->J
48
-60.00
-40.00
-20.00
0.00
20.00
40.00
60.00
80.00
NAIST WEBLIO
MT
naver Kyoto-U TOSHIBA Online A EHR SMT T2S RBMT B Sense
Crowd Evaluation Score
Scientific Paper J->C
49
-25.00
-20.00
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
25.00
TOSHIBA Kyoto-U SMT S2T NAIST RBMT B Online D
Crowd Evaluation Score
Scientific Paper C->J
50
-40.00
-30.00
-20.00
-10.00
0.00
10.00
20.00
30.00
40.00
50.00
NAIST EHR Kyoto-U TOSHIBA SMT T2S BJTUNLP Online A RBMT A
Crowd Evaluation Score
Scientific Paper C->J
51
-25.00
-20.00
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
25.00
TOSHIBA Kyoto-U SMT S2T NAIST RBMT B Online D
Crowd Evaluation Score
Patent C->J
52
-50.00
-40.00
-30.00
-20.00
-10.00
0.00
10.00
20.00
30.00
40.00
Kyoto-U TOSHIBA EHR SMT T2S ntt Online A WASUIPS RBMT A
Crowd Evaluation Score
Patent K->J
53
-30.00
-20.00
-10.00
0.00
10.00
20.00
30.00
40.00
50.00
Online A naver NICT EHR TOSHIBA Sense SMT
Hiero
RBMT A Sense
Crowd Evaluation Score
JPO Adequacy Evaluation Results
54
Problem of Automatic Evaluation
The highest automatic scores
The lowest crowd score
55
Next Step
• WAT2016 will be co-located with Coling2016!
– Not decided yet…
• Include new language pair!
– Indonesian-English
• Need more investigation to acquire reliable
human evaluation results at low cost
56
Summary
• MT is an essential tool for the easy access to
the foreign information
• Our contributions
– J-C MT project to promote science and
technology exchange between China and Japan
• Constructed and exchanged language resources
• Have been developing sentence analyzers and MT
– Workshop on Asian Translation
• What’s next
– Make practical use of the developed MT system
57
THANK YOU FOR YOUR ATTENTION!
58

More Related Content

Similar to Promoting Science and Technology Exchange using Machine Translation

International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...CSCJournals
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSAksw Group
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Yusuke Oda
 
Min-Yen Kan - 2015 - Keywords, phrases, clauses and sentences: topicality, i...
Min-Yen Kan - 2015 -  Keywords, phrases, clauses and sentences: topicality, i...Min-Yen Kan - 2015 -  Keywords, phrases, clauses and sentences: topicality, i...
Min-Yen Kan - 2015 - Keywords, phrases, clauses and sentences: topicality, i...Association for Computational Linguistics
 
Algorithms for the thematic analysis of twitter datasets
Algorithms for the thematic analysis of twitter datasetsAlgorithms for the thematic analysis of twitter datasets
Algorithms for the thematic analysis of twitter datasetsaneeshabakharia
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...MOVING Project
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Tim Allison
 
A Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian LanguagesA Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian LanguagesIIIT Hyderabad
 
A Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian LanguagesA Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian LanguagesIIIT Hyderabad
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
International Journal of Biometrics and Bioinformatics(IJBB)  Volume (4) Issu...International Journal of Biometrics and Bioinformatics(IJBB)  Volume (4) Issu...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...CSCJournals
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorHoffman Lab
 
Biomechanical Simulation using Supercomputer for Predictive Medicine
Biomechanical Simulation using Supercomputer for Predictive MedicineBiomechanical Simulation using Supercomputer for Predictive Medicine
Biomechanical Simulation using Supercomputer for Predictive MedicineCelso Furukawa
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
Creation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent AnalysisCreation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent AnalysisIRJET Journal
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Masahito Ohue
 
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...Ichigaku Takigawa
 

Similar to Promoting Science and Technology Exchange using Machine Translation (20)

Text Analysis of Academic Papers Archived in Institutional Repositories
Text Analysis of Academic Papers Archived in Institutional RepositoriesText Analysis of Academic Papers Archived in Institutional Repositories
Text Analysis of Academic Papers Archived in Institutional Repositories
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
Min-Yen Kan - 2015 - Keywords, phrases, clauses and sentences: topicality, i...
Min-Yen Kan - 2015 -  Keywords, phrases, clauses and sentences: topicality, i...Min-Yen Kan - 2015 -  Keywords, phrases, clauses and sentences: topicality, i...
Min-Yen Kan - 2015 - Keywords, phrases, clauses and sentences: topicality, i...
 
Algorithms for the thematic analysis of twitter datasets
Algorithms for the thematic analysis of twitter datasetsAlgorithms for the thematic analysis of twitter datasets
Algorithms for the thematic analysis of twitter datasets
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
SISAP17
SISAP17SISAP17
SISAP17
 
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
 
A Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian LanguagesA Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian Languages
 
A Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian LanguagesA Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian Languages
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
International Journal of Biometrics and Bioinformatics(IJBB)  Volume (4) Issu...International Journal of Biometrics and Bioinformatics(IJBB)  Volume (4) Issu...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
Biomechanical Simulation using Supercomputer for Predictive Medicine
Biomechanical Simulation using Supercomputer for Predictive MedicineBiomechanical Simulation using Supercomputer for Predictive Medicine
Biomechanical Simulation using Supercomputer for Predictive Medicine
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Creation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent AnalysisCreation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent Analysis
 
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
Link Mining for Kernel-based Compound-Protein Interaction Predictions Using a...
 
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
 
Semantic annotation of biomedical data
Semantic annotation of biomedical dataSemantic annotation of biomedical data
Semantic annotation of biomedical data
 

Recently uploaded

Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this periodSaraIsabelJimenez
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxJohnree4
 

Recently uploaded (20)

Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this period
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptx
 

Promoting Science and Technology Exchange using Machine Translation

  • 1. Promoting Science and Technology Exchange using Machine Translation Toshiaki Nakazawa Japan Science and Technology Agency Oct. 30, 2015 @ PSLT2015
  • 2. Topics Today • Introduction • Practical J-C MT Development Project by JST • 2nd Workshop on Asian Translation (WAT2015) 2
  • 3. Number of Patents in the World 3 http://www.meti.go.jp/press/2014/11/20141112003/20141112003.html Ohters China Korea Europe USA Japan
  • 4. Number of Scientific Papers 4 USA Japan China * JST has calculated from “Web of Science” by Thomson Reuters 0 50000 100000 150000 200000 250000 300000 350000 400000 4500001981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 C hina G erm any France U nited Kingdom India Japan South Korea U nited States Singapore
  • 5. Q. Who is she? • Tu Youyou (屠 呦呦) • The first Chinese scientist to win a Nobel science award (Physiology or Medicine) in 2015 • Turned to ancient texts in China and discovered clues for the anti-parasitic drugs 5 Photo from The New York Times
  • 6. Frontrunner 5000 • Issued by Institute of Scientific and Technical Information of China(ISTIC) • Selected 315 outstanding journals among 4600 journals in China • Further selected 5000 outstanding papers from each scientific field • Abstracts are written in English, but the contents are in Chinese – Less access from abroad 6 http://f5000.istic.ac.cn
  • 7. Q. Who is he? • Toshihide Maskawa (益川敏英) • Professor Emeritus at Kyoto University • Awarded the 2008 Nobel Prize in Physics • Extremely poor at foreign languages – Made a Nobel Lecture in Japanese – Poorly written English papers 7 Photo from Wikipedia
  • 8. “English is just one of the tools” • Juichi Yamagiwa (山極寿一) • World-renowned expert in the study of gorillas • The current president of Kyoto University • “Thinking faculty can be obtained by thinking in their mother tongue (Japanese).” • Translate -> Think 8 Photo from Nikkei
  • 9. Promoting the Information Access • Increasing number of documents written in other than English • Important information exists among them • MT is an essential tool for the easy access to the foreign information – Chinese/Korean patent translation/search by JPO – Practical JC MT Development Project by JST 9
  • 10. Topics Today • Introduction • Practical J-C MT Development Project by JST – Language resource construction • automatic dictionary construction [PACLIC2015] – Sentence analyzers (dependency parser) • accuracy on scientific papers – MT engine development • overview of KyotoEBMT • 2nd Workshop on Asian Translation (WAT2015) 10
  • 11. Project Overview • Period: 5 years from 2013 • Participating organizations – Japan: JST, KyotoU(supporting: Tsukuba U, NICT) – China: ISTIC, CAS, BJTU, HIT • Break through the language barrier between Japan and China by MT and promote the science and technology exchange 11 http://foresight.jst.go.jp/jazh_zhja_mt/
  • 12. Goal of This Project Language Resource Construction MT Engine Development Sentence Analyzers Japanese Chinese 機械翻訳 机器翻译 アルゴリズム 算法 蓄積 积累 アセトン 丙酮 … … 4M Technical Term Dictionary ja: 原言語の意味を正しく目的 言語に再現するためには,原言 語表現の意味に適した訳語の 選択が必要である。 zh: 为了能够正确的再现原来 语言的意思,选择适合表现原 来语言意思的译语是很重要的。 5M Parallel Corpus 开发机器翻译技术 开发 机器 翻译 技术 开发 机器 翻译 技术 Word Segmentation Dependency Analysis 作为 测量 器械 使用 了 秒表 Input: 作为测量器械使用 了秒表 TranslationExamples Output: 測定機器としては ストップウォッチを用いた 作为 使用 了 变位 操作者 オペレータ して は 変位 と を 用いた 機器 して は ストップウォッチ と を 用いた 測定 使用 秒表 ストップウォッチ を 使った 输入 器械 入力 機器 测量 频率 測定 頻度 ・・・・・ ・・・・・ Example-based Machine Translation especially for Chinese Word seg: ACL2014 (short) IJCNLP2013 Parsing: PACLIC2012 Online Example Retrieving: EMNLP2011 Decoding: EMNLP2014 Dictionary Construction by pivoting: NAACL2015 PACLIC2015 DEMO: ACL2014 12
  • 14. J-C Language Resources • Parallel Corpus – Scientific Paper: 2M (including ASPEC, manual construction and automatic extraction) • will be increased to 5M during the project – Patent: 31M (automatic extraction) 14
  • 15. • One of the fruits of the Japanese-Chinese machine translation project conducted between 2006 and 2010 in Japan • JE scientific paper abstract corpus – 3M parallel sentences extracted from 2M JE paper abstracts owned by JST • JC scientific paper excerpt corpus – 680K parallel sentences manually translated from Japanese papers which are stored in the e-journal site “J-STAGE” run by JST 15 http://lotus.kuee.kyoto-u.ac.jp/ASPEC/
  • 16. J-C Language Resources • Parallel Corpus – Scientific Paper: 2M (including ASPEC, manual construction and automatic extraction) • will be increased to 5M during the project – Patent: 31M (automatic extraction) • Parallel Dictionary – Automatic construction using the existing resources – 3.6M entries (about 90% accuracy) 16
  • 17. Large-scale Dictionary Construction via Pivot-based Statistical Machine Translation with Significance Pruning and Neural Network Features Raj Dabre1, Chenhui Chu2, Fabien Cromieres2, Toshiaki Nakazawa2, Sadao Kurohashi1 1: Kyoto University, Japan 2: JST, Japan PACLIC2015
  • 18. Overview • What we want: High quality, large size technical term dictionary • Why: Can be used as additional resource for MT or CLIR etc. • How: pivot based SMT (baseline, Chu+ 2015) + significance pruning + reranking by NN model + character-based OOV translation by NN 18
  • 19. Dictionary Construction via Pivot-based Statistical Machine Translation (SMT) [Chu+ 2015] 19 Ja-Zh pivot phrase table アダプター ||| 接头 ||| … 反応 ||| 反应 ||| … ・・・ Ja-Zh SMT アダプター蛋白質 ↵ ||| 接头蛋白 アセチル化反応 ||| ↵ 乙酰化反应 ・・・ En-Zh corpus reaction ||| 反应 ||| … adapter ||| 接头 ||| … ・・・ En-Zh phrase table Ja-En corpus Ja-Zh corpus Ja-Zh dictionary 蛋白 質 ||| 蛋白 ||| … アセチル 化 ||| 乙酰化 ||| … ・・・ Ja-Zh direct phrase table アダプター ||| adapter ||| … 反応 ||| reaction ||| … ・・・ Ja-En phrase table Pivoting アダプター蛋白質 ↵ ||| adapter protein ・・・ Ja-En dictionary 乙酰化反应 ||| ↵ acetylation reaction ・・・ Zh-En dictionary Common Chinese characters Z h 雪 爱 发 Ja 雪 愛 発
  • 20. Noise Problem 20 In the pivot phrase table, the average number of translations for each source phrase is 10,451! Pivot phrase table アダプター ||| 接头 ||| … アダプタ ||| 承载鞍 ||| … しかも ||| 接头 ||| … しかも |||承载鞍 ||| … 反応 ||| 反应 ||| … 反応 ||| 合成 ||| … 計算 ||| 反应 ||| … 計算 ||| 合成 ||| … ・・・ アダプター ||| adapter ||| … しかも ||| adapter ||| … 反応 ||| reaction ||| … 計算 ||| reaction ||| … ・・・ Source-Pivot phrase table Pivoting reaction ||| 反应 ||| … reaction ||| 合成 ||| … adapter ||| 接头 ||| … adapter ||| 承载鞍 ||| … ・・・ Pivot-Target phrase table
  • 21. Significance Pruning (1/2) [Johnson+ 2007] • Contingency table of phrase pairs in corpus 21 # parallel sentences containing phrase s, t # source sentences containing phrase s # target sentences containing phrase t # parallel sentences
  • 22. Significance Pruning (2/2) [Johnson+ 2007] • Fisher’s exact test 22 Phrase pairs with a p-value larger than a threshold are pruned Hypergeometric distibution
  • 23. Reranking by NN model 23 Character based model Reranker with neural features アダプター蛋白質 ↵ ||| 接头蛋白 アセチル化反応 ||| ↵ 乙酰化反应 ・・・ Ja-Zh parallel corpus (ASPEC, 680k) Ja-Zh dictionary automatically constructed by the baseline method (3.6M entries) ジアルキルアミン (Dialkyl amine) 二烷基仲胺 ||| -1.66314 二烃基胺 ||| -2.09771 ・・・ 二烷基酰胺 ||| -2.46545 二烃基胺 ||| -82.57215 二烷基仲胺 ||| -109.61948 ・・・ 二烷基酰胺 ||| -118.26405
  • 24. Character-based NN Model • Learn character-based NN translation model for both translation directions – Groundhog framework for learning • Model can be used also for the translation of OOV words 24
  • 25. Dataset for Experiments Language Name Size Ja-En (1.4M) Wiki title 361k Med 54k EDR 491k JST 550k En-Zh (4.5M) Wiki title 151k Med 48k EDR 909k Wanfang 2.0M ISTIC 1.4M Ja-Zh (561k) Wiki title 175k Med 54k EDR 330k 25 Language Name Size Ja-En (49.1M) LCAS 3.5M Abst title 22.6M Abst JICST 19.9M ASPEC 3.0M En-Zh (8.7M) LCAS 6.0M LCAS title 1.0M ISTIC PC 1.5M Ja-Zh (680k) ASPEC 680k Bilingual dictionaries Parallel corpora
  • 26. Experimental Results 26 Method BLEU 4 OOV (%) Accuracy w/ OOV Accuracy w/o OOV 1 best 20 best 1 best 20 best 1. Direct only 40.84 26 0.3721 0.5255 0.5011 0.7082 2. Pivot only 53.32 8 0.5038 0.7284 0.5470 0.7908 3. Direct+Pivot (1+2) 54.52 8 0.5136 0.7367 0.5574 0.7994 4. 3 + Statistical Pruning* 55.86 8 0.5303 0.7260 0.5755 0.7878 5. 4 + NN Reranking 58.55 8 0.5566 0.7260 0.6040 0.7878 6. 4 + SVM Reranking 55.28 8 0.5472 0.7260 0.5938 0.7878 7. 5 + OOV translation 58.00 0 0.5588 0.7300 - - 8. 6 + OOV translation 54.85 0 0.5494 0.7300 - - * Only pivot-target phrase table is pruned Evaluated on Ja-Zh Iwanami biology and life science dictionaries (dev: 4,983 pairs, test: 4,982 pairs)
  • 27. Underestimation Problem Type Ja term References Translations 1 粘質土 粘质土/黏 质土 粘性土/软泥/黏土/粘质土/黏性土/亚粘土/粘质 土壤/粘性土壤/黏性土地/粘土质 2 チョウザメ 類 鲟形目鱼 类/鲟鱼类 鲟形目/鲟鱼/鱘科类/鲟鱼类/鲟类/鱘科亚纲/鲟 鱼亚纲/鱘科化合物/鲟鱼化合物/鲟亚纲 3 心血管系 デコンディ ショニング 心血管脱 适应/心血 管脱锻炼 血管脱/心血管系统去条件化/心血管去条件化/去 条件化心血管系统/血管去条件化/心血管系去条 件化/去条件化心血管/去条件化的心血管系统/去 条件化对心血管系统/心血管系统的去条件化 27 Type 1: top 1 is correct, but not covered by the references Type 2: correct one is listed in top 20 Type 3: correct one is *not* listed in top 20 76% (38/50) of the errors belong to Type 1 => actual 1-best accuracy is about 90%
  • 28. Summary of Dictionary Construction • Using the proposed method, we constructed 3.6M dictionary by translating Ja-En and En- Zh dictionaries • Future work: Classify the dictionary into different domains • Open the dictionary to public soon – improve the quality by crowd power 28 abnormity 畸形 (Biology) 反常 (Business Administration)
  • 30. Chinese-Japanese Scientific Paper Treebank • Selected 1000 parallel sentences from Ja-Zh scientific papers • HIT created Chinese treebank and Kyoto-U created Japanese treebank • Not enough for training the parsers, but useful to check the practical accuracy of parsers for scientific sentences • Not public now, sorry …  30
  • 31. Dependency Parsing Accuracy • Japanese: 88.3% – Clause-level evaluation, starting from gold segmentation and POS-tag – Lower than that for Web or newspaper by 2-3% • Chinese: 75.7% – Starting from gold segmentation and POS-tag – Root accuracy = 73.2% – Sentence accuracy = 12.7% 31
  • 33. Overview of KyotoEBMT 33 Translation ExamplesInput: 例えばプラスチック は石油から製造さ れる Output: plastic is produced from petroleum for example 例えば for example プラスチック は 石油 から 製造 さ れる 例えば plastic is produced from petroleum for example the 水素 は 現在 天然ガス や 石油 から 製造 さ れる hydrogen is produced from natural gas and petroleum at present ・・・・・ ・・・・・ プラスチック を 調査 した We investigated plastic raw
  • 34. Specificities (1/2) • No “phrase-table” – all translation rules computed on-the-fly for each input – cons: • possibly slower (but not so slow) • computing significance/ sparse features more complicated – pros: • full-context available for computing features • no limit on the size of matched rules • possibility to output perfect translation when input is very similar to an example 34
  • 35. Specificities (2/2) • “Flexible” translation rules – Optional words – Alternative insertion positions – Decoder can process flexible rules more efficiently than a long list of alternative rules • some “flexible rules” may actually encode > millions of “standard rules” 35
  • 36. Flexible Rules Extracted on-the-fly 36 プラスチック (plastic)は 石油 から 製造 さ れる 例えば(for example) the 水素 は 現在 天然ガス や 石油 から 製造 さ れる hydrogen is produced from natural gas and petroleum at present raw X(plastic) is petroleum produced from Y(for example) ? Y(for example) Y(for example) raw* Y: ambiguous insertion position X: Simple case (X has an equivalent in the source example) “raw”: null-aligned = optional word
  • 37. Improvements from Last Year • Support forest input – compact representation of many parses – reduce the effect of parsing errors • Supervised word alignment using Nile together with the dependency tree-based alignment model • 10 new features • Reranking with Neural MT (Riesa et al., 2011) (Nakazawa and Kurohashi, 2012) (Bahdanau et al., 2015) 37
  • 39. 的重要性 Better Representation for PE 考虑到 计算 一般人口中发生肾上腺偶发肿瘤的概率 我们 调查了 体检中发现肾上腺偶发肿瘤的 概率 の重要性を考慮して を計算する 一般人口に副腎偶発腫が発生する確率 我々は を調査した 検診に副腎偶発腫を発現する 確率 , 。 , 。 の重要性 を考慮してを計算する一般人口に副腎偶発腫が発生する確率 我々は を調査した検診に副腎偶発腫を発現する 確率 , 。 Chinese analysis Japanese translation in Chinese order Japanese Translation Result [Kishimoto et. al, 2014 WPTP3]
  • 40. Topics Today • Introduction • Practical J-C MT Development Project by JST – Language resource construction • automatic dictionary construction [PACLIC2015] – Sentence analyzers (dependency parser) • accuracy on scientific papers – MT engine development • overview of KyotoEBMT • 2nd Workshop on Asian Translation (WAT2015) 40
  • 41. • MT evaluation campaign focusing on Asian languages (Japanese, Chinese, Korean and English for now) – Workshop was held the day before yesterday • Tasks: – Japanese  English scientific paper (ASPEC) – Japanese  Chinese scientific paper (ASPEC) – Chinese, Korean -> Japanese patent (JPC) • All the data including test set are OPEN – contribute to continuous evolution of MT research by freely distributing the data (like PennTreebank sec. 23) 41 http://lotus.kuee.kyoto-u.ac.jp/WAT/
  • 42. Participants List of MT Tasks 42 Team ID Organization ASPEC JPC JE EJ JC CJ CJ KJ NAIST Nara Institute of Science and Technology ✓ ✓ ✓ ✓ Kyoto-U Kyoto University ✓ ✓ ✓ ✓ ✓ WEBLIO_M T Weblio, Inc. ✓ TMU Tokyo Metropolitan University ✓ BJTUNLP Beijing Jiaotong University ✓ Sense Saarland University & Nanyang Technological University ✓ ✓ ✓ NICT National Institute of Information and Communication Technology ✓ ✓ TOSHIBA Toshiba Corporation ✓ ✓ ✓ ✓ ✓ ✓ WASUIPS Waseda University ✓ naver NAVER Corporation ✓ ✓ EHR Ehara NLP Research Laboratory ✓ ✓ ✓ ✓ ntt NTT Communication Science Laboratories ✓ outside Japancompany
  • 44. Human Evaluation in WAT2015 • Pairwise Crowdsourcing Evaluation – System output v.s. baseline output – Evaluators judge win (1), loss (-1), or tie (0) for the system output – 5 evaluators assessed for each translation pair – The final judgment for each sentence is decided by voting based on the sum of judgments: • Win: sum ≧ 2, Loss: sum ≦ -2, Tie: otherwise – Crowd score = 100 * (Win-Loss) / 400 44
  • 45. Human Evaluation in WAT2015 • JPO Adequacy Evaluation (NEW) – Top 3 teams of each subtask according to the Crowd score – 5-scale criterion defined by Japan Patent Office 45 5 All important informa7on is transmiced correctly. (100%) 4 Almost all important informa7on is transmiced correctly. (80%〜) 3 More than half of important informa7on is transmiced correctly. (50%〜) 2 Some of important informa7on is transmiced correctly. (20%〜) 1 Almost no important informa7on is transmiced correctly. (〜20%)
  • 46. Findings at WAT2015 • Neural Network based re-ranking is effective (NAIST, Kyoto-U, naver) • The top SMT outperformed RBMT for Chinese- Japanese and Korean-Japanese patent translation • Korean-Japanese patent translation achieved high scores for both automatic and human evaluations • A problem of automatic evaluation was found in the Korean-Japanese evaluation • For the detail, please visit http://lotus.kuee.kyoto-u.ac.jp/WAT/ or search papers in ACL Anthology 46
  • 47. Scientific Paper J->E 47 -40.00 -30.00 -20.00 -10.00 0.00 10.00 20.00 30.00 40.00 50.00 NAIST Kyoto-U TOSHIBA RBMT D NICT SMT S2T Online D Sense TMU Crowd Evaluation Score
  • 48. Scientific Paper E->J 48 -60.00 -40.00 -20.00 0.00 20.00 40.00 60.00 80.00 NAIST WEBLIO MT naver Kyoto-U TOSHIBA Online A EHR SMT T2S RBMT B Sense Crowd Evaluation Score
  • 50. Scientific Paper C->J 50 -40.00 -30.00 -20.00 -10.00 0.00 10.00 20.00 30.00 40.00 50.00 NAIST EHR Kyoto-U TOSHIBA SMT T2S BJTUNLP Online A RBMT A Crowd Evaluation Score
  • 52. Patent C->J 52 -50.00 -40.00 -30.00 -20.00 -10.00 0.00 10.00 20.00 30.00 40.00 Kyoto-U TOSHIBA EHR SMT T2S ntt Online A WASUIPS RBMT A Crowd Evaluation Score
  • 53. Patent K->J 53 -30.00 -20.00 -10.00 0.00 10.00 20.00 30.00 40.00 50.00 Online A naver NICT EHR TOSHIBA Sense SMT Hiero RBMT A Sense Crowd Evaluation Score
  • 55. Problem of Automatic Evaluation The highest automatic scores The lowest crowd score 55
  • 56. Next Step • WAT2016 will be co-located with Coling2016! – Not decided yet… • Include new language pair! – Indonesian-English • Need more investigation to acquire reliable human evaluation results at low cost 56
  • 57. Summary • MT is an essential tool for the easy access to the foreign information • Our contributions – J-C MT project to promote science and technology exchange between China and Japan • Constructed and exchanged language resources • Have been developing sentence analyzers and MT – Workshop on Asian Translation • What’s next – Make practical use of the developed MT system 57
  • 58. THANK YOU FOR YOUR ATTENTION! 58

Editor's Notes

  1. Binomial coefficient. This probability is interpreted as the probability of observing by chance an association that is at least as strong as the given one and hence its significance. A larger C(s, t) means more extreme cases than the observed one. a=log(N) for 1-1-1 tables