SlideShare a Scribd company logo
1 of 22
What is Machine Translation? 
GT 09/08/2013
Topics covered 
• Three types of Machine Translation 
• What can be translated? 
• Common MT systems 
• Which systems do our clients use? 
• Which system do we use?
Three Types of Machine Translation 
• Statistical Machine Translation (SMT) 
• Rule-Based Machine Translation (RBMT) 
• Hybrid Machine Translation 
– Rules post-processed by statistics 
– Statistics guided by rules
Statistical Machine Translation (SMT) 
• Developed by IBM in the early 1990s. 
• It is called “Statistical” because it is based on probability. 
• Two or three-step process: 
1. Training 
2. Decoding (= machine translation) 
3. [Recommended] Re-training (= Improving the engine once the files have 
been post-edited) 
• Training is the critical step of Machine Translation and takes much 
longer than the machine translation process itself.
SMT – Training Process 
1. Start by creating a Training Corpus 
– Can be one or several translation memories in TMX format 
– Can be a collection of source and target texts that will need to be aligned 
1. Clean the corpus (automatic or semi-automatic process) 
– Remove duplicates (keeping the most recent entry), identical source-target 
segments, tags => Result is clean, text-only sentences 
– Can involve manual cleansing depending on the level of “noise” found 
1. Build a language model from the corpus (automatic process) 
– Built for the target language only 
– Contains n-grams (group of n words) 
– Used to find the smoothest translation = High probability of using the correct 
n-gram based on its frequency in the corpus => Fluency. 
1. Build a translation model from the corpus (automatic process) 
– Bilingual model 
– Contains n-grams 
– Used to find the best translation match = High probability that a target n-gram is the 
translation of a source n-gram => Accuracy.
SMT - Decoding Process 
• What people understand Machine Translation to be 
• A file is processed sentence by sentence 
• Each sentence is broken into n-grams 
• The n-grams are translated based on the highest probability scores 
in the phrase model and in the language model 
• The phrase is re-constructed based on the best n-grams 
• The file is re-constructed from all the translated phrases
SMT Example (ES-EN) 
Maria no daba una bofetada a la bruja verde 
Mary not give a slap to the witch green 
did not a slap by green witch 
no slap to the 
did not give to 
the 
slap the witch 
• The translation models tells us which is the more likely translation given the source words. 
• The language models tells us which translation is the best linguistically. 
Possible good translations: 
• Mary did not give a slap to the green witch. 
• Mary did not slap the green witch.
SMT – Re-training Process 
• This is an optional but recommended step. 
• The post-edited files are converted into a new TMX file. 
• The post-editors’ feedback is used to attempt to correct frequently 
occurring errors => Modify engine settings. 
• The engine is re-trained using the previous Training Corpus as well 
as the new TMX file.
SMT - Considerations 
• A large training corpus does not guarantee good quality MT output. 
• A clean and consistent training corpus must be used in order to achieve 
good quality MT output. 
• It is best to use a domain-based engine even when the client is the same, 
e.g. create one engine for UI and one for Help/Doc. 
• The quality of the MT output can vary from language to language and 
even from handoff to handoff. 
• The quality of the source text is important - Consistent terminology and 
sentence structure produce better output. 
• SMT engines can be tuned and improved with feedback. 
• SMT engines can be re-trained and improved by updating the training 
corpus with newly post-edited content.
Rule-Based Machine Translation (RBMT) 
• Based on: 
– Terminology 
• Bilingual or multilingual dictionary needed 
• Mono-lingual normalisation dictionary needed in order to standardise or correct 
source text before translation or to correct target text after translation 
– Rules representing the source sentence structure 
– Rules representing the target sentence structure 
– Rules on how the source structure and the target structure relate to each other 
• Steps: 
1. Obtain part-of-speech information for each source word (article, noun, verb etc). 
2. Obtain syntactic information about the verb (tense, person, voice). 
3. Parse the source sentence in order to identify the structure (subject, verb, object etc). 
4. Translate source words into target words. 
5. Create translated sentence by mapping dictionary entries into appropriate inflected 
forms based on target rules. 
6. [Optional but recommended] Once the post-editing is complete, update the 
dictionaries and/or rules based on the post-editors’ feedback.
RBMT - Considerations 
• Need very good dictionaries => Building new dictionaries is expensive 
because it needs to be done by a skilled linguist for each language. 
• The output may be accurate and grammatically correct, but not always 
very fluent. 
• RBMT engines are more expensive than SMT engines because a great 
deal of effort is required in terms of development and customisation 
before the engine produces the desired quality. 
• SMT engines can be re-trained automatically, whereas RBMT engines can 
only be updated through human intervention (update dictionaries and 
rules).
Hybrid Machine Translation 
Two types: 
• Rules post-processed by statistics 
– Translations are performed using a rules-based engine. 
– Statistics are then used in an attempt to adjust/correct the output from the 
rules engine. 
• Statistics guided by rules 
– Rules are used to pre-process data in an attempt to better guide the 
statistical engine. 
– Rules are also used to post-process the statistical output to perform 
functions such as normalization. 
– This approach has a lot more power, flexibility and control when 
translating.
What can be machine-translated?
Three File Types 
• Mono-lingual files (e.g. DOCX, HTML, TXT) 
Engines can translate mono-lingual files but this results in a mono-lingual 
translation => Very difficult to post-edit without reference to the source. 
• Translation memories in TMX format 
–The MT output is inserted into the target area of the translation unit. 
–The source files for translation are processed in a CAT tool against the MT TM, 
but: 
Penalties are applied to translation hits originating from the MT TM to indicate 
that the translation needs to be post-edited. 
• Bilingual files 
–The best option is to machine-translate XLIFF files. These are bilingual files than 
can be imported into all modern CAT tools => Post-editing can be supported by the 
use of a standard TM. 
–Machine-translated segments are flagged with a specific status in the CAT tool.
Which content? 
• Technical, structured content fares better than creative, free-flowing 
content 
– MT well suited to help systems, user guides, FAQs, Knowledge Base articles 
• UI strings not necessarily well suited to MT 
– UI strings can be difficult to interpret in standard localisation projects (omitted words 
for conciseness, variables, verb or noun?) => If UI strings are difficult for a human to 
interpret, it will be even harder for the engine 
– Short strings are not necessarily easier for the MT engine to decode than longer strings 
• Do not expect the engine to be creative 
– If words are not present in the Training Corpus or in the Dictionaries, the engine will not 
be able to come up with a translation for them => Depending on the engine, unknown 
words will be omitted, or left untranslated in the MT output 
• What level of MT output do you require? 
– Do you need to bring the MT output to human-quality level? 
– Do you simply need to be able to understand what is being said (e.g. social network 
sites, support chat lines)?
Common MT Applications (1/3) 
SMT 
Google Translate 71 languages 
Often translates into intermediate language and into English first to arrive at real target 
language, e.g. Catalan (ca ↔ es ↔ en ↔ other) 
CAREFUL ABOUT NDA! 
Microsoft Translator 39 languages 
• Bing Translator online 
• Free API up to 2 million characters per month 
• Offer Enterprise solutions 
CAREFUL ABOUT NDA! 
SDL Language Weaver 
(SDL BeGlobal) 
54 language pairs 
Free of charge to individual translators through Trados Studio 2011, but the engine is 
not specific to their client or to their domain => CAREFUL ABOUT NDA! 
Subscription for Enterprises and LSPs 
Enterprises and LSPs may train their own engines via SDL BeGlobal Trainer (secure) 
• Make MT part of the translation workflow via WorldServer 
• Make MT suggestions available through the cloud via Trados Studio 2011
Common MT Applications (2/3) 
SMT 
Language Studio 
(by Asia Online) 
Over 500 direct language pairs 
• Offer on-site server installation => Licenses based on language pairs and translation 
volume capacity. 
• Offer Software as a Service (SaaS) => Pay as you go with 3 options (volume, fixed 
monthly fee, file size). 
Offer 4 levels of MT quality, all with varying degrees of customisation (and price) 
Customisation is carried out by Asia Online 
Moses Open Source (free) 
No language limitations 
Highly customisable on all levels (training and decoding) => Companies use Moses but 
tailor it to their needs 
Possible to turn it into a Hybrid system with the application of language-specific rules
Common MT Applications (3/3) 
RBMT 
PROMT 12 language pairs (no Asian support) 
•Provide a free online translator tool (Online-translator.com) 
•PROMT Professional (for translators) costs $265 
•Offer Enterprise solution (part of translation workflow) 
Apertium Open Source (free) 
36 languages pairs 
No Asian character support 
Hybrid 
SYSTRAN Have been around for 40+ years 
Started out as a RBMT system and has now been updated with the use of statistics 
52 language pairs 
• SYSTRAN Premium Translator version lets you fully manage the dictionary (~ £700) 
• SYSTRAN Enterprise Server 7 available in three editions depending on company needs 
Systran say it is the fastest MT solution available
Timeline 
2010 - Asia Online launches Language Studio, a comprehensive MT and post-editing solution. 
- Systran launches its enhanced Enterprise 7 MT software. 
- Language Weaver launches its ‘quality confidence’ module. The company is acquired by SDL. 
2009 - Systran releases version 7, a hybrid version of its original RBMT. 
Includes an automated post-editing module. 
2007 - MOSES is launched as a downloadable kit. It begins to be used in a large scale EU project 
(Euromatrix) to speed up the MT development of new language pairs. 
2004 - The OpenTrad project funded by the Spanish government begins to develop MT engines 
for Spain’s various languages. Using an existing RBMT engine, the consortium builds 
Apertium. 
2002 - Language Weaver is founded in California to develop SMT systems. 
2001 - IBM launches its WebSphere translation engine for 8 languages. 
- The National Institute of Science and Technology (NIST) launches its first round of MT 
system 
benchmarking. 
1997 - The AltaVista Babelfish service launched on the web using Systran.
Which MT systems do our clients use? 
SMT 
Adobe Moses – Carried out initial tests in 2009 using PROMT for Russian and Language Weaver 
for French and Spanish 
Autodesk Moses 
HP Language Weaver – Also have access to Microsoft Translator 
Oracle Moses – Switched from Language Weaver in 2012 
Sybase Moses – Trained by Pangeanic in Spain 
RBMT 
PTC PROMT 
Hybrid 
Symantec SYSTRAN
Which system do we use? 
Moses hybrid (Statistics guided by rules) 
To be continued…
Thank you!

More Related Content

What's hot

Translation Technology - a brief but useful introduction
Translation Technology -  a brief but useful introductionTranslation Technology -  a brief but useful introduction
Translation Technology - a brief but useful introductionFerris Translations e.U.
 
The translation of metaphor
The translation of metaphorThe translation of metaphor
The translation of metaphorAmer Minhas
 
Trasnlation shift
Trasnlation shiftTrasnlation shift
Trasnlation shiftBuhsra
 
Philosophical approaches to translation
Philosophical approaches to translationPhilosophical approaches to translation
Philosophical approaches to translationHabibeh khosravi
 
Types of corpus linguistics Parallel ,aligned...
 Types of corpus linguistics Parallel ,aligned... Types of corpus linguistics Parallel ,aligned...
Types of corpus linguistics Parallel ,aligned...RajpootBhatti5
 
Types of translation
Types of translationTypes of translation
Types of translationAshish Pal
 
Computer Aided Translation
Computer Aided TranslationComputer Aided Translation
Computer Aided TranslationPhilipp Koehn
 
Globalization and translation
Globalization and translationGlobalization and translation
Globalization and translationPankaj Dwivedi
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological AnalysisAkshat Pandey
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsAdnanBaloch15
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpusThennarasuSakkan
 
Translation methods
Translation methodsTranslation methods
Translation methodsAuver2012
 
Translation 1st lecture
Translation 1st lectureTranslation 1st lecture
Translation 1st lectureBahra Salah
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine TranslationJaganadh Gopinadhan
 
key Terms in translation studies
key Terms in translation studieskey Terms in translation studies
key Terms in translation studiesBuhsra
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Moses Altovar
 
Translation history
Translation historyTranslation history
Translation historybjigjidsuren
 

What's hot (20)

Translation Technology - a brief but useful introduction
Translation Technology -  a brief but useful introductionTranslation Technology -  a brief but useful introduction
Translation Technology - a brief but useful introduction
 
The translation of metaphor
The translation of metaphorThe translation of metaphor
The translation of metaphor
 
Trasnlation shift
Trasnlation shiftTrasnlation shift
Trasnlation shift
 
Philosophical approaches to translation
Philosophical approaches to translationPhilosophical approaches to translation
Philosophical approaches to translation
 
Types of corpus linguistics Parallel ,aligned...
 Types of corpus linguistics Parallel ,aligned... Types of corpus linguistics Parallel ,aligned...
Types of corpus linguistics Parallel ,aligned...
 
translation method
translation methodtranslation method
translation method
 
Types of translation
Types of translationTypes of translation
Types of translation
 
Computer Aided Translation
Computer Aided TranslationComputer Aided Translation
Computer Aided Translation
 
Globalization and translation
Globalization and translationGlobalization and translation
Globalization and translation
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological Analysis
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpus
 
Translation methods
Translation methodsTranslation methods
Translation methods
 
Translation 1st lecture
Translation 1st lectureTranslation 1st lecture
Translation 1st lecture
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
Translation theory
Translation theoryTranslation theory
Translation theory
 
key Terms in translation studies
key Terms in translation studieskey Terms in translation studies
key Terms in translation studies
 
Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...Techniques in translation, computer assisted, machine translation, subtitling...
Techniques in translation, computer assisted, machine translation, subtitling...
 
Translation history
Translation historyTranslation history
Translation history
 

Viewers also liked

Introduction to Machine translation - AEM
Introduction to Machine translation - AEMIntroduction to Machine translation - AEM
Introduction to Machine translation - AEMVivek Sachdeva
 
Introducing cat tools
Introducing cat toolsIntroducing cat tools
Introducing cat toolsAdrian Brand
 
Extending Machine Translation in AEM
Extending Machine Translation in AEMExtending Machine Translation in AEM
Extending Machine Translation in AEMVivek Sachdeva
 
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...SDL
 
Types of translation
Types of translationTypes of translation
Types of translationAzhar Bhatti
 
Translation Types
Translation TypesTranslation Types
Translation TypesElena Shapa
 
Basic steps in the research process
Basic steps in the research processBasic steps in the research process
Basic steps in the research processyen_dbsk
 
Deep Learning for Machine Translation, by Jean Senellart, SYSTRAN
Deep Learning for Machine Translation, by Jean Senellart, SYSTRANDeep Learning for Machine Translation, by Jean Senellart, SYSTRAN
Deep Learning for Machine Translation, by Jean Senellart, SYSTRANTAUS - The Language Data Network
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translationbdonaldson
 
Human vs machine translation
Human  vs machine translationHuman  vs machine translation
Human vs machine translationJeff Hernandez
 
Indianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural SectorIndianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural Sectorwittylama
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
 
Semantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by WikipediaSemantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by WikipediaMaxim Grinev
 
Machine translation vs human translation
Machine translation vs human translationMachine translation vs human translation
Machine translation vs human translationLanguages Pro
 
Natural Language Generation: New Automation and Personalization Opportunities
Natural Language Generation: New Automation and Personalization OpportunitiesNatural Language Generation: New Automation and Personalization Opportunities
Natural Language Generation: New Automation and Personalization OpportunitiesAutomated Insights
 
Online Character Recognition
Online Character RecognitionOnline Character Recognition
Online Character RecognitionKamakhya Gupta
 

Viewers also liked (18)

Introduction to Machine translation - AEM
Introduction to Machine translation - AEMIntroduction to Machine translation - AEM
Introduction to Machine translation - AEM
 
Introducing cat tools
Introducing cat toolsIntroducing cat tools
Introducing cat tools
 
Extending Machine Translation in AEM
Extending Machine Translation in AEMExtending Machine Translation in AEM
Extending Machine Translation in AEM
 
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...Machine Translation: Latest Innovations and their Impact on Commercial Transl...
Machine Translation: Latest Innovations and their Impact on Commercial Transl...
 
Types of translation
Types of translationTypes of translation
Types of translation
 
Translation Types
Translation TypesTranslation Types
Translation Types
 
Basic steps in the research process
Basic steps in the research processBasic steps in the research process
Basic steps in the research process
 
Analysis
AnalysisAnalysis
Analysis
 
Deep Learning for Machine Translation, by Jean Senellart, SYSTRAN
Deep Learning for Machine Translation, by Jean Senellart, SYSTRANDeep Learning for Machine Translation, by Jean Senellart, SYSTRAN
Deep Learning for Machine Translation, by Jean Senellart, SYSTRAN
 
Microsoft - SEO - TAUS Tokyo Forum 2015
Microsoft - SEO - TAUS Tokyo Forum 2015Microsoft - SEO - TAUS Tokyo Forum 2015
Microsoft - SEO - TAUS Tokyo Forum 2015
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
Human vs machine translation
Human  vs machine translationHuman  vs machine translation
Human vs machine translation
 
Indianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural SectorIndianapolis - Wikipedia and the Cultural Sector
Indianapolis - Wikipedia and the Cultural Sector
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
Semantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by WikipediaSemantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by Wikipedia
 
Machine translation vs human translation
Machine translation vs human translationMachine translation vs human translation
Machine translation vs human translation
 
Natural Language Generation: New Automation and Personalization Opportunities
Natural Language Generation: New Automation and Personalization OpportunitiesNatural Language Generation: New Automation and Personalization Opportunities
Natural Language Generation: New Automation and Personalization Opportunities
 
Online Character Recognition
Online Character RecognitionOnline Character Recognition
Online Character Recognition
 

Similar to What is machine translation

Lexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLoriThicke
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Sheeyam Shellvacumar
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)TAUS - The Language Data Network
 
mt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTmt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTRamdan43
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTkantanmt
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Jaya Mathew
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...Welocalize
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyIconic Translation Machines
 
PCSG_Computer_Science_Unit_1_Lecture_2.pptx
PCSG_Computer_Science_Unit_1_Lecture_2.pptxPCSG_Computer_Science_Unit_1_Lecture_2.pptx
PCSG_Computer_Science_Unit_1_Lecture_2.pptxAliyahAli19
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones RIILP
 

Similar to What is machine translation (20)

Build your own statistical engines
Build your own statistical enginesBuild your own statistical engines
Build your own statistical engines
 
Lexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLexcelera MT Breaking Compromises
Lexcelera MT Breaking Compromises
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
mt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTmt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPT
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMT
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happy
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
PCSG_Computer_Science_Unit_1_Lecture_2.pptx
PCSG_Computer_Science_Unit_1_Lecture_2.pptxPCSG_Computer_Science_Unit_1_Lecture_2.pptx
PCSG_Computer_Science_Unit_1_Lecture_2.pptx
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
project present
project presentproject present
project present
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Translation Memory
Translation MemoryTranslation Memory
Translation Memory
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
The Impact of Corpora Qulality on Neural Machine Translation
The Impact of Corpora Qulality on Neural Machine TranslationThe Impact of Corpora Qulality on Neural Machine Translation
The Impact of Corpora Qulality on Neural Machine Translation
 

Recently uploaded

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Recently uploaded (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

What is machine translation

  • 1. What is Machine Translation? GT 09/08/2013
  • 2. Topics covered • Three types of Machine Translation • What can be translated? • Common MT systems • Which systems do our clients use? • Which system do we use?
  • 3. Three Types of Machine Translation • Statistical Machine Translation (SMT) • Rule-Based Machine Translation (RBMT) • Hybrid Machine Translation – Rules post-processed by statistics – Statistics guided by rules
  • 4. Statistical Machine Translation (SMT) • Developed by IBM in the early 1990s. • It is called “Statistical” because it is based on probability. • Two or three-step process: 1. Training 2. Decoding (= machine translation) 3. [Recommended] Re-training (= Improving the engine once the files have been post-edited) • Training is the critical step of Machine Translation and takes much longer than the machine translation process itself.
  • 5. SMT – Training Process 1. Start by creating a Training Corpus – Can be one or several translation memories in TMX format – Can be a collection of source and target texts that will need to be aligned 1. Clean the corpus (automatic or semi-automatic process) – Remove duplicates (keeping the most recent entry), identical source-target segments, tags => Result is clean, text-only sentences – Can involve manual cleansing depending on the level of “noise” found 1. Build a language model from the corpus (automatic process) – Built for the target language only – Contains n-grams (group of n words) – Used to find the smoothest translation = High probability of using the correct n-gram based on its frequency in the corpus => Fluency. 1. Build a translation model from the corpus (automatic process) – Bilingual model – Contains n-grams – Used to find the best translation match = High probability that a target n-gram is the translation of a source n-gram => Accuracy.
  • 6. SMT - Decoding Process • What people understand Machine Translation to be • A file is processed sentence by sentence • Each sentence is broken into n-grams • The n-grams are translated based on the highest probability scores in the phrase model and in the language model • The phrase is re-constructed based on the best n-grams • The file is re-constructed from all the translated phrases
  • 7. SMT Example (ES-EN) Maria no daba una bofetada a la bruja verde Mary not give a slap to the witch green did not a slap by green witch no slap to the did not give to the slap the witch • The translation models tells us which is the more likely translation given the source words. • The language models tells us which translation is the best linguistically. Possible good translations: • Mary did not give a slap to the green witch. • Mary did not slap the green witch.
  • 8. SMT – Re-training Process • This is an optional but recommended step. • The post-edited files are converted into a new TMX file. • The post-editors’ feedback is used to attempt to correct frequently occurring errors => Modify engine settings. • The engine is re-trained using the previous Training Corpus as well as the new TMX file.
  • 9. SMT - Considerations • A large training corpus does not guarantee good quality MT output. • A clean and consistent training corpus must be used in order to achieve good quality MT output. • It is best to use a domain-based engine even when the client is the same, e.g. create one engine for UI and one for Help/Doc. • The quality of the MT output can vary from language to language and even from handoff to handoff. • The quality of the source text is important - Consistent terminology and sentence structure produce better output. • SMT engines can be tuned and improved with feedback. • SMT engines can be re-trained and improved by updating the training corpus with newly post-edited content.
  • 10. Rule-Based Machine Translation (RBMT) • Based on: – Terminology • Bilingual or multilingual dictionary needed • Mono-lingual normalisation dictionary needed in order to standardise or correct source text before translation or to correct target text after translation – Rules representing the source sentence structure – Rules representing the target sentence structure – Rules on how the source structure and the target structure relate to each other • Steps: 1. Obtain part-of-speech information for each source word (article, noun, verb etc). 2. Obtain syntactic information about the verb (tense, person, voice). 3. Parse the source sentence in order to identify the structure (subject, verb, object etc). 4. Translate source words into target words. 5. Create translated sentence by mapping dictionary entries into appropriate inflected forms based on target rules. 6. [Optional but recommended] Once the post-editing is complete, update the dictionaries and/or rules based on the post-editors’ feedback.
  • 11. RBMT - Considerations • Need very good dictionaries => Building new dictionaries is expensive because it needs to be done by a skilled linguist for each language. • The output may be accurate and grammatically correct, but not always very fluent. • RBMT engines are more expensive than SMT engines because a great deal of effort is required in terms of development and customisation before the engine produces the desired quality. • SMT engines can be re-trained automatically, whereas RBMT engines can only be updated through human intervention (update dictionaries and rules).
  • 12. Hybrid Machine Translation Two types: • Rules post-processed by statistics – Translations are performed using a rules-based engine. – Statistics are then used in an attempt to adjust/correct the output from the rules engine. • Statistics guided by rules – Rules are used to pre-process data in an attempt to better guide the statistical engine. – Rules are also used to post-process the statistical output to perform functions such as normalization. – This approach has a lot more power, flexibility and control when translating.
  • 13. What can be machine-translated?
  • 14. Three File Types • Mono-lingual files (e.g. DOCX, HTML, TXT) Engines can translate mono-lingual files but this results in a mono-lingual translation => Very difficult to post-edit without reference to the source. • Translation memories in TMX format –The MT output is inserted into the target area of the translation unit. –The source files for translation are processed in a CAT tool against the MT TM, but: Penalties are applied to translation hits originating from the MT TM to indicate that the translation needs to be post-edited. • Bilingual files –The best option is to machine-translate XLIFF files. These are bilingual files than can be imported into all modern CAT tools => Post-editing can be supported by the use of a standard TM. –Machine-translated segments are flagged with a specific status in the CAT tool.
  • 15. Which content? • Technical, structured content fares better than creative, free-flowing content – MT well suited to help systems, user guides, FAQs, Knowledge Base articles • UI strings not necessarily well suited to MT – UI strings can be difficult to interpret in standard localisation projects (omitted words for conciseness, variables, verb or noun?) => If UI strings are difficult for a human to interpret, it will be even harder for the engine – Short strings are not necessarily easier for the MT engine to decode than longer strings • Do not expect the engine to be creative – If words are not present in the Training Corpus or in the Dictionaries, the engine will not be able to come up with a translation for them => Depending on the engine, unknown words will be omitted, or left untranslated in the MT output • What level of MT output do you require? – Do you need to bring the MT output to human-quality level? – Do you simply need to be able to understand what is being said (e.g. social network sites, support chat lines)?
  • 16. Common MT Applications (1/3) SMT Google Translate 71 languages Often translates into intermediate language and into English first to arrive at real target language, e.g. Catalan (ca ↔ es ↔ en ↔ other) CAREFUL ABOUT NDA! Microsoft Translator 39 languages • Bing Translator online • Free API up to 2 million characters per month • Offer Enterprise solutions CAREFUL ABOUT NDA! SDL Language Weaver (SDL BeGlobal) 54 language pairs Free of charge to individual translators through Trados Studio 2011, but the engine is not specific to their client or to their domain => CAREFUL ABOUT NDA! Subscription for Enterprises and LSPs Enterprises and LSPs may train their own engines via SDL BeGlobal Trainer (secure) • Make MT part of the translation workflow via WorldServer • Make MT suggestions available through the cloud via Trados Studio 2011
  • 17. Common MT Applications (2/3) SMT Language Studio (by Asia Online) Over 500 direct language pairs • Offer on-site server installation => Licenses based on language pairs and translation volume capacity. • Offer Software as a Service (SaaS) => Pay as you go with 3 options (volume, fixed monthly fee, file size). Offer 4 levels of MT quality, all with varying degrees of customisation (and price) Customisation is carried out by Asia Online Moses Open Source (free) No language limitations Highly customisable on all levels (training and decoding) => Companies use Moses but tailor it to their needs Possible to turn it into a Hybrid system with the application of language-specific rules
  • 18. Common MT Applications (3/3) RBMT PROMT 12 language pairs (no Asian support) •Provide a free online translator tool (Online-translator.com) •PROMT Professional (for translators) costs $265 •Offer Enterprise solution (part of translation workflow) Apertium Open Source (free) 36 languages pairs No Asian character support Hybrid SYSTRAN Have been around for 40+ years Started out as a RBMT system and has now been updated with the use of statistics 52 language pairs • SYSTRAN Premium Translator version lets you fully manage the dictionary (~ £700) • SYSTRAN Enterprise Server 7 available in three editions depending on company needs Systran say it is the fastest MT solution available
  • 19. Timeline 2010 - Asia Online launches Language Studio, a comprehensive MT and post-editing solution. - Systran launches its enhanced Enterprise 7 MT software. - Language Weaver launches its ‘quality confidence’ module. The company is acquired by SDL. 2009 - Systran releases version 7, a hybrid version of its original RBMT. Includes an automated post-editing module. 2007 - MOSES is launched as a downloadable kit. It begins to be used in a large scale EU project (Euromatrix) to speed up the MT development of new language pairs. 2004 - The OpenTrad project funded by the Spanish government begins to develop MT engines for Spain’s various languages. Using an existing RBMT engine, the consortium builds Apertium. 2002 - Language Weaver is founded in California to develop SMT systems. 2001 - IBM launches its WebSphere translation engine for 8 languages. - The National Institute of Science and Technology (NIST) launches its first round of MT system benchmarking. 1997 - The AltaVista Babelfish service launched on the web using Systran.
  • 20. Which MT systems do our clients use? SMT Adobe Moses – Carried out initial tests in 2009 using PROMT for Russian and Language Weaver for French and Spanish Autodesk Moses HP Language Weaver – Also have access to Microsoft Translator Oracle Moses – Switched from Language Weaver in 2012 Sybase Moses – Trained by Pangeanic in Spain RBMT PTC PROMT Hybrid Symantec SYSTRAN
  • 21. Which system do we use? Moses hybrid (Statistics guided by rules) To be continued…