2. Topics covered
• Three types of Machine Translation
• What can be translated?
• Common MT systems
• Which systems do our clients use?
• Which system do we use?
3. Three Types of Machine Translation
• Statistical Machine Translation (SMT)
• Rule-Based Machine Translation (RBMT)
• Hybrid Machine Translation
– Rules post-processed by statistics
– Statistics guided by rules
4. Statistical Machine Translation (SMT)
• Developed by IBM in the early 1990s.
• It is called “Statistical” because it is based on probability.
• Two or three-step process:
1. Training
2. Decoding (= machine translation)
3. [Recommended] Re-training (= Improving the engine once the files have
been post-edited)
• Training is the critical step of Machine Translation and takes much
longer than the machine translation process itself.
5. SMT – Training Process
1. Start by creating a Training Corpus
– Can be one or several translation memories in TMX format
– Can be a collection of source and target texts that will need to be aligned
1. Clean the corpus (automatic or semi-automatic process)
– Remove duplicates (keeping the most recent entry), identical source-target
segments, tags => Result is clean, text-only sentences
– Can involve manual cleansing depending on the level of “noise” found
1. Build a language model from the corpus (automatic process)
– Built for the target language only
– Contains n-grams (group of n words)
– Used to find the smoothest translation = High probability of using the correct
n-gram based on its frequency in the corpus => Fluency.
1. Build a translation model from the corpus (automatic process)
– Bilingual model
– Contains n-grams
– Used to find the best translation match = High probability that a target n-gram is the
translation of a source n-gram => Accuracy.
6. SMT - Decoding Process
• What people understand Machine Translation to be
• A file is processed sentence by sentence
• Each sentence is broken into n-grams
• The n-grams are translated based on the highest probability scores
in the phrase model and in the language model
• The phrase is re-constructed based on the best n-grams
• The file is re-constructed from all the translated phrases
7. SMT Example (ES-EN)
Maria no daba una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
• The translation models tells us which is the more likely translation given the source words.
• The language models tells us which translation is the best linguistically.
Possible good translations:
• Mary did not give a slap to the green witch.
• Mary did not slap the green witch.
8. SMT – Re-training Process
• This is an optional but recommended step.
• The post-edited files are converted into a new TMX file.
• The post-editors’ feedback is used to attempt to correct frequently
occurring errors => Modify engine settings.
• The engine is re-trained using the previous Training Corpus as well
as the new TMX file.
9. SMT - Considerations
• A large training corpus does not guarantee good quality MT output.
• A clean and consistent training corpus must be used in order to achieve
good quality MT output.
• It is best to use a domain-based engine even when the client is the same,
e.g. create one engine for UI and one for Help/Doc.
• The quality of the MT output can vary from language to language and
even from handoff to handoff.
• The quality of the source text is important - Consistent terminology and
sentence structure produce better output.
• SMT engines can be tuned and improved with feedback.
• SMT engines can be re-trained and improved by updating the training
corpus with newly post-edited content.
10. Rule-Based Machine Translation (RBMT)
• Based on:
– Terminology
• Bilingual or multilingual dictionary needed
• Mono-lingual normalisation dictionary needed in order to standardise or correct
source text before translation or to correct target text after translation
– Rules representing the source sentence structure
– Rules representing the target sentence structure
– Rules on how the source structure and the target structure relate to each other
• Steps:
1. Obtain part-of-speech information for each source word (article, noun, verb etc).
2. Obtain syntactic information about the verb (tense, person, voice).
3. Parse the source sentence in order to identify the structure (subject, verb, object etc).
4. Translate source words into target words.
5. Create translated sentence by mapping dictionary entries into appropriate inflected
forms based on target rules.
6. [Optional but recommended] Once the post-editing is complete, update the
dictionaries and/or rules based on the post-editors’ feedback.
11. RBMT - Considerations
• Need very good dictionaries => Building new dictionaries is expensive
because it needs to be done by a skilled linguist for each language.
• The output may be accurate and grammatically correct, but not always
very fluent.
• RBMT engines are more expensive than SMT engines because a great
deal of effort is required in terms of development and customisation
before the engine produces the desired quality.
• SMT engines can be re-trained automatically, whereas RBMT engines can
only be updated through human intervention (update dictionaries and
rules).
12. Hybrid Machine Translation
Two types:
• Rules post-processed by statistics
– Translations are performed using a rules-based engine.
– Statistics are then used in an attempt to adjust/correct the output from the
rules engine.
• Statistics guided by rules
– Rules are used to pre-process data in an attempt to better guide the
statistical engine.
– Rules are also used to post-process the statistical output to perform
functions such as normalization.
– This approach has a lot more power, flexibility and control when
translating.
14. Three File Types
• Mono-lingual files (e.g. DOCX, HTML, TXT)
Engines can translate mono-lingual files but this results in a mono-lingual
translation => Very difficult to post-edit without reference to the source.
• Translation memories in TMX format
–The MT output is inserted into the target area of the translation unit.
–The source files for translation are processed in a CAT tool against the MT TM,
but:
Penalties are applied to translation hits originating from the MT TM to indicate
that the translation needs to be post-edited.
• Bilingual files
–The best option is to machine-translate XLIFF files. These are bilingual files than
can be imported into all modern CAT tools => Post-editing can be supported by the
use of a standard TM.
–Machine-translated segments are flagged with a specific status in the CAT tool.
15. Which content?
• Technical, structured content fares better than creative, free-flowing
content
– MT well suited to help systems, user guides, FAQs, Knowledge Base articles
• UI strings not necessarily well suited to MT
– UI strings can be difficult to interpret in standard localisation projects (omitted words
for conciseness, variables, verb or noun?) => If UI strings are difficult for a human to
interpret, it will be even harder for the engine
– Short strings are not necessarily easier for the MT engine to decode than longer strings
• Do not expect the engine to be creative
– If words are not present in the Training Corpus or in the Dictionaries, the engine will not
be able to come up with a translation for them => Depending on the engine, unknown
words will be omitted, or left untranslated in the MT output
• What level of MT output do you require?
– Do you need to bring the MT output to human-quality level?
– Do you simply need to be able to understand what is being said (e.g. social network
sites, support chat lines)?
16. Common MT Applications (1/3)
SMT
Google Translate 71 languages
Often translates into intermediate language and into English first to arrive at real target
language, e.g. Catalan (ca ↔ es ↔ en ↔ other)
CAREFUL ABOUT NDA!
Microsoft Translator 39 languages
• Bing Translator online
• Free API up to 2 million characters per month
• Offer Enterprise solutions
CAREFUL ABOUT NDA!
SDL Language Weaver
(SDL BeGlobal)
54 language pairs
Free of charge to individual translators through Trados Studio 2011, but the engine is
not specific to their client or to their domain => CAREFUL ABOUT NDA!
Subscription for Enterprises and LSPs
Enterprises and LSPs may train their own engines via SDL BeGlobal Trainer (secure)
• Make MT part of the translation workflow via WorldServer
• Make MT suggestions available through the cloud via Trados Studio 2011
17. Common MT Applications (2/3)
SMT
Language Studio
(by Asia Online)
Over 500 direct language pairs
• Offer on-site server installation => Licenses based on language pairs and translation
volume capacity.
• Offer Software as a Service (SaaS) => Pay as you go with 3 options (volume, fixed
monthly fee, file size).
Offer 4 levels of MT quality, all with varying degrees of customisation (and price)
Customisation is carried out by Asia Online
Moses Open Source (free)
No language limitations
Highly customisable on all levels (training and decoding) => Companies use Moses but
tailor it to their needs
Possible to turn it into a Hybrid system with the application of language-specific rules
18. Common MT Applications (3/3)
RBMT
PROMT 12 language pairs (no Asian support)
•Provide a free online translator tool (Online-translator.com)
•PROMT Professional (for translators) costs $265
•Offer Enterprise solution (part of translation workflow)
Apertium Open Source (free)
36 languages pairs
No Asian character support
Hybrid
SYSTRAN Have been around for 40+ years
Started out as a RBMT system and has now been updated with the use of statistics
52 language pairs
• SYSTRAN Premium Translator version lets you fully manage the dictionary (~ £700)
• SYSTRAN Enterprise Server 7 available in three editions depending on company needs
Systran say it is the fastest MT solution available
19. Timeline
2010 - Asia Online launches Language Studio, a comprehensive MT and post-editing solution.
- Systran launches its enhanced Enterprise 7 MT software.
- Language Weaver launches its ‘quality confidence’ module. The company is acquired by SDL.
2009 - Systran releases version 7, a hybrid version of its original RBMT.
Includes an automated post-editing module.
2007 - MOSES is launched as a downloadable kit. It begins to be used in a large scale EU project
(Euromatrix) to speed up the MT development of new language pairs.
2004 - The OpenTrad project funded by the Spanish government begins to develop MT engines
for Spain’s various languages. Using an existing RBMT engine, the consortium builds
Apertium.
2002 - Language Weaver is founded in California to develop SMT systems.
2001 - IBM launches its WebSphere translation engine for 8 languages.
- The National Institute of Science and Technology (NIST) launches its first round of MT
system
benchmarking.
1997 - The AltaVista Babelfish service launched on the web using Systran.
20. Which MT systems do our clients use?
SMT
Adobe Moses – Carried out initial tests in 2009 using PROMT for Russian and Language Weaver
for French and Spanish
Autodesk Moses
HP Language Weaver – Also have access to Microsoft Translator
Oracle Moses – Switched from Language Weaver in 2012
Sybase Moses – Trained by Pangeanic in Spain
RBMT
PTC PROMT
Hybrid
Symantec SYSTRAN
21. Which system do we use?
Moses hybrid (Statistics guided by rules)
To be continued…