23/10/2023
Terminology for machine translation: effective approaches to improve the quality of translation output
As good as the results of machine translation are, they are still far from perfect. Above all, it’s the rendering of specialised terminology that plays a decisive role in the quality of the output. This is because specialised terms are often translated incorrectly, inconsistently or at least in a way that deviates from a company’s corporate language. So we explain when and how terminology can be integrated into machine translation and what needs to be taken into account in the process.
MT systems generally translate texts and the terminology they contain “as learnt”, i.e. as trained by the training material. This is because neural machine translation works on the basis of statistical probability, analysing how often a foreign-language equivalent for a word occurs in the training corpus and what other words occurred in the same sentence.
As far as specialised terms are concerned, this means that standardised terms for which there are few or no common synonyms are usually rendered correctly and consistently by MT systems. This is because standardisation means consistent use across many sources. On the other hand, terms that occur rarely or very inconsistently in the training material will be translated freely, incorrectly or differently from one instance to the next. This is because a proliferation of different terms, either within a company or across many companies and or even industries, ultimately manifests in the form of variants in MT output. Depending on the standardisation, subject area, text content and company specifications, the translation offered by an MT system may therefore run counter to expectations.
In a recent article on domain-specific MT systems, we showed that there is no significant difference between the use of generic and domain-specific MT engines when it comes to the rendering of specialised terminology. Specialised terms were translated inconsistently in both types of system. However, it’s precisely this inconsistency that many companies have been battling against for years or decades by undertaking systematic terminology work and managing their terminology in the form of databases or Excel files. And since specialised vocabulary plays such an important role in the quality of the translation, when using machine translation, we have to ask: how we can get the desired terminology into the MT output?
There are three approaches to integration and implementation: before, during and after machine translation. In the discussion below, we work through the options, starting with the last one, and take a look at the details.
After machine translation: correcting terminology during post-editing
Even though correcting terminology is one of the most time-consuming aspects of post-editing machine translation output, the integration of specified terminology in the form of glossaries or databases is a decisive factor in text quality. This is because where MT systems do not translate specialised terms or fail to translate them consistently, post-editors need the clearest possible specifications in order to correct the machine output.
In practice, this means that all important specialised terms must be defined, preferably in a database format, and integrated into the post-editor’s working environment. Integrating this information into CAT tools means that source-language terms are recognised and the target-language equivalents are suggested directly. This allows source and target terms to be clearly linked and prevents the disorganised proliferation of terminology variants in MT systems.
But even terms that have already been translated uniformly by an engine can contradict a company’s corporate language. In one of our practical tests, a German text from the automotive sector contained the term “Spurhalteassistent”. The term is often not found in dictionaries and if you look at the websites of various manufacturers you will find many variations in English, from “lane keeping assist” to “lane guard system”. You can therefore expect an equivalent wealth of variants in machine-translated text.
However, even if an MT system consistently selects one of the possible translations, it may still contradict the company’s own terminology and must then be replaced throughout the text during post-editing. This means that both the post-editors and the quality of the specified terminology play a decisive role in the correct rendering of terminology after machine translation.
During machine translation: integrating terminology by means of a glossary
Many providers of MT systems allow you to integrate specified terminology in the form of glossaries, which are made available to the engine during translation. The advantage of the glossary function clearly lies in its dynamic nature. If specified terms change or new ones are added, the glossary can be added to or even reduced if terms become superfluous.
Providers differ in terms of the supported language pairs and the file formats in which the glossaries need to be provided. While some providers only support list formats such as .csv, others can work with the terminology database standard .tbx.
However, although the entire database could theoretically be quickly exported and used as a glossary, this makes little practical sense for at least two reasons.
Firstly, MT systems usually cannot work with additional information such as usage, i.e. information on whether a term is preferred, permitted or prohibited. Either the information is ignored and forbidden terms are interpreted as valid translations, or the system completely rejects entries that appear ambiguous. Ideally, the MT system should be provided with positive one-to-one terminology – i.e. only unambiguous equivalents and only preferred terms for the target language.
Secondly, when creating a glossary, the number of defined terms must also be carefully considered. A large number of terms may not affect the turnaround time of the machine translation, but it does have a significant impact on its quality. The more specified terms the machine receives, the more the neural approach is cancelled out and the MT system is narrowly constrained. The result is often a stringing together of terms from the database, resulting in clumsy-sounding sentences. Best practice is therefore a “bottom-up” strategy, in which glossaries are populated with a limited number of important core terms. The machine output is then checked for terminology errors and any terms that are regularly associated with errors can be added to the glossary.
It’s important to look at the native output of translation engines. Depending on the subject area and language, many terms from a company database may not need to be included in a glossary as MT systems already use the desired terminology due to their statistical frequency. For example, a company has an entry in its terminology database for “Schraubendreher” (screwdriver) because the use of this standardised term and its synonym “Schraubenzieher” has been defined. From a terminological point of view, this entry makes perfect sense, but it is superfluous for a glossary for the German-English language direction. It can be safely assumed that every MT system will natively translate both “Schraubendreher” and “Schraubenzieher” as “screwdriver” as this is by far the most common and standardised English equivalent.
Before machine translation: consistent terminology in the training material
MT training promises to raise the quality of machine output to a new level. Through targeted training using company data, the appropriately trained MT system should not only translate technical details correctly but also in accordance with the defined style and corporate language. This requires comprehensive but linguistically clean training material for the desired language pair, which is then used to train a customised engine.
For the purposes of specifying terminology, this means that all training material must contain the desired terminology in a clear, consistent form. As a general principle, a source text should contain plenty of variants while the target language should be kept consistent. The above example of “Schraubendreher” (screwdriver) is a good illustration of this. The source language material must contain both German terms, i.e. “Schraubendreher” and “Schraubenzieher”. If possible, both terms should be used in different inflections and in a variety of sentences. In the target language – in our example English – only “screwdriver” should be included to ensure consistent translation.
As the training material is usually put together from translation data acquired over many years or decades, consistency in terminology is usually wishful thinking. This makes it all the more important to clean up the data before training so that the machine can learn the desired terminology.
Incidentally, it would be inappropriate to simply transfer the desired terminology to the MT system in list form as training material. A simple word list lacks the context that is so important for NMT systems and which is used for statistical comparison: with what other words does the word you are looking for often occur, and how is it translated in different sentences? An NMT system cannot learn anything from the list format without context.
In summary
It’s imperative that machine translation has specified terminology to work with if it is to match human translations in terms of quality. Whether this is integrated into the translation output before, during or after the use of MT depends on the process and the capabilities of the systems. However, the quality and preparation of terminology play a decisive role.
Would you like more information about the interaction between terminology and machine translation? Or would you like to know how your organisation’s terminology can be integrated into machine translation? Then get in touch with us at mtpe@oneword.de.
8 good reasons to choose oneword.
Learn more about what we do and what sets us apart from traditional translation agencies.
We explain 8 good reasons and more to choose oneword for a successful partnership.