07/06/2024

oneCleanup: cleaning up data made easy

Data is the new gold: it’s valuable and required for a range of processes, applications and developments. Generative AI in particular is showing once again what large amounts of data can create. But data is also the new rubbish: it appears in a wide variety of places and in large quantities, accumulates quickly, never reduces in size and sometimes grows very uncontrollably. And the bigger the mountain of data, the more difficult it becomes to use it meaningfully. Our oneCleanup service takes on this challenge and helps to uncover the shimmering gold beneath the layer of dirt. We present the background information and details, and we demonstrate why it’s high time that databases are seen not as a tangled mess but as treasure troves.

Terminology and translation memory – what’s the difference?

The kind of data that is relevant differs for each area of life and each area of business. With oneCleanup, we concentrate on language data and focus on the two most important types of data in the translation sector: translation memories and terminology databases.

In a translation memory (TM), the source text and the corresponding translation are stored segment by segment. The TM is therefore the translator’s digital memory. Each new text to be translated is compared with all previous projects stored in the memory, and identical or similar segments are recognised. These segments then do not have to be translated again – which might result in a different translation altogether. Instead, the existing translation can simply be reused, or adapted as necessary. This clearly saves time and money, as existing segments do not cost the full amount.

However, a terminology database is different, as it does not contain complete sentences. Instead, it contains entries for terms with the matching terms in the target language, illustrations, definitions and additional information. A terminology database is given priority over the TM in the translation process. When a translation is being produced, it identifies the terms that occur within a segment and displays the foreign-language equivalent. During the translation, or after the translation is complete, the text is checked to ensure that the specified terms have been used correctly, as part of a terminology check.

Historical growth

The two databases therefore contain different language-related data. What both have in common is that the data accumulates quickly: the number of segments stored in the TM grows with each translation project, and the terminology database grows with every new entry or term. Large projects or importing prepared terminology lists can lead to substantial and sometimes uncontrolled growth. In day-to-day translation work, however, only a few companies have established processes for regular checks and data maintenance or even for targeted data clean-ups. Because you might actually think: the more data the better. Every segment available in the TM could be required again elsewhere; every term recorded could save time researching and increase consistency. So do large amounts of data save time and costs?

In practice, the opposite is often the case: large amounts of data quickly become confusing and therefore more difficult to handle. Databases then continue to grow in an uncontrolled manner and the data becomes unclean. And unclean data is much more difficult to use in a meaningful way. If, for example, a terminology database contains duplicates with different information or if a TM contains two different translations for an almost identical source segment, this disrupts the translation process and leads to increased effort spent researching and selecting the correct data. If the corresponding segments in the TM or entries in the terminology database are not corrected, this effort is repeated each time they appear during translation projects. However, the data from the TM and the terminology database is also becoming increasingly relevant outside of the translation process. This is because language data can be used for processes and applications in very different scenarios.

Language data for a wide range of applications

For a number of reasons, language data is being valued as it actually should have been long before now, whether in knowledge management, the targeted use of Large Language Models (LLMs) or machine translation. Here are two examples of scenarios that use language data.

In scenario 1, a company wants to train a chatbot for German and English to respond to support requests in both languages. The AI-supported assistant will be based on existing manuals, which are used to generate responses. Translations from the last ten years are used to provide sufficient input for the training. However, the TM data has never been adapted to changes in terminology and the interface texts have also changed in the meantime. The chatbot could therefore display outdated information or refer to buttons that no longer exist. The TM used for training also contains numerous duplicates and fragments, as segmentation was not always optimal during the translation process. This means that the AI system has a lot of data input but can learn nothing at all, or nothing meaningful, from it. However, for many models users are charged based on tokens, i.e. the smallest units used to process text are counted. This means that both the quantity and the quality of the input are decisive.

In scenario 2, the content of the terminology database is going to be used as a glossary for machine translation. In an ideal world, all entries would be transferred to the MT system and implemented correctly and consistently by the MT engine. In reality, however, terminology databases often contain thousands of entries that are supposed to serve as input. These entries may contain contradictions, be ambiguous or contain translated terms from different areas. Converting an extensive database into a glossary can also mean that every second word that is being translated is suddenly specified by the glossary. The MT system’s good and fluent translation then quickly becomes a series of specified terms, which can significantly change and negatively impact the output. For this scenario, too, the quantity and quality of the data are crucial for the data to be used in a meaningful way. It’s clear that in both these scenarios, the data must be cleaned up. This is where our oneCleanup service comes into play.

Potential for cleaning up data and practical implementation

The aim of cleaning up TM and terminology data is therefore to obtain a reduced and clean database. To analyse where there is potential for cleaning up data, i.e. what can be cleaned up, we use automation and scripting so that we can evaluate the large volumes of data quickly and effectively.

For both types of data, we consider five key points:

  • Incorrect forms
  • Incorrect source-target pairings
  • Duplicates and similar data
  • Missing information
  • Outdated data

What each check then specifically targets varies greatly depending on the type of data. Incorrect form in the terminology database includes, for instance, terms that have been capitalised even though they should be in lower case. However, when analysing TM data, this criterion in the check returns segments that end with different punctuation marks, for example.

With oneCleanup, we can analyse databases of any size. The steps in the check can be extended on a case-by-case basis to fulfil all company requirements. This is because no two TMs and terminology databases are structured or filled in exactly the same way.

The potential for cleaning up the data is presented as clear analysis results. Our oneCleanup service is highly automated, making it possible to rapidly assess the actual effort required to clean up the data. As always where quality and informed decisions are required, people are then involved to evaluate the results and determine and implement the necessary measures. Changes and corrections can be made immediately or data can be marked for deletion. The results of the analyses also enable an iterative approach in order to implement the clean-up steps gradually.

Conclusion: Clean data through expertise

Data is only the new gold if it is regularly checked and cleaned up. Because, in all areas in which language data can be used, quality is more important than quantity. With oneCleanup, we use our decades of language and process expertise to analyse data from TMs and terminology databases in an efficient way that saves resources and to leverage the potential of data clean-up.

Would you like to find out more about oneCleanup? If so, our experts will be happy to arrange a consultation.

8 good reasons to choose oneword.

Learn more about what we do and what sets us apart from traditional translation agencies.

We explain 8 good reasons and more to choose oneword for a successful partnership.

Request a quotation

    I agree that oneword GmbH may contact me and store the data that I provide.