Unveiling the meaning of machine translation quality measures

Understand the different approaches to evaluate machine translation quality – and what they mean for your translation and localization needs.

Machine translation (MT) is growing in popularity and sophistication as the technology matures, and expectations about quality are rising accordingly. The question facing translation customers is whether MT quality is sufficient for their purposes or whether additional human expert post-editing and review processes are necessary.

Every post-editor and language service provider must establish whether they can improve raw machine translation results to meet customer expectations – and at what cost. This can be a tricky problem to solve, but one that can definitely be overcome with the right methodology and know-how.

We know that MT is not always reliable and that its raw output needs to be revised, but how do we know if it will actually save us work while we’re required to maintain human quality standards? How do we determine the amount of effort it saves compared to a human translation from scratch? We don’t want to be wasting time and effort, after all.

With that in mind, we need methods to evaluate raw machine translation quality. People usually expect that MT, an automatically produced translation, also comes with an automatically produced indication of its correctness or reliability – or at least that there are tools to automatically rate MT quality and indicate the effort involved in post-editing. Unfortunately, it’s not that easy.

Evaluate translation quality

MT Automate Scoring

BLEU and similar methods

Post-edit distance

Which system is better?

Conclusion

How do we evaluate translation quality, anyway?

In order to better understand how we might evaluate machine translation quality, it makes sense to look at how we currently evaluate human translation quality.

Scoring standards for human translation include (but are not limited to) the Multidimensional Quality Metrics (MQM), the Dynamic Quality Framework (DQF), and the J2450 Translation Quality Metric. These standards are used to evaluate quality criteria like linguistic correctness, understandability, fluency, cultural appropriateness, and so on.

These evaluation methods usually produce a unified score that reflects the number of mistakes and their severity in relation to the volume of a given text. Such scores can be tuned to the relevant use case (using adjusted thresholds, for instance) so that you can decide whether a translation is good or bad – that is, suitable to your purposes or not. So far, so good.

But whatever standard you choose – and however you define your thresholds – the task of detecting and classifying errors according to those metrics relies entirely on human reviewers.

And here’s the bad news you’ve been waiting for: This task remains a manual, human task even when you’re assessing machine translation quality.

So, what’s the point of automatic scoring of machine translation quality, then?

The answer is simple: Automated scores are useful – it’s just that their usefulness depends on what answer you expect.

The challenges in assessing actual translation quality don’t magically disappear when moving from human to machine translation. Furthermore, there are various metrics for measuring machine translation quality, and the one you should use depends on what you want to know.

For instance, if you want to assess whether machine translated content can be used without post-editing for a given use case, you would essentially use the same quality assessment as you would for human translation: A qualified linguist reviews the translation and its source, classifies errors and then obtains a score that indicates whether the raw MT passed or failed in the relevant context. There’s no magic shortcut or way around it: If you want to be sure that a given machine-translated text meets your quality expectations, you need to apply human review.

But what if you have a different question? What if you want to compare MT to MT – that is, to get a general idea of how well a specific MT engine works for a given test set when compared to other engines? For comparative evaluations, the bilingual evaluation understudy (BLEU) method might fit your needs best.

And finally, what about the question that matters most in a post-editing context: Are we saving effort in translation by post-editing MT compared to translating from scratch? And if so, how much? In this case, if you want to make sure you’re not spinning your wheels, post-edit distance (PED) could be the measurement method you’re looking for.

Let’s take a closer look at BLEU and similar methods and PED to better understand what they actually measure.

BLEU and similar methods – There’s only one right answer

The bilingual evaluation understudy (BLEU) scoring methodology and similar methods such as HTER (Human-targeted Translation Error Rate) or LEPOR (Length Penalty, Precision, n-gram Position difference Penalty and Recall) were developed by MT engineers as a quick and inexpensive way to evaluate their MT engine tuning success, because they don’t require the involvement of a human evaluator. However, this means that they also don’t provide the same answers that a human evaluator might.

How BLEU works

BLEU is based on the assumption that there is only one good translation for a text, and that MT quality is the degree to which an MT output is similar to that translation. The “good translation” is called the reference translation and is a sample of text in both the source language and target language. In more concrete terms, it is a sample that was human-translated before and that is considered of good quality.

Measurement therefore happens based on exactly that reference text: The source text is translated by one or several MT engines, and an algorithm calculates the difference between each MT result and the reference translation. The result is the so-called BLEU score, which is expressed as a number between 0 and 1, or between 0% and 100%: The higher the BLEU score, the more similar the two texts are.

The shortcomings of BLEU

While the way in which this method calculates similarity is quite sophisticated, the primary issue with BLEU and similar metrics is that they assume that there is only one good translation for each text. However, professional linguists generally understand that that there may be several fitting translations for any given source text.

As such, BLEU does not truly measure translation quality, but rather the degree to which a given engine can imitate certain reference texts.

It’s easy to understand that BLEU scores for the same MT engine will differ depending on the reference text. It is also clear that a BLEU score obtained with a poor quality reference text will not reflect MT quality at all. Moreover, the score will depend on the size of the sample you use, the character set of the languages measured, and other factors. Not so straightforward now, is it?

It is also clear that BLEU will not deliver a quality verdict on new texts because it requires a test scenario with an established (human-translated) reference text. You can’t use BLEU to measure the quality of machine translations that have never been translated by humans before, which makes it unsuitable for a predictive application.

BLEU is, however, a valid instrument to measure the effect of engine trainings and – to some extent – to compare the engines of different MT providers. However, it is important to note that a BLEU score is not a fixed characteristic of an MT engine but rather of a test scenario. The same engine will score differently depending on the reference translation.

The BLEU verdict

While BLEU usually does correlate with human judgment on MT quality, it does not actually answer the quality question for a given text. It merely indicates how probable it is that a text similar to the reference translation will be correct. Additionally, there is growing evidence that even in this limited scope of application, BLEU might be nearing the end of its usable life.

PE Distance – Measuring under real-world conditions

How PED works

Post-edit distance (PED) measures the amount of editing that a machine-translated text requires in order to meet quality expectations. The primary difference in comparison to BLEU is that the human reference translation is actually done based on MT, which increases the probability that machine translation and human translation are similar or identical. This is because translators with a solid post-editing background will not introduce unnecessary changes to the MT. Therefore, assuming that the translators did their job correctly, PED reflects MT suitability for post-editing much better than BLEU does.

So, can any linguist with post-editing experience do the post-editing for a PED analysis? Not quite. The important factor here is that the translator actually understands the customer’s quality expectations for the text. A machine translation can sound fluent, without any apparent errors of the meaning, and still not meet quality requirements. For instance, customer-specific terminology or style might not have been applied, texts might exceed length limitations, or formatting information might have been lost. In short, you’ll want a linguist with both post-editing experience and customer know-how.

With PED, real-world conditions are required to obtain reliable figures, and post-edit distance can be calculated only based on post-editing that meets quality expectations. An algorithm calculates the difference between the raw MT and post-edited translation and issues a value per segment and per text sample. This value indicates the percentage of raw MT that was reused by the translator, starting from 100% (translator made no changes to the segment or text) and decreasing from there. High PED scores indicate a real gain in efficiency for the translator.

How do PED scores relate to post-editing effort?

The rule of thumb here is that the higher the PED score, the lower the effort. However, as with translation memory matches, there’s a certain percentage threshold that must be reached to represent real gains in efficiency. If the overall PED value for a given text type is consistently below this threshold, MT doesn’t save time.

So, does a high PED value mean that the translator had no effort, and do you have to pay for post-editing if PED is close to 100%? The answer is: If you want post-editing, it will have a cost. It is important to note that even with a very high post-edit distance value, the translator’s effort is not zero: They have performed a full review of the target text and compared it to the source text, validated that the terminology applied by the MT system is the right one, potentially performed additional research or obtained clarification, and so on. Therefore, the effort of post-editing is never zero, even when there are almost no edits. This is comparable to a second opinion by a physician: The fact that both doctors come to the same conclusion doesn’t mean the second one didn’t have to check the patient thoroughly.

Reliable post-editing effort predictions

By assessing PED values across large enough volumes of similar text, you can get a reliable indication of the effort involved and quantify efficiency gains. Small anecdotal samples are not a suitable basis for this kind of analysis, as they might result in PED figures which are too positive or negative and ultimately not representative of average real-world results. Thankfully, testing with suitable volumes does not mean adding cost to your normal translation process. We know our stuff on this one, so don’t hesitate to ask your contact at Acolad for a Machine Translation Pilot and learn how to calculate your savings potential.

Machine translation quality – Which system is best?

At Acolad, we know what it takes to produce high-quality translations, and we choose our human translators and MT engines accordingly.

Would we work with a human translator who delivers superior quality but is notorious for sharing their customers’ content on social platforms and disclosing business secrets? Or one who is unable to adhere to technical requirements and regularly introduces errors into XML structures and formats? We would have to be crazy! And you would be crazy too if you stayed with any LSP that permitted such behavior.

Furthermore, would we ask a single translator to perform translations into all our target languages and for all subject matters? Again, we’d have to be crazy.

The same considerations are relevant for MT, and we have developed a decidedly non-crazy approach to the challenge: We apply a range of criteria when it comes to MT engine selection, and not all of the criteria are strictly about linguistic output quality – though it is a crucial piece of the puzzle. In order to safely and efficiently apply machine translation in our processes, we also consider confidentiality, availability of a sustainable service offering (including API), overall cost, and general robustness of the system.

We define robustness as the ability to produce good linguistic quality outside of laboratory conditions, which includes tolerance of source text typos, incomplete sentences, creative formatting and foreign language phrases in source files. Furthermore, we assess the quality of integration in the relevant translation memory tool.

Ultimately, there is no one-size-fits-all solution, and a concrete context is required to answer the question of which MT system is “best”. Technology evolves rapidly, and our preferred technologies from last year might not be the best options today. We keep up with the state of the art in the industry so that you don’t need to be an MT expert, and we monitor the market so that you can select the best possible engine for your scenario.

Conclusion

So, it turns out that so-called MT quality indicators like BLEU, LEPOR, TER or PED actually don’t measure quality as such. But there’s good news: They do provide us with the KPIs we need to take quality decisions.

In practical terms, measuring actual linguistic quality in translation – whether human or machine-generated – still remains a manual exercise. There’s currently no such thing as an automated quality score, which is why having the right experts for all relevant target languages at hand is a great advantage when it comes to picking the right system and assessing new technologies.

Given the pace of technological evolution, we may see more automated solutions for assessing translation quality on the horizon. Until then, Acolad has everything well in hand.

Learn more about MT quality indicators and test our "MT to fit" approach

Schedule a 1:1 session with one of our experts

Unveiling the meaning of machine translation quality measures

Contents

How do we evaluate translation quality, anyway?

So, what’s the point of automatic scoring of machine translation quality, then?

BLEU and similar methods – There’s only one right answer

How BLEU works

The shortcomings of BLEU

The BLEU verdict

PE Distance – Measuring under real-world conditions

How do PED scores relate to post-editing effort?

Reliable post-editing effort predictions

Machine translation quality – Which system is best?

Conclusion

Learn more about MT quality indicators and test our "MT to fit" approach

Company

Resources

Connect

Working on international projects?