Quality Estimation Scores Explained: Beyond Numbers and Metrics
Adria Crangasu
January 15 , 2025 · 5 min
As Large Language Models (LLMs) continue to evolve, machine-generated translations are becoming the default translation method for many industries and use cases. Advanced neural machine translation (NMT) engines allow businesses to save money and translate more content, but evaluating the quality of machine generated translations can still be costly and cumbersome – defeating the purpose of using NMT.
Fortunately, quality estimation (QE) is on hand to help us measure the accuracy and overall quality of machine translations without the need for a reference translation. Below, we’ll explore the fundamentals of quality estimation and how it differs from quality evaluation. We will also take a closer look at the key metrics involved in calculating QE scores.
Understanding the Basics of Quality Estimation Scores
Quality estimation is an easy way to anticipate the quality level of machine-generated translations before investing in human post-editing or risking it all and publishing raw MT. The beauty of QE is that it allows you to measure quality without taking reference translations into account. Additionally, the process doesn’t depend on the input or oversight of human linguists.
QE takes several factors into account to generate a score that can be used to underline the reliability and accuracy of machine-translated text. Key metrics like fluency, style, accuracy, and more are individually assessed, with translated text compared against a multitude of linguistic models and other language resources.
While QE scores are a fairly reliable benchmark, it’s worth remembering that they’re only an estimation. The value of QE scores can vary depending on the context of the text in question. Different metrics and tailored calculations can be used to make scores more relevant to specific scenarios.
What’s the Difference Between Quality Estimation and Quality Evaluation?
While quality estimation and quality evaluation sound similar, they’re actually quite different. Quality evaluation involves assessing content after the translation process is over. Once delivered, machine-translated text is compared against reference translations written by human linguists to establish things like relevance, reliability, and readability of MT.
While quality evaluation takes place after translation, quality estimation is a part of the translation process and doesn’t require reference translations. It’s used as a predictive tool to establish the quality level of machine translations without human involvement. Quality estimation scores serve as a useful guide to streamline translation and localization workflows, determining if MT use is appropriate for the source text and whether machine-generated texts will require any post-editing intervention by human translators.
How Quality Estimation Scores Are Generated
Quality estimation scores are easily calculated, with several factors taken into account. Generally speaking, the most important of these factors is semantic similarity. In other words, how similar two text segments are based on their meaning.
Other factors taken into account include sentence embedding, a natural language processing technique. Meaning, context, and how words in a sentence relate to each other are all covered by this technique. When working out QA scores, sentence embedding helps determine the similarity between source and target segments.
Each segment is then given its own quality estimation score. In most cases, scores are displayed in numerical form on a scale ranging from 0 to 1. The higher the score, the better the quality level. However, with custom models, there’s plenty of scope for customization of scoring labels.
Key Metrics in Quality Estimation and Their Importance
Quality estimation is defined by a handful of key metrics. Below, we’ll take a look at some of the most important ones.
Fluency: Ensuring Readable Translations
Fluency forms an important part of quality estimation. As well as being accurate and maintaining the meaning of the original, translated text segments need to be readable. The more readable and accessible the text is, the higher the overall QE score.
Adequacy: Maintaining Content Accuracy
Another crucial metric to consider when calculating QE scores is adequacy. Here, we’re referring to the overall accuracy of the translated text. If translated text correctly represents content from the original, adequacy ratings and QE scoring are higher.
Error Prediction: Identifying Potential Issues in MT Output
Machine translations are far from perfect. More basic MT tools can struggle with nuanced and complex language, leading to mistranslations with output text in target languages. Quality estimation looks at the potential for such errors, underlining likely issues that may arise during the machine translation process. The higher this likelihood is, the more it will impact the final QE score.
Similarities and Differences Between QE Scores and Translation Memory (TM) Matches
While QE scores and TM matches have the same goal (making the translation process more efficient), they aren’t created equally. Translation memory matches compare similarities between two source text segments, comparing a previously translated segment with a newly translated segment. Maintaining a quality TM of verified past translations makes it a great tool for speeding up the translation process and reducing costs.
QE scores, however, are calculated based on similarities between the source text string and target string without the need for a reference translation.
Reliability of QE Scores: What You Need to Know
It’s important to remember that QE scores are an approximation, rather than an ironclad statistic. Context needs to be taken into account when deciding how reliable an estimation actually is. Ultimately, QE scores serve as a guideline and are interpreted for usefulness. However, by customizing models, estimates can be made more specific.
Customizing QE Scores for Specific Domains and Languages
Looking to improve the accuracy of QE scores? Custom QE models tailored to things like certain language pairs can deliver the results you’re looking for. You can adjust training data labels to best suit specific scenarios, while individual models can be trained for use with specific language pairs and focused industry models to better align scores.
Future of Quality Estimation: AI and Customizable Scoring Models
Currently, quality estimation only delivers an approximation when determining scores for machine translations. As previously discussed, the reliability of these scores can vary and are largely dependent on context. However, quality estimation is set to become increasingly reliable in the future.
Advancements in artificial intelligence are certain to increase the reliability of quality estimation scoring. The rise of artificial neural networks will allow for deep learning to enrich quality estimation tools and technologies. Meanwhile, improvements to predictive analysis will allow for machine learning, data analysis, and new statistical models to produce increasingly reliable quality estimates.
What’s more, customizable scoring models will become more commonplace, helping overcome context-specific drawbacks. With advanced customization, models can be heavily tailored to provide scores that are better aligned with highly specific scenarios, significantly increasing the value of QE scores.
Improve Quality Estimation Scores with BLEND
If you are looking for ways to increase localization efficiency, reduce costs, and take advantage of AI developments, quality estimation is a great place to go. Essential for ensuring accuracy and high-quality output of NMT, QE scores can be readily tailored to better suit the needs of your industry, producing reliable results without the need for reference translations.
Flexi by BLEND offers the best quality estimation engines for your needs, along with expert AI and machine translation services and professional human editing when you need it. Get in touch with our team today to find out how Flexi’s QE solutions can connect to your workflows and save you 50-80% on localization costs.
With over 10 years of experience in the localization industry, Adria is an expert in designing localization and translation solutions for global businesses in every industry. Adria is BLEND’s Solutions Architect and the Chapter Manager of Women in Localization Romania.