A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore

IEEE Open Journal of the Computer Society Pub Date : 2025-04-14 DOI:10.1109/OJCS.2025.3560333

Aniruddha Mukherjee;Vikas Hassija;Vinay Chamola;Karunesh Kumar Gupta

{"title":"A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore","authors":"Aniruddha Mukherjee;Vikas Hassija;Vinay Chamola;Karunesh Kumar Gupta","doi":"10.1109/OJCS.2025.3560333","DOIUrl":null,"url":null,"abstract":"<sc>Bleurt is a recently introduced metric that employs <sc>Bert, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While traditional metrics like<sc>Bleu rely on lexical similarities, <sc>Bleurt leverages <sc>Bert’s semantic and syntactic capabilities to provide more robust evaluation through complex text representations. However, studies have shown that <sc>Bert, despite its impressive performance in natural language processing tasks can sometimes deviate from human judgment, particularly in specific syntactic and semantic scenarios. Through systematic experimental analysis at the word level, including categorization of errors such as lexical mismatches, untranslated terms, and structural inconsistencies, we investigate how <sc>Bleurt handles various translation challenges. Our study addresses three central questions: What are the strengths and weaknesses of <sc>Bleurt, how do they align with <sc>Bert’s known limitations, and how does it compare with the similar automatic neural metric for machine translation, <sc>BERTScore? Using manually annotated datasets that emphasize different error types and linguistic phenomena, we find that <sc>Bleurt excels at identifying nuanced differences between sentences with high overlap, an area where <sc>BERTScore shows limitations. Our systematic experiments, provide insights for their effective application in machine translation evaluation.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"658-668"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10964149","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10964149/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Bleurt is a recently introduced metric that employs Bert, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While traditional metrics likeBleu rely on lexical similarities, Bleurt leverages Bert’s semantic and syntactic capabilities to provide more robust evaluation through complex text representations. However, studies have shown that Bert, despite its impressive performance in natural language processing tasks can sometimes deviate from human judgment, particularly in specific syntactic and semantic scenarios. Through systematic experimental analysis at the word level, including categorization of errors such as lexical mismatches, untranslated terms, and structural inconsistencies, we investigate how Bleurt handles various translation challenges. Our study addresses three central questions: What are the strengths and weaknesses of Bleurt, how do they align with Bert’s known limitations, and how does it compare with the similar automatic neural metric for machine translation, BERTScore? Using manually annotated datasets that emphasize different error types and linguistic phenomena, we find that Bleurt excels at identifying nuanced differences between sentences with high overlap, an area where BERTScore shows limitations. Our systematic experiments, provide insights for their effective application in machine translation evaluation.

查看原文本刊更多论文

用于机器翻译的自动神经度量的详细比较分析：BLEURT和BERTScore

Bleurt是最近引入的一个度量标准，它使用Bert，一个强大的预训练语言模型来评估候选翻译与机器翻译输出背景下的参考翻译相比有多好。像bleu这样的传统度量依赖于词汇相似性，而Bleurt利用Bert的语义和句法能力，通过复杂的文本表示提供更健壮的评估。然而，研究表明，尽管Bert在自然语言处理任务中表现出色，但有时会偏离人类的判断，特别是在特定的句法和语义场景中。通过在单词层面进行系统的实验分析，包括对词汇不匹配、未翻译术语和结构不一致等错误的分类，我们研究了Bleurt如何处理各种翻译挑战。我们的研究解决了三个核心问题：Bleurt的优点和缺点是什么，它们如何与Bert已知的局限性保持一致，以及它如何与类似的机器翻译自动神经指标BERTScore进行比较？使用强调不同错误类型和语言现象的手动注释数据集，我们发现Bleurt擅长识别高重叠句子之间的细微差异，这是BERTScore显示局限性的领域。通过系统的实验，为其在机器翻译评价中的有效应用提供了参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Open Journal of the Computer Society

CiteScore

12.60

自引率

0.00%

发文量