A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore

Aniruddha Mukherjee;Vikas Hassija;Vinay Chamola;Karunesh Kumar Gupta
{"title":"A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore","authors":"Aniruddha Mukherjee;Vikas Hassija;Vinay Chamola;Karunesh Kumar Gupta","doi":"10.1109/OJCS.2025.3560333","DOIUrl":null,"url":null,"abstract":"<sc><b>Bleurt</b></small> is a recently introduced metric that employs <sc>Bert</small>, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While traditional metrics like<sc>Bleu</small> rely on lexical similarities, <sc>Bleurt</small> leverages <sc>Bert</small>’s semantic and syntactic capabilities to provide more robust evaluation through complex text representations. However, studies have shown that <sc>Bert</small>, despite its impressive performance in natural language processing tasks can sometimes deviate from human judgment, particularly in specific syntactic and semantic scenarios. Through systematic experimental analysis at the word level, including categorization of errors such as lexical mismatches, untranslated terms, and structural inconsistencies, we investigate how <sc>Bleurt</small> handles various translation challenges. Our study addresses three central questions: What are the strengths and weaknesses of <sc>Bleurt</small>, how do they align with <sc>Bert</small>’s known limitations, and how does it compare with the similar automatic neural metric for machine translation, <sc>BERTScore</small>? Using manually annotated datasets that emphasize different error types and linguistic phenomena, we find that <sc>Bleurt</small> excels at identifying nuanced differences between sentences with high overlap, an area where <sc>BERTScore</small> shows limitations. Our systematic experiments, provide insights for their effective application in machine translation evaluation.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"658-668"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10964149","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10964149/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Bleurt is a recently introduced metric that employs Bert, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While traditional metrics likeBleu rely on lexical similarities, Bleurt leverages Bert’s semantic and syntactic capabilities to provide more robust evaluation through complex text representations. However, studies have shown that Bert, despite its impressive performance in natural language processing tasks can sometimes deviate from human judgment, particularly in specific syntactic and semantic scenarios. Through systematic experimental analysis at the word level, including categorization of errors such as lexical mismatches, untranslated terms, and structural inconsistencies, we investigate how Bleurt handles various translation challenges. Our study addresses three central questions: What are the strengths and weaknesses of Bleurt, how do they align with Bert’s known limitations, and how does it compare with the similar automatic neural metric for machine translation, BERTScore? Using manually annotated datasets that emphasize different error types and linguistic phenomena, we find that Bleurt excels at identifying nuanced differences between sentences with high overlap, an area where BERTScore shows limitations. Our systematic experiments, provide insights for their effective application in machine translation evaluation.
用于机器翻译的自动神经度量的详细比较分析:BLEURT和BERTScore
Bleurt是最近引入的一个度量标准,它使用Bert,一个强大的预训练语言模型来评估候选翻译与机器翻译输出背景下的参考翻译相比有多好。像bleu这样的传统度量依赖于词汇相似性,而Bleurt利用Bert的语义和句法能力,通过复杂的文本表示提供更健壮的评估。然而,研究表明,尽管Bert在自然语言处理任务中表现出色,但有时会偏离人类的判断,特别是在特定的句法和语义场景中。通过在单词层面进行系统的实验分析,包括对词汇不匹配、未翻译术语和结构不一致等错误的分类,我们研究了Bleurt如何处理各种翻译挑战。我们的研究解决了三个核心问题:Bleurt的优点和缺点是什么,它们如何与Bert已知的局限性保持一致,以及它如何与类似的机器翻译自动神经指标BERTScore进行比较?使用强调不同错误类型和语言现象的手动注释数据集,我们发现Bleurt擅长识别高重叠句子之间的细微差异,这是BERTScore显示局限性的领域。通过系统的实验,为其在机器翻译评价中的有效应用提供了参考。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
12.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信