迷失在翻译中？在评价中发现：句子级翻译评价综述

IF 23.8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

ACM Computing Surveys Pub Date : 2025-05-16 DOI:10.1145/3735970

Ananya Mukherjee, Manish Shrivastava

{"title":"迷失在翻译中？在评价中发现：句子级翻译评价综述","authors":"Ananya Mukherjee, Manish Shrivastava","doi":"10.1145/3735970","DOIUrl":null,"url":null,"abstract":"Machine Translation (MT) revolutionizes cross-lingual communication but is prone to errors, necessitating thorough evaluation for enhancement. Translation quality can be assessed by humans and automatic evaluation metrics. Human evaluation, though valuable, is costly and subject to limitations in scalability and consistency. While automated metrics supplement manual evaluations, this field still has considerable potential for development. However, there exists prior survey work on automatic evaluation metrics, it is worth noting that most of these are focused on resource-rich languages, leaving a significant gap in evaluating MT outputs across other language families. To bridge this gap, we present an exhaustive survey, encompassing discussions on MT meta-evaluation datasets, human assessments, and diverse metrics. We categorize both human and automatic evaluation approaches, and offer decision trees to aid in selecting the appropriate approach. Additionally, we evaluate sentences across languages, domains and linguistic features, and further meta-evaluate the metrics by correlating them with human scores. We critically examine the limitations and challenges inherent in current datasets and evaluation approaches. We propose suggestions for future research aimed at enhancing MT evaluation, including the importance of diverse and well-distributed datasets, the refinement of human evaluation methodologies, and the development of robust metrics that closely align with human judgments.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":"18 1","pages":""},"PeriodicalIF":23.8000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lost in Translation? Found in Evaluation: A Comprehensive Survey on Sentence-Level Translation Evaluation\",\"authors\":\"Ananya Mukherjee, Manish Shrivastava\",\"doi\":\"10.1145/3735970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine Translation (MT) revolutionizes cross-lingual communication but is prone to errors, necessitating thorough evaluation for enhancement. Translation quality can be assessed by humans and automatic evaluation metrics. Human evaluation, though valuable, is costly and subject to limitations in scalability and consistency. While automated metrics supplement manual evaluations, this field still has considerable potential for development. However, there exists prior survey work on automatic evaluation metrics, it is worth noting that most of these are focused on resource-rich languages, leaving a significant gap in evaluating MT outputs across other language families. To bridge this gap, we present an exhaustive survey, encompassing discussions on MT meta-evaluation datasets, human assessments, and diverse metrics. We categorize both human and automatic evaluation approaches, and offer decision trees to aid in selecting the appropriate approach. Additionally, we evaluate sentences across languages, domains and linguistic features, and further meta-evaluate the metrics by correlating them with human scores. We critically examine the limitations and challenges inherent in current datasets and evaluation approaches. We propose suggestions for future research aimed at enhancing MT evaluation, including the importance of diverse and well-distributed datasets, the refinement of human evaluation methodologies, and the development of robust metrics that closely align with human judgments.\",\"PeriodicalId\":50926,\"journal\":{\"name\":\"ACM Computing Surveys\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":23.8000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Computing Surveys\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3735970\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3735970","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

机器翻译给跨语言交流带来了革命性的变化，但也容易出现错误，需要进行全面的评估以改进。翻译质量可以通过人工评估和自动评估指标来评估。人工评估虽然有价值，但成本很高，而且在可伸缩性和一致性方面受到限制。虽然自动化度量标准补充了手工评估，但该领域仍有相当大的发展潜力。然而，之前有关于自动评估指标的调查工作，值得注意的是，其中大多数都集中在资源丰富的语言上，在评估其他语族的机器翻译输出方面留下了很大的差距。为了弥补这一差距，我们提出了一项详尽的调查，包括对MT元评估数据集、人类评估和各种指标的讨论。我们对人工评估方法和自动评估方法进行分类，并提供决策树来帮助选择合适的方法。此外，我们跨语言、领域和语言特征评估句子，并通过将它们与人类得分相关联进一步对指标进行元评估。我们批判性地审视当前数据集和评估方法固有的局限性和挑战。我们对未来的研究提出了建议，旨在加强机器翻译评估，包括多样化和分布良好的数据集的重要性，人类评估方法的改进，以及与人类判断密切一致的稳健指标的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Lost in Translation? Found in Evaluation: A Comprehensive Survey on Sentence-Level Translation Evaluation

Machine Translation (MT) revolutionizes cross-lingual communication but is prone to errors, necessitating thorough evaluation for enhancement. Translation quality can be assessed by humans and automatic evaluation metrics. Human evaluation, though valuable, is costly and subject to limitations in scalability and consistency. While automated metrics supplement manual evaluations, this field still has considerable potential for development. However, there exists prior survey work on automatic evaluation metrics, it is worth noting that most of these are focused on resource-rich languages, leaving a significant gap in evaluating MT outputs across other language families. To bridge this gap, we present an exhaustive survey, encompassing discussions on MT meta-evaluation datasets, human assessments, and diverse metrics. We categorize both human and automatic evaluation approaches, and offer decision trees to aid in selecting the appropriate approach. Additionally, we evaluate sentences across languages, domains and linguistic features, and further meta-evaluate the metrics by correlating them with human scores. We critically examine the limitations and challenges inherent in current datasets and evaluation approaches. We propose suggestions for future research aimed at enhancing MT evaluation, including the importance of diverse and well-distributed datasets, the refinement of human evaluation methodologies, and the development of robust metrics that closely align with human judgments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Computing Surveys 工程技术-计算机：理论方法

CiteScore

33.20

自引率

0.60%

发文量

372

审稿时长

12 months

期刊介绍： ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods. ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.