{"title":"迷失在翻译中?在评价中发现:句子级翻译评价综述","authors":"Ananya Mukherjee, Manish Shrivastava","doi":"10.1145/3735970","DOIUrl":null,"url":null,"abstract":"Machine Translation (MT) revolutionizes cross-lingual communication but is prone to errors, necessitating thorough evaluation for enhancement. Translation quality can be assessed by humans and automatic evaluation metrics. Human evaluation, though valuable, is costly and subject to limitations in scalability and consistency. While automated metrics supplement manual evaluations, this field still has considerable potential for development. However, there exists prior survey work on automatic evaluation metrics, it is worth noting that most of these are focused on resource-rich languages, leaving a significant gap in evaluating MT outputs across other language families. To bridge this gap, we present an exhaustive survey, encompassing discussions on MT meta-evaluation datasets, human assessments, and diverse metrics. We categorize both human and automatic evaluation approaches, and offer decision trees to aid in selecting the appropriate approach. Additionally, we evaluate sentences across languages, domains and linguistic features, and further meta-evaluate the metrics by correlating them with human scores. We critically examine the limitations and challenges inherent in current datasets and evaluation approaches. We propose suggestions for future research aimed at enhancing MT evaluation, including the importance of diverse and well-distributed datasets, the refinement of human evaluation methodologies, and the development of robust metrics that closely align with human judgments.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":"18 1","pages":""},"PeriodicalIF":23.8000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lost in Translation? Found in Evaluation: A Comprehensive Survey on Sentence-Level Translation Evaluation\",\"authors\":\"Ananya Mukherjee, Manish Shrivastava\",\"doi\":\"10.1145/3735970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine Translation (MT) revolutionizes cross-lingual communication but is prone to errors, necessitating thorough evaluation for enhancement. Translation quality can be assessed by humans and automatic evaluation metrics. Human evaluation, though valuable, is costly and subject to limitations in scalability and consistency. While automated metrics supplement manual evaluations, this field still has considerable potential for development. However, there exists prior survey work on automatic evaluation metrics, it is worth noting that most of these are focused on resource-rich languages, leaving a significant gap in evaluating MT outputs across other language families. To bridge this gap, we present an exhaustive survey, encompassing discussions on MT meta-evaluation datasets, human assessments, and diverse metrics. We categorize both human and automatic evaluation approaches, and offer decision trees to aid in selecting the appropriate approach. Additionally, we evaluate sentences across languages, domains and linguistic features, and further meta-evaluate the metrics by correlating them with human scores. We critically examine the limitations and challenges inherent in current datasets and evaluation approaches. We propose suggestions for future research aimed at enhancing MT evaluation, including the importance of diverse and well-distributed datasets, the refinement of human evaluation methodologies, and the development of robust metrics that closely align with human judgments.\",\"PeriodicalId\":50926,\"journal\":{\"name\":\"ACM Computing Surveys\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":23.8000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Computing Surveys\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3735970\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3735970","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Lost in Translation? Found in Evaluation: A Comprehensive Survey on Sentence-Level Translation Evaluation
Machine Translation (MT) revolutionizes cross-lingual communication but is prone to errors, necessitating thorough evaluation for enhancement. Translation quality can be assessed by humans and automatic evaluation metrics. Human evaluation, though valuable, is costly and subject to limitations in scalability and consistency. While automated metrics supplement manual evaluations, this field still has considerable potential for development. However, there exists prior survey work on automatic evaluation metrics, it is worth noting that most of these are focused on resource-rich languages, leaving a significant gap in evaluating MT outputs across other language families. To bridge this gap, we present an exhaustive survey, encompassing discussions on MT meta-evaluation datasets, human assessments, and diverse metrics. We categorize both human and automatic evaluation approaches, and offer decision trees to aid in selecting the appropriate approach. Additionally, we evaluate sentences across languages, domains and linguistic features, and further meta-evaluate the metrics by correlating them with human scores. We critically examine the limitations and challenges inherent in current datasets and evaluation approaches. We propose suggestions for future research aimed at enhancing MT evaluation, including the importance of diverse and well-distributed datasets, the refinement of human evaluation methodologies, and the development of robust metrics that closely align with human judgments.
期刊介绍:
ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods.
ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.