On the Evaluation of Machine Translation n-best Lists

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI:10.18653/v1/2020.eval4nlp-1.7

Jacob Bremerman, Huda Khayrallah, Douglas W. Oard, Matt Post

{"title":"On the Evaluation of Machine Translation n-best Lists","authors":"Jacob Bremerman, Huda Khayrallah, Douglas W. Oard, Matt Post","doi":"10.18653/v1/2020.eval4nlp-1.7","DOIUrl":null,"url":null,"abstract":"The standard machine translation evaluation framework measures the single-best output of machine translation systems. There are, however, many situations where n-best lists are needed, yet there is no established way of evaluating them. This paper establishes a framework for addressing n-best evaluation by outlining three different questions one could consider when determining how one would define a ‘good’ n-best list and proposing evaluation measures for each question. The first and principal contribution is an evaluation measure that characterizes the translation quality of an entire n-best list by asking whether many of the valid translations are placed near the top of the list. The second is a measure that uses gold translations with preference annotations to ask to what degree systems can produce ranked lists in preference order. The third is a measure that rewards partial matches, evaluating the closeness of the many items in an n-best list to a set of many valid references. These three perspectives make clear that having access to many references can be useful when n-best evaluation is the goal.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.eval4nlp-1.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The standard machine translation evaluation framework measures the single-best output of machine translation systems. There are, however, many situations where n-best lists are needed, yet there is no established way of evaluating them. This paper establishes a framework for addressing n-best evaluation by outlining three different questions one could consider when determining how one would define a ‘good’ n-best list and proposing evaluation measures for each question. The first and principal contribution is an evaluation measure that characterizes the translation quality of an entire n-best list by asking whether many of the valid translations are placed near the top of the list. The second is a measure that uses gold translations with preference annotations to ask to what degree systems can produce ranked lists in preference order. The third is a measure that rewards partial matches, evaluating the closeness of the many items in an n-best list to a set of many valid references. These three perspectives make clear that having access to many references can be useful when n-best evaluation is the goal.

查看原文本刊更多论文

关于机器翻译n-best列表的评价

标准的机器翻译评估框架衡量机器翻译系统的单一最佳输出。然而，在许多情况下，需要n-best列表，但没有确定的方法来评估它们。本文通过概述在确定如何定义“好”n-best列表并为每个问题提出评估措施时可以考虑的三个不同问题，建立了一个解决n-best评估的框架。第一个也是最主要的贡献是一种评估措施，通过询问是否有许多有效的翻译被放置在列表的顶部，来表征整个n-best列表的翻译质量。第二种方法是使用带有偏好注释的黄金翻译来询问系统在多大程度上可以按照偏好顺序生成排名列表。第三种是奖励部分匹配的度量，评估n-best列表中许多项与许多有效引用集合的接近程度。这三个透视图清楚地表明，当以n-best评估为目标时，访问许多引用可能是有用的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

自引率

0.00%

发文量