Pre-trained language models evaluating themselves - A comparative study

First Workshop on Insights from Negative Results in NLP Pub Date : 1900-01-01 DOI:10.18653/v1/2022.insights-1.25

Philipp Koch, M. Aßenmacher, C. Heumann

引用次数: 1

Abstract

Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.

查看原文本刊更多论文

预训练语言模型自我评估——一项比较研究

近年来，随着基于模型的度量标准的引入，对生成文本的评估受到了新的关注。这些新指标与人类判断具有更高的相关性，并且似乎克服了符号时代以前基于n图的指标的许多问题。在这项工作中，我们研究了最近引入的指标BERTScore, BLEURT, NUBIA, MoverScore和Mark-Evaluate (Petersen)。我们研究了他们对不同类型的语义退化(词性缺失和否定)、词序扰动、词性缺失和常见的重复问题的敏感性。没有任何指标显示出适当的否定行为，而且它们对上述其他问题都不敏感。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

First Workshop on Insights from Negative Results in NLP

自引率

0.00%

发文量