Item Response Theory for Efficient Human Evaluation of Chatbots

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI:10.18653/v1/2020.eval4nlp-1.3

João Sedoc, L. Ungar

引用次数: 25

Abstract

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

查看原文本刊更多论文

人类对聊天机器人有效评价的项目反应理论

会话代理的质量目前是通过人工评估来评估的，通常需要大量的比较才能达到统计显著性。在本文中，我们引入了项目反应理论(IRT)来评估聊天机器人，使用配对比较，注释者判断哪个系统对下一轮对话的反应更好。IRT被广泛应用于教育测试中，用于同时评估考生的能力和试题的质量。它同样非常适合聊天机器人评估，因为它允许评估模型和用于评估模型的提示。我们使用IRT来有效地评估聊天机器人，并表明来自评估集的不同示例更适合于比较高质量(更接近人类性能)而不是低质量系统。最后，我们使用IRT在保留判别能力的同时减少了人类注释者评估的评估示例的数量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

自引率

0.00%

发文量