Evaluation of Question Answering Systems: Complexity of Judging a Natural Language

IF 28 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

ACM Computing Surveys Pub Date : 2025-06-12 DOI:10.1145/3744663

Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, Frank Emmert-Streib

{"title":"Evaluation of Question Answering Systems: Complexity of Judging a Natural Language","authors":"Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, Frank Emmert-Streib","doi":"10.1145/3744663","DOIUrl":null,"url":null,"abstract":"Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research. One of their key advantages is that they enable more natural interactions between humans and machines, such as in virtual assistants or search engines. Over the past few decades, many QA systems have been developed to handle diverse QA tasks. However, the evaluation of these systems is intricate, as many of the available evaluation scores are not task-agnostic. Furthermore, translating human judgment into measurable metrics continues to be an open issue. These complexities add challenges to their assessment. This survey provides a systematic overview of evaluation scores and introduces a taxonomy with two main branches: Human-Centric Evaluation Scores (HCES) and Automatic Evaluation Scores (AES). Since many of these scores were originally designed for specific tasks but have been applied more generally, we also cover the basics of QA frameworks and core paradigms to provide a deeper understanding of their capabilities and limitations. Lastly, we discuss benchmark datasets that are critical for conducting systematic evaluations across various QA tasks.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":"10 1","pages":""},"PeriodicalIF":28.0000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3744663","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research. One of their key advantages is that they enable more natural interactions between humans and machines, such as in virtual assistants or search engines. Over the past few decades, many QA systems have been developed to handle diverse QA tasks. However, the evaluation of these systems is intricate, as many of the available evaluation scores are not task-agnostic. Furthermore, translating human judgment into measurable metrics continues to be an open issue. These complexities add challenges to their assessment. This survey provides a systematic overview of evaluation scores and introduces a taxonomy with two main branches: Human-Centric Evaluation Scores (HCES) and Automatic Evaluation Scores (AES). Since many of these scores were originally designed for specific tasks but have been applied more generally, we also cover the basics of QA frameworks and core paradigms to provide a deeper understanding of their capabilities and limitations. Lastly, we discuss benchmark datasets that are critical for conducting systematic evaluations across various QA tasks.

查看原文本刊更多论文

问答系统的评价：判断自然语言的复杂性

问答（QA）系统是自然语言处理（NLP）研究的一个前沿和快速发展的领域。它们的主要优势之一是使人与机器之间的互动更加自然，比如虚拟助手或搜索引擎。在过去的几十年里，开发了许多QA系统来处理不同的QA任务。然而，这些系统的评估是复杂的，因为许多可用的评估分数不是任务不可知论的。此外，将人类的判断转化为可测量的指标仍然是一个悬而未决的问题。这些复杂性给评估增加了挑战。本调查提供了评估分数的系统概述，并介绍了两个主要分支的分类法：以人为中心的评估分数（HCES）和自动评估分数（AES）。由于这些分数中的许多最初是为特定任务设计的，但已被更广泛地应用，因此我们还介绍了QA框架和核心范式的基础知识，以更深入地了解它们的功能和局限性。最后，我们讨论了基准数据集，这些数据集对于跨各种QA任务进行系统评估至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Computing Surveys 工程技术-计算机：理论方法

CiteScore

33.20

自引率

0.60%

发文量

372

审稿时长

12 months

期刊介绍： ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods. ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.