Evaluation of Question Answering Systems: Complexity of Judging a Natural Language

IF 23.8 1区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, Frank Emmert-Streib
{"title":"Evaluation of Question Answering Systems: Complexity of Judging a Natural Language","authors":"Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, Frank Emmert-Streib","doi":"10.1145/3744663","DOIUrl":null,"url":null,"abstract":"Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research. One of their key advantages is that they enable more natural interactions between humans and machines, such as in virtual assistants or search engines. Over the past few decades, many QA systems have been developed to handle diverse QA tasks. However, the evaluation of these systems is intricate, as many of the available evaluation scores are not task-agnostic. Furthermore, translating human judgment into measurable metrics continues to be an open issue. These complexities add challenges to their assessment. This survey provides a systematic overview of evaluation scores and introduces a taxonomy with two main branches: Human-Centric Evaluation Scores (HCES) and Automatic Evaluation Scores (AES). Since many of these scores were originally designed for specific tasks but have been applied more generally, we also cover the basics of QA frameworks and core paradigms to provide a deeper understanding of their capabilities and limitations. Lastly, we discuss benchmark datasets that are critical for conducting systematic evaluations across various QA tasks.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":"10 1","pages":""},"PeriodicalIF":23.8000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3744663","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Question answering (QA) systems are a leading and rapidly advancing field of natural language processing (NLP) research. One of their key advantages is that they enable more natural interactions between humans and machines, such as in virtual assistants or search engines. Over the past few decades, many QA systems have been developed to handle diverse QA tasks. However, the evaluation of these systems is intricate, as many of the available evaluation scores are not task-agnostic. Furthermore, translating human judgment into measurable metrics continues to be an open issue. These complexities add challenges to their assessment. This survey provides a systematic overview of evaluation scores and introduces a taxonomy with two main branches: Human-Centric Evaluation Scores (HCES) and Automatic Evaluation Scores (AES). Since many of these scores were originally designed for specific tasks but have been applied more generally, we also cover the basics of QA frameworks and core paradigms to provide a deeper understanding of their capabilities and limitations. Lastly, we discuss benchmark datasets that are critical for conducting systematic evaluations across various QA tasks.
问答系统的评价:判断自然语言的复杂性
问答(QA)系统是自然语言处理(NLP)研究的一个前沿和快速发展的领域。它们的主要优势之一是使人与机器之间的互动更加自然,比如虚拟助手或搜索引擎。在过去的几十年里,开发了许多QA系统来处理不同的QA任务。然而,这些系统的评估是复杂的,因为许多可用的评估分数不是任务不可知论的。此外,将人类的判断转化为可测量的指标仍然是一个悬而未决的问题。这些复杂性给评估增加了挑战。本调查提供了评估分数的系统概述,并介绍了两个主要分支的分类法:以人为中心的评估分数(HCES)和自动评估分数(AES)。由于这些分数中的许多最初是为特定任务设计的,但已被更广泛地应用,因此我们还介绍了QA框架和核心范式的基础知识,以更深入地了解它们的功能和局限性。最后,我们讨论了基准数据集,这些数据集对于跨各种QA任务进行系统评估至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Computing Surveys
ACM Computing Surveys 工程技术-计算机:理论方法
CiteScore
33.20
自引率
0.60%
发文量
372
审稿时长
12 months
期刊介绍: ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods. ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信