qaAskeR\(^+\): a novel testing method for question answering software via asking recursive questions

IF 2 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2023-03-28 DOI:10.1007/s10515-023-00380-2

Xiaoyuan Xie, Shuo Jin, Songqiang Chen

{"title":"qaAskeR\\(^+\\): a novel testing method for question answering software via asking recursive questions","authors":"Xiaoyuan Xie, Shuo Jin, Songqiang Chen","doi":"10.1007/s10515-023-00380-2","DOIUrl":null,"url":null,"abstract":"<div>Question Answering (QA) is an attractive and challenging area in NLP community. With the development of QA technique, plenty of QA software has been applied in daily human life to provide convenient access of information retrieval. To investigate the performance of QA software, many benchmark datasets have been constructed to provide various test cases. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases are mandatory to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this work, we propose a novel testing method, qaAskeR \\(^+\\), with five new Metamorphic Relations for QA software. qaAskeR \\(^+\\) does not refer to the annotated labels of test cases. Instead, based on the idea that a correct answer should imply a piece of reliable knowledge that always conforms with any other correct answer, qaAskeR \\(^+\\) tests QA software by inspecting its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. Experimental results show that qaAskeR \\(^+\\) can reveal quite a few violations that indicate actual answering issues on various mainstream QA software without using any pre-annotated labels.</div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"30 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-023-00380-2.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-023-00380-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 2

Abstract

Question Answering (QA) is an attractive and challenging area in NLP community. With the development of QA technique, plenty of QA software has been applied in daily human life to provide convenient access of information retrieval. To investigate the performance of QA software, many benchmark datasets have been constructed to provide various test cases. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases are mandatory to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this work, we propose a novel testing method, qaAskeR \(^+\), with five new Metamorphic Relations for QA software. qaAskeR \(^+\) does not refer to the annotated labels of test cases. Instead, based on the idea that a correct answer should imply a piece of reliable knowledge that always conforms with any other correct answer, qaAskeR \(^+\) tests QA software by inspecting its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. Experimental results show that qaAskeR \(^+\) can reveal quite a few violations that indicate actual answering issues on various mainstream QA software without using any pre-annotated labels.

Abstract Image

查看原文本刊更多论文

qaAskeR \(^+\):一种通过递归提问来测试问答软件的新方法

问答（QA）是NLP社区中一个具有吸引力和挑战性的领域。随着质量保证技术的发展，大量的质量保证软件已被应用于人类的日常生活中，为信息检索提供了方便。为了研究QA软件的性能，已经构建了许多基准数据集来提供各种测试用例。然而，当前的QA软件主要是在基于参考的范式中进行测试的，在这种范式中，测试用例的预期输出（标签）必须在测试前经过大量的人工注释。因此，无论是使用过程中的实时测试，还是对大量未标记的真实数据进行的可扩展测试，都是不可行的，这使得目前对QA软件的测试不够灵活和充分。在这项工作中，我们提出了一种新的QA软件测试方法qaAskeR\（^+\），该方法具有五种新的变形关系。qaAskeR\（^+\）没有引用测试用例的注释标签。相反，基于正确答案应该意味着一个可靠的知识，并且总是与任何其他正确答案一致的想法，qaAskeR\（^+\）通过检查QA软件在多个递归问题上的行为来测试QA软件，这些问题与相同或一些进一步丰富的知识相关。实验结果表明，qaAskeR\（^+\）可以在不使用任何预先注释的标签的情况下，在各种主流QA软件上揭示相当多的违规行为，表明实际的回答问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.