Hybrid mutation driven testing for natural language inference

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Software-Evolution and Process Pub Date : 2024-06-17 DOI:10.1002/smr.2694

Linghan Meng, Yanhui Li, Lin Chen, Mingliang Ma, Yuming Zhou, Baowen Xu

{"title":"Hybrid mutation driven testing for natural language inference","authors":"Linghan Meng, Yanhui Li, Lin Chen, Mingliang Ma, Yuming Zhou, Baowen Xu","doi":"10.1002/smr.2694","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Natural language inference (NLI) is a task to infer the relationship between the premise and hypothesis sentences, whose models have essential applications in the many natural language processing (NLP) fields, for example, machine reading comprehension and recognizing textual entailment. Due to the data-driven programming paradigm, bugs inevitably occur in NLI models during the application process, which calls for novel automatic testing techniques to deal with NLI testing challenges. The main difficulty in achieving automatic testing for NLI models is the oracle problem; that is, it may be too expensive to label NLI model inputs manually and hence be too challenging to verify the correctness of model outputs. To tackle the oracle problem, this study proposes a novel automatic testing method <b>hybrid mutation driven testing (HMT)</b>, which extends the mutation idea applied in other NLP domains successfully. Specifically, as there are two sets of sentences, that is, premise and hypothesis, to be mutated, we propose four mutation operators to achieve the hybrid mutation strategy, which mutate the premise and the hypothesis sentences <i>jointly</i> or <i>individually</i>. We assume that the mutation would not affect the outputs; that is, if the original and mutated outputs are inconsistent, inconsistency bugs could be detected without knowing the true labels. To evaluate our method HMT, we conduct experiments on two widely used datasets with two advanced models and generate more than 520,000 mutations by applying our mutation operators. Our experimental results show that (a) our method, HMT, can effectively generate mutated testing samples, (b) our method can effectively trigger the inconsistency bugs of the NLI models, and (c) all four mutation operators can independently trigger inconsistency bugs.</p>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"36 10","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.2694","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Natural language inference (NLI) is a task to infer the relationship between the premise and hypothesis sentences, whose models have essential applications in the many natural language processing (NLP) fields, for example, machine reading comprehension and recognizing textual entailment. Due to the data-driven programming paradigm, bugs inevitably occur in NLI models during the application process, which calls for novel automatic testing techniques to deal with NLI testing challenges. The main difficulty in achieving automatic testing for NLI models is the oracle problem; that is, it may be too expensive to label NLI model inputs manually and hence be too challenging to verify the correctness of model outputs. To tackle the oracle problem, this study proposes a novel automatic testing method hybrid mutation driven testing (HMT), which extends the mutation idea applied in other NLP domains successfully. Specifically, as there are two sets of sentences, that is, premise and hypothesis, to be mutated, we propose four mutation operators to achieve the hybrid mutation strategy, which mutate the premise and the hypothesis sentences jointly or individually. We assume that the mutation would not affect the outputs; that is, if the original and mutated outputs are inconsistent, inconsistency bugs could be detected without knowing the true labels. To evaluate our method HMT, we conduct experiments on two widely used datasets with two advanced models and generate more than 520,000 mutations by applying our mutation operators. Our experimental results show that (a) our method, HMT, can effectively generate mutated testing samples, (b) our method can effectively trigger the inconsistency bugs of the NLI models, and (c) all four mutation operators can independently trigger inconsistency bugs.

查看原文本刊更多论文

自然语言推理的混合突变驱动测试

自然语言推理（NLI）是一项推断前提句和假设句之间关系的任务，其模型在许多自然语言处理（NLP）领域都有重要应用，例如机器阅读理解和识别文本蕴涵。由于数据驱动的编程模式，NLI 模型在应用过程中不可避免地会出现错误，这就需要新颖的自动测试技术来应对 NLI 测试挑战。实现 NLI 模型自动测试的主要困难是甲骨文问题，即人工标注 NLI 模型输入的成本可能太高，因此验证模型输出的正确性也太具有挑战性。为了解决甲骨文问题，本研究提出了一种新的自动测试方法混合突变驱动测试（HMT），它扩展了成功应用于其他 NLP 领域的突变思想。具体来说，由于需要突变的句子有两组，即前提句和假设句，我们提出了四种突变算子来实现混合突变策略，它们可以联合或单独突变前提句和假设句。我们假设突变不会影响输出结果，也就是说，如果原始输出结果和突变后的输出结果不一致，则可以在不知道真实标签的情况下检测出不一致错误。为了评估我们的 HMT 方法，我们在两个广泛使用的数据集上用两个高级模型进行了实验，并通过应用我们的突变算子生成了 52 万多个突变。实验结果表明：(a) 我们的 HMT 方法能有效生成突变测试样本；(b) 我们的方法能有效触发 NLI 模型的不一致错误；(c) 所有四个突变算子都能独立触发不一致错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-

自引率

10.00%

发文量

109