Predicting Patch Correctness Based on the Similarity of Failing Test Cases

ACM Transactions on Software Engineering and Methodology (TOSEM) Pub Date : 2021-07-28 DOI:10.1145/3511096

Haoye Tian, Yinghua Li, Weiguo Pian, Abdoul Kader Kabor'e, Kui Liu, Andrew Habib, Jacques Klein, Tegawendé F. Bissyandé

{"title":"Predicting Patch Correctness Based on the Similarity of Failing Test Cases","authors":"Haoye Tian, Yinghua Li, Weiguo Pian, Abdoul Kader Kabor'e, Kui Liu, Andrew Habib, Jacques Klein, Tegawendé F. Bissyandé","doi":"10.1145/3511096","DOIUrl":null,"url":null,"abstract":"How do we know a generated patch is correct? This is a key challenging question that automated program repair (APR) systems struggle to address given the incompleteness of available test suites. Our intuition is that we can triage correct patches by checking whether each generated patch implements code changes (i.e., behavior) that are relevant to the bug it addresses. Such a bug is commonly specified by a failing test case. Towards predicting patch correctness in APR, we propose a novel yet simple hypothesis on how the link between the patch behavior and failing test specifications can be drawn: similar failing test cases should require similar patches. We then propose BATS, an unsupervised learning-based approach to predict patch correctness by checking patch Behavior Against failing Test Specification. BATS exploits deep representation learning models for code and patches: For a given failing test case, the yielded embedding is used to compute similarity metrics in the search for historical similar test cases to identify the associated applied patches, which are then used as a proxy for assessing the correctness of the APR-generated patches. Experimentally, we first validate our hypothesis by assessing whether ground-truth developer patches cluster together in the same way that their associated failing test cases are clustered. Then, after collecting a large dataset of 1,278 plausible patches (written by developers or generated by 32 APR tools), we use BATS to predict correct patches: BATS achieves AUC between 0.557 to 0.718 and recall between 0.562 and 0.854 in identifying correct patches. Our approach outperforms state-of-the-art techniques for identifying correct patches without the need for large labeled patch datasets—as is the case with machine learning-based approaches. While BATS is constrained by the availability of similar test cases, we show that it can still be complementary to existing approaches: When combined with a recent approach that relies on supervised learning, BATS improves the overall recall in detecting correct patches. We finally show that BATS is complementary to the state-of-the-art PATCH-SIM dynamic approach for identifying correct patches generated by APR tools.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"8 1","pages":"1 - 30"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology (TOSEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

How do we know a generated patch is correct? This is a key challenging question that automated program repair (APR) systems struggle to address given the incompleteness of available test suites. Our intuition is that we can triage correct patches by checking whether each generated patch implements code changes (i.e., behavior) that are relevant to the bug it addresses. Such a bug is commonly specified by a failing test case. Towards predicting patch correctness in APR, we propose a novel yet simple hypothesis on how the link between the patch behavior and failing test specifications can be drawn: similar failing test cases should require similar patches. We then propose BATS, an unsupervised learning-based approach to predict patch correctness by checking patch Behavior Against failing Test Specification. BATS exploits deep representation learning models for code and patches: For a given failing test case, the yielded embedding is used to compute similarity metrics in the search for historical similar test cases to identify the associated applied patches, which are then used as a proxy for assessing the correctness of the APR-generated patches. Experimentally, we first validate our hypothesis by assessing whether ground-truth developer patches cluster together in the same way that their associated failing test cases are clustered. Then, after collecting a large dataset of 1,278 plausible patches (written by developers or generated by 32 APR tools), we use BATS to predict correct patches: BATS achieves AUC between 0.557 to 0.718 and recall between 0.562 and 0.854 in identifying correct patches. Our approach outperforms state-of-the-art techniques for identifying correct patches without the need for large labeled patch datasets—as is the case with machine learning-based approaches. While BATS is constrained by the availability of similar test cases, we show that it can still be complementary to existing approaches: When combined with a recent approach that relies on supervised learning, BATS improves the overall recall in detecting correct patches. We finally show that BATS is complementary to the state-of-the-art PATCH-SIM dynamic approach for identifying correct patches generated by APR tools.

查看原文本刊更多论文

基于失败测试用例相似性的补丁正确性预测

我们如何知道生成的补丁是正确的?这是一个关键的具有挑战性的问题，自动程序修复(APR)系统在给定可用测试套件的不完整性的情况下很难解决这个问题。我们的直觉是，我们可以通过检查每个生成的补丁是否实现了与其所解决的错误相关的代码更改(即行为)来分类正确的补丁。这样的错误通常由失败的测试用例指定。为了预测APR中的补丁正确性，我们就如何绘制补丁行为和失败测试规范之间的联系提出了一个新颖而简单的假设:类似的失败测试用例应该需要类似的补丁。然后我们提出了BATS，这是一种基于无监督学习的方法，通过检查补丁行为来预测补丁的正确性。BATS利用代码和补丁的深度表示学习模型:对于给定的失败测试用例，生成的嵌入用于在搜索历史相似测试用例中计算相似性度量，以识别相关的应用补丁，然后将其用作评估apr生成的补丁的正确性的代理。在实验上，我们首先通过评估真实的开发人员补丁是否以与它们相关的失败测试用例聚集在一起的相同方式来验证我们的假设。然后，在收集了1278个可信补丁(由开发人员编写或由32个APR工具生成)的大型数据集之后，我们使用BATS来预测正确的补丁:BATS在识别正确补丁方面的AUC在0.557到0.718之间，召回率在0.562到0.854之间。我们的方法在识别正确的补丁方面优于最先进的技术，而不需要大量标记的补丁数据集——就像基于机器学习的方法一样。虽然BATS受到类似测试用例可用性的限制，但我们表明它仍然可以作为现有方法的补充:当与最近依赖于监督学习的方法相结合时，BATS提高了检测正确补丁的整体召回率。我们最后表明，BATS是最先进的PATCH-SIM动态方法的补充，用于识别由APR工具生成的正确补丁。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Software Engineering and Methodology (TOSEM)

自引率

0.00%

发文量