Hunting bugs with accelerated optimal graph vertex matching

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2022-07-18 DOI:10.1145/3533767.3534393

Xiaohui Zhang, Yuanjun Gong, Bin Liang, Jianjun Huang, Wei You, Wenchang Shi, Jian Zhang

{"title":"Hunting bugs with accelerated optimal graph vertex matching","authors":"Xiaohui Zhang, Yuanjun Gong, Bin Liang, Jianjun Huang, Wei You, Wenchang Shi, Jian Zhang","doi":"10.1145/3533767.3534393","DOIUrl":null,"url":null,"abstract":"Various techniques based on code similarity measurement have been proposed to detect bugs. Essentially, the code fragment can be regarded as a kind of graph. Performing code graph similarity comparison to identify the potential bugs is a natural choice. However, the logic of a bug often involves only a few statements in the code fragment, while others are bug-irrelevant. They can be considered as a kind of noise, and can heavily interfere with the code similarity measurement. In theory, performing optimal vertex matching can address the problem well, but the task is NP-complete and cannot be applied to a large-scale code base. In this paper, we propose a two-phase strategy to accelerate code graph vertex matching for detecting bugs. In the first phase, a vertex matching embedding model is trained and used to rapidly filter a limited number of candidate code graphs from the target code base, which are likely to have a high vertex matching degree with the seed, i.e., the known buggy code. As a result, the number of code graphs needed to be further analyzed is dramatically reduced. In the second phase, a high-order similarity embedding model based on graph convolutional neural network is built to efficiently get the approximately optimal vertex matching between the seed and candidates. On this basis, the code graph similarity is calculated to identify the potential buggy code. The proposed method is applied to five open source projects. In total, 31 unknown bugs were successfully detected and confirmed by developers. Comparative experiments demonstrate that our method can effectively mitigate the noise problem, and the detection efficiency can be improved dozens of times with the two-phase strategy.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"199 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Various techniques based on code similarity measurement have been proposed to detect bugs. Essentially, the code fragment can be regarded as a kind of graph. Performing code graph similarity comparison to identify the potential bugs is a natural choice. However, the logic of a bug often involves only a few statements in the code fragment, while others are bug-irrelevant. They can be considered as a kind of noise, and can heavily interfere with the code similarity measurement. In theory, performing optimal vertex matching can address the problem well, but the task is NP-complete and cannot be applied to a large-scale code base. In this paper, we propose a two-phase strategy to accelerate code graph vertex matching for detecting bugs. In the first phase, a vertex matching embedding model is trained and used to rapidly filter a limited number of candidate code graphs from the target code base, which are likely to have a high vertex matching degree with the seed, i.e., the known buggy code. As a result, the number of code graphs needed to be further analyzed is dramatically reduced. In the second phase, a high-order similarity embedding model based on graph convolutional neural network is built to efficiently get the approximately optimal vertex matching between the seed and candidates. On this basis, the code graph similarity is calculated to identify the potential buggy code. The proposed method is applied to five open source projects. In total, 31 unknown bugs were successfully detected and confirmed by developers. Comparative experiments demonstrate that our method can effectively mitigate the noise problem, and the detection efficiency can be improved dozens of times with the two-phase strategy.

查看原文本刊更多论文

用加速最优图顶点匹配寻找bug

人们提出了各种基于代码相似度度量的技术来检测bug。从本质上讲，代码片段可以看作是一种图。执行代码图相似性比较来识别潜在的bug是一种自然的选择。然而，错误的逻辑通常只涉及代码片段中的几个语句，而其他语句与错误无关。它们可以看作是一种噪声，严重干扰代码相似度的测量。理论上，执行最优顶点匹配可以很好地解决这个问题，但这个任务是np完备的，不能应用于大规模的代码库。在本文中，我们提出了一种两阶段策略来加速代码图顶点匹配以检测bug。在第一阶段，训练一个顶点匹配嵌入模型，从目标代码库中快速筛选有限数量的候选代码图，这些候选代码图可能与种子(即已知的bug代码)具有较高的顶点匹配度。因此，需要进一步分析的代码图的数量大大减少了。第二阶段，建立基于图卷积神经网络的高阶相似度嵌入模型，有效地获得种子与候选点之间的近似最优顶点匹配;在此基础上，计算代码图相似度来识别潜在的bug代码。将该方法应用于五个开源项目。总共有31个未知bug被开发人员成功检测并确认。对比实验表明，该方法可以有效地缓解噪声问题，并且两相策略的检测效率可以提高数十倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量