学习基于图的代码表示，用于源级功能相似性检测

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) Pub Date : 2023-05-01 DOI:10.1109/ICSE48619.2023.00040

Jiahao Liu, Jun Zeng, Xiang Wang, Zhenkai Liang

{"title":"学习基于图的代码表示，用于源级功能相似性检测","authors":"Jiahao Liu, Jun Zeng, Xiang Wang, Zhenkai Liang","doi":"10.1109/ICSE48619.2023.00040","DOIUrl":null,"url":null,"abstract":"Detecting code functional similarity forms the basis of various software engineering tasks. However, the detection is challenging as functionally similar code fragments can be implemented differently, e.g., with irrelevant syntax. Recent studies incorporate program dependencies as semantics to identify syntactically different yet semantically similar programs, but they often focus only on local neighborhoods (e.g., one-hop dependencies), limiting the expressiveness of program semantics in modeling functionalities. In this paper, we present Tailor that explicitly exploits deep graph-structured code features for functional similarity detection. Given source-level programs, Tailor first represents them into code property graphs (CPGs) - which combine abstract syntax trees, control flow graphs, and data flow graphs - to collectively reason about program syntax and semantics. Then, Tailor learns representations of CPGs by applying a CPG-based neural network (CPGNN) to iteratively propagate information on them. It improves over prior work on code representation learning through a new graph neural network (GNN) tailored to CPG structures instead of the off-the-shelf GNNs used previously. We systematically evaluate Tailor on C and Java programs using two public benchmarks. Experimental results show that Tailor outperforms the state-of-the-art approaches, achieving 99.8% and 99.9% F-scores in code clone detection and 98.3% accuracy in source code classification.","PeriodicalId":376379,"journal":{"name":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Learning Graph-based Code Representations for Source-level Functional Similarity Detection\",\"authors\":\"Jiahao Liu, Jun Zeng, Xiang Wang, Zhenkai Liang\",\"doi\":\"10.1109/ICSE48619.2023.00040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting code functional similarity forms the basis of various software engineering tasks. However, the detection is challenging as functionally similar code fragments can be implemented differently, e.g., with irrelevant syntax. Recent studies incorporate program dependencies as semantics to identify syntactically different yet semantically similar programs, but they often focus only on local neighborhoods (e.g., one-hop dependencies), limiting the expressiveness of program semantics in modeling functionalities. In this paper, we present Tailor that explicitly exploits deep graph-structured code features for functional similarity detection. Given source-level programs, Tailor first represents them into code property graphs (CPGs) - which combine abstract syntax trees, control flow graphs, and data flow graphs - to collectively reason about program syntax and semantics. Then, Tailor learns representations of CPGs by applying a CPG-based neural network (CPGNN) to iteratively propagate information on them. It improves over prior work on code representation learning through a new graph neural network (GNN) tailored to CPG structures instead of the off-the-shelf GNNs used previously. We systematically evaluate Tailor on C and Java programs using two public benchmarks. Experimental results show that Tailor outperforms the state-of-the-art approaches, achieving 99.8% and 99.9% F-scores in code clone detection and 98.3% accuracy in source code classification.\",\"PeriodicalId\":376379,\"journal\":{\"name\":\"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)\",\"volume\":\"116 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSE48619.2023.00040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE48619.2023.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

检测代码功能相似性是各种软件工程任务的基础。然而，检测是具有挑战性的，因为功能相似的代码片段可以以不同的方式实现，例如，使用不相关的语法。最近的研究将程序依赖关系作为语义来识别语法上不同但语义上相似的程序，但它们通常只关注局部邻域(例如，单跳依赖关系)，限制了程序语义在建模功能中的表达性。在本文中，我们提出了明确利用深度图结构代码特征进行功能相似性检测的Tailor。给定源级程序，Tailor首先将它们表示为代码属性图(cpg)， cpg结合了抽象语法树、控制流图和数据流图，以共同推断程序语法和语义。然后，Tailor通过应用基于cpg的神经网络(CPGNN)在cpg上迭代传播信息来学习cpg的表示。它通过针对CPG结构定制的新图神经网络(GNN)而不是之前使用的现成GNN，改进了之前在代码表示学习方面的工作。我们使用两个公共基准系统地评估了C和Java程序上的Tailor。实验结果表明，Tailor算法在代码克隆检测方面的f值分别达到99.8%和99.9%，在源代码分类方面的准确率达到98.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning Graph-based Code Representations for Source-level Functional Similarity Detection

Detecting code functional similarity forms the basis of various software engineering tasks. However, the detection is challenging as functionally similar code fragments can be implemented differently, e.g., with irrelevant syntax. Recent studies incorporate program dependencies as semantics to identify syntactically different yet semantically similar programs, but they often focus only on local neighborhoods (e.g., one-hop dependencies), limiting the expressiveness of program semantics in modeling functionalities. In this paper, we present Tailor that explicitly exploits deep graph-structured code features for functional similarity detection. Given source-level programs, Tailor first represents them into code property graphs (CPGs) - which combine abstract syntax trees, control flow graphs, and data flow graphs - to collectively reason about program syntax and semantics. Then, Tailor learns representations of CPGs by applying a CPG-based neural network (CPGNN) to iteratively propagate information on them. It improves over prior work on code representation learning through a new graph neural network (GNN) tailored to CPG structures instead of the off-the-shelf GNNs used previously. We systematically evaluate Tailor on C and Java programs using two public benchmarks. Experimental results show that Tailor outperforms the state-of-the-art approaches, achieving 99.8% and 99.9% F-scores in code clone detection and 98.3% accuracy in source code classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量