TreeCen:为可伸缩的语义代码克隆检测构建树形图

Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering Pub Date : 2022-10-10 DOI:10.1145/3551349.3556927

Yutao Hu, Deqing Zou, Junru Peng, Yueming Wu, Junjie Shan, Hai Jin

{"title":"TreeCen:为可伸缩的语义代码克隆检测构建树形图","authors":"Yutao Hu, Deqing Zou, Junru Peng, Yueming Wu, Junjie Shan, Hai Jin","doi":"10.1145/3551349.3556927","DOIUrl":null,"url":null,"abstract":"Code clone detection is an important research problem that has attracted wide attention in software engineering. Many methods have been proposed for detecting code clone, among which text-based and token-based approaches are scalable but lack consideration of code semantics, thus resulting in the inability to detect semantic code clones. Methods based on intermediate representations of codes can solve the problem of semantic code clone detection. However, graph-based methods are not practicable due to code compilation, and existing tree-based approaches are limited by the scale of trees for scalable code clone detection. In this paper, we propose TreeCen, a scalable tree-based code clone detector, which satisfies scalability while detecting semantic clones effectively. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and transform it into a simple graph representation (i.e., tree graph) according to the node type, rather than using traditional heavyweight tree matching. We then treat the tree graph as a social network and adopt centrality analysis on each node to maintain the tree details. By this, the original complex tree can be converted into a 72-dimensional vector while containing comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and use it to find code clones. We conduct comparative evaluations on effectiveness and scalability. The experimental results show that TreeCen maintains the best performance of the other six state-of-the-art methods (i.e., SourcererCC, RtvNN, DeepSim, SCDetector, Deckard, and ASTNN) with F1 scores of 0.99 and 0.95 on BigCloneBench and Google Code Jam datasets, respectively. In terms of scalability, TreeCen is about 79 times faster than the other state-of-the-art tree-based semantic code clone detector (ASTNN), about 13 times faster than the fastest graph-based approach (SCDetector), and even about 22 times faster than the one-time trained token-based detector (RtvNN).","PeriodicalId":197939,"journal":{"name":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"TreeCen: Building Tree Graph for Scalable Semantic Code Clone Detection\",\"authors\":\"Yutao Hu, Deqing Zou, Junru Peng, Yueming Wu, Junjie Shan, Hai Jin\",\"doi\":\"10.1145/3551349.3556927\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Code clone detection is an important research problem that has attracted wide attention in software engineering. Many methods have been proposed for detecting code clone, among which text-based and token-based approaches are scalable but lack consideration of code semantics, thus resulting in the inability to detect semantic code clones. Methods based on intermediate representations of codes can solve the problem of semantic code clone detection. However, graph-based methods are not practicable due to code compilation, and existing tree-based approaches are limited by the scale of trees for scalable code clone detection. In this paper, we propose TreeCen, a scalable tree-based code clone detector, which satisfies scalability while detecting semantic clones effectively. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and transform it into a simple graph representation (i.e., tree graph) according to the node type, rather than using traditional heavyweight tree matching. We then treat the tree graph as a social network and adopt centrality analysis on each node to maintain the tree details. By this, the original complex tree can be converted into a 72-dimensional vector while containing comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and use it to find code clones. We conduct comparative evaluations on effectiveness and scalability. The experimental results show that TreeCen maintains the best performance of the other six state-of-the-art methods (i.e., SourcererCC, RtvNN, DeepSim, SCDetector, Deckard, and ASTNN) with F1 scores of 0.99 and 0.95 on BigCloneBench and Google Code Jam datasets, respectively. In terms of scalability, TreeCen is about 79 times faster than the other state-of-the-art tree-based semantic code clone detector (ASTNN), about 13 times faster than the fastest graph-based approach (SCDetector), and even about 22 times faster than the one-time trained token-based detector (RtvNN).\",\"PeriodicalId\":197939,\"journal\":{\"name\":\"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering\",\"volume\":\"2015 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3551349.3556927\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3551349.3556927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

代码克隆检测是软件工程中受到广泛关注的一个重要研究问题。目前已经提出了许多检测代码克隆的方法，其中基于文本和基于标记的方法具有可扩展性，但缺乏对代码语义的考虑，导致无法检测语义代码克隆。基于代码中间表示的方法可以解决语义代码克隆检测问题。然而，由于代码编译的原因，基于图的方法不可行，现有的基于树的方法受限于树的规模，无法进行可扩展的代码克隆检测。本文提出了一种可扩展的基于树的代码克隆检测器TreeCen，它在有效检测语义克隆的同时满足可扩展性。给定方法的源代码，我们首先基于静态分析提取其抽象语法树(AST)，并根据节点类型将其转换为简单的图表示(即树图)，而不是使用传统的重量级树匹配。然后，我们将树状图视为一个社会网络，并对每个节点采用中心性分析来维护树状图的细节。通过这种方法，可以将原始复树转换为包含AST全面结构信息的72维向量。最后，将这些向量输入到机器学习模型中，训练检测器并使用它来查找代码克隆。我们对有效性和可扩展性进行比较评估。实验结果表明，TreeCen在BigCloneBench和谷歌Code Jam数据集上的F1分数分别为0.99和0.95，保持了其他六种最先进方法(即SourcererCC, RtvNN, DeepSim, SCDetector, Deckard和ASTNN)的最佳性能。在可扩展性方面，TreeCen比其他最先进的基于树的语义代码克隆检测器(ASTNN)快79倍，比最快的基于图的方法(SCDetector)快13倍，甚至比一次性训练的基于令牌的检测器(RtvNN)快22倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TreeCen: Building Tree Graph for Scalable Semantic Code Clone Detection

Code clone detection is an important research problem that has attracted wide attention in software engineering. Many methods have been proposed for detecting code clone, among which text-based and token-based approaches are scalable but lack consideration of code semantics, thus resulting in the inability to detect semantic code clones. Methods based on intermediate representations of codes can solve the problem of semantic code clone detection. However, graph-based methods are not practicable due to code compilation, and existing tree-based approaches are limited by the scale of trees for scalable code clone detection. In this paper, we propose TreeCen, a scalable tree-based code clone detector, which satisfies scalability while detecting semantic clones effectively. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and transform it into a simple graph representation (i.e., tree graph) according to the node type, rather than using traditional heavyweight tree matching. We then treat the tree graph as a social network and adopt centrality analysis on each node to maintain the tree details. By this, the original complex tree can be converted into a 72-dimensional vector while containing comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and use it to find code clones. We conduct comparative evaluations on effectiveness and scalability. The experimental results show that TreeCen maintains the best performance of the other six state-of-the-art methods (i.e., SourcererCC, RtvNN, DeepSim, SCDetector, Deckard, and ASTNN) with F1 scores of 0.99 and 0.95 on BigCloneBench and Google Code Jam datasets, respectively. In terms of scalability, TreeCen is about 79 times faster than the other state-of-the-art tree-based semantic code clone detector (ASTNN), about 13 times faster than the fastest graph-based approach (SCDetector), and even about 22 times faster than the one-time trained token-based detector (RtvNN).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

自引率

0.00%

发文量