Neural Detection of Semantic Code Clones Via Tree-Based Convolution

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC) Pub Date : 2019-05-01 DOI:10.1109/ICPC.2019.00021

Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, Qianxiang Wang

{"title":"Neural Detection of Semantic Code Clones Via Tree-Based Convolution","authors":"Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, Qianxiang Wang","doi":"10.1109/ICPC.2019.00021","DOIUrl":null,"url":null,"abstract":"Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.","PeriodicalId":6853,"journal":{"name":"2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)","volume":"10 1","pages":"70-80"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"91","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPC.2019.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 91

Abstract

Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.

查看原文本刊更多论文

基于树卷积的语义代码克隆神经检测

代码克隆是相似的代码片段，它们共享相同的语义，但可能在语法上有不同程度的差异。检测代码克隆有助于降低软件维护成本并防止故障。在过去的二十年里，人们提出了各种检测代码克隆的方法，但很少有方法可以检测语义克隆，即具有不同语法的代码克隆。最近的研究试图采用深度学习来检测代码克隆，例如使用基于树的LSTM而不是抽象语法树(AST)。然而，它并没有充分利用代码片段的结构信息，从而限制了它的克隆检测能力。为了充分发挥深度学习检测代码克隆的能力，我们提出了一种使用基于树的卷积来检测语义克隆的新方法，通过从AST中捕获代码片段的结构信息和从代码令牌中捕获词法信息。此外，我们的方法解决了源代码具有无限的令牌和模型词汇表的限制，因此在处理看不见的令牌时，利用代码令牌中的词汇信息通常是无效的。特别地，我们提出了一种新的嵌入技术，称为位置感知字符嵌入(PACE)，它本质上将任何标记视为字符一热嵌入的位置加权组合。我们的实验结果表明，我们的方法在两个流行的代码克隆基准测试(OJClone和BigCloneBench)上的f1分数分别提高了0.42和0.15，大大优于现有的最先进的方法，同时计算效率更高。我们的实验结果还表明，当代码克隆包含看不见的令牌时，PACE使我们的方法更加有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)

自引率

0.00%

发文量