Folding Repeated Instructions for Improving Token-Based Code Clone Detection

2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation Pub Date : 2012-09-23 DOI:10.1109/SCAM.2012.21

Hiroaki Murakami, Keisuke Hotta, Yoshiki Higo, H. Igaki, S. Kusumoto

{"title":"Folding Repeated Instructions for Improving Token-Based Code Clone Detection","authors":"Hiroaki Murakami, Keisuke Hotta, Yoshiki Higo, H. Igaki, S. Kusumoto","doi":"10.1109/SCAM.2012.21","DOIUrl":null,"url":null,"abstract":"A variety of code clone detection methods have been proposed before now. However, only a small part of them is widely used. Widely-used methods are line-based and token-based ones. They have high scalability because they neither require deep source code analysis nor constructing complex intermediate structures for the detection. High scalability is one of the big advantages in code clone detection tools. On the other hand, line/token-based detections yield many false positives. One of the factors is the presence of repeated instructions in the source code. For example, herein we assume that there are consecutive three printf statements in C source code. If we apply a token-based detection to them, the former two statements are detected as a code clone of the latter two statements. However, such overlapped code clones are redundant and so not useful for developers. In this paper, we propose a new detection method that is free from the influence of the presence of repeated instructions. The proposed method transforms every of repeated instructions into a special form, and then it detects code clones using a suffix array algorithm. The transformation prevents many false positives from being detected. Also, the detection speed remains. The proposed detection method has already been developed as a software tool, FRISC. We confirmed the usefulness of the proposed method by conducting a quantitative evaluation of FRISC with Bellon's oracle.","PeriodicalId":291855,"journal":{"name":"2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation","volume":"57 10","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM.2012.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

A variety of code clone detection methods have been proposed before now. However, only a small part of them is widely used. Widely-used methods are line-based and token-based ones. They have high scalability because they neither require deep source code analysis nor constructing complex intermediate structures for the detection. High scalability is one of the big advantages in code clone detection tools. On the other hand, line/token-based detections yield many false positives. One of the factors is the presence of repeated instructions in the source code. For example, herein we assume that there are consecutive three printf statements in C source code. If we apply a token-based detection to them, the former two statements are detected as a code clone of the latter two statements. However, such overlapped code clones are redundant and so not useful for developers. In this paper, we propose a new detection method that is free from the influence of the presence of repeated instructions. The proposed method transforms every of repeated instructions into a special form, and then it detects code clones using a suffix array algorithm. The transformation prevents many false positives from being detected. Also, the detection speed remains. The proposed detection method has already been developed as a software tool, FRISC. We confirmed the usefulness of the proposed method by conducting a quantitative evaluation of FRISC with Bellon's oracle.

查看原文本刊更多论文

改进基于令牌的代码克隆检测的折叠重复说明

目前已经提出了多种代码克隆检测方法。然而，其中只有一小部分被广泛使用。常用的方法有基于行的方法和基于记号的方法。它们具有很高的可伸缩性，因为它们既不需要深入的源代码分析，也不需要为检测构建复杂的中间结构。高可伸缩性是代码克隆检测工具的一大优势。另一方面，基于行/令牌的检测会产生许多误报。其中一个因素是源代码中存在重复的指令。例如，这里我们假设在C源代码中有连续的三个printf语句。如果我们对它们应用基于令牌的检测，前两条语句将被检测为后两条语句的代码克隆。然而，这种重叠的代码克隆是多余的，因此对开发人员没有用处。在本文中，我们提出了一种新的检测方法，它不受重复指令存在的影响。该方法将每条重复指令转换成一种特殊的形式，然后使用后缀数组算法检测代码克隆。这种转换可以防止检测到许多误报。同时，检测速度保持不变。所提出的检测方法已经发展成为一个软件工具，FRISC。我们通过Bellon's oracle对FRISC进行定量评估，证实了所提出方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation

自引率

0.00%

发文量