Finding repeated strings in code repositories and its applications to code-clone detection

2021 28th Asia-Pacific Software Engineering Conference (APSEC) Pub Date : 2021-12-01 DOI:10.1109/APSEC53868.2021.00057

Yoriyuki Yamagata, Fabien Hervé, Yuji Fujiwara, Katsuro Inoue

{"title":"Finding repeated strings in code repositories and its applications to code-clone detection","authors":"Yoriyuki Yamagata, Fabien Hervé, Yuji Fujiwara, Katsuro Inoue","doi":"10.1109/APSEC53868.2021.00057","DOIUrl":null,"url":null,"abstract":"Although researchers have created many advanced code-clone detection techniques, more effort is required to realize wide adaptation of these techniques in the industry. One of the reasons behind this is the reliance of these advanced techniques on lexing and parsing programs. Modern programming languages have complex lexical conventions and grammar, which evolve constantly. Therefore, using advanced code-clone detection techniques requires substantial and continuous effort. This paper proposes a lightweight language-independent method to detect code clones by simply finding repeated strings in a code repository, relying on neither lexing nor parsing. The proposed method is based on an efficient technique developed in a bio-informatics context to find repeated strings. We refer to the repeated strings in the source-code as weak Type-1 clones. Because the proposed technique normalizes newlines, tabs, and white spaces into a single white space, it can find clones in which newline positions or indentations are changed, as often in the case when copy-pasting occurs. Although the proposed method only finds verbatim copies, it also makes interesting observations regarding repository structures. Many developers may prefer the proposed simple approach because it is easier to understand than other advanced techniques that use heuristics, approximation, and machine learning.","PeriodicalId":143800,"journal":{"name":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSEC53868.2021.00057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Although researchers have created many advanced code-clone detection techniques, more effort is required to realize wide adaptation of these techniques in the industry. One of the reasons behind this is the reliance of these advanced techniques on lexing and parsing programs. Modern programming languages have complex lexical conventions and grammar, which evolve constantly. Therefore, using advanced code-clone detection techniques requires substantial and continuous effort. This paper proposes a lightweight language-independent method to detect code clones by simply finding repeated strings in a code repository, relying on neither lexing nor parsing. The proposed method is based on an efficient technique developed in a bio-informatics context to find repeated strings. We refer to the repeated strings in the source-code as weak Type-1 clones. Because the proposed technique normalizes newlines, tabs, and white spaces into a single white space, it can find clones in which newline positions or indentations are changed, as often in the case when copy-pasting occurs. Although the proposed method only finds verbatim copies, it also makes interesting observations regarding repository structures. Many developers may prefer the proposed simple approach because it is easier to understand than other advanced techniques that use heuristics, approximation, and machine learning.

查看原文本刊更多论文

查找代码存储库中的重复字符串及其在代码克隆检测中的应用

尽管研究人员已经创造了许多先进的代码克隆检测技术，但要实现这些技术在工业上的广泛应用，还需要付出更多的努力。这背后的原因之一是这些高级技术依赖于词法分析和解析程序。现代编程语言具有复杂的词汇约定和语法，并且不断发展。因此，使用先进的代码克隆检测技术需要大量和持续的努力。本文提出了一种轻量级的独立于语言的方法，通过简单地在代码存储库中查找重复字符串来检测代码克隆，而不依赖于词法分析和解析。提出的方法是基于在生物信息学环境中开发的一种高效技术来查找重复字符串。我们将源代码中的重复字符串称为弱Type-1克隆。由于所建议的技术将换行符、制表符和空格规范化为单个空白，因此它可以找到换行符位置或缩进被更改的克隆，这通常发生在复制粘贴的情况下。尽管所建议的方法只查找逐字副本，但它也对存储库结构进行了有趣的观察。许多开发人员可能更喜欢建议的简单方法，因为它比使用启发式、近似和机器学习的其他高级技术更容易理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 28th Asia-Pacific Software Engineering Conference (APSEC)

自引率

0.00%

发文量