SourcererCC: Scaling Code Clone Detection to Big-Code

2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) Pub Date : 2015-12-20 DOI:10.1145/2884781.2884877

Hitesh Sajnani, V. Saini, Jeffrey Svajlenko, C. Roy, C. Lopes

{"title":"SourcererCC: Scaling Code Clone Detection to Big-Code","authors":"Hitesh Sajnani, V. Saini, Jeffrey Svajlenko, C. Roy, C. Lopes","doi":"10.1145/2884781.2884877","DOIUrl":null,"url":null,"abstract":"Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"36 1","pages":"1157-1168"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"456","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2884781.2884877","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 456

Abstract

Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.

查看原文本刊更多论文

SourcererCC:扩展代码克隆检测到大代码

尽管进行了十年的积极研究，但明显缺乏克隆检测技术，无法扩展到大型存储库，以检测险些失败的克隆。在本文中，我们提出了一个基于令牌的克隆检测器SourcererCC，它可以使用标准工作站从大型项目间存储库中检测精确的和接近的克隆。它利用优化的反向索引来快速查询给定代码块的潜在克隆。使用基于令牌排序的过滤启发式方法，可以显著减少索引的大小、检测克隆所需的代码块比较次数，以及判断潜在克隆所需的令牌比较次数。我们评估了SourcererCC的可扩展性、执行时间、召回率和精度，并将其与四个公开可用的最先进的工具进行了比较。为了测量召回率，我们使用了两个最近的基准:(1)一个真实克隆的大基准，BigCloneBench，以及(2)一个基于数千个细粒度人工克隆的突变/注入框架。我们发现SourcererCC具有很高的查全率和准确性，并且能够使用标准工作站扩展到大型项目间存储库(25K项目，250MLOC)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量