Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases

Gul Aftab Ahmed, James Patten, Yuanhua Han, Guoxian Lu, Wei Hou, David Gregg, Jim Buckley, Muslim Chochlov
{"title":"Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases","authors":"Gul Aftab Ahmed, James Patten, Yuanhua Han, Guoxian Lu, Wei Hou, David Gregg, Jim Buckley, Muslim Chochlov","doi":"10.1002/spe.3355","DOIUrl":null,"url":null,"abstract":"Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We present SSCD, a BERT‐based clone detection approach that targets high recall of Type 3 and Type 4 clones at a very large scale (in line with our industrial partner's requirements). It computes a representative embedding for each code fragment and finds similar fragments using a nearest neighbor search. Thus, SSCD avoids the pairwise‐comparison bottleneck of other neural network approaches, while also using a parallel, GPU‐accelerated search to tackle scalability. This article describes the approach, proposing and evaluating several refinements to improve Type 3/4 clone detection at scale. It provides a substantial empirical evaluation of the technique, including a speed/efficacy comparison of the approach against SourcererCC and Oreo, the only other neural‐network approach currently capable of scaling to hundreds of millions of LOC. It also includes a large in‐situ evaluation on our industrial collaborator's code base that assesses the original technique, the impact of the proposed refinements and illustrates the impact of incremental, active learning on its efficacy. We find that SSCD is significantly faster and more accurate than SourcererCC and Oreo. SAGA, a GPU‐accelerated traditional clone detection approach, is a little better than SSCD for T1/T2 clones, but substantially worse for T3/T4 clones. Thus, SSCD is both scalable to industrial code sizes, and comparatively more accurate than existing approaches for difficult T3/T4 clone searching. In‐situ evaluation on company datasets shows that SSCD outperforms the baseline approach (CCFinderX) for T3/T4 clones. Whitespace removal and active learning further improve SSCD effectiveness.","PeriodicalId":21899,"journal":{"name":"Software: Practice and Experience","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Software: Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/spe.3355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We present SSCD, a BERT‐based clone detection approach that targets high recall of Type 3 and Type 4 clones at a very large scale (in line with our industrial partner's requirements). It computes a representative embedding for each code fragment and finds similar fragments using a nearest neighbor search. Thus, SSCD avoids the pairwise‐comparison bottleneck of other neural network approaches, while also using a parallel, GPU‐accelerated search to tackle scalability. This article describes the approach, proposing and evaluating several refinements to improve Type 3/4 clone detection at scale. It provides a substantial empirical evaluation of the technique, including a speed/efficacy comparison of the approach against SourcererCC and Oreo, the only other neural‐network approach currently capable of scaling to hundreds of millions of LOC. It also includes a large in‐situ evaluation on our industrial collaborator's code base that assesses the original technique, the impact of the proposed refinements and illustrates the impact of incremental, active learning on its efficacy. We find that SSCD is significantly faster and more accurate than SourcererCC and Oreo. SAGA, a GPU‐accelerated traditional clone detection approach, is a little better than SSCD for T1/T2 clones, but substantially worse for T3/T4 clones. Thus, SSCD is both scalable to industrial code sizes, and comparatively more accurate than existing approaches for difficult T3/T4 clone searching. In‐situ evaluation on company datasets shows that SSCD outperforms the baseline approach (CCFinderX) for T3/T4 clones. Whitespace removal and active learning further improve SSCD effectiveness.
基于近邻 BERT 的可扩展克隆检测:大规模工业代码库的实用方法
隐藏的代码克隆会对软件维护造成负面影响,但在大型代码库中手动检测它们并不现实。此外,自动方法发现检测语法不同的克隆非常具有挑战性。虽然最近的深度神经网络(例如基于 BERT 的人工神经网络)在检测此类克隆方面似乎更有效,但它们对目标系统中的每个代码对进行成对比较的效率很低,而且在大型代码库中的扩展性也很差。我们提出的 SSCD 是一种基于 BERT 的克隆检测方法,其目标是在非常大的范围内(符合我们行业合作伙伴的要求)实现对第 3 类和第 4 类克隆的高召回率。它为每个代码片段计算代表性嵌入,并使用近邻搜索找到相似片段。因此,SSCD 避免了其他神经网络方法的成对比较瓶颈,同时还使用 GPU 加速的并行搜索来解决可扩展性问题。本文介绍了这种方法,提出并评估了几项改进措施,以提高 3/4 型克隆检测的规模。文章对该技术进行了大量实证评估,包括与 SourcererCC 和 Oreo(目前唯一能扩展到数亿 LOC 的神经网络方法)的速度/功效比较。报告还包括对我们工业合作者代码库的大型现场评估,评估了原始技术、建议改进的影响,并说明了增量主动学习对其功效的影响。我们发现,SSCD 比 SourcererCC 和 Oreo 更快、更准确。在 T1/T2 克隆方面,GPU 加速的传统克隆检测方法 SAGA 比 SSCD 略胜一筹,但在 T3/T4 克隆方面则差得多。因此,SSCD 既可扩展到工业代码大小,又比现有的 T3/T4 克隆搜索方法更准确。在公司数据集上进行的现场评估表明,在 T3/T4 克隆方面,SSCD 优于基准方法(CCFinderX)。空白去除和主动学习进一步提高了 SSCD 的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信