基于近邻 BERT 的可扩展克隆检测：大规模工业代码库的实用方法

Software: Practice and Experience Pub Date : 2024-06-12 DOI:10.1002/spe.3355

Gul Aftab Ahmed, James Patten, Yuanhua Han, Guoxian Lu, Wei Hou, David Gregg, Jim Buckley, Muslim Chochlov

{"title":"基于近邻 BERT 的可扩展克隆检测：大规模工业代码库的实用方法","authors":"Gul Aftab Ahmed, James Patten, Yuanhua Han, Guoxian Lu, Wei Hou, David Gregg, Jim Buckley, Muslim Chochlov","doi":"10.1002/spe.3355","DOIUrl":null,"url":null,"abstract":"Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We present SSCD, a BERT‐based clone detection approach that targets high recall of Type 3 and Type 4 clones at a very large scale (in line with our industrial partner's requirements). It computes a representative embedding for each code fragment and finds similar fragments using a nearest neighbor search. Thus, SSCD avoids the pairwise‐comparison bottleneck of other neural network approaches, while also using a parallel, GPU‐accelerated search to tackle scalability. This article describes the approach, proposing and evaluating several refinements to improve Type 3/4 clone detection at scale. It provides a substantial empirical evaluation of the technique, including a speed/efficacy comparison of the approach against SourcererCC and Oreo, the only other neural‐network approach currently capable of scaling to hundreds of millions of LOC. It also includes a large in‐situ evaluation on our industrial collaborator's code base that assesses the original technique, the impact of the proposed refinements and illustrates the impact of incremental, active learning on its efficacy. We find that SSCD is significantly faster and more accurate than SourcererCC and Oreo. SAGA, a GPU‐accelerated traditional clone detection approach, is a little better than SSCD for T1/T2 clones, but substantially worse for T3/T4 clones. Thus, SSCD is both scalable to industrial code sizes, and comparatively more accurate than existing approaches for difficult T3/T4 clone searching. In‐situ evaluation on company datasets shows that SSCD outperforms the baseline approach (CCFinderX) for T3/T4 clones. Whitespace removal and active learning further improve SSCD effectiveness.","PeriodicalId":21899,"journal":{"name":"Software: Practice and Experience","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases\",\"authors\":\"Gul Aftab Ahmed, James Patten, Yuanhua Han, Guoxian Lu, Wei Hou, David Gregg, Jim Buckley, Muslim Chochlov\",\"doi\":\"10.1002/spe.3355\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We present SSCD, a BERT‐based clone detection approach that targets high recall of Type 3 and Type 4 clones at a very large scale (in line with our industrial partner's requirements). It computes a representative embedding for each code fragment and finds similar fragments using a nearest neighbor search. Thus, SSCD avoids the pairwise‐comparison bottleneck of other neural network approaches, while also using a parallel, GPU‐accelerated search to tackle scalability. This article describes the approach, proposing and evaluating several refinements to improve Type 3/4 clone detection at scale. It provides a substantial empirical evaluation of the technique, including a speed/efficacy comparison of the approach against SourcererCC and Oreo, the only other neural‐network approach currently capable of scaling to hundreds of millions of LOC. It also includes a large in‐situ evaluation on our industrial collaborator's code base that assesses the original technique, the impact of the proposed refinements and illustrates the impact of incremental, active learning on its efficacy. We find that SSCD is significantly faster and more accurate than SourcererCC and Oreo. SAGA, a GPU‐accelerated traditional clone detection approach, is a little better than SSCD for T1/T2 clones, but substantially worse for T3/T4 clones. Thus, SSCD is both scalable to industrial code sizes, and comparatively more accurate than existing approaches for difficult T3/T4 clone searching. In‐situ evaluation on company datasets shows that SSCD outperforms the baseline approach (CCFinderX) for T3/T4 clones. Whitespace removal and active learning further improve SSCD effectiveness.\",\"PeriodicalId\":21899,\"journal\":{\"name\":\"Software: Practice and Experience\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Software: Practice and Experience\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/spe.3355\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Software: Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/spe.3355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

隐藏的代码克隆会对软件维护造成负面影响，但在大型代码库中手动检测它们并不现实。此外，自动方法发现检测语法不同的克隆非常具有挑战性。虽然最近的深度神经网络（例如基于 BERT 的人工神经网络）在检测此类克隆方面似乎更有效，但它们对目标系统中的每个代码对进行成对比较的效率很低，而且在大型代码库中的扩展性也很差。我们提出的 SSCD 是一种基于 BERT 的克隆检测方法，其目标是在非常大的范围内（符合我们行业合作伙伴的要求）实现对第 3 类和第 4 类克隆的高召回率。它为每个代码片段计算代表性嵌入，并使用近邻搜索找到相似片段。因此，SSCD 避免了其他神经网络方法的成对比较瓶颈，同时还使用 GPU 加速的并行搜索来解决可扩展性问题。本文介绍了这种方法，提出并评估了几项改进措施，以提高 3/4 型克隆检测的规模。文章对该技术进行了大量实证评估，包括与 SourcererCC 和 Oreo（目前唯一能扩展到数亿 LOC 的神经网络方法）的速度/功效比较。报告还包括对我们工业合作者代码库的大型现场评估，评估了原始技术、建议改进的影响，并说明了增量主动学习对其功效的影响。我们发现，SSCD 比 SourcererCC 和 Oreo 更快、更准确。在 T1/T2 克隆方面，GPU 加速的传统克隆检测方法 SAGA 比 SSCD 略胜一筹，但在 T3/T4 克隆方面则差得多。因此，SSCD 既可扩展到工业代码大小，又比现有的 T3/T4 克隆搜索方法更准确。在公司数据集上进行的现场评估表明，在 T3/T4 克隆方面，SSCD 优于基准方法（CCFinderX）。空白去除和主动学习进一步提高了 SSCD 的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases

Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We present SSCD, a BERT‐based clone detection approach that targets high recall of Type 3 and Type 4 clones at a very large scale (in line with our industrial partner's requirements). It computes a representative embedding for each code fragment and finds similar fragments using a nearest neighbor search. Thus, SSCD avoids the pairwise‐comparison bottleneck of other neural network approaches, while also using a parallel, GPU‐accelerated search to tackle scalability. This article describes the approach, proposing and evaluating several refinements to improve Type 3/4 clone detection at scale. It provides a substantial empirical evaluation of the technique, including a speed/efficacy comparison of the approach against SourcererCC and Oreo, the only other neural‐network approach currently capable of scaling to hundreds of millions of LOC. It also includes a large in‐situ evaluation on our industrial collaborator's code base that assesses the original technique, the impact of the proposed refinements and illustrates the impact of incremental, active learning on its efficacy. We find that SSCD is significantly faster and more accurate than SourcererCC and Oreo. SAGA, a GPU‐accelerated traditional clone detection approach, is a little better than SSCD for T1/T2 clones, but substantially worse for T3/T4 clones. Thus, SSCD is both scalable to industrial code sizes, and comparatively more accurate than existing approaches for difficult T3/T4 clone searching. In‐situ evaluation on company datasets shows that SSCD outperforms the baseline approach (CCFinderX) for T3/T4 clones. Whitespace removal and active learning further improve SSCD effectiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Software: Practice and Experience

自引率

0.00%

发文量