A privacy-preserving standardized model for large-scale source code fingerprint extraction and clone detection

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Standards & Interfaces Pub Date : 2025-03-19 DOI:10.1016/j.csi.2025.103998

Ming Yang , Yu-an Tan , Ning Shi , Yajie Wang , Ziqi Wang , Qi Liang

{"title":"A privacy-preserving standardized model for large-scale source code fingerprint extraction and clone detection","authors":"Ming Yang , Yu-an Tan , Ning Shi , Yajie Wang , Ziqi Wang , Qi Liang","doi":"10.1016/j.csi.2025.103998","DOIUrl":null,"url":null,"abstract":"<div><div>With the rapid advancement of software technology, developers often replicate or modify existing code to achieve code cloning, thereby improving development efficiency. However, the widespread use of open-source code may lead to intellectual property disputes and infringement risks. Additionally, the repeated use of cloned code can exacerbate vulnerabilities, increasing system fragility and maintenance costs, especially when synchronized modifications are required for cloned fragments during software evolution. To address these challenges, this paper proposes a privacy-preserving large-scale code fingerprint extraction model—Ringer. The model decouples feature extraction from clone detection, enabling efficient clone detection without direct access to the source code. Ringer employs syntax trees for lexical and syntactic analysis to comprehensively extract code features, and generates irreversible code fingerprints through token replacement and the Metro-128 hash algorithm, ensuring the privacy of the source code while effectively detecting clones. Experimental results show that Ringer performs excellently on datasets from multiple programming languages (e.g., Java, C++, Python, etc.), maintaining consistently high accuracy based on the characteristics of each language. On the Python dataset, Ringer achieves detection accuracies of 94%, 94%, and 97% for Type-1, Type-2, and Type-3 clones, respectively, further validating its efficiency and reliability in practical applications. Compared to mainstream detection tools (e.g., Moss and NiCad), Ringer outperforms in cross-language detection, demonstrating its robust adaptability and superior accuracy. This strongly supports the broad applicability of Ringer for privacy-preserving clone detection in large-scale, multi-language codebases.</div></div>","PeriodicalId":50635,"journal":{"name":"Computer Standards & Interfaces","volume":"94 ","pages":"Article 103998"},"PeriodicalIF":4.1000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Standards & Interfaces","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0920548925000273","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

With the rapid advancement of software technology, developers often replicate or modify existing code to achieve code cloning, thereby improving development efficiency. However, the widespread use of open-source code may lead to intellectual property disputes and infringement risks. Additionally, the repeated use of cloned code can exacerbate vulnerabilities, increasing system fragility and maintenance costs, especially when synchronized modifications are required for cloned fragments during software evolution. To address these challenges, this paper proposes a privacy-preserving large-scale code fingerprint extraction model—Ringer. The model decouples feature extraction from clone detection, enabling efficient clone detection without direct access to the source code. Ringer employs syntax trees for lexical and syntactic analysis to comprehensively extract code features, and generates irreversible code fingerprints through token replacement and the Metro-128 hash algorithm, ensuring the privacy of the source code while effectively detecting clones. Experimental results show that Ringer performs excellently on datasets from multiple programming languages (e.g., Java, C++, Python, etc.), maintaining consistently high accuracy based on the characteristics of each language. On the Python dataset, Ringer achieves detection accuracies of 94%, 94%, and 97% for Type-1, Type-2, and Type-3 clones, respectively, further validating its efficiency and reliability in practical applications. Compared to mainstream detection tools (e.g., Moss and NiCad), Ringer outperforms in cross-language detection, demonstrating its robust adaptability and superior accuracy. This strongly supports the broad applicability of Ringer for privacy-preserving clone detection in large-scale, multi-language codebases.

查看原文本刊更多论文

大规模源代码指纹提取和克隆检测的隐私保护标准化模型

随着软件技术的飞速发展，开发人员经常会对现有代码进行复制或修改，以实现代码克隆，从而提高开发效率。然而，开源代码的广泛使用可能会导致知识产权纠纷和侵权风险。此外，重复使用克隆代码会加剧漏洞，增加系统脆弱性和维护成本，特别是在软件进化期间需要对克隆片段进行同步修改时。为了解决这些问题，本文提出了一种保护隐私的大规模代码指纹提取模型——林格模型。该模型将特征提取与克隆检测解耦，无需直接访问源代码即可实现高效的克隆检测。Ringer采用语法树进行词法和句法分析，全面提取代码特征，并通过令牌替换和Metro-128哈希算法生成不可逆的代码指纹，在有效检测克隆的同时保证了源代码的私密性。实验结果表明，Ringer在多种编程语言（如Java、c++、Python等）的数据集上表现优异，并根据每种语言的特点保持一致的高精度。在Python数据集上，Ringer对Type-1、Type-2和Type-3克隆的检测准确率分别达到94%、94%和97%，进一步验证了其在实际应用中的效率和可靠性。与主流检测工具（如Moss和NiCad）相比，Ringer在跨语言检测方面表现出色，表现出强大的适应性和优越的准确性。这有力地支持了林格在大规模、多语言代码库中用于保护隐私的克隆检测的广泛适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Standards & Interfaces 工程技术-计算机：软件工程

CiteScore

11.90

自引率

16.00%

发文量

审稿时长

6 months

期刊介绍： The quality of software, well-defined interfaces (hardware and software), the process of digitalisation, and accepted standards in these fields are essential for building and exploiting complex computing, communication, multimedia and measuring systems. Standards can simplify the design and construction of individual hardware and software components and help to ensure satisfactory interworking. Computer Standards & Interfaces is an international journal dealing specifically with these topics. The journal • Provides information about activities and progress on the definition of computer standards, software quality, interfaces and methods, at national, European and international levels • Publishes critical comments on standards and standards activities • Disseminates user''s experiences and case studies in the application and exploitation of established or emerging standards, interfaces and methods • Offers a forum for discussion on actual projects, standards, interfaces and methods by recognised experts • Stimulates relevant research by providing a specialised refereed medium.