A systematic literature review on cross-language source code clone detection

IF 12.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computer Science Review Pub Date : 2025-06-27 DOI:10.1016/j.cosrev.2025.100786

Asra Sulaiman Alshabib , Sajjad Mahmood , Mohammad Alshayeb

{"title":"A systematic literature review on cross-language source code clone detection","authors":"Asra Sulaiman Alshabib , Sajjad Mahmood , Mohammad Alshayeb","doi":"10.1016/j.cosrev.2025.100786","DOIUrl":null,"url":null,"abstract":"<div><h3>Context</h3><div>Cross-language code Clone Detection (CLCCD) is crucial to maintaining consistency and minimizing redundancy in modern software development, where similar code may appear in different projects written in various programming languages. While previous reviews have explored code clone detection in general, none have exclusively focused on CLCCD.</div></div><div><h3>Objective</h3><div>This study aims to bridge this gap by reviewing the existing CLCCD approaches, focusing on detection techniques, preprocessing methods, feature extraction approaches, datasets, and evaluation metrics used.</div></div><div><h3>Method</h3><div>A systematic literature review (SLR) was conducted, analyzing 26 studies published in journals, conferences, and workshops until May 2025. Both quantitative and qualitative data were systematically analyzed to derive the findings.</div></div><div><h3>Results</h3><div>CLCCD has evolved from traditional techniques to deep learning models, but fully automated tools remain unavailable. Parsing (73 %), normalization (35 %), and tokenization (27 %) are widely used preprocessing techniques in CLCCD methods. Most studies (38.5 %) employ hybrid feature extraction, which combines tree-based and graph-based methods to capture code structure and semantics. However, the datasets primarily sourced from programming competition platforms lack diversity and standardization. Performance evaluation largely relies on metrics like precision, recall, and F1-score, while incorporating additional evaluation metrics could provide more insights into detection performance.</div></div><div><h3>Conclusion</h3><div>This SLR summarizes current CLCCD research, highlighting advancements and challenges. Significant gaps include the absence of diverse and standardized datasets and the limited exploration of advanced feature extraction techniques. Future research should focus on creating better datasets, adopting novel detection techniques, and exploring feature extraction methods to improve CLCCD performance for modern multi-language systems.</div></div>","PeriodicalId":48633,"journal":{"name":"Computer Science Review","volume":"58 ","pages":"Article 100786"},"PeriodicalIF":12.7000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Review","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574013725000620","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context

Cross-language code Clone Detection (CLCCD) is crucial to maintaining consistency and minimizing redundancy in modern software development, where similar code may appear in different projects written in various programming languages. While previous reviews have explored code clone detection in general, none have exclusively focused on CLCCD.

Objective

This study aims to bridge this gap by reviewing the existing CLCCD approaches, focusing on detection techniques, preprocessing methods, feature extraction approaches, datasets, and evaluation metrics used.

Method

A systematic literature review (SLR) was conducted, analyzing 26 studies published in journals, conferences, and workshops until May 2025. Both quantitative and qualitative data were systematically analyzed to derive the findings.

Results

CLCCD has evolved from traditional techniques to deep learning models, but fully automated tools remain unavailable. Parsing (73 %), normalization (35 %), and tokenization (27 %) are widely used preprocessing techniques in CLCCD methods. Most studies (38.5 %) employ hybrid feature extraction, which combines tree-based and graph-based methods to capture code structure and semantics. However, the datasets primarily sourced from programming competition platforms lack diversity and standardization. Performance evaluation largely relies on metrics like precision, recall, and F1-score, while incorporating additional evaluation metrics could provide more insights into detection performance.

Conclusion

This SLR summarizes current CLCCD research, highlighting advancements and challenges. Significant gaps include the absence of diverse and standardized datasets and the limited exploration of advanced feature extraction techniques. Future research should focus on creating better datasets, adopting novel detection techniques, and exploring feature extraction methods to improve CLCCD performance for modern multi-language systems.

查看原文本刊更多论文

跨语言源代码克隆检测的系统文献综述

背景跨语言代码克隆检测（CLCCD）对于在现代软件开发中保持一致性和最小化冗余至关重要，因为类似的代码可能出现在用不同编程语言编写的不同项目中。虽然以前的评论一般探讨了代码克隆检测，但没有一个专门关注CLCCD。本研究旨在通过回顾现有的CLCCD方法来弥补这一差距，重点关注检测技术、预处理方法、特征提取方法、数据集和使用的评估指标。方法采用系统文献综述（SLR）方法，对截至2025年5月在期刊、会议和研讨会上发表的26项研究进行分析。系统地分析定量和定性数据以得出研究结果。clccd已经从传统技术发展到深度学习模型，但完全自动化的工具仍然不可用。解析（73%）、规范化（35%）和标记化（27%）是CLCCD方法中广泛使用的预处理技术。大多数研究（38.5%）采用混合特征提取，它结合了基于树和基于图的方法来捕获代码结构和语义。然而，主要来源于编程竞赛平台的数据集缺乏多样性和标准化。性能评估在很大程度上依赖于精确度、召回率和f1分数等指标，而合并其他评估指标可以提供对检测性能的更多了解。总结了当前CLCCD的研究现状，指出了研究进展和面临的挑战。重要的差距包括缺乏多样化和标准化的数据集，以及对先进特征提取技术的探索有限。未来的研究应着眼于创建更好的数据集，采用新的检测技术，探索特征提取方法，以提高现代多语言系统的CLCCD性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Science Review Computer Science-General Computer Science

CiteScore

32.70

自引率

0.00%

发文量

审稿时长

51 days

期刊介绍： Computer Science Review, a publication dedicated to research surveys and expository overviews of open problems in computer science, targets a broad audience within the field seeking comprehensive insights into the latest developments. The journal welcomes articles from various fields as long as their content impacts the advancement of computer science. In particular, articles that review the application of well-known Computer Science methods to other areas are in scope only if these articles advance the fundamental understanding of those methods.