Asra Sulaiman Alshabib , Sajjad Mahmood , Mohammad Alshayeb
{"title":"A systematic literature review on cross-language source code clone detection","authors":"Asra Sulaiman Alshabib , Sajjad Mahmood , Mohammad Alshayeb","doi":"10.1016/j.cosrev.2025.100786","DOIUrl":null,"url":null,"abstract":"<div><h3>Context</h3><div>Cross-language code Clone Detection (CLCCD) is crucial to maintaining consistency and minimizing redundancy in modern software development, where similar code may appear in different projects written in various programming languages. While previous reviews have explored code clone detection in general, none have exclusively focused on CLCCD.</div></div><div><h3>Objective</h3><div>This study aims to bridge this gap by reviewing the existing CLCCD approaches, focusing on detection techniques, preprocessing methods, feature extraction approaches, datasets, and evaluation metrics used.</div></div><div><h3>Method</h3><div>A systematic literature review (SLR) was conducted, analyzing 26 studies published in journals, conferences, and workshops until May 2025. Both quantitative and qualitative data were systematically analyzed to derive the findings.</div></div><div><h3>Results</h3><div>CLCCD has evolved from traditional techniques to deep learning models, but fully automated tools remain unavailable. Parsing (73 %), normalization (35 %), and tokenization (27 %) are widely used preprocessing techniques in CLCCD methods. Most studies (38.5 %) employ hybrid feature extraction, which combines tree-based and graph-based methods to capture code structure and semantics. However, the datasets primarily sourced from programming competition platforms lack diversity and standardization. Performance evaluation largely relies on metrics like precision, recall, and F1-score, while incorporating additional evaluation metrics could provide more insights into detection performance.</div></div><div><h3>Conclusion</h3><div>This SLR summarizes current CLCCD research, highlighting advancements and challenges. Significant gaps include the absence of diverse and standardized datasets and the limited exploration of advanced feature extraction techniques. Future research should focus on creating better datasets, adopting novel detection techniques, and exploring feature extraction methods to improve CLCCD performance for modern multi-language systems.</div></div>","PeriodicalId":48633,"journal":{"name":"Computer Science Review","volume":"58 ","pages":"Article 100786"},"PeriodicalIF":12.7000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Review","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574013725000620","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Context
Cross-language code Clone Detection (CLCCD) is crucial to maintaining consistency and minimizing redundancy in modern software development, where similar code may appear in different projects written in various programming languages. While previous reviews have explored code clone detection in general, none have exclusively focused on CLCCD.
Objective
This study aims to bridge this gap by reviewing the existing CLCCD approaches, focusing on detection techniques, preprocessing methods, feature extraction approaches, datasets, and evaluation metrics used.
Method
A systematic literature review (SLR) was conducted, analyzing 26 studies published in journals, conferences, and workshops until May 2025. Both quantitative and qualitative data were systematically analyzed to derive the findings.
Results
CLCCD has evolved from traditional techniques to deep learning models, but fully automated tools remain unavailable. Parsing (73 %), normalization (35 %), and tokenization (27 %) are widely used preprocessing techniques in CLCCD methods. Most studies (38.5 %) employ hybrid feature extraction, which combines tree-based and graph-based methods to capture code structure and semantics. However, the datasets primarily sourced from programming competition platforms lack diversity and standardization. Performance evaluation largely relies on metrics like precision, recall, and F1-score, while incorporating additional evaluation metrics could provide more insights into detection performance.
Conclusion
This SLR summarizes current CLCCD research, highlighting advancements and challenges. Significant gaps include the absence of diverse and standardized datasets and the limited exploration of advanced feature extraction techniques. Future research should focus on creating better datasets, adopting novel detection techniques, and exploring feature extraction methods to improve CLCCD performance for modern multi-language systems.
期刊介绍:
Computer Science Review, a publication dedicated to research surveys and expository overviews of open problems in computer science, targets a broad audience within the field seeking comprehensive insights into the latest developments. The journal welcomes articles from various fields as long as their content impacts the advancement of computer science. In particular, articles that review the application of well-known Computer Science methods to other areas are in scope only if these articles advance the fundamental understanding of those methods.