Analyzing the dependability of Large Language Models for code clone generation

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-07-04 DOI:10.1016/j.jss.2025.112548

Azeeza Eagal, Kathryn T. Stolee, John-Paul Ore

{"title":"Analyzing the dependability of Large Language Models for code clone generation","authors":"Azeeza Eagal, Kathryn T. Stolee, John-Paul Ore","doi":"10.1016/j.jss.2025.112548","DOIUrl":null,"url":null,"abstract":"<div><div>The ability to generate multiple equivalent versions of the same code segment across different programming languages and within the same language is valuable for code translation, language migration, and code comprehension in education. However, current avenues for generating code clones – through manual creation or specialized software tools – often fail to consistently generate a variety of behaviorally equivalent code clones. Large Language Models (LLMs) offer a promising solution by leveraging their extensive training on diverse codebases to automatically generate code. Unlike traditional methods, LLMs can produce code across a wide variety of programming languages with minimal user effort. Using LLMs for code clone generation could significantly reduce the time and resources needed to create code clones while enhancing their syntactic diversity.</div><div>In this quantitative empirical study, we investigate the dependability of LLMs as potential generators of code clones. We gathered equivalent code solutions (i.e., behavioral clones) in C++, Java, and Python from thirty-six programming problems from the well-known technical interview practice platform, LeetCode. We query OpenAI’s GPT-3.5, GPT-4, and CodeLlama to generate code clones of the LeetCode solutions. We measure the behavioral equivalence of the LLM-generated clones using a behavioral similarity clustering technique inspired by the code clone detection tool, Simion-based Language Agnostic Code Clones (SLACC). This study reveals that, despite LLMs demonstrating the potential for code generation, their capacity to consistently generate syntactically diverse but behaviorally equivalent code clones is limited. At lower temperature settings, LLMs are more successful in producing behaviorally consistent, syntactically similar code clones within the same language. However, for cross-language cloning tasks and at higher temperature settings and programming difficulties, LLMs introduce greater syntactic diversity and lead to higher rates of compilation and runtime errors, resulting in a decline in behavioral consistency. These findings indicate a need for further quality assurance measures for the use of LLMs for code clone generation. All the data and scripts associated with this paper can be found <span><span>https://zenodo.org/records/14968618</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"230 ","pages":"Article 112548"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002171","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The ability to generate multiple equivalent versions of the same code segment across different programming languages and within the same language is valuable for code translation, language migration, and code comprehension in education. However, current avenues for generating code clones – through manual creation or specialized software tools – often fail to consistently generate a variety of behaviorally equivalent code clones. Large Language Models (LLMs) offer a promising solution by leveraging their extensive training on diverse codebases to automatically generate code. Unlike traditional methods, LLMs can produce code across a wide variety of programming languages with minimal user effort. Using LLMs for code clone generation could significantly reduce the time and resources needed to create code clones while enhancing their syntactic diversity.

In this quantitative empirical study, we investigate the dependability of LLMs as potential generators of code clones. We gathered equivalent code solutions (i.e., behavioral clones) in C++, Java, and Python from thirty-six programming problems from the well-known technical interview practice platform, LeetCode. We query OpenAI’s GPT-3.5, GPT-4, and CodeLlama to generate code clones of the LeetCode solutions. We measure the behavioral equivalence of the LLM-generated clones using a behavioral similarity clustering technique inspired by the code clone detection tool, Simion-based Language Agnostic Code Clones (SLACC). This study reveals that, despite LLMs demonstrating the potential for code generation, their capacity to consistently generate syntactically diverse but behaviorally equivalent code clones is limited. At lower temperature settings, LLMs are more successful in producing behaviorally consistent, syntactically similar code clones within the same language. However, for cross-language cloning tasks and at higher temperature settings and programming difficulties, LLMs introduce greater syntactic diversity and lead to higher rates of compilation and runtime errors, resulting in a decline in behavioral consistency. These findings indicate a need for further quality assurance measures for the use of LLMs for code clone generation. All the data and scripts associated with this paper can be found https://zenodo.org/records/14968618.

查看原文本刊更多论文

大型语言模型在代码克隆生成中的可靠性分析

在不同的编程语言和同一语言中生成相同代码段的多个等效版本的能力对于代码翻译、语言迁移和教育中的代码理解是有价值的。然而，当前生成代码克隆的途径——通过手工创建或专门的软件工具——常常不能一致地生成各种行为等效的代码克隆。大型语言模型（llm）通过利用它们在各种代码库上的广泛训练来自动生成代码，提供了一个有前途的解决方案。与传统方法不同，llm可以以最小的用户工作量生成跨多种编程语言的代码。使用llm生成代码克隆可以显著减少创建代码克隆所需的时间和资源，同时增强其语法多样性。在这个定量的实证研究中，我们调查了llm作为代码克隆的潜在生成器的可靠性。我们从著名的技术面试实践平台LeetCode的36个编程问题中收集了c++、Java和Python的等效代码解决方案（即行为克隆）。我们查询OpenAI的GPT-3.5、GPT-4和CodeLlama来生成LeetCode解决方案的代码克隆。我们使用受代码克隆检测工具simon -based Language Agnostic code克隆（SLACC）启发的行为相似性聚类技术来测量llm生成的克隆的行为等价性。这项研究表明，尽管法学硕士展示了代码生成的潜力，但他们持续生成语法多样但行为等效的代码克隆的能力是有限的。在较低的温度设置下，llm更能成功地在同一语言中生成行为一致、语法相似的代码克隆。然而，对于跨语言克隆任务，在更高的温度设置和编程困难下，llm引入了更大的语法多样性，导致更高的编译和运行时错误率，从而导致行为一致性的下降。这些发现表明需要进一步的质量保证措施来使用llm进行代码克隆生成。与本文相关的所有数据和脚本都可以在https://zenodo.org/records/14968618上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.