Exploring the potential of general purpose LLMs in automated software refactoring: an empirical study

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2025-03-01 DOI:10.1007/s10515-025-00500-0

Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, Hui Liu

{"title":"Exploring the potential of general purpose LLMs in automated software refactoring: an empirical study","authors":"Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, Hui Liu","doi":"10.1007/s10515-025-00500-0","DOIUrl":null,"url":null,"abstract":"<div><p>Software refactoring is an essential activity for improving the readability, maintainability, and reusability of software projects. To this end, a large number of automated or semi-automated approaches/tools have been proposed to locate poorly designed code, recommend refactoring solutions, and conduct specified refactorings. However, even equipped with such tools, it remains challenging for developers to decide where and what kind of refactorings should be applied. Recent advances in deep learning techniques, especially in large language models (LLMs), make it potentially feasible to automatically refactor source code with LLMs. However, it remains unclear how well LLMs perform compared to human experts in conducting refactorings automatically and accurately. To fill this gap, in this paper, we conduct an empirical study to investigate the potential of LLMs in automated software refactoring, focusing on the identification of refactoring opportunities and the recommendation of refactoring solutions. We first construct a high-quality refactoring dataset comprising 180 real-world refactorings from 20 projects, and conduct the empirical study on the dataset. With the to-be-refactored Java documents as input, ChatGPT and Gemini identified only 28 and 7 respectively out of the 180 refactoring opportunities. The evaluation results suggested that the performance of LLMs in identifying refactoring opportunities is generally low and remains an open problem. However, explaining the expected refactoring subcategories and narrowing the search space in the prompts substantially increased the success rate of ChatGPT from 15.6 to 86.7%. Concerning the recommendation of refactoring solutions, ChatGPT recommended 176 refactoring solutions for the 180 refactorings, and 63.6% of the recommended solutions were comparable to (even better than) those constructed by human experts. However, 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors, which indicate the risk of LLM-based refactoring.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00500-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Software refactoring is an essential activity for improving the readability, maintainability, and reusability of software projects. To this end, a large number of automated or semi-automated approaches/tools have been proposed to locate poorly designed code, recommend refactoring solutions, and conduct specified refactorings. However, even equipped with such tools, it remains challenging for developers to decide where and what kind of refactorings should be applied. Recent advances in deep learning techniques, especially in large language models (LLMs), make it potentially feasible to automatically refactor source code with LLMs. However, it remains unclear how well LLMs perform compared to human experts in conducting refactorings automatically and accurately. To fill this gap, in this paper, we conduct an empirical study to investigate the potential of LLMs in automated software refactoring, focusing on the identification of refactoring opportunities and the recommendation of refactoring solutions. We first construct a high-quality refactoring dataset comprising 180 real-world refactorings from 20 projects, and conduct the empirical study on the dataset. With the to-be-refactored Java documents as input, ChatGPT and Gemini identified only 28 and 7 respectively out of the 180 refactoring opportunities. The evaluation results suggested that the performance of LLMs in identifying refactoring opportunities is generally low and remains an open problem. However, explaining the expected refactoring subcategories and narrowing the search space in the prompts substantially increased the success rate of ChatGPT from 15.6 to 86.7%. Concerning the recommendation of refactoring solutions, ChatGPT recommended 176 refactoring solutions for the 180 refactorings, and 63.6% of the recommended solutions were comparable to (even better than) those constructed by human experts. However, 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors, which indicate the risk of LLM-based refactoring.

查看原文本刊更多论文

探索通用法学硕士在自动化软件重构中的潜力：一项实证研究

软件重构是提高软件项目的可读性、可维护性和可重用性的基本活动。为此，已经提出了大量自动化或半自动化的方法/工具来定位设计不良的代码，推荐重构解决方案，并执行指定的重构。然而，即使配备了这样的工具，对于开发人员来说，决定应该在哪里以及应用什么样的重构仍然是一个挑战。深度学习技术的最新进展，特别是在大型语言模型（llm）中，使得使用llm自动重构源代码成为可能。然而，与人类专家相比，llm在自动准确地进行重构方面的表现如何尚不清楚。为了填补这一空白，在本文中，我们进行了一项实证研究，以调查llm在自动化软件重构中的潜力，重点关注重构机会的识别和重构解决方案的推荐。我们首先构建了一个高质量的重构数据集，包括来自20个项目的180个真实重构，并对该数据集进行了实证研究。以待重构的Java文档为输入，ChatGPT和Gemini在180个重构机会中分别只识别了28个和7个。评估结果表明，法学硕士在识别重构机会方面的表现普遍较低，仍然是一个有待解决的问题。然而，在提示中解释预期的重构子类别并缩小搜索空间，大大提高了ChatGPT的成功率，从15.6%提高到86.7%。在重构方案的推荐方面，ChatGPT为180次重构推荐了176个重构方案，其中63.6%的推荐方案与人类专家构建的方案相当（甚至更好）。然而，ChatGPT建议的176个解决方案中的13个和Gemini建议的137个解决方案中的9个是不安全的，因为它们要么改变了源代码的功能，要么引入了语法错误，这表明了基于llm的重构的风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.