Sixiang Ye;Zeyu Sun;Guoqing Wang;Liwei Guo;Qingyuan Liang;Zheng Li;Yong Liu
{"title":"Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation","authors":"Sixiang Ye;Zeyu Sun;Guoqing Wang;Liwei Guo;Qingyuan Liang;Zheng Li;Yong Liu","doi":"10.1109/TSE.2025.3589634","DOIUrl":null,"url":null,"abstract":"Code generation has gained increasing attention as a task to automate software development by transforming high-level descriptions into executable code. While large language models (LLMs) are effective in generating code, their performance heavily relies on the quality of input prompts. Current prompt engineering methods involve manual effort in designing prompts, which can be time-consuming and yield inconsistent results, potentially constraining the efficacy of LLMs in practical applications. This paper introduces Prochemy, a novel approach for automatically refining prompts iteratively to enhance code generation. Prochemy addresses the limitations of manual prompt engineering by automating the optimization process, ensuring prompt consistency during inference, and aligning with multi-agent systems. It iteratively refines prompts based on model performance, using an optimized final prompt to improve consistency and reliability across tasks. We evaluate Prochemy on both natural language-based code generation and code translation tasks using three series of LLMs. Results show that when combining Prochemy with existing approaches, it outperforms baseline prompting methods. It achieves improvements of 5.0% (GPT-3.5-Turbo) and 1.9% (GPT-4o) over zero-shot baselines on HumanEval. For the state-of-the-art LDB, Prochemy + LDB outperforms standalone methods by 1.2–1.8%. For code translation, Prochemy elevates GPT-4o’s performance on Java-to-Python (AVATAR) from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Furthermore, considering that the o1-mini model integrates prompt engineering techniques, Prochemy can continue to show good performance among it, further validating its effectiveness in code generation and translation tasks. Additionally, Prochemy is designed to be plug-and-play, optimizing prompts with minimal human intervention and seamlessly bridging the gap between simple prompts and complex frameworks.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2472-2493"},"PeriodicalIF":5.6000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11082010/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Code generation has gained increasing attention as a task to automate software development by transforming high-level descriptions into executable code. While large language models (LLMs) are effective in generating code, their performance heavily relies on the quality of input prompts. Current prompt engineering methods involve manual effort in designing prompts, which can be time-consuming and yield inconsistent results, potentially constraining the efficacy of LLMs in practical applications. This paper introduces Prochemy, a novel approach for automatically refining prompts iteratively to enhance code generation. Prochemy addresses the limitations of manual prompt engineering by automating the optimization process, ensuring prompt consistency during inference, and aligning with multi-agent systems. It iteratively refines prompts based on model performance, using an optimized final prompt to improve consistency and reliability across tasks. We evaluate Prochemy on both natural language-based code generation and code translation tasks using three series of LLMs. Results show that when combining Prochemy with existing approaches, it outperforms baseline prompting methods. It achieves improvements of 5.0% (GPT-3.5-Turbo) and 1.9% (GPT-4o) over zero-shot baselines on HumanEval. For the state-of-the-art LDB, Prochemy + LDB outperforms standalone methods by 1.2–1.8%. For code translation, Prochemy elevates GPT-4o’s performance on Java-to-Python (AVATAR) from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Furthermore, considering that the o1-mini model integrates prompt engineering techniques, Prochemy can continue to show good performance among it, further validating its effectiveness in code generation and translation tasks. Additionally, Prochemy is designed to be plug-and-play, optimizing prompts with minimal human intervention and seamlessly bridging the gap between simple prompts and complex frameworks.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.