Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-07-16 DOI:10.1109/TSE.2025.3589634

Sixiang Ye;Zeyu Sun;Guoqing Wang;Liwei Guo;Qingyuan Liang;Zheng Li;Yong Liu

{"title":"Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation","authors":"Sixiang Ye;Zeyu Sun;Guoqing Wang;Liwei Guo;Qingyuan Liang;Zheng Li;Yong Liu","doi":"10.1109/TSE.2025.3589634","DOIUrl":null,"url":null,"abstract":"Code generation has gained increasing attention as a task to automate software development by transforming high-level descriptions into executable code. While large language models (LLMs) are effective in generating code, their performance heavily relies on the quality of input prompts. Current prompt engineering methods involve manual effort in designing prompts, which can be time-consuming and yield inconsistent results, potentially constraining the efficacy of LLMs in practical applications. This paper introduces Prochemy, a novel approach for automatically refining prompts iteratively to enhance code generation. Prochemy addresses the limitations of manual prompt engineering by automating the optimization process, ensuring prompt consistency during inference, and aligning with multi-agent systems. It iteratively refines prompts based on model performance, using an optimized final prompt to improve consistency and reliability across tasks. We evaluate Prochemy on both natural language-based code generation and code translation tasks using three series of LLMs. Results show that when combining Prochemy with existing approaches, it outperforms baseline prompting methods. It achieves improvements of 5.0% (GPT-3.5-Turbo) and 1.9% (GPT-4o) over zero-shot baselines on HumanEval. For the state-of-the-art LDB, Prochemy + LDB outperforms standalone methods by 1.2–1.8%. For code translation, Prochemy elevates GPT-4o’s performance on Java-to-Python (AVATAR) from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Furthermore, considering that the o1-mini model integrates prompt engineering techniques, Prochemy can continue to show good performance among it, further validating its effectiveness in code generation and translation tasks. Additionally, Prochemy is designed to be plug-and-play, optimizing prompts with minimal human intervention and seamlessly bridging the gap between simple prompts and complex frameworks.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2472-2493"},"PeriodicalIF":5.6000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11082010/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Code generation has gained increasing attention as a task to automate software development by transforming high-level descriptions into executable code. While large language models (LLMs) are effective in generating code, their performance heavily relies on the quality of input prompts. Current prompt engineering methods involve manual effort in designing prompts, which can be time-consuming and yield inconsistent results, potentially constraining the efficacy of LLMs in practical applications. This paper introduces Prochemy, a novel approach for automatically refining prompts iteratively to enhance code generation. Prochemy addresses the limitations of manual prompt engineering by automating the optimization process, ensuring prompt consistency during inference, and aligning with multi-agent systems. It iteratively refines prompts based on model performance, using an optimized final prompt to improve consistency and reliability across tasks. We evaluate Prochemy on both natural language-based code generation and code translation tasks using three series of LLMs. Results show that when combining Prochemy with existing approaches, it outperforms baseline prompting methods. It achieves improvements of 5.0% (GPT-3.5-Turbo) and 1.9% (GPT-4o) over zero-shot baselines on HumanEval. For the state-of-the-art LDB, Prochemy + LDB outperforms standalone methods by 1.2–1.8%. For code translation, Prochemy elevates GPT-4o’s performance on Java-to-Python (AVATAR) from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Furthermore, considering that the o1-mini model integrates prompt engineering techniques, Prochemy can continue to show good performance among it, further validating its effectiveness in code generation and translation tasks. Additionally, Prochemy is designed to be plug-and-play, optimizing prompts with minimal human intervention and seamlessly bridging the gap between simple prompts and complex frameworks.

查看原文本刊更多论文

提示炼金术：用于增强代码生成的自动提示细化

代码生成作为一项通过将高级描述转换为可执行代码来实现软件开发自动化的任务，已经获得了越来越多的关注。虽然大型语言模型（llm）在生成代码方面是有效的，但它们的性能在很大程度上依赖于输入提示符的质量。目前的提示器工程方法涉及人工设计提示器，这既耗时又产生不一致的结果，潜在地限制了llm在实际应用中的效果。本文介绍了Prochemy，一种自动迭代精炼提示符以增强代码生成的新方法。Prochemy通过自动化优化过程，确保推理过程中的提示一致性，并与多智能体系统保持一致，解决了手动提示工程的局限性。它根据模型性能迭代地改进提示，使用优化的最终提示来提高跨任务的一致性和可靠性。我们使用三个系列的llm来评估Prochemy基于自然语言的代码生成和代码翻译任务。结果表明，当Prochemy与现有方法相结合时，它优于基线提示方法。它在HumanEval上比零射击基线提高了5.0% （GPT-3.5-Turbo）和1.9% （gpt - 40）。对于最先进的LDB， Prochemy + LDB的性能比独立方法高出1.2-1.8%。对于代码翻译，Prochemy将gpt - 40在Java-to-Python （AVATAR）上的性能从74.5提升到84.1(+12.9%)，将Python-to-Java的性能从66.8提升到78.2（+17.1%）。此外，考虑到01 -mini模型集成了即时工程技术，Prochemy可以在其中继续表现出良好的性能，进一步验证了其在代码生成和翻译任务中的有效性。此外，Prochemy的设计是即插即用的，以最少的人为干预优化提示，无缝地弥合简单提示和复杂框架之间的差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.