Directional Diffusion-Style Code Editing Pre-Training

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-07-25 DOI:10.1109/TSE.2025.3592841

Qingyuan Liang;Zeyu Sun;Qihao Zhu;Junhao Hu;Yifan Zhao;Yizhou Chen;Mingxuan Zhu;Guoqing Wang;Lu Zhang

{"title":"Directional Diffusion-Style Code Editing Pre-Training","authors":"Qingyuan Liang;Zeyu Sun;Qihao Zhu;Junhao Hu;Yifan Zhao;Yizhou Chen;Mingxuan Zhu;Guoqing Wang;Lu Zhang","doi":"10.1109/TSE.2025.3592841","DOIUrl":null,"url":null,"abstract":"Code pre-trained models have shown promising effectiveness in various software engineering tasks. Among these tasks, many tasks are related to software evolution and/or code editing. However, existing code pre-trained models often overlook the real-world code editing data and the evolutionary nature of the editing process. In this paper, to simulate the step-by-step code editing process of human developers, we propose DivoT5, a pre-trained model based on directional diffusion at the data level. In DivoT5, we adopt two categories of pre-training tasks. The first category is mask and denoising tasks augmented with a diffusion direction representing code evolution. That is, we first apply a noising process to the code snippets before evolution, and then ask the pre-training process to restore the snippets with noise into the code snippets after evolution. The second category is tasks aiming to reinforce the evolutionary direction. That is, we first generate various intermediate versions for each pair of snippets before and after evolution, and then ask the pre-training process to transform the intermediate versions into the snippet after evolution for each pair. We evaluate DivoT5 for two code-editing scenarios (including a number of tasks) and one non-editing scenario using four downstream tasks. For each downstream task, we fine-tune the pre-trained DivoT5 on multiple corresponding datasets and evaluate its effectiveness across diverse scenarios Our experimental results show that ivoT5 achieves state-of-the-art (SOTA) performance on most tasks in comparison to models of the same scale (220M), large-scale (770M, 6.7B) models in fine-tuning, and billion-scale (6.7B, 8B, ChatGPT) instruct models in few-shot settings. For one code-editing task (i.e., CodeReview in NL-based CodeRefinement task), DivoT5 pre-trained on top of CodeT5-small (60M) can even outperform CodeT5-base (220M) and other pre-trained models with 220M parameters except for DivoT5 pre-trained on top of CodeT5-base (220M).","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 9","pages":"2583-2600"},"PeriodicalIF":5.6000,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11096907/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Code pre-trained models have shown promising effectiveness in various software engineering tasks. Among these tasks, many tasks are related to software evolution and/or code editing. However, existing code pre-trained models often overlook the real-world code editing data and the evolutionary nature of the editing process. In this paper, to simulate the step-by-step code editing process of human developers, we propose DivoT5, a pre-trained model based on directional diffusion at the data level. In DivoT5, we adopt two categories of pre-training tasks. The first category is mask and denoising tasks augmented with a diffusion direction representing code evolution. That is, we first apply a noising process to the code snippets before evolution, and then ask the pre-training process to restore the snippets with noise into the code snippets after evolution. The second category is tasks aiming to reinforce the evolutionary direction. That is, we first generate various intermediate versions for each pair of snippets before and after evolution, and then ask the pre-training process to transform the intermediate versions into the snippet after evolution for each pair. We evaluate DivoT5 for two code-editing scenarios (including a number of tasks) and one non-editing scenario using four downstream tasks. For each downstream task, we fine-tune the pre-trained DivoT5 on multiple corresponding datasets and evaluate its effectiveness across diverse scenarios Our experimental results show that ivoT5 achieves state-of-the-art (SOTA) performance on most tasks in comparison to models of the same scale (220M), large-scale (770M, 6.7B) models in fine-tuning, and billion-scale (6.7B, 8B, ChatGPT) instruct models in few-shot settings. For one code-editing task (i.e., CodeReview in NL-based CodeRefinement task), DivoT5 pre-trained on top of CodeT5-small (60M) can even outperform CodeT5-base (220M) and other pre-trained models with 220M parameters except for DivoT5 pre-trained on top of CodeT5-base (220M).

查看原文本刊更多论文

定向扩散风格的代码编辑预训练

代码预训练模型在各种软件工程任务中显示出良好的有效性。在这些任务中，许多任务与软件开发和/或代码编辑有关。然而，现有的代码预训练模型往往忽略了现实世界的代码编辑数据和编辑过程的进化本质。在本文中，为了模拟人类开发人员逐步编辑代码的过程，我们提出了一种基于数据级定向扩散的预训练模型DivoT5。在DivoT5中，我们采用了两类预训练任务。第一类是用表示代码演化的扩散方向增强的掩模和去噪任务。也就是说，我们首先对进化前的代码片段进行去噪处理，然后让预训练过程将带有噪声的代码片段还原为进化后的代码片段。第二类是旨在加强进化方向的任务。也就是说，我们首先为每对片段在进化前后生成各种中间版本，然后让预训练过程将中间版本转换为每对片段在进化后的版本。我们在两个代码编辑场景（包括许多任务）和一个使用四个下游任务的非编辑场景中评估了DivoT5。对于每个下游任务，我们在多个相应的数据集上对预训练的DivoT5进行了微调，并评估了其在不同场景下的有效性。实验结果表明，与相同规模（220M）的模型相比，ivoT5在大多数任务上达到了最先进（SOTA）的性能，在微调中，大规模（770M, 6.7B）的模型，在少数几次设置中，十亿规模（6.7B, 8B, ChatGPT）的指示模型。对于一个代码编辑任务（即基于神经网络的CodeReview中的coderedefined任务），在CodeT5-small （60M）之上进行预训练的DivoT5甚至可以优于CodeT5-base （220M）和其他220M参数的预训练模型，除了在CodeT5-base （220M）之上进行预训练的DivoT5。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.