软件工程分类任务中代码预训练模型最佳实践的实证研究

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-02-11 DOI:10.1016/j.eswa.2025.126762

Yu Zhao, Lina Gong, Yaoshen Yu, Zhiqiu Huang, Mingqiang Wei

{"title":"软件工程分类任务中代码预训练模型最佳实践的实证研究","authors":"Yu Zhao, Lina Gong, Yaoshen Yu, Zhiqiu Huang, Mingqiang Wei","doi":"10.1016/j.eswa.2025.126762","DOIUrl":null,"url":null,"abstract":"<div><div>Tackling code-specific classification challenges like detecting code vulnerabilities and identifying code clones is pivotal in software engineering (SE) practice. The utilization of pre-trained models (PTMs) from the natural language processing (NLP) field shows profound benefits in text classification by generating contextual token embeddings. Similarly, for code-specific classification tasks, there is a growing trend among researchers and practitioners to leverage code-oriented PTMs to create embeddings for code snippets or directly apply the code PTMs to the downstream tasks based on the pre-training and fine-tuning paradigm. Nonetheless, we observe that SE researchers and practitioners often treat the code and text in the same way as NLP strategies when employing these code PTMs. However, despite previous studies in the SE field indicating similarities between programming languages and natural languages, it may not be entirely appropriate for current researchers to directly apply NLP knowledge to assume similar behavior in code. Therefore, in order to derive best practices for researchers and practitioners to use code PTMs for SE classification tasks, we first conduct an empirical analysis on six distinct code PTMs, namely CodeBERT, StarEncoder, CodeT5, PLBART, CodeGPT, and CodeGen, across three architectural frameworks (encoder-only, decoder-only, and encoder–decoder) in the context of four SE classification tasks: code vulnerability detection, code clone identification, just-in-time defect prediction, and function docstring mismatch detection under two scenarios of code embedding and task model. Our findings reveal several insights on the use of code PTMs for code-specific classification tasks endeavors: (1) Emphasizing the vector representation of individual code tokens leads to better code embedding quality and task model performance than those generated through specific tokens techniques in both the code embedding scenario and task model scenario. (2) Larger-sized code PTMs do not necessarily lead to superior code embedding quality in the code embedding scenario and better task performance in the task model scenario. (3) Adopting the ways to handle code and text data same as the pre-training phrase cannot guarantee the acquisition of high-quality code embeddings in the code embedding scenario while in the task model scenario, it can most likely acquire better task performance.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"272 ","pages":"Article 126762"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An empirical study of best practices for code pre-trained models on software engineering classification tasks\",\"authors\":\"Yu Zhao, Lina Gong, Yaoshen Yu, Zhiqiu Huang, Mingqiang Wei\",\"doi\":\"10.1016/j.eswa.2025.126762\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Tackling code-specific classification challenges like detecting code vulnerabilities and identifying code clones is pivotal in software engineering (SE) practice. The utilization of pre-trained models (PTMs) from the natural language processing (NLP) field shows profound benefits in text classification by generating contextual token embeddings. Similarly, for code-specific classification tasks, there is a growing trend among researchers and practitioners to leverage code-oriented PTMs to create embeddings for code snippets or directly apply the code PTMs to the downstream tasks based on the pre-training and fine-tuning paradigm. Nonetheless, we observe that SE researchers and practitioners often treat the code and text in the same way as NLP strategies when employing these code PTMs. However, despite previous studies in the SE field indicating similarities between programming languages and natural languages, it may not be entirely appropriate for current researchers to directly apply NLP knowledge to assume similar behavior in code. Therefore, in order to derive best practices for researchers and practitioners to use code PTMs for SE classification tasks, we first conduct an empirical analysis on six distinct code PTMs, namely CodeBERT, StarEncoder, CodeT5, PLBART, CodeGPT, and CodeGen, across three architectural frameworks (encoder-only, decoder-only, and encoder–decoder) in the context of four SE classification tasks: code vulnerability detection, code clone identification, just-in-time defect prediction, and function docstring mismatch detection under two scenarios of code embedding and task model. Our findings reveal several insights on the use of code PTMs for code-specific classification tasks endeavors: (1) Emphasizing the vector representation of individual code tokens leads to better code embedding quality and task model performance than those generated through specific tokens techniques in both the code embedding scenario and task model scenario. (2) Larger-sized code PTMs do not necessarily lead to superior code embedding quality in the code embedding scenario and better task performance in the task model scenario. (3) Adopting the ways to handle code and text data same as the pre-training phrase cannot guarantee the acquisition of high-quality code embeddings in the code embedding scenario while in the task model scenario, it can most likely acquire better task performance.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"272 \",\"pages\":\"Article 126762\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-02-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425003847\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425003847","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

处理特定于代码的分类挑战，如检测代码漏洞和识别代码克隆，在软件工程（SE）实践中是关键的。利用自然语言处理（NLP）领域的预训练模型（ptm）通过生成上下文标记嵌入在文本分类中显示出深刻的好处。类似地，对于特定于代码的分类任务，在研究人员和实践者中有一种日益增长的趋势，即利用面向代码的ptm来为代码片段创建嵌入，或者直接将代码ptm应用于基于预训练和微调范例的下游任务。尽管如此，我们观察到，在使用这些代码ptm时，SE研究人员和实践者经常以与NLP策略相同的方式对待代码和文本。然而，尽管先前在SE领域的研究表明编程语言和自然语言之间存在相似性，但目前的研究人员可能并不完全适合直接应用NLP知识来假设代码中的类似行为。因此，为了获得研究人员和从业者在SE分类任务中使用代码PTMs的最佳实践，我们首先在四种SE分类任务的背景下对六个不同的代码PTMs进行了实证分析，即CodeBERT、StarEncoder、CodeT5、PLBART、CodeGPT和CodeGen，跨越三种架构框架（仅编码器、仅解码器和编码器-解码器）：代码嵌入和任务模型两种场景下的代码漏洞检测、代码克隆识别、即时缺陷预测和功能文档字符串不匹配检测。我们的研究结果揭示了在代码特定分类任务中使用代码ptm的一些见解：(1)强调单个代码令牌的向量表示比在代码嵌入场景和任务模型场景中通过特定令牌技术生成的代码嵌入质量和任务模型性能更好。(2)代码ptm的规模越大，在代码嵌入场景中代码嵌入质量越好，在任务模型场景中任务性能越好。(3)在代码嵌入场景中，采用与预训练阶段相同的代码和文本数据处理方式不能保证获得高质量的代码嵌入，而在任务模型场景中，则最有可能获得更好的任务性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An empirical study of best practices for code pre-trained models on software engineering classification tasks

Tackling code-specific classification challenges like detecting code vulnerabilities and identifying code clones is pivotal in software engineering (SE) practice. The utilization of pre-trained models (PTMs) from the natural language processing (NLP) field shows profound benefits in text classification by generating contextual token embeddings. Similarly, for code-specific classification tasks, there is a growing trend among researchers and practitioners to leverage code-oriented PTMs to create embeddings for code snippets or directly apply the code PTMs to the downstream tasks based on the pre-training and fine-tuning paradigm. Nonetheless, we observe that SE researchers and practitioners often treat the code and text in the same way as NLP strategies when employing these code PTMs. However, despite previous studies in the SE field indicating similarities between programming languages and natural languages, it may not be entirely appropriate for current researchers to directly apply NLP knowledge to assume similar behavior in code. Therefore, in order to derive best practices for researchers and practitioners to use code PTMs for SE classification tasks, we first conduct an empirical analysis on six distinct code PTMs, namely CodeBERT, StarEncoder, CodeT5, PLBART, CodeGPT, and CodeGen, across three architectural frameworks (encoder-only, decoder-only, and encoder–decoder) in the context of four SE classification tasks: code vulnerability detection, code clone identification, just-in-time defect prediction, and function docstring mismatch detection under two scenarios of code embedding and task model. Our findings reveal several insights on the use of code PTMs for code-specific classification tasks endeavors: (1) Emphasizing the vector representation of individual code tokens leads to better code embedding quality and task model performance than those generated through specific tokens techniques in both the code embedding scenario and task model scenario. (2) Larger-sized code PTMs do not necessarily lead to superior code embedding quality in the code embedding scenario and better task performance in the task model scenario. (3) Adopting the ways to handle code and text data same as the pre-training phrase cannot guarantee the acquisition of high-quality code embeddings in the code embedding scenario while in the task model scenario, it can most likely acquire better task performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.