应用大型语言模型发布分类：重新审视扩展数据和新模型

IF 1.4 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Science of Computer Programming Pub Date : 2025-05-20 DOI:10.1016/j.scico.2025.103333

Gabriel Aracena , Kyle Luster , Fabio Santos , Igor Steinmacher , Marco A. Gerosa

{"title":"应用大型语言模型发布分类：重新审视扩展数据和新模型","authors":"Gabriel Aracena , Kyle Luster , Fabio Santos , Igor Steinmacher , Marco A. Gerosa","doi":"10.1016/j.scico.2025.103333","DOIUrl":null,"url":null,"abstract":"<div><div>Effective prioritization of issue reports in software engineering helps to optimize resource allocation and information recovery. However, manual issue classification is laborious and lacks scalability. As an alternative, many open source software (OSS) projects employ automated processes for this task, yet this method often relies on large datasets for adequate training. Traditionally, machine learning techniques have been used for issue classification. More recently, large language models (LLMs) have emerged as powerful tools for addressing a range of software engineering challenges, including code and test generation, mapping new requirements to legacy software endpoints, and conducting code reviews. The following research investigates an automated approach to issue classification based on LLMs. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports, mitigating the necessity for extensive training data while maintaining classification reliability. In our research, we developed an LLM-based approach for accurately labeling issues by selecting two of the most prominent large language models. We then compared their performance across multiple datasets. Our findings show that GPT-4o achieved the best results in classifying issues from the NLBSE 2024 competition. Moreover, GPT-4o outperformed DeepSeek R1, achieving an F1 score 20% higher when both models were trained on the same dataset from the NLBSE 2023 competition, which was ten times larger than the NLBSE 2024 dataset. The fine-tuned GPT-4o model attained an average F1 score of 80.7%, while the fine-tuned DeepSeek R1 model achieved 59.33%. Increasing the dataset size did not improve the F1 score, reducing the dependence on massive datasets for building an efficient solution to issue classification. Notably, in individual repositories, some of our models predicted issue labels with a precision greater than 98%, a recall of 97%, and an F1 score of 90%.</div></div>","PeriodicalId":49561,"journal":{"name":"Science of Computer Programming","volume":"246 ","pages":"Article 103333"},"PeriodicalIF":1.4000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Applying large language models to issue classification: Revisiting with extended data and new models\",\"authors\":\"Gabriel Aracena , Kyle Luster , Fabio Santos , Igor Steinmacher , Marco A. Gerosa\",\"doi\":\"10.1016/j.scico.2025.103333\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Effective prioritization of issue reports in software engineering helps to optimize resource allocation and information recovery. However, manual issue classification is laborious and lacks scalability. As an alternative, many open source software (OSS) projects employ automated processes for this task, yet this method often relies on large datasets for adequate training. Traditionally, machine learning techniques have been used for issue classification. More recently, large language models (LLMs) have emerged as powerful tools for addressing a range of software engineering challenges, including code and test generation, mapping new requirements to legacy software endpoints, and conducting code reviews. The following research investigates an automated approach to issue classification based on LLMs. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports, mitigating the necessity for extensive training data while maintaining classification reliability. In our research, we developed an LLM-based approach for accurately labeling issues by selecting two of the most prominent large language models. We then compared their performance across multiple datasets. Our findings show that GPT-4o achieved the best results in classifying issues from the NLBSE 2024 competition. Moreover, GPT-4o outperformed DeepSeek R1, achieving an F1 score 20% higher when both models were trained on the same dataset from the NLBSE 2023 competition, which was ten times larger than the NLBSE 2024 dataset. The fine-tuned GPT-4o model attained an average F1 score of 80.7%, while the fine-tuned DeepSeek R1 model achieved 59.33%. Increasing the dataset size did not improve the F1 score, reducing the dependence on massive datasets for building an efficient solution to issue classification. Notably, in individual repositories, some of our models predicted issue labels with a precision greater than 98%, a recall of 97%, and an F1 score of 90%.</div></div>\",\"PeriodicalId\":49561,\"journal\":{\"name\":\"Science of Computer Programming\",\"volume\":\"246 \",\"pages\":\"Article 103333\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science of Computer Programming\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167642325000723\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of Computer Programming","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167642325000723","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

在软件工程中对问题报告进行有效的优先级排序有助于优化资源分配和信息恢复。然而，手动问题分类很费力，而且缺乏可伸缩性。作为替代方案，许多开源软件（OSS）项目采用自动化过程来完成这项任务，但是这种方法通常依赖于大型数据集来进行充分的训练。传统上，机器学习技术已用于问题分类。最近，大型语言模型（llm）已经成为解决一系列软件工程挑战的强大工具，包括代码和测试生成，将新需求映射到遗留软件端点，以及进行代码审查。下面的研究探讨了一种基于llm的自动问题分类方法。通过利用这些模型的功能，我们的目标是开发一个健壮的系统，用于确定问题报告的优先级，在保持分类可靠性的同时减少对大量训练数据的需求。在我们的研究中，我们开发了一种基于llm的方法，通过选择两个最突出的大型语言模型来准确地标记问题。然后，我们比较了它们在多个数据集上的表现。我们的研究结果表明，gpt - 40在NLBSE 2024竞赛中的问题分类中取得了最好的结果。此外，gpt - 40的表现优于DeepSeek R1，当两个模型在NLBSE 2023比赛的同一数据集上训练时，F1得分高出20%，NLBSE 2023比赛的数据集比NLBSE 2024数据集大10倍。经过微调的gpt - 40模型F1平均得分为80.7%，而经过微调的DeepSeek R1模型F1平均得分为59.33%。增加数据集的大小并没有提高F1分数，减少了对大量数据集的依赖，从而构建了一个有效的问题分类解决方案。值得注意的是，在单个存储库中，我们的一些模型预测问题标签的精度大于98%，召回率为97%，F1得分为90%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Applying large language models to issue classification: Revisiting with extended data and new models

Effective prioritization of issue reports in software engineering helps to optimize resource allocation and information recovery. However, manual issue classification is laborious and lacks scalability. As an alternative, many open source software (OSS) projects employ automated processes for this task, yet this method often relies on large datasets for adequate training. Traditionally, machine learning techniques have been used for issue classification. More recently, large language models (LLMs) have emerged as powerful tools for addressing a range of software engineering challenges, including code and test generation, mapping new requirements to legacy software endpoints, and conducting code reviews. The following research investigates an automated approach to issue classification based on LLMs. By leveraging the capabilities of such models, we aim to develop a robust system for prioritizing issue reports, mitigating the necessity for extensive training data while maintaining classification reliability. In our research, we developed an LLM-based approach for accurately labeling issues by selecting two of the most prominent large language models. We then compared their performance across multiple datasets. Our findings show that GPT-4o achieved the best results in classifying issues from the NLBSE 2024 competition. Moreover, GPT-4o outperformed DeepSeek R1, achieving an F1 score 20% higher when both models were trained on the same dataset from the NLBSE 2023 competition, which was ten times larger than the NLBSE 2024 dataset. The fine-tuned GPT-4o model attained an average F1 score of 80.7%, while the fine-tuned DeepSeek R1 model achieved 59.33%. Increasing the dataset size did not improve the F1 score, reducing the dependence on massive datasets for building an efficient solution to issue classification. Notably, in individual repositories, some of our models predicted issue labels with a precision greater than 98%, a recall of 97%, and an F1 score of 90%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Science of Computer Programming 工程技术-计算机：软件工程

CiteScore

3.80

自引率

0.00%

发文量

审稿时长

67 days

期刊介绍： Science of Computer Programming is dedicated to the distribution of research results in the areas of software systems development, use and maintenance, including the software aspects of hardware design. The journal has a wide scope ranging from the many facets of methodological foundations to the details of technical issues andthe aspects of industrial practice. The subjects of interest to SCP cover the entire spectrum of methods for the entire life cycle of software systems, including • Requirements, specification, design, validation, verification, coding, testing, maintenance, metrics and renovation of software; • Design, implementation and evaluation of programming languages; • Programming environments, development tools, visualisation and animation; • Management of the development process; • Human factors in software, software for social interaction, software for social computing; • Cyber physical systems, and software for the interaction between the physical and the machine; • Software aspects of infrastructure services, system administration, and network management.