GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-05-17 DOI:10.1016/j.infsof.2025.107778

Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro

{"title":"GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification","authors":"Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro","doi":"10.1016/j.infsof.2025.107778","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Identifying security bugs in software is critical to minimize vulnerability windows. Traditionally, bug reports are submitted through issue trackers and manually analyzed, which is time-consuming. Challenges such as data scarcity and imbalance generally hinder the development of effective machine learning models that could be used to automate this task. Generative Pre-trained Transformer (GPT) models do not require training and are less affected by the imbalance problem. Therefore, they have gained popularity for various text-based classification tasks, apparently becoming a natural highly promising solution for this problem.</div></div><div><h3>Objective:</h3><div>This paper explores the potential of using GPT models to identify security bug reports from the perspective of a user of this type of models. We aim to assess their classification performance in this task compared to traditional machine learning (ML) methods, while also investigating how different factors, such as the prompt used and datasets’ characteristics, affect their results.</div></div><div><h3>Methods:</h3><div>We evaluate the performance of four state-of-the-art GPT models (i.e., GPT4All-Falcon, Wizard, Instruct, OpenOrca) on the task of security bug report identification. We use three different prompts for each GPT model and compare the results with traditional ML models. The empirical results are based on using bug report data from seven projects (i.e., Ambari, Camel, Derby, Wicket, Nova, OpenStack, and Ubuntu).</div></div><div><h3>Results:</h3><div>GPT models show noticeable difficulties in identifying security bug reports, with performance levels generally lower than traditional ML models. The effectiveness of the GPT models is quite variable, depending on the specific model and prompt used, as well as the particular dataset.</div></div><div><h3>Conclusion:</h3><div>Although GPT models are nowadays used in many types of tasks, including classification, their current performance in security bug report identification is surprisingly insufficient and inferior to traditional ML models. Further research is needed to address the challenges identified in this paper in order to effectively apply GPT models to this particular domain.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107778"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095058492500117X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Identifying security bugs in software is critical to minimize vulnerability windows. Traditionally, bug reports are submitted through issue trackers and manually analyzed, which is time-consuming. Challenges such as data scarcity and imbalance generally hinder the development of effective machine learning models that could be used to automate this task. Generative Pre-trained Transformer (GPT) models do not require training and are less affected by the imbalance problem. Therefore, they have gained popularity for various text-based classification tasks, apparently becoming a natural highly promising solution for this problem.

Objective:

This paper explores the potential of using GPT models to identify security bug reports from the perspective of a user of this type of models. We aim to assess their classification performance in this task compared to traditional machine learning (ML) methods, while also investigating how different factors, such as the prompt used and datasets’ characteristics, affect their results.

Methods:

We evaluate the performance of four state-of-the-art GPT models (i.e., GPT4All-Falcon, Wizard, Instruct, OpenOrca) on the task of security bug report identification. We use three different prompts for each GPT model and compare the results with traditional ML models. The empirical results are based on using bug report data from seven projects (i.e., Ambari, Camel, Derby, Wicket, Nova, OpenStack, and Ubuntu).

Results:

GPT models show noticeable difficulties in identifying security bug reports, with performance levels generally lower than traditional ML models. The effectiveness of the GPT models is quite variable, depending on the specific model and prompt used, as well as the particular dataset.

Conclusion:

Although GPT models are nowadays used in many types of tasks, including classification, their current performance in security bug report identification is surprisingly insufficient and inferior to traditional ML models. Further research is needed to address the challenges identified in this paper in order to effectively apply GPT models to this particular domain.

查看原文本刊更多论文

gpt并不是灵丹妙药：使用gpt进行安全错误报告识别的性能和挑战

背景：识别软件中的安全漏洞对于最小化漏洞窗口至关重要。传统上，bug报告是通过问题跟踪器提交并手工分析的，这非常耗时。数据稀缺和不平衡等挑战通常会阻碍有效机器学习模型的发展，这些模型可用于自动化此任务。生成式预训练变形（GPT）模型不需要训练，受不平衡问题的影响较小。因此，它们在各种基于文本的分类任务中得到了普及，显然成为了解决该问题的一个很有前途的自然解决方案。目的：本文从使用GPT模型的用户的角度探讨使用GPT模型识别安全漏洞报告的潜力。我们的目标是评估与传统机器学习（ML）方法相比，他们在这个任务中的分类性能，同时也研究不同的因素，如使用的提示和数据集的特征，是如何影响他们的结果的。方法：评估GPT4All-Falcon、Wizard、directive、OpenOrca四种最先进的GPT模型在安全漏洞报告识别任务上的性能。我们为每个GPT模型使用三种不同的提示，并将结果与传统ML模型进行比较。实证结果基于七个项目（即Ambari、Camel、Derby、Wicket、Nova、OpenStack和Ubuntu）的bug报告数据。结果：GPT模型在识别安全漏洞报告方面显示出明显的困难，其性能水平通常低于传统的ML模型。GPT模型的有效性变化很大，取决于使用的特定模型和提示，以及特定的数据集。结论：尽管GPT模型现在被用于许多类型的任务，包括分类，但其目前在安全漏洞报告识别方面的性能令人惊讶地不足，不如传统的ML模型。为了有效地将GPT模型应用于这一特定领域，需要进一步的研究来解决本文中确定的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.