GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification

IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro
{"title":"GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification","authors":"Horácio L. França ,&nbsp;Katerina Goseva-Popstojanova ,&nbsp;César Teixeira ,&nbsp;Nuno Laranjeiro","doi":"10.1016/j.infsof.2025.107778","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Identifying security bugs in software is critical to minimize vulnerability windows. Traditionally, bug reports are submitted through issue trackers and manually analyzed, which is time-consuming. Challenges such as data scarcity and imbalance generally hinder the development of effective machine learning models that could be used to automate this task. Generative Pre-trained Transformer (GPT) models do not require training and are less affected by the imbalance problem. Therefore, they have gained popularity for various text-based classification tasks, apparently becoming a natural highly promising solution for this problem.</div></div><div><h3>Objective:</h3><div>This paper explores the potential of using GPT models to identify security bug reports from the perspective of a user of this type of models. We aim to assess their classification performance in this task compared to traditional machine learning (ML) methods, while also investigating how different factors, such as the prompt used and datasets’ characteristics, affect their results.</div></div><div><h3>Methods:</h3><div>We evaluate the performance of four state-of-the-art GPT models (i.e., GPT4All-Falcon, Wizard, Instruct, OpenOrca) on the task of security bug report identification. We use three different prompts for each GPT model and compare the results with traditional ML models. The empirical results are based on using bug report data from seven projects (i.e., Ambari, Camel, Derby, Wicket, Nova, OpenStack, and Ubuntu).</div></div><div><h3>Results:</h3><div>GPT models show noticeable difficulties in identifying security bug reports, with performance levels generally lower than traditional ML models. The effectiveness of the GPT models is quite variable, depending on the specific model and prompt used, as well as the particular dataset.</div></div><div><h3>Conclusion:</h3><div>Although GPT models are nowadays used in many types of tasks, including classification, their current performance in security bug report identification is surprisingly insufficient and inferior to traditional ML models. Further research is needed to address the challenges identified in this paper in order to effectively apply GPT models to this particular domain.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107778"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095058492500117X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Context:

Identifying security bugs in software is critical to minimize vulnerability windows. Traditionally, bug reports are submitted through issue trackers and manually analyzed, which is time-consuming. Challenges such as data scarcity and imbalance generally hinder the development of effective machine learning models that could be used to automate this task. Generative Pre-trained Transformer (GPT) models do not require training and are less affected by the imbalance problem. Therefore, they have gained popularity for various text-based classification tasks, apparently becoming a natural highly promising solution for this problem.

Objective:

This paper explores the potential of using GPT models to identify security bug reports from the perspective of a user of this type of models. We aim to assess their classification performance in this task compared to traditional machine learning (ML) methods, while also investigating how different factors, such as the prompt used and datasets’ characteristics, affect their results.

Methods:

We evaluate the performance of four state-of-the-art GPT models (i.e., GPT4All-Falcon, Wizard, Instruct, OpenOrca) on the task of security bug report identification. We use three different prompts for each GPT model and compare the results with traditional ML models. The empirical results are based on using bug report data from seven projects (i.e., Ambari, Camel, Derby, Wicket, Nova, OpenStack, and Ubuntu).

Results:

GPT models show noticeable difficulties in identifying security bug reports, with performance levels generally lower than traditional ML models. The effectiveness of the GPT models is quite variable, depending on the specific model and prompt used, as well as the particular dataset.

Conclusion:

Although GPT models are nowadays used in many types of tasks, including classification, their current performance in security bug report identification is surprisingly insufficient and inferior to traditional ML models. Further research is needed to address the challenges identified in this paper in order to effectively apply GPT models to this particular domain.
gpt并不是灵丹妙药:使用gpt进行安全错误报告识别的性能和挑战
背景:识别软件中的安全漏洞对于最小化漏洞窗口至关重要。传统上,bug报告是通过问题跟踪器提交并手工分析的,这非常耗时。数据稀缺和不平衡等挑战通常会阻碍有效机器学习模型的发展,这些模型可用于自动化此任务。生成式预训练变形(GPT)模型不需要训练,受不平衡问题的影响较小。因此,它们在各种基于文本的分类任务中得到了普及,显然成为了解决该问题的一个很有前途的自然解决方案。目的:本文从使用GPT模型的用户的角度探讨使用GPT模型识别安全漏洞报告的潜力。我们的目标是评估与传统机器学习(ML)方法相比,他们在这个任务中的分类性能,同时也研究不同的因素,如使用的提示和数据集的特征,是如何影响他们的结果的。方法:评估GPT4All-Falcon、Wizard、directive、OpenOrca四种最先进的GPT模型在安全漏洞报告识别任务上的性能。我们为每个GPT模型使用三种不同的提示,并将结果与传统ML模型进行比较。实证结果基于七个项目(即Ambari、Camel、Derby、Wicket、Nova、OpenStack和Ubuntu)的bug报告数据。结果:GPT模型在识别安全漏洞报告方面显示出明显的困难,其性能水平通常低于传统的ML模型。GPT模型的有效性变化很大,取决于使用的特定模型和提示,以及特定的数据集。结论:尽管GPT模型现在被用于许多类型的任务,包括分类,但其目前在安全漏洞报告识别方面的性能令人惊讶地不足,不如传统的ML模型。为了有效地将GPT模型应用于这一特定领域,需要进一步的研究来解决本文中确定的挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information and Software Technology
Information and Software Technology 工程技术-计算机:软件工程
CiteScore
9.10
自引率
7.70%
发文量
164
审稿时长
9.6 weeks
期刊介绍: Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信