Information and Software Technology最新文献

筛选
英文 中文
Assessing output reliability and similarity of large language models in software development: A comparative case study approach 评估软件开发中大型语言模型的输出可靠性和相似性:一种比较案例研究方法
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-29 DOI: 10.1016/j.infsof.2025.107787
Dae-Kyoo Kim , Hua Ming
{"title":"Assessing output reliability and similarity of large language models in software development: A comparative case study approach","authors":"Dae-Kyoo Kim ,&nbsp;Hua Ming","doi":"10.1016/j.infsof.2025.107787","DOIUrl":"10.1016/j.infsof.2025.107787","url":null,"abstract":"<div><h3>Context:</h3><div>Generative large language models (LLMs) are increasingly used across various activities in software development, offering significant potential to enhance productivity. However, there is a lack of systematic study examining the reliability and similarity of the outputs from these models.</div></div><div><h3>Objective:</h3><div>This work presents a comparative analysis of the reliability – defined as the consistency and correctness of software artifacts – and similarity of LLM outputs in software development.</div></div><div><h3>Method:</h3><div>To accomplish the objective, we introduce a structured approach for assessing the reliability and similarity of outputs from five prominent LLMs – ChatGPT, Claude, Copilot, Gemini, and Meta – and apply it within two case studies focused on developing a food order and delivery system and a smart wallet system.</div></div><div><h3>Results:</h3><div>The study found that the overall output reliability of the models is rated at 0.82 with Claude outperforming other models at 0.92, followed by ChatGPT at 0.90, Copilot at 0.80, Meta at 0.75, and Gemini at 0.71. The models demonstrated an overall 57% similarity and 43% variability in their outputs, highlighting the uniqueness of models.</div></div><div><h3>Conclusions:</h3><div>While overall, LLMs exhibit decent reliability in their outputs with varying degrees, they still require human oversight and review of their outputs before implementation. LLMs present unique characteristics that practitioners should consider before adoption.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107787"},"PeriodicalIF":3.8,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generating vulnerability security fixes with Code Language Models 使用代码语言模型生成漏洞安全修复程序
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-28 DOI: 10.1016/j.infsof.2025.107786
Guru Bhandari, Nikola Gavric, Andrii Shalaginov
{"title":"Generating vulnerability security fixes with Code Language Models","authors":"Guru Bhandari,&nbsp;Nikola Gavric,&nbsp;Andrii Shalaginov","doi":"10.1016/j.infsof.2025.107786","DOIUrl":"10.1016/j.infsof.2025.107786","url":null,"abstract":"<div><div>Existing Code Language Models (CLM) have demonstrated significant potential in several coding tasks, including automated code generation in software engineering. Similarly, Automated Program Repair (APR) has shown considerable progress in addressing general software vulnerabilities, yet its application utilizing the recently developed CLM remains unexplored. In this paper, we introduce Patch Language Model (PatchLM), a novel CLM fine-tuned to fix security vulnerabilities in code blocks retrieved from commit hunks associated with Common Vulnerabilities and Exposures (CVE) records. Our proposed model leverages CLM to understand secure coding practices and generate accurate patches. The study aims to address the diverse nature of security flaws across multiple programming languages. Our experimental evaluation demonstrated that PatchLM significantly outperforms the baseline <em>CodeT5</em> and <em>CodeLlama</em> models in generating effective security patches, as reflected in the performance metrics. Specifically, PatchLM achieves improvements of up to 48.35% in CodeBLEU and 28.9% in ROUGE scores compared to the baseline models. Our study demonstrates the practicality and significance of PatchLM in generating vulnerability repairs, providing valuable support for under-resourced security analysts, and paving the way for future research in automated vulnerability fixing with CLM.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107786"},"PeriodicalIF":3.8,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A systematic review of shared personal informatics 共享个人信息的系统回顾
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-27 DOI: 10.1016/j.infsof.2025.107759
Mengru Xue , Pengcheng An , Zengrong Guo , Rong-Hao Liang , Jun Hu , Preben Hansen , Loe Feijs
{"title":"A systematic review of shared personal informatics","authors":"Mengru Xue ,&nbsp;Pengcheng An ,&nbsp;Zengrong Guo ,&nbsp;Rong-Hao Liang ,&nbsp;Jun Hu ,&nbsp;Preben Hansen ,&nbsp;Loe Feijs","doi":"10.1016/j.infsof.2025.107759","DOIUrl":"10.1016/j.infsof.2025.107759","url":null,"abstract":"<div><div>Personal informatics (PI) has gained great attention and become ubiquitous in people’s everyday lives. Although an increasing number of studies set out to explore the social aspects of PI, there remains an unaddressed opportunity for a structured, systematic review to understand why and how the sharing happened, to inform future design and research. This systematic review summarizes the last 13 years of research on the diverse cases of shared PI practice from ACM, PubMed, and IEEE. 100 papers were analyzed, and four types of sharing were identified: Interpersonal targeting, Public broadcasting, Group monitoring, and Community exchanging. Notably, sharing extends beyond data exchange, evolving into a collaborative process across different PI stages. The review offers a taxonomy of shared PI practices and delineates design possibilities, facilitating future exploration in the field. Additionally, it identifies trends and patterns within existing work, suggesting design opportunities for future explorations of shared personal informatics.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107759"},"PeriodicalIF":3.8,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144190039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How do practitioners gain confidence in assurance cases? 从业员如何在保证个案中获得信心?
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-26 DOI: 10.1016/j.infsof.2025.107767
Simon Diemert, Caleb Shortt, Jens H. Weber
{"title":"How do practitioners gain confidence in assurance cases?","authors":"Simon Diemert,&nbsp;Caleb Shortt,&nbsp;Jens H. Weber","doi":"10.1016/j.infsof.2025.107767","DOIUrl":"10.1016/j.infsof.2025.107767","url":null,"abstract":"<div><h3>CONTEXT:</h3><div>Assurance Cases (ACs) are prepared to argue that the system’s desired quality attributes (e.g., safety or security) are satisfied. While there is strong adoption of ACs, practitioners are often left asking an important question: are we confident that the claims made by the case are true? While many confidence assessment methods (CAMs) exist, little is known about the use of these methods in practice.</div></div><div><h3>OBJECTIVE:</h3><div>Develop an understanding of the current state of practice for AC confidence assessment: what methods are used in practice and what barriers exist for their use?</div></div><div><h3>METHOD:</h3><div>Structured interviews and an email questionnaire were used to gather data from practitioners with experience contributing to real-world ACs. Open-coding was performed on transcripts. A description of the current state of AC practice and future considerations for researchers was synthesized from the results.</div></div><div><h3>RESULTS:</h3><div>A total of n = 19 practitioners were interviewed. The most common CAMs were (peer-)review of ACs, dialectic reasoning (“defeaters”), and comparing against checklists. Some practitioners also used models to gain confidence in an AC. Participants preferred qualitative methods and expressed concerns about quantitative CAMs. Barriers to using CAMs included additional work, inadequate guidance, subjectivity and interpretation of results, and trustworthiness of methods.</div></div><div><h3>CONCLUSION:</h3><div>While many CAMs are described in the literature there is a gap between the proposed methods and needs of practitioners. Researchers working in this area should consider the need to: connect CAMs to established practices, use CAMs to communicate with interest holders, crystallize the details of CAM application, curate accessible guidance, and confirm that methods are trustworthy.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107767"},"PeriodicalIF":3.8,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144147927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey of selected characteristics and contexts of the analysis and planning phase of software development projects and their connections to project success 对软件开发项目的分析和计划阶段的选定特征和背景的调查,以及它们与项目成功的联系
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-23 DOI: 10.1016/j.infsof.2025.107798
Magne Jørgensen
{"title":"A survey of selected characteristics and contexts of the analysis and planning phase of software development projects and their connections to project success","authors":"Magne Jørgensen","doi":"10.1016/j.infsof.2025.107798","DOIUrl":"10.1016/j.infsof.2025.107798","url":null,"abstract":"<div><h3>Context</h3><div>The initial planning and analysis phase of software development projects has received limited research attention.</div></div><div><h3>Objective</h3><div>This paper aims to improve our knowledge about this phase and its connection to the project outcomes.</div></div><div><h3>Method</h3><div>Information about 116 software projects was collected and analyzed.</div></div><div><h3>Results</h3><div>Public sector projects performed similarly to private sector projects regarding project efficiency (project performance) but worse regarding effectiveness (benefits realized). More than half (54 %) of the software projects were modernization projects, replacing or improving old software systems. The outcomes of the modernization projects were positively connected with more agility, and negatively connected with insufficient planning and analysis, and when they were motivated by technical debt. The outcomes of the new services and product projects were negatively connected with insufficient planning and analysis, and when they were motivated by market pressure or opportunity.</div></div><div><h3>Conclusion</h3><div>There are characteristics and contexts of the analysis and planning phase potentially useful as indicators of project success.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107798"},"PeriodicalIF":3.8,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144190398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An empirical study on capability of Large Language Models in understanding code semantics 大型语言模型理解代码语义能力的实证研究
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-22 DOI: 10.1016/j.infsof.2025.107780
Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, Son Nguyen
{"title":"An empirical study on capability of Large Language Models in understanding code semantics","authors":"Thu-Trang Nguyen,&nbsp;Thanh Trong Vu,&nbsp;Hieu Dinh Vo,&nbsp;Son Nguyen","doi":"10.1016/j.infsof.2025.107780","DOIUrl":"10.1016/j.infsof.2025.107780","url":null,"abstract":"<div><div>Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, <em>“whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks”</em>. In this paper, we introduce <span>Empica</span>, a comprehensive framework designed to systematically and empirically evaluate the capabilities of code LLMs in understanding code semantics. Specifically, <span>Empica</span> systematically introduces controlled modifications/transformations into the input code and examines the models’ responses. In general, code LLMs must be <em>robust to semantically equivalent code inputs</em> and be <em>sensitive to non-equivalent ones</em>. Specifically, for every SE task, given an input code snippet <span><math><mi>c</mi></math></span> and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs, while they are expected to generate different outputs for <span><math><mi>c</mi></math></span> and its semantic non-equivalent variants. Our experimental results with eight state-of-the-art code LLMs on six representative code understanding tasks reveal that the robustness and sensitivity of code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, code LLMs exhibit better robustness to the semantic preserving transformations than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model’s capabilities of understanding code semantics, especially the sensitivity property.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107780"},"PeriodicalIF":3.8,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multiple case study on reuse in Game Software Engineering 游戏软件工程中重用的多案例研究
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-22 DOI: 10.1016/j.infsof.2025.107781
Jose Ignacio Trasobares , África Domingo , Rodrigo Casamayor , Daniel Blasco , Carlos Cetina
{"title":"A multiple case study on reuse in Game Software Engineering","authors":"Jose Ignacio Trasobares ,&nbsp;África Domingo ,&nbsp;Rodrigo Casamayor ,&nbsp;Daniel Blasco ,&nbsp;Carlos Cetina","doi":"10.1016/j.infsof.2025.107781","DOIUrl":"10.1016/j.infsof.2025.107781","url":null,"abstract":"<div><h3>Context:</h3><div>Game Software Engineering (GSE) is a specialized field at the intersection of software engineering and video game development. Reuse in GSE is particularly complex due to the iterative nature of game development and technical needs that arise in creating interactive digital experiences.</div></div><div><h3>Objective:</h3><div>This paper presents the first multi-case study on reuse in GSE, focusing on how reusable components are developed and maintained in game projects. The study aims to investigate reuse practices by analyzing multiple sources, including access to game projects, interviews with developers, focus groups, studio visits, and code analysis.</div></div><div><h3>Method:</h3><div>The study integrates various evidence sources to gain a comprehensive view of reuse in GSE. Data were gathered from interviews and focus groups, supplemented by direct observations during visits. Additionally, a recent proposal on software phylogenetics was applied to analyze source code, providing insights into reuse in game projects.</div></div><div><h3>Results:</h3><div>Our findings highlight the significance of prefabs in promoting reuse, especially in managing complex game objects. Prefabs emerged as a widely used element, confirmed by developer feedback and repository analysis. Software phylogenetics also revealed certain drawbacks.</div></div><div><h3>Conclusion:</h3><div>While prefabs play a relevant role enhance reusability, they can introduce redundancy, bugs, and unused components (dead prefabs). Understanding these limitations could inspire future research addressing such issues. Prefab-related practices in GSE could benefit other software engineering areas, encouraging broader reuse strategies.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107781"},"PeriodicalIF":3.8,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-source cross-domain vulnerability detection based on code pre-trained model 基于代码预训练模型的多源跨域漏洞检测
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-22 DOI: 10.1016/j.infsof.2025.107764
Yang Cao, Yunwei Dong
{"title":"Multi-source cross-domain vulnerability detection based on code pre-trained model","authors":"Yang Cao,&nbsp;Yunwei Dong","doi":"10.1016/j.infsof.2025.107764","DOIUrl":"10.1016/j.infsof.2025.107764","url":null,"abstract":"<div><h3>Context:</h3><div>In recent years, deep learning-based vulnerability detection methods have achieved significant success. These methods predict vulnerabilities by automatically learning patterns from code annotated with vulnerability information. However, labeled data is usually concentrated in a few software projects and programming languages. In practice, due to distribution discrepancy in vulnerabilities across different software projects or programming languages, vulnerability detection models trained on limited projects or a specific language often struggle to generalize to new projects or languages. Currently, cross-domain vulnerability detection methods utilize domain adaptation to reduce the distribution discrepancy between the labeled source domain and the target domain being tested. However, the language models used in existing methods limit the expressive power of feature vectors, and they only employ single-source domain adaptation methods.</div></div><div><h3>Objective:</h3><div>To address the limitations of current cross-domain vulnerability detection methods, we propose a new method for <u>M</u>ulti-<u>S</u>ource cross-domain <u>V</u>ulnerability <u>D</u>etection (<em>MSVD</em>).</div></div><div><h3>Method:</h3><div>MSVD combines two knowledge transfer methods, fine-tuning and domain adaptation. The fine-tuned code pre-trained model extracts code features, generating more meaningful code vector representations. The adversarial-based multi-source domain adaptation method aligns features between multiple source domains and the target domain, leveraging richer knowledge from multiple source domains.</div></div><div><h3>Results:</h3><div>We conducted experiments on real datasets comprising various languages and projects to evaluate the effectiveness of MSVD. Experiment results show that, compared to the baselines in the target domain, MSVD improves F1-score, accuracy, and AUC in the cross-language scenario by 2.95%<span><math><mo>∼</mo></math></span>112.90%, 4.37%<span><math><mo>∼</mo></math></span>27.65%, and 4.19%<span><math><mo>∼</mo></math></span>57.83%, respectively. Additionally, in the cross-project scenario, MSVD achieves the highest F1-score and shows superior performance in terms of accuracy and AUC.</div></div><div><h3>Conclusion:</h3><div>These results indicate that compared to the current state-of-the-art methods, MSVD significantly improves vulnerability detection performance in two cross-domain settings: cross-language and cross-project, when the target domain is unlabeled.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107764"},"PeriodicalIF":3.8,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144123312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the understandability of coupling-related practices in infrastructure-as-code based deployments 关于在基于基础设施即代码的部署中耦合相关实践的可理解性
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-21 DOI: 10.1016/j.infsof.2025.107761
Pierre-Jean Quéval , Nicole Elisabeth Hörner , Evangelos Ntentos , Uwe Zdun
{"title":"On the understandability of coupling-related practices in infrastructure-as-code based deployments","authors":"Pierre-Jean Quéval ,&nbsp;Nicole Elisabeth Hörner ,&nbsp;Evangelos Ntentos ,&nbsp;Uwe Zdun","doi":"10.1016/j.infsof.2025.107761","DOIUrl":"10.1016/j.infsof.2025.107761","url":null,"abstract":"<div><div>Infrastructure as Code (IaC) empowers software developers and operations teams to automate the deployment and management of IT infrastructure through code. This is particularly valuable for continuously released deployments such as microservices and cloud-based systems. IaC technologies offer flexibility in provisioning and deploying application architectures. However, if the structure is not well-designed, it can lead to severe issues related to coupling aspects. Unfortunately, a lack of comprehensive coupling guidelines for IaC makes ensuring adherence to best practices challenging. Leveraging IaC-based models, metrics, and source code can enhance the comprehension and implementation of coupling measures.</div><div>Our objective was to investigate how developers understand information derived from system source code and compare it to formal IaC system diagrams and metrics. We conducted a controlled experiment involving a group of participants to evaluate the understandability of IaC system architecture descriptions through source code inspection and formal representations.</div><div>We hypothesized that providing formal IaC system diagrams and metrics as supplementary materials would improve the understanding of IaC coupling-related practices measured by task <em>correctness</em>. We also expected that these supplementary resources would lead to a significant increase in task <em>duration</em> and that there would be a notable correlation between <em>correctness</em> and <em>duration</em>.</div><div>The results suggest that including formal IaC system diagrams and metrics as supplementary materials significantly enhances the comprehension of IaC coupling-related practices, as indicated by task <em>correctness</em>. Moreover, providing these formal representations does not significantly prolong task <em>duration</em>, indicating that they do not hinder understanding. A substantial correlation between task <em>correctness</em> and <em>duration</em> is evident when formal IaC system diagrams and metrics are available.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107761"},"PeriodicalIF":3.8,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification gpt并不是灵丹妙药:使用gpt进行安全错误报告识别的性能和挑战
IF 3.8 2区 计算机科学
Information and Software Technology Pub Date : 2025-05-17 DOI: 10.1016/j.infsof.2025.107778
Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro
{"title":"GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification","authors":"Horácio L. França ,&nbsp;Katerina Goseva-Popstojanova ,&nbsp;César Teixeira ,&nbsp;Nuno Laranjeiro","doi":"10.1016/j.infsof.2025.107778","DOIUrl":"10.1016/j.infsof.2025.107778","url":null,"abstract":"<div><h3>Context:</h3><div>Identifying security bugs in software is critical to minimize vulnerability windows. Traditionally, bug reports are submitted through issue trackers and manually analyzed, which is time-consuming. Challenges such as data scarcity and imbalance generally hinder the development of effective machine learning models that could be used to automate this task. Generative Pre-trained Transformer (GPT) models do not require training and are less affected by the imbalance problem. Therefore, they have gained popularity for various text-based classification tasks, apparently becoming a natural highly promising solution for this problem.</div></div><div><h3>Objective:</h3><div>This paper explores the potential of using GPT models to identify security bug reports from the perspective of a user of this type of models. We aim to assess their classification performance in this task compared to traditional machine learning (ML) methods, while also investigating how different factors, such as the prompt used and datasets’ characteristics, affect their results.</div></div><div><h3>Methods:</h3><div>We evaluate the performance of four state-of-the-art GPT models (i.e., GPT4All-Falcon, Wizard, Instruct, OpenOrca) on the task of security bug report identification. We use three different prompts for each GPT model and compare the results with traditional ML models. The empirical results are based on using bug report data from seven projects (i.e., Ambari, Camel, Derby, Wicket, Nova, OpenStack, and Ubuntu).</div></div><div><h3>Results:</h3><div>GPT models show noticeable difficulties in identifying security bug reports, with performance levels generally lower than traditional ML models. The effectiveness of the GPT models is quite variable, depending on the specific model and prompt used, as well as the particular dataset.</div></div><div><h3>Conclusion:</h3><div>Although GPT models are nowadays used in many types of tasks, including classification, their current performance in security bug report identification is surprisingly insufficient and inferior to traditional ML models. Further research is needed to address the challenges identified in this paper in order to effectively apply GPT models to this particular domain.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107778"},"PeriodicalIF":3.8,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信