Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro
{"title":"GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification","authors":"Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro","doi":"10.1016/j.infsof.2025.107778","DOIUrl":"10.1016/j.infsof.2025.107778","url":null,"abstract":"<div><h3>Context:</h3><div>Identifying security bugs in software is critical to minimize vulnerability windows. Traditionally, bug reports are submitted through issue trackers and manually analyzed, which is time-consuming. Challenges such as data scarcity and imbalance generally hinder the development of effective machine learning models that could be used to automate this task. Generative Pre-trained Transformer (GPT) models do not require training and are less affected by the imbalance problem. Therefore, they have gained popularity for various text-based classification tasks, apparently becoming a natural highly promising solution for this problem.</div></div><div><h3>Objective:</h3><div>This paper explores the potential of using GPT models to identify security bug reports from the perspective of a user of this type of models. We aim to assess their classification performance in this task compared to traditional machine learning (ML) methods, while also investigating how different factors, such as the prompt used and datasets’ characteristics, affect their results.</div></div><div><h3>Methods:</h3><div>We evaluate the performance of four state-of-the-art GPT models (i.e., GPT4All-Falcon, Wizard, Instruct, OpenOrca) on the task of security bug report identification. We use three different prompts for each GPT model and compare the results with traditional ML models. The empirical results are based on using bug report data from seven projects (i.e., Ambari, Camel, Derby, Wicket, Nova, OpenStack, and Ubuntu).</div></div><div><h3>Results:</h3><div>GPT models show noticeable difficulties in identifying security bug reports, with performance levels generally lower than traditional ML models. The effectiveness of the GPT models is quite variable, depending on the specific model and prompt used, as well as the particular dataset.</div></div><div><h3>Conclusion:</h3><div>Although GPT models are nowadays used in many types of tasks, including classification, their current performance in security bug report identification is surprisingly insufficient and inferior to traditional ML models. Further research is needed to address the challenges identified in this paper in order to effectively apply GPT models to this particular domain.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107778"},"PeriodicalIF":3.8,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuo Liu , Jacky Keung , Zhen Yang , Fang Liu , Fengji Zhang , Yicheng Sun
{"title":"Exploring continual learning in code intelligence with domain-wise distilled prompts","authors":"Shuo Liu , Jacky Keung , Zhen Yang , Fang Liu , Fengji Zhang , Yicheng Sun","doi":"10.1016/j.infsof.2025.107775","DOIUrl":"10.1016/j.infsof.2025.107775","url":null,"abstract":"<div><h3>Context:</h3><div>Software programs evolve constantly in practice, leading to domain shifts that cannot be fitted in the traditional offline manner. Recently, a few Continual Learning (CL) studies on code intelligence emerged, which learn a sequence of datasets one by one. We criticize existing rehearsal-based CL methods heavily rely on retraining historical samples, bringing about an extra training burden and the risk of data disclosure.</div></div><div><h3>Objective:</h3><div>To overcome the above limitations, in this paper, we leverage the superiority of prompts in eliciting pre-trained knowledge to realize a rehearsal-free method.</div></div><div><h3>Methods:</h3><div>We first explore the performance of vanilla prompt tuning in the CL scenario, finding that inheriting the previous Pre-trained Language Model (PLM) parameters is appropriate and prompt stability should be emphasized. Therefore, we propose an effective method named Prompt Tuning with Domain-wise Distillation (PTDD), which can distill prompts and optimize PLMs with a two-sided learning objective, thus improving PLMs’ performance in diverse domains.</div></div><div><h3>Results:</h3><div>We conduct experiments on three widely-studied code intelligence tasks, including Code Summarization, Code Vulnerability Detection, and Code Clone Detection. We evaluate PTDD in comparison with a series of baselines. Experimental results indicate the effectiveness of PTDD. For instance, PTDD surpasses fine-tuning by 2.55%, 11.12%, and 2.25% in the three tasks, respectively. Moreover, we interpret the effectiveness of PTDD by prompt visualization, and discuss its performance in the low-resource scenario, where the improvement of PTDD becomes stark with fewer training samples and can reach up to 69.09%.</div></div><div><h3>Conclusion:</h3><div>To the best of our knowledge, our work conducts the first experimental study to explore the performance of prompt tuning within the CL setting in the code intelligence field. The research findings indicate the effectiveness of PTDD and contribute to a deeper understanding of the capability of prompts.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107775"},"PeriodicalIF":3.8,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhang, Yiwen Wu, Tao Wang, Bo Ding, Huaimin Wang
{"title":"What problems are MLOps practitioners talking about? A study of discussions in Stack Overflow forum and GitHub projects","authors":"Yang Zhang, Yiwen Wu, Tao Wang, Bo Ding, Huaimin Wang","doi":"10.1016/j.infsof.2025.107768","DOIUrl":"10.1016/j.infsof.2025.107768","url":null,"abstract":"<div><h3>Context:</h3><div>Machine Learning Operations (MLOps) has emerged as a crucial technology for addressing the challenges of designing and maintaining productive ML applications. The widespread adoption of MLOps makes it essential to identify the problems faced by MLOps practitioners. However, there has been relatively little research in this area.</div></div><div><h3>Objectives:</h3><div>To fill this research gap and gain an understanding of the interests and difficulties encountered by MLOps practitioners.</div></div><div><h3>Methods:</h3><div>We mine discussion data from the online Q&A forum, Stack Overflow, and GitHub projects, and analyze 6345 posts and 2103 issues.</div></div><div><h3>Results:</h3><div>We construct the first taxonomy of MLOps problems in practice, consisting of 5 categories and 19 topics. We also investigate the evolution and characteristics (difficulty and sentiment) of these topics, distill 12 frequent solutions for different MLOps problems, and design an MLOps knowledge exploration tool, MLOps-KET.</div></div><div><h3>Conclusion:</h3><div>We find that practitioners face diverse challenges when performing MLOps practices and that the focus of their discussions changed over time. Our study contributes to the MLOps research and development community by providing implications for different audiences and guidance for future support of relevant techniques and tools.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107768"},"PeriodicalIF":3.8,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144089615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Systematic Review of Software Product Value: Perspectives Beyond Functionality","authors":"C.R. Oruthotaarachchi, W.M.J.I. Wijayanayake","doi":"10.1016/j.infsof.2025.107784","DOIUrl":"10.1016/j.infsof.2025.107784","url":null,"abstract":"<div><h3>Context</h3><div>Developing software products that effectively address customer needs while offering high business value is essential. Traditional software value assessments focus on technical performance, cost-effectiveness, and business impact, prioritizing security and reliability. Current definitions have emerged primarily from technical and economic perspectives, eliminating people-oriented perspectives from their discussions. A gap remains for discussion about how people-oriented domains such as management and marketing would interfere with software product value.</div></div><div><h3>Objective</h3><div>This paper presents a systematic literature review that investigates different perspectives of software product value, combining insights from management, marketing, design, and software engineering domains to provide a holistic view of software product value.</div></div><div><h3>Method</h3><div>The study was conducted based on an established systematic review methodology searching for articles published from 2004 to 2024 in five academic databases. A qualitative data analysis approach was used to answer the research questions, and a PRISMA statement was followed to ensure the rigorous reporting of this research.</div></div><div><h3>Results</h3><div>The search process filtered 67 articles, providing valuable insights into the existing discussions of software product value. The findings emphasize that, in addition to functional and non-functional requirements, software product managers must prioritize psychological and social requirements, provide seamless customer relationship management, and connect the software product with both the software and client organizations’ strategic ambitions.</div></div><div><h3>Conclusion</h3><div>The value of software products is not limited to their performance but also the perception of benefits, emotions and brand identity. Integrating software development with exact customer objectives, organizational goals, and market demands significantly maximizes perceived software value. This integrated strategy is critical for increasing value throughout the product's lifecycle and ensuring product market sustainability.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107784"},"PeriodicalIF":3.8,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144089670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A systematic literature review on transformation for testability techniques in software systems","authors":"Fateme Bagheri-Galle , Saeed Parsa , Morteza Zakeri","doi":"10.1016/j.infsof.2025.107788","DOIUrl":"10.1016/j.infsof.2025.107788","url":null,"abstract":"<div><h3>Context</h3><div>Software testability is a critical aspect of software development, enabling efficient error identification during testing. Program transformation techniques, mainly refactoring, play a key role in enhancing testability by simplifying the process of identifying and addressing potential issues. By improving testability, developers empower themselves to create more dependable software products.</div></div><div><h3>Objective</h3><div>Our study aims to conduct a systematic literature review focused on transformation techniques for improving testability in software systems. By analyzing existing research, we seek to provide insights into effective strategies for enhancing testability and addressing critical issues in software development.</div></div><div><h3>Method</h3><div>We queried six digital libraries, resulting in over 5000 articles. After rigorous analysis, we narrowed our focus to 39 primary research papers. Based on a novel hierarchical classification of the approaches used to enhance testability, the selected articles were analyzed considering the refactoring techniques, software metrics, and code smells affecting testability at the design and code levels.</div></div><div><h3>Results</h3><div>Our investigation revealed that among our findings, 53.8 % of the papers specifically employed refactoring for testability, while 46.2 % utilized testability transformation techniques. Only one study provided structured sequences of refactoring for testability. The studies primarily focused on three testing levels: unit testing, regression testing, and graphical user interface (GUI) testing. Notably, unit testing received the most attention, appearing in 71.8 % of the studies. About 64.1 % of the studies involved software projects written in the Java programming language. The results suggest that removing code smells and anti-patterns through refactoring would increase testability.</div></div><div><h3>Conclusion</h3><div>While transformation techniques are essential to increase testability and often improve it, more research is needed to address this critical issue. Additionally, exploring other levels of testing beyond unit testing and using software projects with languages beyond Java is essential. To enhance testability, it is necessary to provide more refactoring sequences aimed at improving testability.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107788"},"PeriodicalIF":3.8,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenzhou Tian , Yudong Teng , Xianqun Ke , Yanping Chen , Lingwei Chen
{"title":"SolBERT: Advancing solidity smart contract similarity analysis via self-supervised pre-training and contrastive fine-tuning","authors":"Zhenzhou Tian , Yudong Teng , Xianqun Ke , Yanping Chen , Lingwei Chen","doi":"10.1016/j.infsof.2025.107766","DOIUrl":"10.1016/j.infsof.2025.107766","url":null,"abstract":"<div><h3>Context:</h3><div>Reliable and effective similarity analysis for the smart contracts facilitates the maintenance and quality assurance of the smart contract ecosystem. However, existing signature-based methods and code representation learning-based methods suffer from limitations such as heavy-weight program analysis payloads or suboptimal contract encodings.</div></div><div><h3>Objective:</h3><div>This paper aims to design a fully unsupervised language model for better capturing the syntactic and semantic richness of Solidity code, and utilizes it for advancing the effectiveness of smart contract similarity analysis.</div></div><div><h3>Methods:</h3><div>Inspired by the impressive semantic learning capability of pre-trained language models (PLMs), we propose SolBERT, a PLM specifically tailored for enhancing Solidity smart contracts similarity detection. To ensure it produces high-quality encodings, SolBERT leverages BERT-style pre-training with the masked language modeling (MLM) and token type prediction (TTP) tasks applied on code-structure-aware token sequences derived from the contracts’ abstract syntax trees (ASTs) through structure-retaining tree linearization and light-weight normalization to learn a base model. On this basis, self-supervised contrastive fine-tuning and unsupervised whitening operations are further performed to optimize contract encoding generation.</div></div><div><h3>Results:</h3><div>Experiments are conducted on three contract similarity-related tasks, including contract clone detection, bug detection, and code clustering. The results indicate that SolBERT significantly outperforms state-of-the-art approaches with average absolute gains of 21.33% and 21.50% in terms of F1, and 17.78% and 26.60% in terms of accuracy for the clone detection and bug detection tasks, respectively; and an average absolute gain of 17.97% for code clustering task. When applying both contrastive fine-tuning and whitening optimizations, SolBERT also shows superior performance than the case of lacking any of them.</div></div><div><h3>Conclusion:</h3><div>The proposed approach, SolBERT, can serve as a reliable and powerful smart contract encoder, better capturing the syntactic and semantic aspects of the Solidity code. The results and findings also validate the effectiveness and positive synergistic effect of SolBERT’s encoding optimization operations.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107766"},"PeriodicalIF":3.8,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143946831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the use of LLMs for the selection phase in systematic literature studies","authors":"Lukas Thode , Umar Iftikhar , Daniel Mendez","doi":"10.1016/j.infsof.2025.107757","DOIUrl":"10.1016/j.infsof.2025.107757","url":null,"abstract":"<div><h3>Context:</h3><div>Systematic literature studies, such as secondary studies, are crucial to aggregate evidence. An essential part of these studies is the selection phase of relevant studies. This, however, is time-consuming, resource-intensive, and error-prone as it highly depends on manual labor and domain expertise. The increasing popularity of Large Language Models (LLMs) raises the question to what extent these manual study selection tasks could be supported in an automated manner.</div></div><div><h3>Objectives:</h3><div>In this manuscript, we report on our effort to explore and evaluate the use of state-of-the-art LLMs to automate the selection phase in systematic literature studies.</div></div><div><h3>Method:</h3><div>We evaluated LLMs for the selection phase using two published systematic literature studies in software engineering as ground truth. Three prompts were designed and applied across five LLMs to the studies’ titles and abstracts based on their inclusion and exclusion criteria. Additionally, we analyzed combining two LLMs to replicate a practical selection phase. We analyzed recall and precision and reflected upon the accuracy of the LLMs, and whether the ground truth studies were conducted by early career scholars or by more advanced ones.</div></div><div><h3>Results:</h3><div>Our results show a high average recall of up to 98% combined with a precision of 27% in a single LLM approach and an average recall of 99% with a precision of 27% in a two-model approach replicating a two-reviewer procedure. Further the Llama 2 models showed the highest average recall 98% across all prompt templates and datasets while GPT4-turbo had the highest average precision 72%.</div></div><div><h3>Conclusions:</h3><div>Our results demonstrate how LLMs could support a selection phase in the future. We recommend a two LLM-approach to archive a higher recall. However, we also critically reflect upon how further studies are required using other models and prompts on more datasets to strengthen the confidence in our presented approach.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107757"},"PeriodicalIF":3.8,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Omar Haggag , Alessandro Pedace , Shidong Pan , John Grundy
{"title":"An analysis of privacy regulations and user concerns of finance mobile applications","authors":"Omar Haggag , Alessandro Pedace , Shidong Pan , John Grundy","doi":"10.1016/j.infsof.2025.107756","DOIUrl":"10.1016/j.infsof.2025.107756","url":null,"abstract":"<div><h3>Context:</h3><div>Financial applications handle sensitive data, including personal details, banking information, and transaction histories, making them prime targets for cyber-attacks. As privacy concerns grow, users and regulators are increasingly analyzing how these apps manage data in different legal contexts.</div></div><div><h3>Objective:</h3><div>This study examines user privacy concerns and assesses the impact of privacy regulations on mobile financial applications in Germany, Australia, and the United States. It aims to evaluate how laws such as the GDPR in the EU, the Privacy Act in Australia, and various U.S. state and federal laws shape app privacy policies. Additionally, the study explores the readability and accessibility of privacy policies.</div></div><div><h3>Methods:</h3><div>User reviews from app stores were analyzed to identify recurring privacy issues and regional differences in concerns. The study also reviewed privacy laws in the EU, Australia, and the U.S. to assess their influence on financial app policies. To analyze the user-friendliness of privacy documents, a readability analysis was conducted using the Flesch Reading Ease score and estimated reading times.</div></div><div><h3>Results:</h3><div>The findings revealed that users are highly concerned about the handling of their data, with significant demand for greater transparency and more robust privacy protections. Regional differences in privacy concerns were identified, with varying levels of engagement with privacy issues in each region. The study also found significant discrepancies in the readability of privacy policies, with many policies proving too complex for the average user to understand.</div></div><div><h3>Conclusion:</h3><div>The study concludes that financial app developers need to simplify their privacy policies and improve transparency to build user trust. It also emphasizes the need for stronger regulatory frameworks to address evolving privacy challenges. Recommendations are made for developers and policymakers to enhance data protection and improve user experience in financial services.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107756"},"PeriodicalIF":3.8,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143891892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Carolina Moises de Souza, Daniela Soares Cruzes, Letizia Jaccheri, Tangni Cunningham Dahl-Jørgensen
{"title":"Promoting social sustainability within software development through the lens of organizational readiness for change theory","authors":"Ana Carolina Moises de Souza, Daniela Soares Cruzes, Letizia Jaccheri, Tangni Cunningham Dahl-Jørgensen","doi":"10.1016/j.infsof.2025.107755","DOIUrl":"10.1016/j.infsof.2025.107755","url":null,"abstract":"<div><h3>Context:</h3><div>Software’s negative impact on society underscores the need to integrate social sustainability into software development. However, effective implementation and practitioners’ readiness for this change remain unclear, requiring further investigation.</div></div><div><h3>Objective:</h3><div>This research aims to understand the conditions that promote organizational readiness for change in the integration of social sustainability into software development from the perspective of software practitioners.</div></div><div><h3>Methods:</h3><div>We conducted multiple case studies containing three cases: (A) an exploratory study with 11 practitioners from four organizations; (B) the proposal and validation of a Walkthrough intervention with 9 students (pilot) and 19 practitioners (questionnaire); and (C) a focus group with 6 practitioners in one organization providing feedback on the Walkthrough.</div></div><div><h3>Results:</h3><div>Four facilitators and barriers were identified as key preconditions for social sustainability integration. Statistical analysis showed that the perceived usefulness of the Walkthrough was significantly higher than intentional behavior indicating strong perceived value despite moderate intention to adopt the practices.</div></div><div><h3>Conclusion:</h3><div>This study identified the key determinants that promote organizational readiness to integrate social sustainability into software development. By proposing a conceptual model, it contributes to helping organizations leverage facilitators, overcome barriers, and offer actionable recommendations for both practice and research.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107755"},"PeriodicalIF":3.8,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143891910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giuseppe Colavito, Filippo Lanubile, Nicole Novielli
{"title":"Benchmarking large language models for automated labeling: The case of issue report classification","authors":"Giuseppe Colavito, Filippo Lanubile, Nicole Novielli","doi":"10.1016/j.infsof.2025.107758","DOIUrl":"10.1016/j.infsof.2025.107758","url":null,"abstract":"<div><h3>Context:</h3><div>Issue labeling is a fundamental task for software development as it is critical for the effective management of software projects. This practice involves assigning a label to issues, such as <em>bug</em> or feature request, denoting a task relevant to the project management. To date, large language models (LLMs) have been proposed to automate this task, including both fine-tuned BERT-like models and zero-shot GPT-like models.</div></div><div><h3>Objectives:</h3><div>In this paper, we investigate which LLMs offer the best trade-off between performance, response time, hardware requirements, and quality of the responses for issue report classification.</div></div><div><h3>Methods:</h3><div>We design and execute a comprehensive benchmark study to assess 22 generative decoder-only LLMs and 2 baseline BERT-like encoder-only models, which we evaluate on two different datasets of GitHub issues.</div></div><div><h3>Results:</h3><div>Generative LLMs demonstrate potential for zero-shot classification. However, their performance varies significantly across datasets and they require substantial computational resources for deployment. In contrast, BERT-like models show more consistent performance and lower resource requirements.</div></div><div><h3>Conclusions:</h3><div>Based on the empirical evidence provided in this study, we discuss implications for researchers and practitioners. In particular, our results suggest that fine-tuning BERT-like encoder-only models enables achieving consistent, state-of-the-art performance across datasets even in presence of a small amount of labeled data available for training.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107758"},"PeriodicalIF":3.8,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143891909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}