{"title":"Assessing output reliability and similarity of large language models in software development: A comparative case study approach","authors":"Dae-Kyoo Kim , Hua Ming","doi":"10.1016/j.infsof.2025.107787","DOIUrl":"10.1016/j.infsof.2025.107787","url":null,"abstract":"<div><h3>Context:</h3><div>Generative large language models (LLMs) are increasingly used across various activities in software development, offering significant potential to enhance productivity. However, there is a lack of systematic study examining the reliability and similarity of the outputs from these models.</div></div><div><h3>Objective:</h3><div>This work presents a comparative analysis of the reliability – defined as the consistency and correctness of software artifacts – and similarity of LLM outputs in software development.</div></div><div><h3>Method:</h3><div>To accomplish the objective, we introduce a structured approach for assessing the reliability and similarity of outputs from five prominent LLMs – ChatGPT, Claude, Copilot, Gemini, and Meta – and apply it within two case studies focused on developing a food order and delivery system and a smart wallet system.</div></div><div><h3>Results:</h3><div>The study found that the overall output reliability of the models is rated at 0.82 with Claude outperforming other models at 0.92, followed by ChatGPT at 0.90, Copilot at 0.80, Meta at 0.75, and Gemini at 0.71. The models demonstrated an overall 57% similarity and 43% variability in their outputs, highlighting the uniqueness of models.</div></div><div><h3>Conclusions:</h3><div>While overall, LLMs exhibit decent reliability in their outputs with varying degrees, they still require human oversight and review of their outputs before implementation. LLMs present unique characteristics that practitioners should consider before adoption.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107787"},"PeriodicalIF":3.8,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating vulnerability security fixes with Code Language Models","authors":"Guru Bhandari, Nikola Gavric, Andrii Shalaginov","doi":"10.1016/j.infsof.2025.107786","DOIUrl":"10.1016/j.infsof.2025.107786","url":null,"abstract":"<div><div>Existing Code Language Models (CLM) have demonstrated significant potential in several coding tasks, including automated code generation in software engineering. Similarly, Automated Program Repair (APR) has shown considerable progress in addressing general software vulnerabilities, yet its application utilizing the recently developed CLM remains unexplored. In this paper, we introduce Patch Language Model (PatchLM), a novel CLM fine-tuned to fix security vulnerabilities in code blocks retrieved from commit hunks associated with Common Vulnerabilities and Exposures (CVE) records. Our proposed model leverages CLM to understand secure coding practices and generate accurate patches. The study aims to address the diverse nature of security flaws across multiple programming languages. Our experimental evaluation demonstrated that PatchLM significantly outperforms the baseline <em>CodeT5</em> and <em>CodeLlama</em> models in generating effective security patches, as reflected in the performance metrics. Specifically, PatchLM achieves improvements of up to 48.35% in CodeBLEU and 28.9% in ROUGE scores compared to the baseline models. Our study demonstrates the practicality and significance of PatchLM in generating vulnerability repairs, providing valuable support for under-resourced security analysts, and paving the way for future research in automated vulnerability fixing with CLM.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107786"},"PeriodicalIF":3.8,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengru Xue , Pengcheng An , Zengrong Guo , Rong-Hao Liang , Jun Hu , Preben Hansen , Loe Feijs
{"title":"A systematic review of shared personal informatics","authors":"Mengru Xue , Pengcheng An , Zengrong Guo , Rong-Hao Liang , Jun Hu , Preben Hansen , Loe Feijs","doi":"10.1016/j.infsof.2025.107759","DOIUrl":"10.1016/j.infsof.2025.107759","url":null,"abstract":"<div><div>Personal informatics (PI) has gained great attention and become ubiquitous in people’s everyday lives. Although an increasing number of studies set out to explore the social aspects of PI, there remains an unaddressed opportunity for a structured, systematic review to understand why and how the sharing happened, to inform future design and research. This systematic review summarizes the last 13 years of research on the diverse cases of shared PI practice from ACM, PubMed, and IEEE. 100 papers were analyzed, and four types of sharing were identified: Interpersonal targeting, Public broadcasting, Group monitoring, and Community exchanging. Notably, sharing extends beyond data exchange, evolving into a collaborative process across different PI stages. The review offers a taxonomy of shared PI practices and delineates design possibilities, facilitating future exploration in the field. Additionally, it identifies trends and patterns within existing work, suggesting design opportunities for future explorations of shared personal informatics.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107759"},"PeriodicalIF":3.8,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144190039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How do practitioners gain confidence in assurance cases?","authors":"Simon Diemert, Caleb Shortt, Jens H. Weber","doi":"10.1016/j.infsof.2025.107767","DOIUrl":"10.1016/j.infsof.2025.107767","url":null,"abstract":"<div><h3>CONTEXT:</h3><div>Assurance Cases (ACs) are prepared to argue that the system’s desired quality attributes (e.g., safety or security) are satisfied. While there is strong adoption of ACs, practitioners are often left asking an important question: are we confident that the claims made by the case are true? While many confidence assessment methods (CAMs) exist, little is known about the use of these methods in practice.</div></div><div><h3>OBJECTIVE:</h3><div>Develop an understanding of the current state of practice for AC confidence assessment: what methods are used in practice and what barriers exist for their use?</div></div><div><h3>METHOD:</h3><div>Structured interviews and an email questionnaire were used to gather data from practitioners with experience contributing to real-world ACs. Open-coding was performed on transcripts. A description of the current state of AC practice and future considerations for researchers was synthesized from the results.</div></div><div><h3>RESULTS:</h3><div>A total of n = 19 practitioners were interviewed. The most common CAMs were (peer-)review of ACs, dialectic reasoning (“defeaters”), and comparing against checklists. Some practitioners also used models to gain confidence in an AC. Participants preferred qualitative methods and expressed concerns about quantitative CAMs. Barriers to using CAMs included additional work, inadequate guidance, subjectivity and interpretation of results, and trustworthiness of methods.</div></div><div><h3>CONCLUSION:</h3><div>While many CAMs are described in the literature there is a gap between the proposed methods and needs of practitioners. Researchers working in this area should consider the need to: connect CAMs to established practices, use CAMs to communicate with interest holders, crystallize the details of CAM application, curate accessible guidance, and confirm that methods are trustworthy.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107767"},"PeriodicalIF":3.8,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144147927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey of selected characteristics and contexts of the analysis and planning phase of software development projects and their connections to project success","authors":"Magne Jørgensen","doi":"10.1016/j.infsof.2025.107798","DOIUrl":"10.1016/j.infsof.2025.107798","url":null,"abstract":"<div><h3>Context</h3><div>The initial planning and analysis phase of software development projects has received limited research attention.</div></div><div><h3>Objective</h3><div>This paper aims to improve our knowledge about this phase and its connection to the project outcomes.</div></div><div><h3>Method</h3><div>Information about 116 software projects was collected and analyzed.</div></div><div><h3>Results</h3><div>Public sector projects performed similarly to private sector projects regarding project efficiency (project performance) but worse regarding effectiveness (benefits realized). More than half (54 %) of the software projects were modernization projects, replacing or improving old software systems. The outcomes of the modernization projects were positively connected with more agility, and negatively connected with insufficient planning and analysis, and when they were motivated by technical debt. The outcomes of the new services and product projects were negatively connected with insufficient planning and analysis, and when they were motivated by market pressure or opportunity.</div></div><div><h3>Conclusion</h3><div>There are characteristics and contexts of the analysis and planning phase potentially useful as indicators of project success.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107798"},"PeriodicalIF":3.8,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144190398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, Son Nguyen
{"title":"An empirical study on capability of Large Language Models in understanding code semantics","authors":"Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, Son Nguyen","doi":"10.1016/j.infsof.2025.107780","DOIUrl":"10.1016/j.infsof.2025.107780","url":null,"abstract":"<div><div>Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, <em>“whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks”</em>. In this paper, we introduce <span>Empica</span>, a comprehensive framework designed to systematically and empirically evaluate the capabilities of code LLMs in understanding code semantics. Specifically, <span>Empica</span> systematically introduces controlled modifications/transformations into the input code and examines the models’ responses. In general, code LLMs must be <em>robust to semantically equivalent code inputs</em> and be <em>sensitive to non-equivalent ones</em>. Specifically, for every SE task, given an input code snippet <span><math><mi>c</mi></math></span> and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs, while they are expected to generate different outputs for <span><math><mi>c</mi></math></span> and its semantic non-equivalent variants. Our experimental results with eight state-of-the-art code LLMs on six representative code understanding tasks reveal that the robustness and sensitivity of code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, code LLMs exhibit better robustness to the semantic preserving transformations than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model’s capabilities of understanding code semantics, especially the sensitivity property.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107780"},"PeriodicalIF":3.8,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose Ignacio Trasobares , África Domingo , Rodrigo Casamayor , Daniel Blasco , Carlos Cetina
{"title":"A multiple case study on reuse in Game Software Engineering","authors":"Jose Ignacio Trasobares , África Domingo , Rodrigo Casamayor , Daniel Blasco , Carlos Cetina","doi":"10.1016/j.infsof.2025.107781","DOIUrl":"10.1016/j.infsof.2025.107781","url":null,"abstract":"<div><h3>Context:</h3><div>Game Software Engineering (GSE) is a specialized field at the intersection of software engineering and video game development. Reuse in GSE is particularly complex due to the iterative nature of game development and technical needs that arise in creating interactive digital experiences.</div></div><div><h3>Objective:</h3><div>This paper presents the first multi-case study on reuse in GSE, focusing on how reusable components are developed and maintained in game projects. The study aims to investigate reuse practices by analyzing multiple sources, including access to game projects, interviews with developers, focus groups, studio visits, and code analysis.</div></div><div><h3>Method:</h3><div>The study integrates various evidence sources to gain a comprehensive view of reuse in GSE. Data were gathered from interviews and focus groups, supplemented by direct observations during visits. Additionally, a recent proposal on software phylogenetics was applied to analyze source code, providing insights into reuse in game projects.</div></div><div><h3>Results:</h3><div>Our findings highlight the significance of prefabs in promoting reuse, especially in managing complex game objects. Prefabs emerged as a widely used element, confirmed by developer feedback and repository analysis. Software phylogenetics also revealed certain drawbacks.</div></div><div><h3>Conclusion:</h3><div>While prefabs play a relevant role enhance reusability, they can introduce redundancy, bugs, and unused components (dead prefabs). Understanding these limitations could inspire future research addressing such issues. Prefab-related practices in GSE could benefit other software engineering areas, encouraging broader reuse strategies.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107781"},"PeriodicalIF":3.8,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-source cross-domain vulnerability detection based on code pre-trained model","authors":"Yang Cao, Yunwei Dong","doi":"10.1016/j.infsof.2025.107764","DOIUrl":"10.1016/j.infsof.2025.107764","url":null,"abstract":"<div><h3>Context:</h3><div>In recent years, deep learning-based vulnerability detection methods have achieved significant success. These methods predict vulnerabilities by automatically learning patterns from code annotated with vulnerability information. However, labeled data is usually concentrated in a few software projects and programming languages. In practice, due to distribution discrepancy in vulnerabilities across different software projects or programming languages, vulnerability detection models trained on limited projects or a specific language often struggle to generalize to new projects or languages. Currently, cross-domain vulnerability detection methods utilize domain adaptation to reduce the distribution discrepancy between the labeled source domain and the target domain being tested. However, the language models used in existing methods limit the expressive power of feature vectors, and they only employ single-source domain adaptation methods.</div></div><div><h3>Objective:</h3><div>To address the limitations of current cross-domain vulnerability detection methods, we propose a new method for <u>M</u>ulti-<u>S</u>ource cross-domain <u>V</u>ulnerability <u>D</u>etection (<em>MSVD</em>).</div></div><div><h3>Method:</h3><div>MSVD combines two knowledge transfer methods, fine-tuning and domain adaptation. The fine-tuned code pre-trained model extracts code features, generating more meaningful code vector representations. The adversarial-based multi-source domain adaptation method aligns features between multiple source domains and the target domain, leveraging richer knowledge from multiple source domains.</div></div><div><h3>Results:</h3><div>We conducted experiments on real datasets comprising various languages and projects to evaluate the effectiveness of MSVD. Experiment results show that, compared to the baselines in the target domain, MSVD improves F1-score, accuracy, and AUC in the cross-language scenario by 2.95%<span><math><mo>∼</mo></math></span>112.90%, 4.37%<span><math><mo>∼</mo></math></span>27.65%, and 4.19%<span><math><mo>∼</mo></math></span>57.83%, respectively. Additionally, in the cross-project scenario, MSVD achieves the highest F1-score and shows superior performance in terms of accuracy and AUC.</div></div><div><h3>Conclusion:</h3><div>These results indicate that compared to the current state-of-the-art methods, MSVD significantly improves vulnerability detection performance in two cross-domain settings: cross-language and cross-project, when the target domain is unlabeled.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107764"},"PeriodicalIF":3.8,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144123312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the understandability of coupling-related practices in infrastructure-as-code based deployments","authors":"Pierre-Jean Quéval , Nicole Elisabeth Hörner , Evangelos Ntentos , Uwe Zdun","doi":"10.1016/j.infsof.2025.107761","DOIUrl":"10.1016/j.infsof.2025.107761","url":null,"abstract":"<div><div>Infrastructure as Code (IaC) empowers software developers and operations teams to automate the deployment and management of IT infrastructure through code. This is particularly valuable for continuously released deployments such as microservices and cloud-based systems. IaC technologies offer flexibility in provisioning and deploying application architectures. However, if the structure is not well-designed, it can lead to severe issues related to coupling aspects. Unfortunately, a lack of comprehensive coupling guidelines for IaC makes ensuring adherence to best practices challenging. Leveraging IaC-based models, metrics, and source code can enhance the comprehension and implementation of coupling measures.</div><div>Our objective was to investigate how developers understand information derived from system source code and compare it to formal IaC system diagrams and metrics. We conducted a controlled experiment involving a group of participants to evaluate the understandability of IaC system architecture descriptions through source code inspection and formal representations.</div><div>We hypothesized that providing formal IaC system diagrams and metrics as supplementary materials would improve the understanding of IaC coupling-related practices measured by task <em>correctness</em>. We also expected that these supplementary resources would lead to a significant increase in task <em>duration</em> and that there would be a notable correlation between <em>correctness</em> and <em>duration</em>.</div><div>The results suggest that including formal IaC system diagrams and metrics as supplementary materials significantly enhances the comprehension of IaC coupling-related practices, as indicated by task <em>correctness</em>. Moreover, providing these formal representations does not significantly prolong task <em>duration</em>, indicating that they do not hinder understanding. A substantial correlation between task <em>correctness</em> and <em>duration</em> is evident when formal IaC system diagrams and metrics are available.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107761"},"PeriodicalIF":3.8,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro
{"title":"GPTs are not the silver bullet: Performance and challenges of using GPTs for security bug report identification","authors":"Horácio L. França , Katerina Goseva-Popstojanova , César Teixeira , Nuno Laranjeiro","doi":"10.1016/j.infsof.2025.107778","DOIUrl":"10.1016/j.infsof.2025.107778","url":null,"abstract":"<div><h3>Context:</h3><div>Identifying security bugs in software is critical to minimize vulnerability windows. Traditionally, bug reports are submitted through issue trackers and manually analyzed, which is time-consuming. Challenges such as data scarcity and imbalance generally hinder the development of effective machine learning models that could be used to automate this task. Generative Pre-trained Transformer (GPT) models do not require training and are less affected by the imbalance problem. Therefore, they have gained popularity for various text-based classification tasks, apparently becoming a natural highly promising solution for this problem.</div></div><div><h3>Objective:</h3><div>This paper explores the potential of using GPT models to identify security bug reports from the perspective of a user of this type of models. We aim to assess their classification performance in this task compared to traditional machine learning (ML) methods, while also investigating how different factors, such as the prompt used and datasets’ characteristics, affect their results.</div></div><div><h3>Methods:</h3><div>We evaluate the performance of four state-of-the-art GPT models (i.e., GPT4All-Falcon, Wizard, Instruct, OpenOrca) on the task of security bug report identification. We use three different prompts for each GPT model and compare the results with traditional ML models. The empirical results are based on using bug report data from seven projects (i.e., Ambari, Camel, Derby, Wicket, Nova, OpenStack, and Ubuntu).</div></div><div><h3>Results:</h3><div>GPT models show noticeable difficulties in identifying security bug reports, with performance levels generally lower than traditional ML models. The effectiveness of the GPT models is quite variable, depending on the specific model and prompt used, as well as the particular dataset.</div></div><div><h3>Conclusion:</h3><div>Although GPT models are nowadays used in many types of tasks, including classification, their current performance in security bug report identification is surprisingly insufficient and inferior to traditional ML models. Further research is needed to address the challenges identified in this paper in order to effectively apply GPT models to this particular domain.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107778"},"PeriodicalIF":3.8,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144116798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}