George Bishop, Suranjan Chakraborty, Honghe Zhou, Josh Dehlinger, Lin Deng, Jonah Lin, Benjamin Kist
{"title":"On-demand generation of high-quality software engineering datasets using large language models and ontologies","authors":"George Bishop, Suranjan Chakraborty, Honghe Zhou, Josh Dehlinger, Lin Deng, Jonah Lin, Benjamin Kist","doi":"10.1007/s10515-026-00617-w","DOIUrl":"10.1007/s10515-026-00617-w","url":null,"abstract":"<div>\u0000 \u0000 <p>Recent advances in generative artificial intelligence (AI) and machine learning (ML) have renewed interest in realizing the long-standing goal of computer-aided software engineering by improving software quality and productivity. Although these techniques have been applied across many software engineering (SE) tasks, their effectiveness depends heavily on access to large, high-quality, labeled, domain-specific datasets, which remain limited, particularly in requirements engineering (RE) where research often relies on natural language artifacts. Existing, public datasets are typically small, contain labeling ambiguities, and show substantial class imbalance, which restricts the development, evaluation, and reproducibility of AI-driven SE approaches. To address these challenges, this paper presents the O3DG approach, a repeatable method for generating on-demand, high-quality, ontology-aligned datasets using large language models (LLMs). O3DG integrates prompt engineering strategies, domain-specific seed examples, and ML-based validation to synthesize diverse and cohesive datasets suitable for SE research. The approach is demonstrated through two representative RE case studies involving the classification of non-functional requirements and the detection of ambiguity in software requirements. For each case, the paper details the O3DG pipeline, ontology mappings, and validation steps that ensure dataset reliability and practical utility. Results show that O3DG produces datasets with strong category cohesion, improved balance across classes, and effective support for ML training. More broadly, the study illustrates how LLM-assisted dataset synthesis can help overcome persistent data limitations and provides a transferable process for producing high-quality datasets across additional SE domains.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00617-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advancing research software engineering with AI: a research framework","authors":"Siamak Farshidi, Kwabena Ebo Bennin, Önder Babur, June Sallou, Ayalew Kassahun, Bedir Tekinerdogan","doi":"10.1007/s10515-026-00621-0","DOIUrl":"10.1007/s10515-026-00621-0","url":null,"abstract":"<div>\u0000 \u0000 <p>Research software has become a central pillar of scientific discovery, yet its engineering quality, sustainability, and reproducibility vary widely across projects. At the same time, advances in artificial intelligence (AI), particularly generative AI (GenAI), are rapidly transforming how software is developed. While these tools promise productivity gains, their broader impact on research software engineering practices remains poorly understood at scale. In this study, we present a large-scale empirical analysis of AI-assisted research software engineering. We analyzed 1,510 open-source research software repositories retrieved from Zenodo using the IEEE Taxonomy 2025 top-level categories (598 query terms), restricted to records labeled Software and created after November 2022 (post-GenAI emergence), with duplicate and incomplete entries removed. To distinguish archival dissemination from active development, we separate Zenodo-only artifacts from records linked to evolving GitHub repositories and enrich the latter with repository-level development indicators. Our analysis integrates multiple dimensions, including software engineering maturity (e.g., documentation, automation, testing, and releases), FAIRness for research software (FAIR4RS metadata indicators), inferred AI and GenAI usage, and operational signals related to AIOps and MLOps practices. Based on these indicators, we propose and empirically ground a quadrant-based model that characterizes research software development modes along the axes of engineering maturity and AI integration. The results show that AI-assisted practices are increasingly present in research software, but their adoption remains uneven and often decoupled from established engineering disciplines. Repositories classified as AI4RSE exhibit longer active lifespans, stronger maintenance signals, and higher FAIR alignment than exploratory or informally developed projects. At the same time, a substantial fraction of Zenodo artifacts represent archival snapshots rather than evolving software, highlighting the importance of interpreting engineering indicators in light of dissemination intent. This work contributes (i) a large-scale empirical characterization based on 1,510 repositories of AI-assisted research software development, (ii) an integrated analytical framework combining software engineering, FAIRness, AI usage, and operational practices, and (iii) evidence-based insights into the opportunities and challenges of responsible and sustainable AI4RSE. Together, these contributions provide a foundation for future research and practical guidance on integrating AI into research software engineering.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00621-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianbin Xu, Cheng Huang, Yutong Zeng, Jianguo Zhao, Tao Leng, Pin Yang
{"title":"Measuring security posture of NAS third-party packages ecosystem: an empirical analysis","authors":"Jianbin Xu, Cheng Huang, Yutong Zeng, Jianguo Zhao, Tao Leng, Pin Yang","doi":"10.1007/s10515-026-00615-y","DOIUrl":"10.1007/s10515-026-00615-y","url":null,"abstract":"<div>\u0000 \u0000 <p>Network-Attached Storage (NAS) devices are essential in the IoT ecosystem, widely used for enterprise data exchange and personal cloud storage. Managed via web-based interfaces and network file-sharing protocols, they are increasingly integrated with cloud services, making them vulnerable to cyber threats. While previous research has focused on NAS firmware and public port security, the security of NAS third-party packages remains largely unexplored. These packages, integrated through web services and APIs, introduce new attack surfaces. To address this gap, we propose NASScanner, an analysis framework for automated package collection, preprocessing, and security assessment. Using NASScanner, we conducted the first large-scale security measurement of NAS third-party packages, analyzing 1,489 packages—the largest dataset of its kind. Our study examined third-party component security, attack mitigation measures, and sensitive information exposure. Leveraging LLM-powered binary analysis (BinaryAI) performs semantic-level function similarity detection, enabling accurate identification of insecure third-party components. Our findings reveal critical security concerns: ① Extensive vulnerabilities. 689 packages contain 36,162 vulnerabilities linked to 4,167 distinct CVEs. ② Low mitigation implementation. Only 22.3% of packages employ Position Independent Executable for security. ③ Sensitive data exposure. 45.87% of packages risk data leaks, with 23,821 instances of direct exposure on the open internet. Our findings highlight significant security risks in NAS third-party packages and provide valuable insights to enhance NAS device security.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giacomo Fantino, Antonio Vetro’, Marco Torchiano, Federica Cappelluti
{"title":"Beyond syntax: enhancing automated documentation with data differences","authors":"Giacomo Fantino, Antonio Vetro’, Marco Torchiano, Federica Cappelluti","doi":"10.1007/s10515-026-00623-y","DOIUrl":"10.1007/s10515-026-00623-y","url":null,"abstract":"<div><p>Modern software development automation is mostly based on AI, covering every aspect of code production and maintenance, throughout the entire software development lifecycle, from requirements and code writing to testing and maintenance. Code commenting is no exception. Automated code comment generation methods rely on static syntactic and lexical features of source code. However, these approaches frequently underperform in data-centric software applications, where understanding the effect of code on data is essential. We explore an execution-aware extension to automatic documentation generation. In this exploratory work, we aim at capturing post-execution data transformations (i.e., <i>semantic data differences)</i> that reveal the code’s effect on data, and use it as a complementary signal alongside existing code representations to automate explanatory comments for data wrangling code. We build a curated dataset of Python notebooks from Kaggle and apply a lightweight execution tracer to extract structured descriptions of runtime data transformations. We define a formal grammar for capturing these effects and integrate them into a multimodal encoder-decoder model using co-attention mechanisms. Multiple training strategies are explored to assess the impact of this new modality on comment generation. Our evaluation reveals that models incorporating this modality performed competitively with code-only baselines. Notably, in cases where no observable data transformation occurred, the presence of symbolic <span>(langle mathsf {no_diff} rangle)</span> signals led to improved robustness and higher comment quality, as measured by both automatic and human evaluation metrics. However, we did not observe improvements in comment quality in semantically rich scenarios, suggesting possible paths of improvement for future research direction. Qualitative analysis of generated comments supports this pattern, indicating that the modality helps stabilize comments by reducing unnecessary or speculative details in neutral cases, but does not provide yet consistent guidance when meaningful data transformations occur. These trends are less pronounced on a larger, noisier extended test set, suggesting sensitivity to comment–code alignment. Our study demonstrates the feasibility and potential of using execution-derived feedback as a complementary signal in automated comment generation. While the current approach is limited by dataset size and modality noise, it demonstrates that post-execution state changes can guide more context-aware and stable code summarization. This suggests a promising direction for execution-sensitive models in assisting data-centric software development and its documentation.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00623-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DIR-SMOTE: a density-influence resampling framework for imbalanced code smell detection","authors":"Ruchika Malhotra, Bhawna Jain, Marouane Kessentini","doi":"10.1007/s10515-026-00624-x","DOIUrl":"10.1007/s10515-026-00624-x","url":null,"abstract":"<div>\u0000 \u0000 <p>Code smell detection is vital for ensuring software quality, but the imbalance between smelly and non-smelly code instances impairs detection, especially for minority smells like Data Class and Feature Envy. Existing oversampling techniques, such as Synthetic Minority Oversampling Technique (SMOTE), Borderline-SMOTE (BL-SMOTE), and Adaptive Synthetic (ADASYN), attempt to mitigate this issue but often introduce noise or semantically irrelevant samples. This study proposes DIR-SMOTE (Density and Influence-based Resampling using SMOTE), a density and explanation-guided resampling framework that integrates local density estimation and SHapley Additive exPlanations (SHAP)-based feature importance to improve the quality of synthetic minority samples. Initially, DIR-SMOTE filters out noisy or isolated minority instances using density metrics. It then employs SHAP to identify the most influential features per instance. Synthetic samples are generated by interpolating between dense neighbors while perturbing only top-ranked SHAP features, thereby preserving semantic integrity. DIR-SMOTE is evaluated on five benchmark datasets, namely, Apache, jEdit, EDTForCSD, DesigniteJava, and MLCQ, across multiple smells such as Long Method, Feature Envy, and Data Class. Compared to nine standard resampling methods, DIR-SMOTE achieves up to 6.7% improvement in F1-score and 5.1% in precision, consistently enhancing smelly code detection in both binary and multiclass settings. Rather than relying on complex generative models, DIR-SMOTE focuses on explanation-guided and density-aware sample generation that remains transparent and computationally efficient. Overall, it offers a lightweight and robust solution that can be seamlessly integrated into practical quality assurance workflows, including automated smell detection tools and IDE-based analyzers.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assessing the effectiveness of large language models for Java vulnerability repair: A comparative study","authors":"Obieda Ananbeh, Wala Alnozami, Dae-Kyoo Kim","doi":"10.1007/s10515-026-00622-z","DOIUrl":"10.1007/s10515-026-00622-z","url":null,"abstract":"<div>\u0000 \u0000 <p>Automated software vulnerability repair (SVR) has emerged as a critical area of research, driven by the increasing complexity and security risks inherent in modern software systems. Large Language Models (LLMs), such as ChatGPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Llama 3.2, have demonstrated remarkable capabilities in software engineering tasks, yet their effectiveness and reliability in repairing vulnerabilities in Java applications have not been thoroughly evaluated. To bridge this gap, this study conducts an extensive comparative evaluation of these prominent LLMs using a novel benchmark comprising 2,362 rigorously validated Java vulnerabilities from 20 diverse real-world projects, categorized across 32 distinct CWE types. Each vulnerability was carefully assessed and validated using automated tools CodeQL and Snyk and expert review, ensuring a high-confidence evaluation dataset. The evaluation covers three prompting configurations one-shot baseline, chain-of-thought (CoT), and retrieval-augmented generation (RAG) and benchmarks model performance against two specialized repair systems, RepairLLaMA and RAP-Gen. The results demonstrate that ChatGPT-4 significantly outperforms other models, achieving the highest fix rate of 70% and a balanced F1-score of 77.66%, highlighting its solid capability to repair vulnerabilities accurately. While Llama 3.2 showed the highest precision 84.23%, it exhibited lower recall 56.05%, indicating a conservative repair strategy. Detailed project-level analysis reveals substantial performance variations, influenced by project complexity and vulnerability type, with recurring difficulties identified in addressing specific CWEs such as hard-coded credentials (CWE-798) and path traversal (CWE-23). Under RAG prompting, ChatGPT-4 reaches a fix rate of 76.84%, matching or surpassing both RepairLLaMA and RAP-Gen, while CoT prompting yields intermediate gains of 4–5 percentage points across all models. This study underscores critical insights into the strengths and limitations of LLM-based vulnerability repair, emphasizing the necessity of tailored model selection and adaptation strategies. Future research should address identified persistent challenges, particularly contextual and complex vulnerability patterns, to further enhance the practicality and reliability of LLM-driven automated software repair.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated testing of prevalent 3D user interactions in virtual reality applications","authors":"Ruizhen Gu, José Miguel Rojas, Donghwan Shin","doi":"10.1007/s10515-026-00620-1","DOIUrl":"10.1007/s10515-026-00620-1","url":null,"abstract":"<div>\u0000 \u0000 <p>Virtual Reality (VR) technologies offer immersive user experiences across various domains, but present unique testing challenges compared to traditional software. Existing VR testing approaches enable scene navigation and interaction activation, but lack the ability to automatically synthesise realistic 3D user inputs (e.g, grab and trigger actions via hand-held controllers). Automated testing that generates and executes such input remains an unresolved challenge. Furthermore, existing metrics fail to robustly capture diverse interaction coverage. This paper addresses these gaps through four key contributions. First, we empirically identify four prevalent interaction types in nine open-source VR projects: <i>fire</i>, <i>manipulate</i>, <i>socket</i>, and <i>custom</i>. Second, we introduce the <i>Interaction Flow Graph</i>, a novel abstraction that systematically models 3D user interactions by identifying targets, actions, and conditions. Third, we construct <span>XRBench3D</span>, a benchmark comprising ten VR scenes that encompass 456 distinct user interactions for evaluating VR interaction testing. Finally, we present <span>XRintTest</span>, an automated testing approach that leverages this graph for dynamic scene exploration and interaction execution. Evaluation on <span>XRBench3D</span> shows that <span>XRintTest</span> achieves great effectiveness, reaching 93% coverage of <i>fire</i>, <i>manipulate</i> and <i>socket</i> interactions across all scenes, and performing 12x more effectively and 6x more efficiently than random exploration. Moreover, <span>XRintTest</span> can detect runtime exceptions and non-exception interaction issues, including subtle configuration defects. In addition, the Interaction Flow Graph can reveal potential interaction design smells that may compromise intended functionality and hinder testing performance for VR applications.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00620-1.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New criteria for test case prioritization for software product lines. A replication and extension study","authors":"Andrada Georgia Tiutin, Andreea Vescan","doi":"10.1007/s10515-026-00619-8","DOIUrl":"10.1007/s10515-026-00619-8","url":null,"abstract":"<div><p>Testing software product lines represents a challenging task mainly because there are many derivable products. To facilitate this issue, multiple solutions were developed to reduce the number of products that are tested while maintaining a good percentage of coverage. However, the order of testing products has received little consideration. The purpose of this research is twofold: first, to replicate the results of a previous study (which uses two specific metrics for prioritization, namely, Variability Coverage & Cyclomatic Complexity - VC&CC, and Coefficient of Connectivity-Density - CoC), and second, to investigate two new metrics to be used as prioritization criteria (Ratio of Variability - RoV, and Flexibility of Configuration - FoC). The APFD (Average Percentage of Faults Detected) metric is used to evaluate the results obtained. In the investigation, a set of 9 feature models with various numbers of features, grouped in three intervals, was used. The results show that the original findings are confirmed for all feature models used. Regarding the new criteria used, FoC and RoV outperformed the CoC metric in 6 out of 9 cases, and also obtained the best results in 3 out of 9 cases. In the other 6 out of 9 cases the VC&CC criterion obtained the best results.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00619-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic techniques for issue report classification: A systematic mapping study","authors":"Muhammad Laiq, Felix Dobslaw","doi":"10.1007/s10515-026-00616-x","DOIUrl":"10.1007/s10515-026-00616-x","url":null,"abstract":"<div>\u0000 \u0000 <p>Several studies have evaluated automatic techniques for classifying software issue reports into bugs and non-bugs to assist practitioners in effectively assigning relevant resources based on the type of issue. Currently, no comprehensive overview of this area has been published. A comprehensive overview will help identify future research directions and provide an extensive collection of potentially relevant existing solutions. This study aims to provide a comprehensive overview of the use of automatic techniques to classify issue reports. We conducted a systematic mapping study and identified 46 studies on the topic. The study results indicate that the existing literature applies various techniques for classifying issue reports, including traditional machine learning and deep learning-based techniques, and more advanced large language models. Furthermore, we observe that these studies (a) lack the involvement of practitioners, (b) do not consider other potentially relevant adoption factors beyond prediction accuracy, such as the explainability, scalability, and generalizability of the techniques, and (c) mainly rely on archival data from open-source repositories only. Therefore, future research should focus on real industrial evaluations, consider other potentially relevant adoption factors, and actively involve practitioners.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-026-00616-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147807620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hallucination detection in LLM code generation: A sampling-based consensus verification approach","authors":"Taicheng Huang, Zhanhui Ren, Yuan Huang, Xiangping Chen, Yi Liu, Zibin Zheng","doi":"10.1007/s10515-026-00605-0","DOIUrl":"10.1007/s10515-026-00605-0","url":null,"abstract":"<div>\u0000 \u0000 <p>Large Language Models (LLMs) have revolutionized the code generation task, but their output often contains \"hallucinations\" - code snippets that look reasonable but are actually wrong (such as API misuse or logic errors). Existing detection methods mainly rely on dynamic code execution, which requires complex runtime environment configurations. This paper proposes HalluCodeDetector, a new static analysis framework based on sampling consistency verification. The method is based on the following assumption: when LLM correctly understands the problem, its random output shows high consistency in syntactic structure, data flow, and API usage patterns. The process of the method is as follows: for a given problem, we let LLM repeatedly generate multiple code samples and evaluate their semantic/functional consistency, a new metric (MRCM) is used to calculate the average similarity between candidate response and other samples to quantify the possibility of hallucination. Experiments on HumanEval+ and MBPP benchmarks demonstrate that HalluCodeDetector achieves AUROC=0.76, outperforming baseline methods like LYNX by 15.2%, and with lower time overhead. Our method provides a secure, efficient, and generalizable solution for improving the reliability of LLM-generated code.</p>\u0000 </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 2","pages":""},"PeriodicalIF":3.1,"publicationDate":"2026-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147561684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}