{"title":"Assessing the effectiveness of large language models for Java vulnerability repair: A comparative study","authors":"Obieda Ananbeh, Wala Alnozami, Dae-Kyoo Kim","doi":"10.1007/s10515-026-00622-z","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Automated software vulnerability repair (SVR) has emerged as a critical area of research, driven by the increasing complexity and security risks inherent in modern software systems. Large Language Models (LLMs), such as ChatGPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Llama 3.2, have demonstrated remarkable capabilities in software engineering tasks, yet their effectiveness and reliability in repairing vulnerabilities in Java applications have not been thoroughly evaluated. To bridge this gap, this study conducts an extensive comparative evaluation of these prominent LLMs using a novel benchmark comprising 2,362 rigorously validated Java vulnerabilities from 20 diverse real-world projects, categorized across 32 distinct CWE types. Each vulnerability was carefully assessed and validated using automated tools CodeQL and Snyk and expert review, ensuring a high-confidence evaluation dataset. The evaluation covers three prompting configurations one-shot baseline, chain-of-thought (CoT), and retrieval-augmented generation (RAG) and benchmarks model performance against two specialized repair systems, RepairLLaMA and RAP-Gen. The results demonstrate that ChatGPT-4 significantly outperforms other models, achieving the highest fix rate of 70% and a balanced F1-score of 77.66%, highlighting its solid capability to repair vulnerabilities accurately. While Llama 3.2 showed the highest precision 84.23%, it exhibited lower recall 56.05%, indicating a conservative repair strategy. Detailed project-level analysis reveals substantial performance variations, influenced by project complexity and vulnerability type, with recurring difficulties identified in addressing specific CWEs such as hard-coded credentials (CWE-798) and path traversal (CWE-23). Under RAG prompting, ChatGPT-4 reaches a fix rate of 76.84%, matching or surpassing both RepairLLaMA and RAP-Gen, while CoT prompting yields intermediate gains of 4–5 percentage points across all models. This study underscores critical insights into the strengths and limitations of LLM-based vulnerability repair, emphasizing the necessity of tailored model selection and adaptation strategies. Future research should address identified persistent challenges, particularly contextual and complex vulnerability patterns, to further enhance the practicality and reliability of LLM-driven automated software repair.</p>\n </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-026-00622-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Automated software vulnerability repair (SVR) has emerged as a critical area of research, driven by the increasing complexity and security risks inherent in modern software systems. Large Language Models (LLMs), such as ChatGPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Llama 3.2, have demonstrated remarkable capabilities in software engineering tasks, yet their effectiveness and reliability in repairing vulnerabilities in Java applications have not been thoroughly evaluated. To bridge this gap, this study conducts an extensive comparative evaluation of these prominent LLMs using a novel benchmark comprising 2,362 rigorously validated Java vulnerabilities from 20 diverse real-world projects, categorized across 32 distinct CWE types. Each vulnerability was carefully assessed and validated using automated tools CodeQL and Snyk and expert review, ensuring a high-confidence evaluation dataset. The evaluation covers three prompting configurations one-shot baseline, chain-of-thought (CoT), and retrieval-augmented generation (RAG) and benchmarks model performance against two specialized repair systems, RepairLLaMA and RAP-Gen. The results demonstrate that ChatGPT-4 significantly outperforms other models, achieving the highest fix rate of 70% and a balanced F1-score of 77.66%, highlighting its solid capability to repair vulnerabilities accurately. While Llama 3.2 showed the highest precision 84.23%, it exhibited lower recall 56.05%, indicating a conservative repair strategy. Detailed project-level analysis reveals substantial performance variations, influenced by project complexity and vulnerability type, with recurring difficulties identified in addressing specific CWEs such as hard-coded credentials (CWE-798) and path traversal (CWE-23). Under RAG prompting, ChatGPT-4 reaches a fix rate of 76.84%, matching or surpassing both RepairLLaMA and RAP-Gen, while CoT prompting yields intermediate gains of 4–5 percentage points across all models. This study underscores critical insights into the strengths and limitations of LLM-based vulnerability repair, emphasizing the necessity of tailored model selection and adaptation strategies. Future research should address identified persistent challenges, particularly contextual and complex vulnerability patterns, to further enhance the practicality and reliability of LLM-driven automated software repair.
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.