Assessing the effectiveness of large language models for Java vulnerability repair: A comparative study

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2026-05-04 DOI:10.1007/s10515-026-00622-z

Obieda Ananbeh, Wala Alnozami, Dae-Kyoo Kim

{"title":"Assessing the effectiveness of large language models for Java vulnerability repair: A comparative study","authors":"Obieda Ananbeh, Wala Alnozami, Dae-Kyoo Kim","doi":"10.1007/s10515-026-00622-z","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Automated software vulnerability repair (SVR) has emerged as a critical area of research, driven by the increasing complexity and security risks inherent in modern software systems. Large Language Models (LLMs), such as ChatGPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Llama 3.2, have demonstrated remarkable capabilities in software engineering tasks, yet their effectiveness and reliability in repairing vulnerabilities in Java applications have not been thoroughly evaluated. To bridge this gap, this study conducts an extensive comparative evaluation of these prominent LLMs using a novel benchmark comprising 2,362 rigorously validated Java vulnerabilities from 20 diverse real-world projects, categorized across 32 distinct CWE types. Each vulnerability was carefully assessed and validated using automated tools CodeQL and Snyk and expert review, ensuring a high-confidence evaluation dataset. The evaluation covers three prompting configurations one-shot baseline, chain-of-thought (CoT), and retrieval-augmented generation (RAG) and benchmarks model performance against two specialized repair systems, RepairLLaMA and RAP-Gen. The results demonstrate that ChatGPT-4 significantly outperforms other models, achieving the highest fix rate of 70% and a balanced F1-score of 77.66%, highlighting its solid capability to repair vulnerabilities accurately. While Llama 3.2 showed the highest precision 84.23%, it exhibited lower recall 56.05%, indicating a conservative repair strategy. Detailed project-level analysis reveals substantial performance variations, influenced by project complexity and vulnerability type, with recurring difficulties identified in addressing specific CWEs such as hard-coded credentials (CWE-798) and path traversal (CWE-23). Under RAG prompting, ChatGPT-4 reaches a fix rate of 76.84%, matching or surpassing both RepairLLaMA and RAP-Gen, while CoT prompting yields intermediate gains of 4–5 percentage points across all models. This study underscores critical insights into the strengths and limitations of LLM-based vulnerability repair, emphasizing the necessity of tailored model selection and adaptation strategies. Future research should address identified persistent challenges, particularly contextual and complex vulnerability patterns, to further enhance the practicality and reliability of LLM-driven automated software repair.</p>\n </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-026-00622-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Automated software vulnerability repair (SVR) has emerged as a critical area of research, driven by the increasing complexity and security risks inherent in modern software systems. Large Language Models (LLMs), such as ChatGPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Llama 3.2, have demonstrated remarkable capabilities in software engineering tasks, yet their effectiveness and reliability in repairing vulnerabilities in Java applications have not been thoroughly evaluated. To bridge this gap, this study conducts an extensive comparative evaluation of these prominent LLMs using a novel benchmark comprising 2,362 rigorously validated Java vulnerabilities from 20 diverse real-world projects, categorized across 32 distinct CWE types. Each vulnerability was carefully assessed and validated using automated tools CodeQL and Snyk and expert review, ensuring a high-confidence evaluation dataset. The evaluation covers three prompting configurations one-shot baseline, chain-of-thought (CoT), and retrieval-augmented generation (RAG) and benchmarks model performance against two specialized repair systems, RepairLLaMA and RAP-Gen. The results demonstrate that ChatGPT-4 significantly outperforms other models, achieving the highest fix rate of 70% and a balanced F1-score of 77.66%, highlighting its solid capability to repair vulnerabilities accurately. While Llama 3.2 showed the highest precision 84.23%, it exhibited lower recall 56.05%, indicating a conservative repair strategy. Detailed project-level analysis reveals substantial performance variations, influenced by project complexity and vulnerability type, with recurring difficulties identified in addressing specific CWEs such as hard-coded credentials (CWE-798) and path traversal (CWE-23). Under RAG prompting, ChatGPT-4 reaches a fix rate of 76.84%, matching or surpassing both RepairLLaMA and RAP-Gen, while CoT prompting yields intermediate gains of 4–5 percentage points across all models. This study underscores critical insights into the strengths and limitations of LLM-based vulnerability repair, emphasizing the necessity of tailored model selection and adaptation strategies. Future research should address identified persistent challenges, particularly contextual and complex vulnerability patterns, to further enhance the practicality and reliability of LLM-driven automated software repair.

查看原文本刊更多论文

评估大型语言模型对Java漏洞修复的有效性：比较研究

在现代软件系统日益增加的复杂性和固有的安全风险的驱动下，自动软件漏洞修复（SVR）已经成为一个关键的研究领域。大型语言模型（llm），如ChatGPT-4、Claude 3.5 Sonnet、Gemini 2.0 Flash和Llama 3.2，已经在软件工程任务中展示了非凡的能力，但它们在修复Java应用程序漏洞方面的有效性和可靠性还没有得到彻底的评估。为了弥补这一差距，本研究使用一个新的基准对这些杰出的法学硕士进行了广泛的比较评估，该基准包括来自20个不同的现实世界项目的2,362个严格验证的Java漏洞，这些漏洞被分类为32种不同的CWE类型。每个漏洞都使用自动化工具CodeQL和Snyk和专家审查进行了仔细评估和验证，确保了高置信度的评估数据集。评估包括三种提示配置，一次性基线、思维链（CoT）和检索增强生成（RAG），以及针对两种专门的修复系统RepairLLaMA和RAP-Gen的基准模型性能。结果表明，ChatGPT-4显著优于其他模型，最高修复率达到70%，平衡f1得分达到77.66%，显示出其可靠的漏洞修复能力。Llama 3.2的修复准确率最高，为84.23%，召回率较低，为56.05%，为保守修复策略。详细的项目级分析揭示了受项目复杂性和漏洞类型影响的实质性性能变化，以及在处理特定的CWEs时发现的反复出现的困难，例如硬编码凭证（CWE-798）和路径遍历（CWE-23）。在RAG提示下，ChatGPT-4的固定率达到76.84%，与RepairLLaMA和RAP-Gen的固定率持平或超过，而CoT提示在所有模型中都有4-5个百分点的中间增益。本研究强调了对基于llm的漏洞修复的优势和局限性的关键见解，强调了定制模型选择和适应策略的必要性。未来的研究应该解决持续存在的挑战，特别是上下文和复杂的漏洞模式，以进一步提高llm驱动的自动化软件修复的实用性和可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.