Hallucination detection in LLM code generation: A sampling-based consensus verification approach

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2026-03-25 DOI:10.1007/s10515-026-00605-0

Taicheng Huang, Zhanhui Ren, Yuan Huang, Xiangping Chen, Yi Liu, Zibin Zheng

{"title":"Hallucination detection in LLM code generation: A sampling-based consensus verification approach","authors":"Taicheng Huang, Zhanhui Ren, Yuan Huang, Xiangping Chen, Yi Liu, Zibin Zheng","doi":"10.1007/s10515-026-00605-0","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Large Language Models (LLMs) have revolutionized the code generation task, but their output often contains \"hallucinations\" - code snippets that look reasonable but are actually wrong (such as API misuse or logic errors). Existing detection methods mainly rely on dynamic code execution, which requires complex runtime environment configurations. This paper proposes HalluCodeDetector, a new static analysis framework based on sampling consistency verification. The method is based on the following assumption: when LLM correctly understands the problem, its random output shows high consistency in syntactic structure, data flow, and API usage patterns. The process of the method is as follows: for a given problem, we let LLM repeatedly generate multiple code samples and evaluate their semantic/functional consistency, a new metric (MRCM) is used to calculate the average similarity between candidate response and other samples to quantify the possibility of hallucination. Experiments on HumanEval+ and MBPP benchmarks demonstrate that HalluCodeDetector achieves AUROC=0.76, outperforming baseline methods like LYNX by 15.2%, and with lower time overhead. Our method provides a secure, efficient, and generalizable solution for improving the reliability of LLM-generated code.</p>\n </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 2","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2026-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-026-00605-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have revolutionized the code generation task, but their output often contains "hallucinations" - code snippets that look reasonable but are actually wrong (such as API misuse or logic errors). Existing detection methods mainly rely on dynamic code execution, which requires complex runtime environment configurations. This paper proposes HalluCodeDetector, a new static analysis framework based on sampling consistency verification. The method is based on the following assumption: when LLM correctly understands the problem, its random output shows high consistency in syntactic structure, data flow, and API usage patterns. The process of the method is as follows: for a given problem, we let LLM repeatedly generate multiple code samples and evaluate their semantic/functional consistency, a new metric (MRCM) is used to calculate the average similarity between candidate response and other samples to quantify the possibility of hallucination. Experiments on HumanEval+ and MBPP benchmarks demonstrate that HalluCodeDetector achieves AUROC=0.76, outperforming baseline methods like LYNX by 15.2%, and with lower time overhead. Our method provides a secure, efficient, and generalizable solution for improving the reliability of LLM-generated code.

Abstract Image

查看原文本刊更多论文

LLM代码生成中的幻觉检测：一种基于抽样的一致性验证方法

大型语言模型（llm）已经彻底改变了代码生成任务，但它们的输出通常包含“幻觉”——看起来合理但实际上是错误的代码片段（例如API误用或逻辑错误）。现有的检测方法主要依赖于动态代码执行，这需要复杂的运行时环境配置。本文提出了一种新的基于抽样一致性验证的静态分析框架——HalluCodeDetector。该方法基于以下假设：当LLM正确理解问题时，其随机输出在语法结构、数据流和API使用模式方面显示出高度一致性。该方法的过程如下：对于给定的问题，我们让LLM反复生成多个代码样本并评估它们的语义/功能一致性，使用一个新的度量（MRCM）来计算候选响应与其他样本之间的平均相似度，以量化幻觉的可能性。在HumanEval+和MBPP基准测试上的实验表明，HalluCodeDetector实现了AUROC=0.76，比LYNX等基准方法的性能高出15.2%，并且时间开销更低。我们的方法为提高llm生成的代码的可靠性提供了一种安全、高效和通用的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.