Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding.

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI:10.1109/cvpr52734.2025.02435

Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, Wenxue Li, Yulong Li, Wenxuan Song, Shiyan Su, Wei Feng, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge

{"title":"Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding.","authors":"Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, Wenxue Li, Yulong Li, Wenxuan Song, Shiyan Su, Wei Feng, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge","doi":"10.1109/cvpr52734.2025.02435","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"26147-26159"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425127/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cvpr52734.2025.02435","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/13 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.

查看原文本刊更多论文

看得远而清楚：用注意因果解码减轻mlm的幻觉。

多模态大语言模型（mllm）的最新进展显著提高了视觉问答的性能。然而，他们经常会产生幻觉。在这项工作中，幻觉被分为两种主要类型：初始幻觉和雪球幻觉。我们认为足够的上下文信息可以直接从令牌交互过程中提取。受解码策略中因果推理的启发，我们提出利用因果掩模在多模态令牌之间建立信息传播。假设是这些标记之间的交互不足可能导致模型依赖于异常标记，忽略了密集和丰富的上下文线索。因此，我们建议通过处理异常值令牌来干预传播过程，以增强上下文推断。为此，我们提出了FarSight，这是一种通用的即插即用解码策略，仅通过优化因果掩码来减少异常标记的注意力干扰。我们方法的核心是有效的令牌传播。我们在因果掩模的上三角矩阵内设计了一个注意力注册结构，动态分配注意力以捕获转移到离群标记的注意力。此外，提出了一种掩蔽率递减的位置感知编码方法，使模型能够进一步关注前面的标记，特别是对于视频序列任务。通过大量的实验，FarSight在不同的mllm图像和视频基准测试中显示了显着的减轻幻觉的性能，证明了其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

CiteScore

43.50

自引率

0.00%

发文量