Qilang Ye,Zitong Yu,Rui Shao,Yawen Cui,Xiangui Kang,Xin Liu,Philip Torr,Xiaochun Cao
{"title":"CAT+: Investigating and Enhancing Audio-visual Understanding in Large Language Models.","authors":"Qilang Ye,Zitong Yu,Rui Shao,Yawen Cui,Xiangui Kang,Xin Liu,Philip Torr,Xiaochun Cao","doi":"10.1109/tpami.2025.3582389","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual ambiguity, and 2) audio-visual hallucination. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses. To overcome the two aforementioned issues, we introduce the CAT+, which enhances MLLM to ensure more robust multimodal understanding. We first propose the Sequential Question-guided Module (SQM), which combines tiny transformer layers and cascades Q-Formers to realize a solid audio-visual grounding. After feature alignment and high-quality instruction tuning, we introduce Ambiguity Scoring Direct Preference Optimization (AS-DPO) to correct the problem of CAT+ bias toward ambiguous descriptions. To explore the hallucinatory deficits of MLLMs in dynamic audio-visual scenes, we build a new Audio-visual Hallucination Benchmark, named AVHbench. This benchmark detects the extent of MLLM's hallucinations across three different protocols in the perceptual object, counting, and holistic description tasks. Extensive experiments across video-based understanding, open-ended, and close-ended AVQA demonstrate the superior performance of our method.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"26 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3582389","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual ambiguity, and 2) audio-visual hallucination. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses. To overcome the two aforementioned issues, we introduce the CAT+, which enhances MLLM to ensure more robust multimodal understanding. We first propose the Sequential Question-guided Module (SQM), which combines tiny transformer layers and cascades Q-Formers to realize a solid audio-visual grounding. After feature alignment and high-quality instruction tuning, we introduce Ambiguity Scoring Direct Preference Optimization (AS-DPO) to correct the problem of CAT+ bias toward ambiguous descriptions. To explore the hallucinatory deficits of MLLMs in dynamic audio-visual scenes, we build a new Audio-visual Hallucination Benchmark, named AVHbench. This benchmark detects the extent of MLLM's hallucinations across three different protocols in the perceptual object, counting, and holistic description tasks. Extensive experiments across video-based understanding, open-ended, and close-ended AVQA demonstrate the superior performance of our method.
多模态大型语言模型(Multimodal Large Language Models, mllm)因其丰富的内部隐式知识而受到广泛关注。尽管将视听技术引入法学硕士的进步促进了各种视听问答(AVQA)任务的发展,但他们仍然面临两个关键挑战:1)视听模糊,2)视听幻觉。现有的mlm可以对视听内容做出反应,但有时由于反应的模糊性或幻觉而无法描述特定对象。为了克服上述两个问题,我们引入了CAT+,它增强了mlm,以确保更健壮的多模态理解。我们首先提出了顺序问题引导模块(SQM),它结合了微小的变压器层和级联的Q-Formers来实现坚实的视听接地。在特征对齐和高质量指令调优之后,我们引入了模糊评分直接偏好优化(AS-DPO)来纠正CAT+对模糊描述的偏差问题。为了探究mllm在动态视听场景中的幻觉缺陷,我们建立了一个新的视听幻觉基准,命名为AVHbench。这个基准测试在感知对象、计数和整体描述任务中检测MLLM在三种不同协议中的幻觉程度。基于视频的理解、开放式和封闭式AVQA的大量实验证明了我们的方法具有优越的性能。
期刊介绍:
The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.