{"title":"Coherence-aware and snap-triggered: A novel mechanism for audio-visual cooperative tasks","authors":"Cunhan Guo, Heyan Huang, Ruiqi Hu, Danjie Han","doi":"10.1016/j.eswa.2026.131559","DOIUrl":null,"url":null,"abstract":"<div><div>Audio-Visual Cooperative tasks underpin multimodal scene understanding and compel models to reconcile continuous temporal evolution with abrupt sensory transitions. We propose the Coherence-Aware and Snap-Triggered mechanism (CAST) mechanism, a plug-in temporal refinement layer without perturbing backbone parameters or demanding additional modalities. The Exponential Memory based Coherence-Aware module attenuates distant frame contributions through an exponentially decaying weight envelope, thereby preventing the persistent influence of obsolete disruptions. Complementarily, the Optical Flow based Snap-Triggered Module module registers instantaneous motion discontinuities and reallocates attention toward nascent events. Operating in concert, these modules yield a representation that remains coherent across smooth transitions yet responsive to sudden perturbations. Empirical evaluation across multiple AVC benchmarks demonstrates consistent superiority over established baselines, corroborating that CAST enhances temporal fidelity and, by extension, the reliability of downstream multimodal decisions.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"313 ","pages":"Article 131559"},"PeriodicalIF":7.5000,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417426004720","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Audio-Visual Cooperative tasks underpin multimodal scene understanding and compel models to reconcile continuous temporal evolution with abrupt sensory transitions. We propose the Coherence-Aware and Snap-Triggered mechanism (CAST) mechanism, a plug-in temporal refinement layer without perturbing backbone parameters or demanding additional modalities. The Exponential Memory based Coherence-Aware module attenuates distant frame contributions through an exponentially decaying weight envelope, thereby preventing the persistent influence of obsolete disruptions. Complementarily, the Optical Flow based Snap-Triggered Module module registers instantaneous motion discontinuities and reallocates attention toward nascent events. Operating in concert, these modules yield a representation that remains coherent across smooth transitions yet responsive to sudden perturbations. Empirical evaluation across multiple AVC benchmarks demonstrates consistent superiority over established baselines, corroborating that CAST enhances temporal fidelity and, by extension, the reliability of downstream multimodal decisions.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.