{"title":"Cross-modality-enhanced visual Scene Graph Generation","authors":"Fei Yu , Hui Ji , Yuehua Li","doi":"10.1016/j.inffus.2025.103430","DOIUrl":null,"url":null,"abstract":"<div><div>Humans perceive scenes through multisensory cues, yet existing Scene Graph Generation (SGG) methods predominantly rely on visual input alone, neglecting the complementary information provided by auditory signals and cross-modal interactions. To overcome this limitation, we propose Audio-Enhanced Scene Graph Generation (AESGG), a novel framework that integrates audio cues to enhance both object detection and relation prediction. AESGG improves visual object proposals by incorporating aligned audio features, thereby reducing ambiguity in detection. It further employs a spatio-temporal transformer to model dynamic inter-object relationships over time. A self-supervised learning strategy is introduced to capture relation transitions across video frames effectively. To facilitate research in audio-visual scene understanding, we also present the VALM dataset. Experimental results demonstrate that AESGG consistently outperforms state-of-the-art baselines, achieving up to a 2.0 percentage point improvement in relation prediction metrics (R@50, PredCls, with constraints), reflecting its robust and generalizable performance gains.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"126 ","pages":"Article 103430"},"PeriodicalIF":15.5000,"publicationDate":"2025-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525005032","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Humans perceive scenes through multisensory cues, yet existing Scene Graph Generation (SGG) methods predominantly rely on visual input alone, neglecting the complementary information provided by auditory signals and cross-modal interactions. To overcome this limitation, we propose Audio-Enhanced Scene Graph Generation (AESGG), a novel framework that integrates audio cues to enhance both object detection and relation prediction. AESGG improves visual object proposals by incorporating aligned audio features, thereby reducing ambiguity in detection. It further employs a spatio-temporal transformer to model dynamic inter-object relationships over time. A self-supervised learning strategy is introduced to capture relation transitions across video frames effectively. To facilitate research in audio-visual scene understanding, we also present the VALM dataset. Experimental results demonstrate that AESGG consistently outperforms state-of-the-art baselines, achieving up to a 2.0 percentage point improvement in relation prediction metrics (R@50, PredCls, with constraints), reflecting its robust and generalizable performance gains.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.