{"title":"基于多模态注意的视听检索深度相关学习","authors":"Jiwei Zhang, Hirotaka Hachiya","doi":"10.1016/j.mlwa.2025.100695","DOIUrl":null,"url":null,"abstract":"<div><div>The cross-modal retrieval task aims to retrieve audio modality information from the database that best matches the visual modality and vice versa. One of the key challenges in this field is the inconsistency of audio and visual features, which increases the complexity of capturing cross-modal information, making it difficult for machines to accurately understand visual content and retrieve suitable audio data. In this work, we propose a novel deep correlation learning with multi-modal attention (DCLMA) for visual-audio retrieval, which selectively focuses on relevant information fragments through multi-modal attention, and effectively integrates audio-visual information to enhance modal interaction and correlation representation learning capabilities. First, to achieve accurate retrieval of associated multi-modal data, we utilize multiple attention-composed models to interactively learn the complex correlation of audio and visual multi-scale features. Second, cross-modal attention is exploited to mine inter-modal correlations at the global level. Finally, we combine multi-scale and global-level representations to obtain modality-integrated representations, which enhance the representation capabilities of inputs. Furthermore, our objective function supervised model learns discriminative and modality-invariant features between samples from different semantic categories in the mutual latent space. Experimental results on cross-modal retrieval on two widely used benchmark datasets demonstrate that our proposed approach is superior in learning effective representations and significantly outperforms state-of-the-art cross-modal retrieval methods. Code is available at <span><span>https://github.com/zhangjiwei-japan/cross-modal-visual-audio-retrieval</span><svg><path></path></svg></span></div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"21 ","pages":"Article 100695"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DCLMA: Deep correlation learning with multi-modal attention for visual-audio retrieval\",\"authors\":\"Jiwei Zhang, Hirotaka Hachiya\",\"doi\":\"10.1016/j.mlwa.2025.100695\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The cross-modal retrieval task aims to retrieve audio modality information from the database that best matches the visual modality and vice versa. One of the key challenges in this field is the inconsistency of audio and visual features, which increases the complexity of capturing cross-modal information, making it difficult for machines to accurately understand visual content and retrieve suitable audio data. In this work, we propose a novel deep correlation learning with multi-modal attention (DCLMA) for visual-audio retrieval, which selectively focuses on relevant information fragments through multi-modal attention, and effectively integrates audio-visual information to enhance modal interaction and correlation representation learning capabilities. First, to achieve accurate retrieval of associated multi-modal data, we utilize multiple attention-composed models to interactively learn the complex correlation of audio and visual multi-scale features. Second, cross-modal attention is exploited to mine inter-modal correlations at the global level. Finally, we combine multi-scale and global-level representations to obtain modality-integrated representations, which enhance the representation capabilities of inputs. Furthermore, our objective function supervised model learns discriminative and modality-invariant features between samples from different semantic categories in the mutual latent space. Experimental results on cross-modal retrieval on two widely used benchmark datasets demonstrate that our proposed approach is superior in learning effective representations and significantly outperforms state-of-the-art cross-modal retrieval methods. Code is available at <span><span>https://github.com/zhangjiwei-japan/cross-modal-visual-audio-retrieval</span><svg><path></path></svg></span></div></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"21 \",\"pages\":\"Article 100695\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827025000787\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025000787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DCLMA: Deep correlation learning with multi-modal attention for visual-audio retrieval
The cross-modal retrieval task aims to retrieve audio modality information from the database that best matches the visual modality and vice versa. One of the key challenges in this field is the inconsistency of audio and visual features, which increases the complexity of capturing cross-modal information, making it difficult for machines to accurately understand visual content and retrieve suitable audio data. In this work, we propose a novel deep correlation learning with multi-modal attention (DCLMA) for visual-audio retrieval, which selectively focuses on relevant information fragments through multi-modal attention, and effectively integrates audio-visual information to enhance modal interaction and correlation representation learning capabilities. First, to achieve accurate retrieval of associated multi-modal data, we utilize multiple attention-composed models to interactively learn the complex correlation of audio and visual multi-scale features. Second, cross-modal attention is exploited to mine inter-modal correlations at the global level. Finally, we combine multi-scale and global-level representations to obtain modality-integrated representations, which enhance the representation capabilities of inputs. Furthermore, our objective function supervised model learns discriminative and modality-invariant features between samples from different semantic categories in the mutual latent space. Experimental results on cross-modal retrieval on two widely used benchmark datasets demonstrate that our proposed approach is superior in learning effective representations and significantly outperforms state-of-the-art cross-modal retrieval methods. Code is available at https://github.com/zhangjiwei-japan/cross-modal-visual-audio-retrieval