Scalable multi-modal representation learning networks

IF 10.7 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence Review Pub Date : 2025-04-17 DOI:10.1007/s10462-025-11224-8

Zihan Fang, Ying Zou, Shiyang Lan, Shide Du, Yanchao Tan, Shiping Wang

{"title":"Scalable multi-modal representation learning networks","authors":"Zihan Fang, Ying Zou, Shiyang Lan, Shide Du, Yanchao Tan, Shiping Wang","doi":"10.1007/s10462-025-11224-8","DOIUrl":null,"url":null,"abstract":"<div><p>Multi-modal representation learning is recognized for its comprehensive interpretation across diverse modalities. Although existing approaches have yielded favorable results, they face challenges in high-order information preservation and out-of-sample data generalization. To tackle these issues, we propose a scalable multi-modal representation learning networks framework, which aims to learn optimal modality-specific projection matrices to project multi-modal features to a shared representation space. Specifically, weight guided modality-wise and row-sparsity driven feature-wise measures are considered to achieve adaptively hierarchical feature selection from the original data. Then, within the unified latent representation space, we employ hypergraph embedding to preserve the intricate high-order local geometric structures within the modality-specific high-dimensional spaces. Finally, we propose a proximal operator-inspired network architecture to resolve the optimization objectives, streamlining the process of feature auto-weighted selection and representation learning. The experimental results highlight the effectiveness and superiority of the proposed method, while online testing on out-of-sample data further demonstrates robust generalization. The code of the proposed method is publicly available at: https://github.com/ZihanFang11/SMMRL.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 7","pages":""},"PeriodicalIF":10.7000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11224-8.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11224-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-modal representation learning is recognized for its comprehensive interpretation across diverse modalities. Although existing approaches have yielded favorable results, they face challenges in high-order information preservation and out-of-sample data generalization. To tackle these issues, we propose a scalable multi-modal representation learning networks framework, which aims to learn optimal modality-specific projection matrices to project multi-modal features to a shared representation space. Specifically, weight guided modality-wise and row-sparsity driven feature-wise measures are considered to achieve adaptively hierarchical feature selection from the original data. Then, within the unified latent representation space, we employ hypergraph embedding to preserve the intricate high-order local geometric structures within the modality-specific high-dimensional spaces. Finally, we propose a proximal operator-inspired network architecture to resolve the optimization objectives, streamlining the process of feature auto-weighted selection and representation learning. The experimental results highlight the effectiveness and superiority of the proposed method, while online testing on out-of-sample data further demonstrates robust generalization. The code of the proposed method is publicly available at: https://github.com/ZihanFang11/SMMRL.

查看原文本刊更多论文

可扩展的多模态表示学习网络

多模态表征学习因其对不同模态的综合解释而得到认可。虽然现有的方法取得了良好的效果，但它们在高阶信息保存和样本外数据泛化方面面临挑战。为了解决这些问题，我们提出了一个可扩展的多模态表示学习网络框架，该框架旨在学习特定于模态的最佳投影矩阵，以将多模态特征投影到共享表示空间。具体来说，考虑了权重引导的模式智能和行稀疏驱动的特征智能度量，以实现从原始数据中自适应分层特征选择。然后，在统一的潜在表示空间内，采用超图嵌入的方法在特定模态的高维空间内保留复杂的高阶局部几何结构。最后，我们提出了一种近似算子启发的网络架构来解决优化目标，简化了特征自动加权选择和表示学习的过程。实验结果表明了该方法的有效性和优越性，而对样本外数据的在线测试进一步证明了该方法的鲁棒泛化性。建议的方法的代码可在：https://github.com/ZihanFang11/SMMRL上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence Review 工程技术-计算机：人工智能

CiteScore

22.00

自引率

3.30%

发文量

194

审稿时长

5.3 months

期刊介绍： Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.