Zihan Fang, Ying Zou, Shiyang Lan, Shide Du, Yanchao Tan, Shiping Wang
{"title":"可扩展的多模态表示学习网络","authors":"Zihan Fang, Ying Zou, Shiyang Lan, Shide Du, Yanchao Tan, Shiping Wang","doi":"10.1007/s10462-025-11224-8","DOIUrl":null,"url":null,"abstract":"<div><p>Multi-modal representation learning is recognized for its comprehensive interpretation across diverse modalities. Although existing approaches have yielded favorable results, they face challenges in high-order information preservation and out-of-sample data generalization. To tackle these issues, we propose a scalable multi-modal representation learning networks framework, which aims to learn optimal modality-specific projection matrices to project multi-modal features to a shared representation space. Specifically, weight guided modality-wise and row-sparsity driven feature-wise measures are considered to achieve adaptively hierarchical feature selection from the original data. Then, within the unified latent representation space, we employ hypergraph embedding to preserve the intricate high-order local geometric structures within the modality-specific high-dimensional spaces. Finally, we propose a proximal operator-inspired network architecture to resolve the optimization objectives, streamlining the process of feature auto-weighted selection and representation learning. The experimental results highlight the effectiveness and superiority of the proposed method, while online testing on out-of-sample data further demonstrates robust generalization. The code of the proposed method is publicly available at: https://github.com/ZihanFang11/SMMRL.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 7","pages":""},"PeriodicalIF":10.7000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11224-8.pdf","citationCount":"0","resultStr":"{\"title\":\"Scalable multi-modal representation learning networks\",\"authors\":\"Zihan Fang, Ying Zou, Shiyang Lan, Shide Du, Yanchao Tan, Shiping Wang\",\"doi\":\"10.1007/s10462-025-11224-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Multi-modal representation learning is recognized for its comprehensive interpretation across diverse modalities. Although existing approaches have yielded favorable results, they face challenges in high-order information preservation and out-of-sample data generalization. To tackle these issues, we propose a scalable multi-modal representation learning networks framework, which aims to learn optimal modality-specific projection matrices to project multi-modal features to a shared representation space. Specifically, weight guided modality-wise and row-sparsity driven feature-wise measures are considered to achieve adaptively hierarchical feature selection from the original data. Then, within the unified latent representation space, we employ hypergraph embedding to preserve the intricate high-order local geometric structures within the modality-specific high-dimensional spaces. Finally, we propose a proximal operator-inspired network architecture to resolve the optimization objectives, streamlining the process of feature auto-weighted selection and representation learning. The experimental results highlight the effectiveness and superiority of the proposed method, while online testing on out-of-sample data further demonstrates robust generalization. The code of the proposed method is publicly available at: https://github.com/ZihanFang11/SMMRL.</p></div>\",\"PeriodicalId\":8449,\"journal\":{\"name\":\"Artificial Intelligence Review\",\"volume\":\"58 7\",\"pages\":\"\"},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2025-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10462-025-11224-8.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence Review\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10462-025-11224-8\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11224-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Multi-modal representation learning is recognized for its comprehensive interpretation across diverse modalities. Although existing approaches have yielded favorable results, they face challenges in high-order information preservation and out-of-sample data generalization. To tackle these issues, we propose a scalable multi-modal representation learning networks framework, which aims to learn optimal modality-specific projection matrices to project multi-modal features to a shared representation space. Specifically, weight guided modality-wise and row-sparsity driven feature-wise measures are considered to achieve adaptively hierarchical feature selection from the original data. Then, within the unified latent representation space, we employ hypergraph embedding to preserve the intricate high-order local geometric structures within the modality-specific high-dimensional spaces. Finally, we propose a proximal operator-inspired network architecture to resolve the optimization objectives, streamlining the process of feature auto-weighted selection and representation learning. The experimental results highlight the effectiveness and superiority of the proposed method, while online testing on out-of-sample data further demonstrates robust generalization. The code of the proposed method is publicly available at: https://github.com/ZihanFang11/SMMRL.
期刊介绍:
Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.