Yutian Li;Zhuopan Yang;Zhenguo Yang;Xiaoping Li;Wenyin Liu;Qing Li
{"title":"Multimodal Disentangled Fusion Network via VAEs for Multimodal Zero-Shot Learning","authors":"Yutian Li;Zhuopan Yang;Zhenguo Yang;Xiaoping Li;Wenyin Liu;Qing Li","doi":"10.1109/TCSS.2025.3575939","DOIUrl":null,"url":null,"abstract":"Addressing the bias problem in multimodal zero-shot learning tasks is challenging due to the domain shift between seen and unseen classes, as well as the semantic gap across different modalities. To tackle these challenges, we propose a multimodal disentangled fusion network (MDFN) that unifies the class embedding space for multimodal zero-shot learning. MDFN exploits feature disentangled variational autoencoder (FD-VAE) in two branches to distangle unimodal features into modality-specific representations that are semantically consistent and unrelated, where semantics are shared within classes. In particular, semantically consistent representations and unimodal features are integrated to retain the semantics of the original features in the form of residuals. Furthermore, multimodal conditional VAE (MC-VAE) in two branches is adopted to learn cross-modal interactions with modality-specific conditions. Finally, the complementary multimodal representations achieved by MC-VAE are encoded into a fusion network (FN) with a self-adaptive margin center loss (SAMC-loss) to predict target class labels in embedding forms. By learning the distance among domain samples, SAMC-loss promotes intraclass compactness and interclass separability. Experiments on zero-shot and news event datasets demonstrate the superior performance of MDFN, with the harmonic mean improved by 27.2% on the MMED dataset and 5.1% on the SUN dataset.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":"12 5","pages":"3684-3697"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11073778/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
引用次数: 0
Abstract
Addressing the bias problem in multimodal zero-shot learning tasks is challenging due to the domain shift between seen and unseen classes, as well as the semantic gap across different modalities. To tackle these challenges, we propose a multimodal disentangled fusion network (MDFN) that unifies the class embedding space for multimodal zero-shot learning. MDFN exploits feature disentangled variational autoencoder (FD-VAE) in two branches to distangle unimodal features into modality-specific representations that are semantically consistent and unrelated, where semantics are shared within classes. In particular, semantically consistent representations and unimodal features are integrated to retain the semantics of the original features in the form of residuals. Furthermore, multimodal conditional VAE (MC-VAE) in two branches is adopted to learn cross-modal interactions with modality-specific conditions. Finally, the complementary multimodal representations achieved by MC-VAE are encoded into a fusion network (FN) with a self-adaptive margin center loss (SAMC-loss) to predict target class labels in embedding forms. By learning the distance among domain samples, SAMC-loss promotes intraclass compactness and interclass separability. Experiments on zero-shot and news event datasets demonstrate the superior performance of MDFN, with the harmonic mean improved by 27.2% on the MMED dataset and 5.1% on the SUN dataset.
期刊介绍:
IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.