Multimodal Disentangled Fusion Network via VAEs for Multimodal Zero-Shot Learning

IF 4.5 2区 计算机科学 Q1 COMPUTER SCIENCE, CYBERNETICS
Yutian Li;Zhuopan Yang;Zhenguo Yang;Xiaoping Li;Wenyin Liu;Qing Li
{"title":"Multimodal Disentangled Fusion Network via VAEs for Multimodal Zero-Shot Learning","authors":"Yutian Li;Zhuopan Yang;Zhenguo Yang;Xiaoping Li;Wenyin Liu;Qing Li","doi":"10.1109/TCSS.2025.3575939","DOIUrl":null,"url":null,"abstract":"Addressing the bias problem in multimodal zero-shot learning tasks is challenging due to the domain shift between seen and unseen classes, as well as the semantic gap across different modalities. To tackle these challenges, we propose a multimodal disentangled fusion network (MDFN) that unifies the class embedding space for multimodal zero-shot learning. MDFN exploits feature disentangled variational autoencoder (FD-VAE) in two branches to distangle unimodal features into modality-specific representations that are semantically consistent and unrelated, where semantics are shared within classes. In particular, semantically consistent representations and unimodal features are integrated to retain the semantics of the original features in the form of residuals. Furthermore, multimodal conditional VAE (MC-VAE) in two branches is adopted to learn cross-modal interactions with modality-specific conditions. Finally, the complementary multimodal representations achieved by MC-VAE are encoded into a fusion network (FN) with a self-adaptive margin center loss (SAMC-loss) to predict target class labels in embedding forms. By learning the distance among domain samples, SAMC-loss promotes intraclass compactness and interclass separability. Experiments on zero-shot and news event datasets demonstrate the superior performance of MDFN, with the harmonic mean improved by 27.2% on the MMED dataset and 5.1% on the SUN dataset.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":"12 5","pages":"3684-3697"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11073778/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
引用次数: 0

Abstract

Addressing the bias problem in multimodal zero-shot learning tasks is challenging due to the domain shift between seen and unseen classes, as well as the semantic gap across different modalities. To tackle these challenges, we propose a multimodal disentangled fusion network (MDFN) that unifies the class embedding space for multimodal zero-shot learning. MDFN exploits feature disentangled variational autoencoder (FD-VAE) in two branches to distangle unimodal features into modality-specific representations that are semantically consistent and unrelated, where semantics are shared within classes. In particular, semantically consistent representations and unimodal features are integrated to retain the semantics of the original features in the form of residuals. Furthermore, multimodal conditional VAE (MC-VAE) in two branches is adopted to learn cross-modal interactions with modality-specific conditions. Finally, the complementary multimodal representations achieved by MC-VAE are encoded into a fusion network (FN) with a self-adaptive margin center loss (SAMC-loss) to predict target class labels in embedding forms. By learning the distance among domain samples, SAMC-loss promotes intraclass compactness and interclass separability. Experiments on zero-shot and news event datasets demonstrate the superior performance of MDFN, with the harmonic mean improved by 27.2% on the MMED dataset and 5.1% on the SUN dataset.
基于vae的多模态解纠缠融合网络用于多模态零射击学习
由于可见类和未见类之间的域转移以及不同模态之间的语义差距,解决多模态零射击学习任务中的偏差问题具有挑战性。为了解决这些挑战,我们提出了一种多模态解纠缠融合网络(MDFN),该网络统一了多模态零射击学习的类嵌入空间。mfn利用两个分支中的特征解纠缠变分自编码器(FD-VAE)将单模态特征分离为特定于模态的表示,这些表示在语义上一致且不相关,其中语义在类中共享。特别地,将语义一致的表示和单峰特征结合起来,以残差的形式保留原始特征的语义。此外,采用两个分支的多模态条件VAE (MC-VAE)来学习具有模态特定条件的跨模态交互。最后,将MC-VAE获得的互补多模态表示编码到一个融合网络(FN)中,该网络具有自适应边缘中心损失(SAMC-loss),用于预测嵌入形式中的目标类标签。SAMC-loss通过学习域样本之间的距离,提高了类内紧密性和类间可分离性。在零射击和新闻事件数据集上的实验证明了mfn的优越性能,在MMED数据集上谐波均值提高了27.2%,在SUN数据集上提高了5.1%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Computational Social Systems
IEEE Transactions on Computational Social Systems Social Sciences-Social Sciences (miscellaneous)
CiteScore
10.00
自引率
20.00%
发文量
316
期刊介绍: IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信