Disentanglement Translation Network for multimodal sentiment analysis

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2023-09-21 DOI:10.1016/j.inffus.2023.102031

Ying Zeng, Wenjun Yan, Sijie Mai, Haifeng Hu

{"title":"Disentanglement Translation Network for multimodal sentiment analysis","authors":"Ying Zeng, Wenjun Yan, Sijie Mai, Haifeng Hu","doi":"10.1016/j.inffus.2023.102031","DOIUrl":null,"url":null,"abstract":"<div><p>Obtaining an effective joint representation has always been the goal for multimodal tasks. However, distributional gap inevitably exists due to the heterogeneous nature of different modalities, which poses burden on the fusion process and the learning of multimodal representation. The imbalance of modality dominance further aggravates this problem, where inferior modalities may contain much redundancy that introduces additional variations. To address the aforementioned issues, we propose a Disentanglement Translation Network (DTN) with Slack Reconstruction to capture desirable information properties, obtain a unified feature distribution and reduce redundancy. Specifically, the encoder–decoder-based disentanglement framework is adopted to decouple the unimodal representations into modality-common and modality-specific subspaces, which explores the cross-modal commonality and diversity, respectively. In the encoding stage, to narrow down the discrepancy, a two-stage translation is devised to incorporate with the disentanglement learning framework. The first stage targets at learning modality-invariant embedding for modality-common information with adversarial learning strategy, capturing the commonality shared across modalities. The second stage considers the modality-specific information that reveals diversity. To relieve the burden of multimodal fusion, we realize Specific-Common Distribution Matching to further unify the distribution of the desirable information. As for the decoding and reconstruction stage, we propose Slack Reconstruction to seek a balance between retaining discriminative information and reducing redundancy. Although the existing commonly-used reconstruction loss with strict constraint lowers the risk of information loss, it easily leads to the preservation of information redundancy. In contrast, Slack Reconstruction imposes a more relaxed constraint so that the redundancy is not forced to be retained, and simultaneously explores the inter-sample relationships. The proposed method aids multimodal fusion by learning the exact properties and obtaining a more uniform distribution for cross-modal data, and manages to reduce information redundancy to further ensure feature effectiveness. Extensive experiments on the task of multimodal sentiment analysis indicate the effectiveness of the proposed method. The codes are available at <span>https://github.com/zengy268/DTN</span><svg><path></path></svg>.</p></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"102 ","pages":"Article 102031"},"PeriodicalIF":15.5000,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253523003470","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Obtaining an effective joint representation has always been the goal for multimodal tasks. However, distributional gap inevitably exists due to the heterogeneous nature of different modalities, which poses burden on the fusion process and the learning of multimodal representation. The imbalance of modality dominance further aggravates this problem, where inferior modalities may contain much redundancy that introduces additional variations. To address the aforementioned issues, we propose a Disentanglement Translation Network (DTN) with Slack Reconstruction to capture desirable information properties, obtain a unified feature distribution and reduce redundancy. Specifically, the encoder–decoder-based disentanglement framework is adopted to decouple the unimodal representations into modality-common and modality-specific subspaces, which explores the cross-modal commonality and diversity, respectively. In the encoding stage, to narrow down the discrepancy, a two-stage translation is devised to incorporate with the disentanglement learning framework. The first stage targets at learning modality-invariant embedding for modality-common information with adversarial learning strategy, capturing the commonality shared across modalities. The second stage considers the modality-specific information that reveals diversity. To relieve the burden of multimodal fusion, we realize Specific-Common Distribution Matching to further unify the distribution of the desirable information. As for the decoding and reconstruction stage, we propose Slack Reconstruction to seek a balance between retaining discriminative information and reducing redundancy. Although the existing commonly-used reconstruction loss with strict constraint lowers the risk of information loss, it easily leads to the preservation of information redundancy. In contrast, Slack Reconstruction imposes a more relaxed constraint so that the redundancy is not forced to be retained, and simultaneously explores the inter-sample relationships. The proposed method aids multimodal fusion by learning the exact properties and obtaining a more uniform distribution for cross-modal data, and manages to reduce information redundancy to further ensure feature effectiveness. Extensive experiments on the task of multimodal sentiment analysis indicate the effectiveness of the proposed method. The codes are available at https://github.com/zengy268/DTN.

查看原文本刊更多论文

用于多模态情感分析的解纠缠翻译网络

获得有效的联合表示一直是多模式任务的目标。然而，由于不同模态的异质性，分布差距不可避免地存在，这给融合过程和多模态表示的学习带来了负担。模态优势的不平衡进一步加剧了这个问题，其中较低的模态可能包含大量冗余，从而引入额外的变化。为了解决上述问题，我们提出了一种具有Slack重构的解纠缠翻译网络（DTN），以捕获所需的信息属性，获得统一的特征分布并减少冗余。具体而言，采用基于编码器-解码器的解纠缠框架将单峰表示解耦为模态公共子空间和模态特定子空间，分别探索了跨模态的共性和多样性。在编码阶段，为了缩小差异，设计了一个两阶段翻译，以结合解纠缠学习框架。第一阶段的目标是用对抗性学习策略学习模态公共信息的模态不变嵌入，捕捉模态之间共享的公共性。第二阶段考虑揭示多样性的模态特定信息。为了减轻多模式融合的负担，我们实现了特定的公共分布匹配，以进一步统一所需信息的分布。至于解码和重建阶段，我们提出了Slack重建，以在保留判别信息和减少冗余之间寻求平衡。尽管现有的常用的具有严格约束的重建损失降低了信息丢失的风险，但它很容易导致信息冗余的保留。相反，Slack重构施加了一个更宽松的约束，这样就不会强制保留冗余，同时探索样本间的关系。所提出的方法通过学习准确的特性和获得更均匀的跨模态数据分布来帮助多模态融合，并设法减少信息冗余，以进一步确保特征的有效性。对多模态情感分析任务的大量实验表明了所提出方法的有效性。代码可在https://github.com/zengy268/DTN.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.