学习语言引导的自适应超模态表征，用于多模态情感分析

Conference on Empirical Methods in Natural Language Processing Pub Date : 2023-10-09 DOI:10.18653/v1/2023.emnlp-main.49

Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, Tianshu Yu

{"title":"学习语言引导的自适应超模态表征，用于多模态情感分析","authors":"Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, Tianshu Yu","doi":"10.18653/v1/2023.emnlp-main.49","DOIUrl":null,"url":null,"abstract":"Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.","PeriodicalId":505350,"journal":{"name":"Conference on Empirical Methods in Natural Language Processing","volume":"33 1","pages":"756-767"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis\",\"authors\":\"Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, Tianshu Yu\",\"doi\":\"10.18653/v1/2023.emnlp-main.49\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.\",\"PeriodicalId\":505350,\"journal\":{\"name\":\"Conference on Empirical Methods in Natural Language Processing\",\"volume\":\"33 1\",\"pages\":\"756-767\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conference on Empirical Methods in Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2023.emnlp-main.49\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2023.emnlp-main.49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

尽管多模态情感分析（MSA）通过利用来自多个来源（如语言、视频和音频）的丰富信息证明是有效的，但跨模态的潜在情感无关信息和冲突信息可能会阻碍性能的进一步提高。为了缓解这一问题，我们提出了自适应语言引导多模态转换器（ALMT），它结合了自适应超模态学习（AHL）模块，在不同尺度语言特征的引导下，从视觉和音频特征中学习不相关/冲突抑制表征。利用所获得的超模态表征，该模型可以通过多模态融合获得互补的联合表征，从而实现有效的 MSA。在实践中，ALMT 在几个流行的数据集（如 MOSI、MOSEI 和 CH-SIMS）上取得了一流的性能，大量的消减证明了我们的无关性/冲突抑制机制的有效性和必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis

Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Conference on Empirical Methods in Natural Language Processing

自引率

0.00%

发文量