Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI:10.1145/3503161.3547755

Dingkang Yang, Haopeng Kuang, Shuai Huang, Lihua Zhang

{"title":"Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences","authors":"Dingkang Yang, Haopeng Kuang, Shuai Huang, Lihua Zhang","doi":"10.1145/3503161.3547755","DOIUrl":null,"url":null,"abstract":"Understanding human behaviors and intents from videos is a challenging task. Video flows usually involve time-series data from different modalities, such as natural language, facial gestures, and acoustic information. Due to the variable receiving frequency for sequences from each modality, the collected multimodal streams are usually unaligned. For multimodal fusion of asynchronous sequences, the existing methods focus on projecting multiple modalities into a common latent space and learning the hybrid representations, which neglects the diversity of each modality and the commonality across different modalities. Motivated by this observation, we propose a Multimodal Fusion approach for learning modality-Specific and modality-Agnostic representations (MFSA) to refine multimodal representations and leverage the complementarity across different modalities. Specifically, a predictive self-attention module is used to capture reliable contextual dependencies and enhance the unique features over the modality-specific spaces. Meanwhile, we propose a hierarchical cross-modal attention module to explore the correlations between cross-modal elements over the modality-agnostic space. In this case, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, the modality-specific and -agnostic multimodal representations are used together for downstream tasks. Comprehensive experiments on three multimodal datasets clearly demonstrate the superiority of our approach.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3547755","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Understanding human behaviors and intents from videos is a challenging task. Video flows usually involve time-series data from different modalities, such as natural language, facial gestures, and acoustic information. Due to the variable receiving frequency for sequences from each modality, the collected multimodal streams are usually unaligned. For multimodal fusion of asynchronous sequences, the existing methods focus on projecting multiple modalities into a common latent space and learning the hybrid representations, which neglects the diversity of each modality and the commonality across different modalities. Motivated by this observation, we propose a Multimodal Fusion approach for learning modality-Specific and modality-Agnostic representations (MFSA) to refine multimodal representations and leverage the complementarity across different modalities. Specifically, a predictive self-attention module is used to capture reliable contextual dependencies and enhance the unique features over the modality-specific spaces. Meanwhile, we propose a hierarchical cross-modal attention module to explore the correlations between cross-modal elements over the modality-agnostic space. In this case, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, the modality-specific and -agnostic multimodal representations are used together for downstream tasks. Comprehensive experiments on three multimodal datasets clearly demonstrate the superiority of our approach.

查看原文本刊更多论文

学习异步多模态语言序列的特定模态和不可知表示

从视频中理解人类的行为和意图是一项具有挑战性的任务。视频流通常涉及不同形式的时间序列数据，如自然语言、面部手势和声学信息。由于每个模态序列的接收频率是可变的，因此收集到的多模态流通常是不对齐的。对于异步序列的多模态融合，现有方法侧重于将多个模态投射到一个共同的潜在空间中并学习混合表征，忽略了每个模态的多样性和不同模态之间的共性。基于这一观察结果，我们提出了一种多模态融合方法来学习模态特定表征和模态不可知论表征(MFSA)，以改进多模态表征并利用不同模态之间的互补性。具体来说，预测性自关注模块用于捕获可靠的上下文依赖关系，并增强特定于模态空间的独特功能。同时，我们提出了一个分层的跨模态注意模块，以探索跨模态元素在模态不可知空间上的相关性。在这种情况下，提出了一种双重鉴别器策略，以确保以对抗的方式产生不同的表示。最终，特定于模态和不可知的多模态表示一起用于下游任务。在三个多模态数据集上的综合实验清楚地证明了我们方法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 30th ACM International Conference on Multimedia

自引率

0.00%

发文量