Multimodal Transformer for Unaligned Multimodal Language Sequences.

Proceedings of the conference. Association for Computational Linguistics. Meeting Pub Date : 2019-07-01 DOI:10.18653/v1/p19-1656

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov

{"title":"Multimodal Transformer for Unaligned Multimodal Language Sequences.","authors":"Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov","doi":"10.18653/v1/p19-1656","DOIUrl":null,"url":null,"abstract":"<p><p>Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.</p>","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":" ","pages":"6558-6569"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7195022/pdf/nihms-1570579.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/p19-1656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Abstract Image

查看原文本刊更多论文

未对齐多模态语言序列的多模态变换器

人类语言通常是多模态的，包括自然语言、面部手势和声音行为。然而，对这种多模态人类语言时间序列数据建模存在两大挑战：1) 由于每种模态的序列采样率不同，导致固有的数据不对齐；以及 2) 不同模态的元素之间存在长程依赖关系。在本文中，我们引入了多模态变换器（MulT），以端到端方式通用地解决上述问题，而无需明确地对齐数据。我们模型的核心是定向成对跨模态注意力，它关注跨不同时间步长的多模态序列之间的交互，并潜移默化地将流从一种模态适应到另一种模态。在对齐和非对齐多模态时间序列上进行的综合实验表明，我们的模型在很大程度上优于最先进的方法。此外，经验分析表明，MulT 中提出的跨模态注意力机制能够捕捉到相关的跨模态信号。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the conference. Association for Computational Linguistics. Meeting

自引率

0.00%

发文量