Deep Multimodal Fusion: Combining Discrete Events and Continuous Signals

Proceedings of the 16th International Conference on Multimodal Interaction Pub Date : 2014-11-12 DOI:10.1145/2663204.2663236

H. P. Martínez, Georgios N. Yannakakis

{"title":"Deep Multimodal Fusion: Combining Discrete Events and Continuous Signals","authors":"H. P. Martínez, Georgios N. Yannakakis","doi":"10.1145/2663204.2663236","DOIUrl":null,"url":null,"abstract":"Multimodal datasets often feature a combination of continuous signals and a series of discrete events. For instance, when studying human behaviour it is common to annotate actions performed by the participant over several other modalities such as video recordings of the face or physiological signals. These events are nominal, not frequent and are not sampled at a continuous rate while signals are numeric and often sampled at short fixed intervals. This fundamentally different nature complicates the analysis of the relation among these modalities which is often studied after each modality has been summarised or reduced. This paper investigates a novel approach to model the relation between such modality types bypassing the need for summarising each modality independently of each other. For that purpose, we introduce a deep learning model based on convolutional neural networks that is adapted to process multiple modalities at different time resolutions we name deep multimodal fusion. Furthermore, we introduce and compare three alternative methods (convolution, training and pooling fusion) to integrate sequences of events with continuous signals within this model. We evaluate deep multimodal fusion using a game user dataset where player physiological signals are recorded in parallel with game events. Results suggest that the proposed architecture can appropriately capture multimodal information as it yields higher prediction accuracies compared to single-modality models. In addition, it appears that pooling fusion, based on a novel filter-pooling method provides the more effective fusion approach for the investigated types of data.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"54","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2663204.2663236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 54

Abstract

Multimodal datasets often feature a combination of continuous signals and a series of discrete events. For instance, when studying human behaviour it is common to annotate actions performed by the participant over several other modalities such as video recordings of the face or physiological signals. These events are nominal, not frequent and are not sampled at a continuous rate while signals are numeric and often sampled at short fixed intervals. This fundamentally different nature complicates the analysis of the relation among these modalities which is often studied after each modality has been summarised or reduced. This paper investigates a novel approach to model the relation between such modality types bypassing the need for summarising each modality independently of each other. For that purpose, we introduce a deep learning model based on convolutional neural networks that is adapted to process multiple modalities at different time resolutions we name deep multimodal fusion. Furthermore, we introduce and compare three alternative methods (convolution, training and pooling fusion) to integrate sequences of events with continuous signals within this model. We evaluate deep multimodal fusion using a game user dataset where player physiological signals are recorded in parallel with game events. Results suggest that the proposed architecture can appropriately capture multimodal information as it yields higher prediction accuracies compared to single-modality models. In addition, it appears that pooling fusion, based on a novel filter-pooling method provides the more effective fusion approach for the investigated types of data.

查看原文本刊更多论文

深度多模态融合:结合离散事件和连续信号

多模态数据集通常具有连续信号和一系列离散事件的组合。例如，在研究人类行为时，通常会通过其他几种方式(如面部录像或生理信号)来注释参与者的行为。这些事件是名义上的，不频繁，不以连续速率采样，而信号是数字的，通常以短的固定间隔采样。这种根本不同的性质使分析这些情态之间的关系变得复杂，而分析这些关系通常是在总结或简化每个情态之后进行的。本文研究了一种新的方法来模拟这些模态类型之间的关系，而不需要独立地总结每个模态。为此，我们引入了一种基于卷积神经网络的深度学习模型，该模型适用于在不同时间分辨率下处理多个模态，我们将其命名为深度多模态融合。此外，我们介绍并比较了三种替代方法(卷积、训练和池化融合)来整合该模型中连续信号的事件序列。我们使用游戏用户数据集评估深度多模态融合，其中玩家生理信号与游戏事件并行记录。结果表明，与单模态模型相比，所提出的体系结构可以适当地捕获多模态信息，因为它可以产生更高的预测精度。此外，基于滤波池化方法的池化融合为所研究的数据类型提供了更有效的融合方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th International Conference on Multimodal Interaction

自引率

0.00%

发文量