Residual multimodal Transformer for expression-EEG fusion continuous emotion recognition

IF 7.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

CAAI Transactions on Intelligence Technology Pub Date : 2024-05-08 DOI:10.1049/cit2.12346

Xiaofang Jin, Jieyu Xiao, Libiao Jin, Xinruo Zhang

{"title":"Residual multimodal Transformer for expression-EEG fusion continuous emotion recognition","authors":"Xiaofang Jin, Jieyu Xiao, Libiao Jin, Xinruo Zhang","doi":"10.1049/cit2.12346","DOIUrl":null,"url":null,"abstract":"<p>Continuous emotion recognition is to predict emotion states through affective information and more focus on the continuous variation of emotion. Fusion of electroencephalography (EEG) and facial expressions videos has been used in this field, while there are with some limitations in current researches, such as hand-engineered features, simple approaches to integration. Hence, a new continuous emotion recognition model is proposed based on the fusion of EEG and facial expressions videos named residual multimodal Transformer (RMMT). Firstly, the Resnet50 and temporal convolutional network (TCN) are utilised to extract spatiotemporal features from videos, and the TCN is also applied to process the computed EEG frequency power to acquire spatiotemporal features of EEG. Then, a multimodal Transformer is used to fuse the spatiotemporal features from the two modalities. Furthermore, a residual connection is introduced to fuse shallow features with deep features which is verified to be effective for continuous emotion recognition through experiments. Inspired by knowledge distillation, the authors incorporate feature-level loss into the loss function to further enhance the network performance. Experimental results show that the RMMT reaches a superior performance over other methods for the MAHNOB-HCI dataset. Ablation studies on the residual connection and loss function in the RMMT demonstrate that both of them is functional.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"9 5","pages":"1290-1304"},"PeriodicalIF":7.3000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12346","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12346","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Continuous emotion recognition is to predict emotion states through affective information and more focus on the continuous variation of emotion. Fusion of electroencephalography (EEG) and facial expressions videos has been used in this field, while there are with some limitations in current researches, such as hand-engineered features, simple approaches to integration. Hence, a new continuous emotion recognition model is proposed based on the fusion of EEG and facial expressions videos named residual multimodal Transformer (RMMT). Firstly, the Resnet50 and temporal convolutional network (TCN) are utilised to extract spatiotemporal features from videos, and the TCN is also applied to process the computed EEG frequency power to acquire spatiotemporal features of EEG. Then, a multimodal Transformer is used to fuse the spatiotemporal features from the two modalities. Furthermore, a residual connection is introduced to fuse shallow features with deep features which is verified to be effective for continuous emotion recognition through experiments. Inspired by knowledge distillation, the authors incorporate feature-level loss into the loss function to further enhance the network performance. Experimental results show that the RMMT reaches a superior performance over other methods for the MAHNOB-HCI dataset. Ablation studies on the residual connection and loss function in the RMMT demonstrate that both of them is functional.

Abstract Image

查看原文本刊更多论文

用于表情-EEG 融合连续情绪识别的残差多模态变换器

连续情绪识别是通过情感信息预测情绪状态，更加关注情绪的连续变化。脑电图（EEG）和面部表情视频的融合已被应用于这一领域，但目前的研究还存在一些局限性，如手工特征设计、简单的融合方法等。因此，我们提出了一种基于脑电图和面部表情视频融合的新的连续情感识别模型，命名为残差多模态变换器（RMMT）。首先，利用 Resnet50 和时空卷积网络（TCN）从视频中提取时空特征，并应用 TCN 处理计算出的脑电图频率功率，以获取脑电图的时空特征。然后，使用多模态变换器融合两种模态的时空特征。此外，还引入了残差连接，将浅层特征与深层特征进行融合，并通过实验验证了该方法对连续情绪识别的有效性。受知识提炼的启发，作者在损失函数中加入了特征级损失，以进一步提高网络性能。实验结果表明，在 MAHNOB-HCI 数据集上，RMMT 的性能优于其他方法。对 RMMT 中的残余连接和损失函数进行的消融研究表明，它们都是功能性的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

11.00

自引率

3.90%

发文量

134

审稿时长

35 weeks

期刊介绍： CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.